10 Jan 2017
About 2 years ago, I wrote a little application, @fsibot. @fsibot is a Twitter bot which, when it receives a Tweet that is a valid F# expression, will evaluate it and return the result to the sender. Got to code FizzBuzz in an interview? Impress your audience, and send a Tweet from your cell phone to @fsibot:
It was very fun to write, rather pointless, but turned out to be an interesting exercise, which taught me a lot. And, in spite of its simplicity, it’s a decent sample app, which touches on many aspects a real-world app might encounter.
After some hiccups early on, @fsibot has been running pretty smoothly, until I noticed issues recently. Rather than trying to figure out what the hell was going on, I decided to port it over Azure Functions, which sounded like a better fit for it. While at it, I also made a couple of changes to the bot. If you are interested, you can find the code on GitHub.
25 Sep 2016
The intent of this post is primarily practical. During the Kaggle Home Depot competition, we ended up using the Random Forest implementation of ALGLIB, which worked quite well for us. Taylor Wood did all the work figuring out how to use it, and I wanted to document some of its aspects, as a reminder to myself, and to provide a starting point for others who might need a Random Forest from F#.
The other reason I wanted to do this is, I have been quite interested lately in the idea of developing a DSL to specify a machine learning model, which could be fed to various algorithms implementation via simple adapters. In that context, I thought taking a look at ALGLIB and how they approached data modelling could be useful.
I won’t discuss the Random Forest algorithm itself; my goal here will be to “just use it”. In order to do this, I will be using the Titanic dataset from the Kaggle “Learning From Disaster” competition. I like that dataset because it’s not too big, but it hits many interesting problems: missing data, features of different types, … I will be using it two ways, for classification (as is usually the case), but also for regression.
Let’s dive in the ALGLIB random forest. The library is available as a nuget package,
alglibnet2. To use it, simply reference the assembly
#r @"alglibnet2/lib/alglibnet2.dll"; you can then immediately train a random forest, using the
alglib.dfbuildrandomdecisionforest method - no need to open any namespace. The training method comes in 2 flavors,
alglib.dfbuildrandomdecisionforestx1. The first one is a specialization of the second one, which takes an additional argument; therefore, I’ll work on the second, most general version.
03 Sep 2016
Today, we’ll close our exploration of Gradient Boosting. First, we looked into a simplified form of the approach, and saw how to combine weak learners into a decent predictor. Then, we implemented a very basic regression tree. Today, we will put all of this together. Instead of stumps, we will progressively fit regression trees to the residuals left by our previous model; and rather than using plain residuals, we will leverage DiffSharp, an F# automatic differentiation library, to generalize the approach to arbitrary loss functions.
I won’t go back over the whole setup again here; instead I will just recap what we have at our disposition so far. Our goal is to predict the quality of a bottle of wine, based on some of its chemical characteristics, using the Wine Quality dataset from the UCI Machine Learning repository. (References: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.)
Gist available here
We are using a couple of types to model our problem:
type Wine = CsvProvider<"data/winequality-red.csv",";",InferRows=1500>
type Observation = Wine.Row
type Feature = Observation -> float
type Example = Observation * float
type Predictor = Observation -> float
28 Aug 2016
One of the reasons I use F# so much is that it’s an awesome scripting language to Get Stuff Done. Case in point: this blog. I recently decided to switch from BlogEngine.NET to Jekyll, which meant porting over nearly 9 years of blog posts (about 300), extracting html-formatted content from SQL and converting it to markdown. After a couple of weeks of manual process, I realized that at the current cadence, it would take me about a year to complete, and that by then I would probably have lost my mind out of boredom. Time for some automation with F# scripts!
14 Aug 2016
In our previous installment, we began exploring Gradient Boosting, and outlined how by combining extremely crude regression models - stumps - we could iteratively create a decent prediction model for the quality of wine bottles, using one
Feature, one of the chemical measurements we have available.
In and of itself, this is an interesting result: the approach allows us to aggregate mediocre indicators together into a predictor that is better than its individual parts. However, so far, we are using only a tiny subset of the information available. Why restrict ourselves to a single
Feature, and not use all of them? And, if the approach works with something as weak as a stump, perhaps we can do better, by aggregating less trivial prediction models?
This will be our goal today: we will create a Regression Tree, which we will in a future installment use in place of stumps in our Boosting procedure.