Field notes: Coursera Machine Learning class
30 Jun 2013I just completed the Coursera Machine Learning class this week, and enjoyed the experience very much. Let’s get the obvious out of the way: getting a high-quality class, for free, wherever you are, at your own pace, is pretty amazing, and I can put up with a sometimes flaky video player for that. Every quarter in college, I would agonize over what limited number of classes I should take, thinking that I might not be able to take that class ever again once I graduated, and Coursera is awesome for that – now I know I can keep learning forever, with that worry gone. Thank you!
One thing I was hoping to get from this class was a broader perspective on machine learning. What I know from the topic, I learnt from various disparate sources, collecting ideas, algorithms and recipes. As a result, my knowledge was a bit of a hodge-podge. The class really delivered on that front. For the most part, the lectures progressed logically, from linear regression, to logistic regression, neural networks and support vector machines, all in a unified manner: define a cost function, use regularization to compensate for over-fitting, and fit parameters using gradient descent. I really enjoyed the coherence of the progression, which helped seeing the commonalities between all the approaches.
Following that thought, my biggest take-away was the emphasis on over- or under-fitting. I understood the concept before the class, but it wasn’t as prominently on my mind as it is now. This is probably a side-effect of my past experience in optimization and statistics, where the data was easier to visualize, and the goal was mostly to find the optimum fit, potentially leading to fragile solutions which wouldn’t generalize – over-fitting wasn’t a problem I gave much thought. In a space like machine learning, where datasets are too large to get a visual sense of what’s going on, keeping that question in mind is important. Relatedly, I found the discussions on how to diagnose a model to focus efforts extremely valuable: while more data is usually better, there are situations where it won’t help, and again, with large and hard to comprehend datasets, understanding what is potentially going wrong in a model and why is very important to avoid wasting efforts in the wrong direction, or simply figure out a direction to take when stuck.
One aspect I found interesting is that while quite a few of the models discussed have a long history in statistics (linear and logistic regression for instance), there was virtually no mention of statistics in the entire class – no goodness-of-fit statistics, no null hypothesis, no discussion on how parameters are distributed, nothing like that. Instead, it felt much closer to my background in operations research: define the function you are trying to minimize, and minimize it. I am not sure whether this has any implications regarding where statistics are headed, but found it intriguing.
Finally, the other aspect that struck me was the emphasis on linear algebra. In essence, one of the messages of the class was “if you want high performance, express your problem in a vectorized form”. I can understand why, from two perspectives. First, computers are really, really good at dealing with linear algebra operations, and second, expressing a problem with matrices and vectors is typically nicely compact. Coming from operations research, this is something I am fairly comfortable with. At the same time, the explosion of indices, sub- and super-scripts wasn’t the most pleasant part of the class, and I spent an inordinate amount of time in programming homework just trying to figure out if a particular element was a row or column vector, tinkering with transposition until the bloody product would just work. I found myself really missing some hints on what the shape of a particular element was, or early warning that the operation wasn’t going to work. Relatedly, while there is a nice consistency in working in a world where everything is a vector or matrix of floats, I felt slightly disturbed at representing true or false as 1 and 0s (for instance), and missing cleaner functional operations like filter or map.
That’s it – overall, this was time very well spent. I am very glad I did it, and would recommend that class to anyone who is not allergic to math, and wants a good introduction to the topic!