7 August 2017
Recently I was having a conversation with a colleague data scientist, someone with impressive credentials and ample experience. We were discussing the pros and cons of various algorithms. Different strokes for different folks, everyone has their own set of “favorite” tools and techniques. Our profession is (still) so young that there is rarely compelling evidence for clear superiority of one technique over another. That being said, strengths and weaknesses of various algorithms are generally well understood.
What seems not so well understood, is that an algorithm can only enable a better predictive result within the realm of a particular dataset. And there’s the rub. “Predictive quality” is an inherently multi-dimensional attribute of a model. What “better” means can be measured along several attributes. “Accuracy”, i.e. separating classes of the outcome variable is one of them. And even for that measure there are lots of ways to skin a cat. Accuracy for the top 1-10-50%? Etc.
My deeper point is that when you consider multiple algorithms on predictive quality, they will do better along one or more dimensions, and without exception worse along some others. For example: a Neural Network (Deep Learning) often provides the highest predictive accuracy. Almost always, higher predictive accuracy is a desirable feature for a model, if you make side-by-side comparisons. But Neural Networks have some drawbacks, too, like requiring more stringent data preparation, and more effort to monitor model effectiveness due to its inherent black box mechanics.
If you look at a prediction problem one level higher, you need to ask questions like: “When should this prediction be optimal?”, “How often will I reuse this model, and what are the corresponding deployment costs?”, “How important is insight into the mathematical dynamics of this model?”, “How easy is it to test and monitor this model?”, etc. The list goes on and on.
From doing this work now for some odd 20 years, I have learned though that “data” are usually a far more powerful lever than “algorithm.” Sure, you can always arrive at a more accurate prediction by spending longer at tweaking your model. But going back to the primary process and enriching the data set, doing a more thorough job of cleaning it, and extracting additional features to feed into the model data set are usually –imho– far more important than tinkering longer with an algorithm. Or spending more time comparing algorithms and selecting “the best” one. But that’s just me, of course…