The bless of dimensionality

The cover story in Wired by Chris Anderson, “The End of Theory” relies on a silent assumption, which may be obvious but is still worth stating. The reason that such a “petabyte approach” works is that reality occupies only a tiny fraction of the space of all possibilities.  For example, the human genome consists of about three billion base pairs.  However, not every billion-lengths string of four symbols corresponds to a viable organism, much less an existing one or a human individual.  In other words, the intrinsic dimensionality of your sample (the human population) is much smaller than the raw dimensionality of the possibilities (about 4^3,2000,000 strings).

I won’t try to justify “traditional” models. But I also wouldn’t go so far as to say that models will disappear, just that many will be increasingly statistical in nature. If you can throw the dice a large enough number of times, it doesn’t matter whether “God” plays them or not.  The famous quote by Einstein suggests that quantum mechanics was originally seen as a cop-out by some: we can’t find the underlying “truth”, so we settle with probability distributions for position and momentum.  However, this was only the beggining.

Still, we need models.  Going back to the DNA example, I suspect that few people models the genome as a single, huge, billion-length string.  That is not a very useful random variable.  Chopping it up into pieces with different functional significance and coming up with the appropriate random variables, so one can draw statistical inferences, sounds very much like modeling to me.

Furthermore, hypothesis testing and confidence intervals won’t go away either.  After all, anyone who has taken a course in experimental physics knows that repeating a measurement and calculating confidence intervals based on multiple data points is a fundamental part of the process (and also the main motivating force in the original development of statistics).  Now we can collect petabytes of data points.  Maybe there is a shift in balance between theory (in the traditional, Laplacian sense, which I suspect is what the article really refers to) and experiment.  But the fundamental principles remain much the same.

So, perhaps more is not fundamentally different after all, and we still need to be careful not to overfit.  I’ll leave you with a quote from “A Random Walk down Wall Street” by Burt Malkiel (emphasis mine):

[…] it’s sometimes possible to correlate two completely unrelated events.  Indeed, Mark Hulbert reports that stock-market researcher David Leinweber found that the indicator most closely correlated with the S&P 500 Index is the volume of butter production in Bangladesh.

Dimensionality may be a bless, but it can still be a curse sometimes.