Comparing data storage options in Python

When it comes to numerical computing, I always gave in to the unparalleled convenience of Matlab, which I think is the best IDE for that purpose.  If your data consists of matrices or vectors and fits in main memory, it’s very hard to beat Matlab’s smooth workflow for interactive analysis and quick iteration.  Also, with judicious use of MEX, performance is more than good enough.  However, over the past two years, I’ve been increasingly using Python (with numpy, matplotlib, scipy, ipython, and scikit-learn), for three reasons: (i) I’m already a big Python fan; (ii) it’s open-source hence it’s easier for others to reuse your code; and (iii) more importantly, it can easily handle non-matrix data types (e.g., text, complex graphs) and has a large collection of libraries for almost anything you can imagine.  In fact, even when using Matlab, I had a separate set of scripts to collect and/or parse raw data, and then turn it into a matrix.  Juggling both Python and Matlab code can get pretty messy, so why not do everything in Python?

Before I continue, let me say that, yes, I know Matlab has cell arrays and even objects, but still… you wouldn’t really use Matlab for e.g., text processing or web scraping. Yes, I know Matlab has distributed computing toolboxes, but I’m only considering main memory here; these days 256GB RAM is not hard to come by and that’s good enough for 99% of (non-production) data exploration tasks. Finally, yes, I know you can interface Java to Matlab, but that’s still two languages and two codebases.

Storing matrix data in Matlab is easy.  The .MAT format works great, it is pretty efficient, and can be used with almost any language (including Python).  At the other extreme, arbitrary objects can be stored in Python as pickles (the de-facto Python standard?), however (i) they are notoriously inefficient (even with cPickle), and (ii) they are not portable.  I could perhaps live with (ii), but (i) is a problem.  At some point, I tried out SqlAlchemy (on top of sqlite) which is quite feature-rich, but also quite inefficient, since it does a lot of things I don’t need. I had expected to pay a performance penalty, but hadn’t realized how large until measuring it.  So, I decided to do some quick-n-dirty measurements of various options.

Read the rest of this entry »

Comments

Quick hack: visualizing RBF bandwidth

A few weeks ago, I was explaining the general concepts behind support vector machine classifiers. and kernels.  The majority of the audience has no background in linear algebra, so I had to rely on a lot of analogies and pictures.  I had previously introduced the notions of decision function and decision boundary (the zero-crossing of the decision function), and described dot products, projections, and linear equations, as simply as possible.

The overview of SVMs was centered around the observations that the decision function is, eventually, a weighted additive superposition (linear combination) of evaluations of “things that behave like projections in a higher-dimensional space via a non-linear mapping” (kernel functions) over the support vectors (a subset of the training samples, chosen based on the idea of “fat margins”).

Most of the explanations and pictures were based on linear functions, but I wanted to give an idea of what these kernels look like, how their “superposition” looks like, and how kernel parameters vary the picture (and may relate to overfitting).   For that I chose radial basis functions. I found myself doing a lot of handwaving in the process, until I realized that I could whip up an animation.  Following that class, I had 1.5 hours during another midterm, so I did that (Python with Matplotlib animations, FTW!!).  The result follows.

Here is how the decision boundary changes as the bandwidth becomes narrower:

For large radii, there are a fewer support vectors and kernel evaluations cover a large swath of the space.  As the radii shrink, all points become support vectors, and the SVM essentially devolves into a “table model” (i.e., the “model” is the data, and only the data, with no generalization ability whatsoever).

This decision boundary is the zero-crossing of the decision function, which can also be fully visualized in this case.  One way to understand this is that the non-linear feature mapping “deforms” the 2D-plane into a more complex surface (where, however, we can still talk about “projections”, in a way), in such a way that I can still use a plane (z=0) to separate the two classes.  Here is how that surface changes, again as the bandwidth becomes narrower:

Finally, in order to justify that, for this dataset, a really large radius is the appropriate choice, I ran the same experiments with multiple random subsets of the training data and showed that, for large radii, the decision boundaries are almost the same across all subsets, but for smaller radii, they start to diverge significantly.

Here is the source code I used (warning: this is raw and uncut, no cleanup for release!).  One of these days (or, at least, by next year’s class) I’ll get around to making multiple concurrent, alpha-blended animations for different subsets of the training set, to illustrate the last point better (I used static snapshots instead) and also give a nice visual illustration of model testing and ideas behind cross-validation; of course, feel free to play with the code. ;)

Comments