## Comparing data storage options in Python

When it comes to numerical computing, I always gave in to the unparalleled convenience of Matlab, which I think is the best IDE for that purpose. Â If your data consists of matrices or vectorsÂ andÂ fits in main memory, it’s very hard to beat Matlab’s smooth workflow for interactive analysis and quick iteration. Â Also, with judicious use of MEX, performance is more than good enough. Â However, over the past two years, I’ve been increasingly using Python (with numpy, matplotlib, scipy, ipython, and scikit-learn), for three reasons: (i) I’m already a big Python fan; (ii) it’s open-source hence it’s easier for others to reuse your code; and (iii) more importantly, it can easily handle non-matrix data types (e.g., text, complex graphs) and has a large collection of libraries for almost anything you can imagine. Â In fact, even when using Matlab, I had a separate set of scripts to collect and/or parse raw data, and then turn it into a matrix. Â Juggling bothÂ Python and Matlab code can get pretty messy, so why not do everything in Python?

Before I continue, let me say that,Â yes, I know Matlab has cell arrays and even objects, but still… you wouldn’t really use Matlab for e.g., text processing or web scraping.Â Yes, I know Matlab has distributed computing toolboxes, but I’m only considering main memory here; these days 256GB RAM is not hard to come by and that’s good enough for 99% of (non-production) data exploration tasks. Finally, yes, I know you can interface Java to Matlab, but that’s still two languages and two codebases.

Storing matrix data in Matlab is easy. Â The .MAT format works great, it is pretty efficient, and can be used withÂ almost any language (including Python). Â At the other extreme, arbitrary objects can be stored in Python as pickles (the de-facto PythonÂ standard?), however (i) they are notoriously inefficient (even with cPickle), and (ii) they are not portable. Â I could perhaps live with (ii), but (i) is a problem. Â At some point, I tried out SqlAlchemy (on top of sqlite) which is quite feature-rich, but also quite inefficient, since it does a lot of things I don’t need. I had expected to pay a performance penalty, but hadn’t realized how large until measuring it. Â So, I decided to do some quick-n-dirty measurements of various options.