Comparing data storage options in Python
When it comes to numerical computing, I always gave in to the unparalleled convenience of Matlab, which I think is the best IDE for that purpose.  If your data consists of matrices or vectors and fits in main memory, it’s very hard to beat Matlab’s smooth workflow for interactive analysis and quick iteration.  Also, with judicious use of MEX, performance is more than good enough.  However, over the past two years, I’ve been increasingly using Python (with numpy, matplotlib, scipy, ipython, and scikit-learn), for three reasons: (i) I’m already a big Python fan; (ii) it’s open-source hence it’s easier for others to reuse your code; and (iii) more importantly, it can easily handle non-matrix data types (e.g., text, complex graphs) and has a large collection of libraries for almost anything you can imagine.  In fact, even when using Matlab, I had a separate set of scripts to collect and/or parse raw data, and then turn it into a matrix.  Juggling both Python and Matlab code can get pretty messy, so why not do everything in Python?
Before I continue, let me say that, yes, I know Matlab has cell arrays and even objects, but still… you wouldn’t really use Matlab for e.g., text processing or web scraping. Yes, I know Matlab has distributed computing toolboxes, but I’m only considering main memory here; these days 256GB RAM is not hard to come by and that’s good enough for 99% of (non-production) data exploration tasks. Finally, yes, I know you can interface Java to Matlab, but that’s still two languages and two codebases.
Storing matrix data in Matlab is easy.  The .MAT format works great, it is pretty efficient, and can be used with almost any language (including Python).  At the other extreme, arbitrary objects can be stored in Python as pickles (the de-facto Python standard?), however (i) they are notoriously inefficient (even with cPickle), and (ii) they are not portable.  I could perhaps live with (ii), but (i) is a problem.  At some point, I tried out SqlAlchemy (on top of sqlite) which is quite feature-rich, but also quite inefficient, since it does a lot of things I don’t need. I had expected to pay a performance penalty, but hadn’t realized how large until measuring it.  So, I decided to do some quick-n-dirty measurements of various options.
The goal was to compare Python overheads (due to the interpreted nature of Python, the GIL, etc etc), not raw I/O performance. Â Furthermore, I’m looking for a simple data storage solution, not for a distributed, fault-tolerant, scalable solution (so DHTs, column stores, etc like memcached, Riak, Redis, HBase, Cassandra, Impala, MongoDB, Neo4j, etc etc etc, are out). Â Also, I’m looking for something that’s as “Pythonic” as possible and with reasonably mature options (so I left things like LevelDB and Tokyo Cabinet out). Â And, in any case, this is not meant to be an exhaustive list (or a comprehensive benchmark, for that matter); I had to stop somewhere.
In the end, I ended up comparing the following storage options:
- sqlite: Load rows as tuples, directly using the sqlite DB-API interface.
- sqlite-obj:Â Similar, but also convert loaded tuples into Python objects (one member variable per column).
- sqlalchemy-core: Load rows as tuples, using SqlAlchemy core (on top of sqlite) instead of DB-API directly.
- sqlalchemy-orm: Use the full-blown ORM and session features from SqlAlchemy (on top of sqlite).
- csv: CSV files, manually parsed (file iteration and string split).
- csv-numpy: CSV files parsed using the I/O functions in numpy.
- cpickle: Store records as tuples, using cPickle.
- cpickle-obj: Store records as Python objects (one member variable per column), using cPickle.
- zodb: ZOPE object store, using BTrees.
- pytables: A library for tabular data built on top of HDF5.
- h5py: Another, lower-level library for HDF5 files.
Furthermore, I also wanted to get an idea of how easily Python code can be optimized.  In the past, I’d hand-coded C extensions when really necessary, I had played a little bith with cython, and I had heard of PyPy (but never tried it).  So, while at it, I also considered the following Python implementations and tools:
- cpython: Â The usual implementation of Python.
- cython: Python extension that allows compiling to C extensions; no substantial benefit compared to cpython (at least without investing major effort: I only annotated some of the loop variables, the rest of the code was unmodified), so results are omitted.
- pypy: Python JIT compiler; some libraries won’t run on this, so those are missing from the results.
The dataset used was very simple, consisting of five columns/fields of random floating point numbers (so the data are, hopefully, incompressible), with sizes of up to 500M records. Â The dataset size is quite modest, but should be sufficient for the goals stated above (comparing relative Python overheads, not actual disk I/O performance). File sizes (relative to sqlite, again) are shown below. Â For the record, the ‘raw’ data (500,000 rec x 5 floats/rec x 8 bytes/float) would have stood at 0.74, same as pytables which has zero overhead (well, 64KB to be exact); sqlite has a 36% overhead. Â ZODB size includes the index, but that’s just 2.7% of the total (caveat: although records were only added, never deleted, I’m not familiar with ZODB and didn’t check if I should still have done any manual garbage collection).
Runs were performed on a machine with an ext4-formatted Samsung EVO850 1TB SSD, Ubuntu 14.04LTS and, FWIW, a Core i7-480K at 3.7GHz. RAM was 64GB and, therefore, the buffercache was more than large enough to fit all dataset sizes.  One run was used to warm up the cache, and results shown are from a second run.  Note that, particularly in this setting (i.e., reading from memory), many (most?) of the libraries appear to be CPU-bound (due to serialization/deserialization and object construction overheads), not I/O-bound.  I cautiously say “appear to be” since this statement is based on eyeballing “top” output, rather than any serious profiling.
For full disclosure, here’s a dump of the source files and timing data, provided as-is (so, caveat: far from release quality, not intended for reuse, really messy, undocumented, etc etc—some bits need to be run manually through an iPython prompt and/or commented-in/commented-out, don’t ask me which, I don’t remember :). If, however, anyone finds anything stupid there, please do let me know.
First a sanity check, wall clock time vs. dataset size is perfectly linear, as expected:
The next plot shows average wall-clock time (over all dataset sizes) for both cpython and pypy, normalized against that of raw sqlite with cpython:
As usual, I procrastinated several weeks before posting any of this. In the meantime, I added a second EVO850 and migrated from ext4 to btrfs with RAID-0 for data and LZO compression.  Out of curiosity I reran the code.  While at it, I added ZODB to the mix. Here are the results (cpython only, normalized against sqlite on btrfs):
Pytables is, oddly, faster! Â For completeness, here are the speedups observed with striping across two disks, vs a single disk.
Remember that these are (or should be) hot buffercache measurements, so disk I/O bandwidth should not matter, only memory bandwidth.  Not quite sure what is going on here; I don’t believe PyTables uses multiple threads in its C code (and, even if it did, why would the number of threads depend on the… RAID level??).  Maybe some profiling is in order (and, if you have any ideas, please let me know).
Comparing Python implementations. Woah, look at PyPy go!  When it works, it really works.  See SqlAlchemy go from being 2.5x slower when using the low-level APIs or 25x slower with all the heavyweight ORM machinery, to almost directly competitive with raw sqlite or 6x slower (a 4x speedup), respectively.  Similarly, manual object re-construction on top of raw sqlite now has negligible overhead.  However, most libraries unfortunately do not (yet) run on PyPy.  More importantly, the frameworks I need for data analysis also do not support PyPy (I’m aware there is a special version of NumPy, but matplotlib, SciPy, etc are still far from being compatible).  Also, I’m not quite sure why pickles were noticeably slower with PyPy.
Comparing data formats. Sqlite is overall the most competitive option.  This is good; you can never really go wrong with a relational data format, so it should serve as a general basis. PyTables is also impressively fast (it’s pretty much the only option that beats raw sqlite, for this simple table of all-floats).  Finally, I was somewhat surprised that NumPy’s CSV I/O is that slow (yes, it has to deal with all the funky quotation, escapes, and formatting variations, and CSV text is not exactly a high-performance format, but still…).
For the time being, I’ll probably stick with sqlite, but get rid of the SqlAlchemy ORM bits that I’ve been using (or, perhaps, keep them for small datasets). The nice thing is that I can keep my data files and perhaps look for a better abstraction layer than DB-API, but the sqlite “core” itself appears reasonably efficient. Eventually, however, I’d like to have something like the relationship feature of the ORM (but without all the other heavyweight machinery for sessions, syncing, etc), so I can easily persist graph data, with arbitrary node and edge attributes (FWIW, I currently use NetworkX once the data is loaded; I know it’s slow, but it’s very Pythonic and convenient, and I rarely resort to iGraph or other libraries, at least so far—but that’s another story).