The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.” Although this is a fairly obvious statement, the article is well-written and worth a read. The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.
However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.
In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes. In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it
- Provides data de-clustering and replication across many machines, and
- Allows computations to execute on local copies of the data.
Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.
Most of the fundamental ideas for distributed data processing are not new. For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right. Reading the original Google papers, none make a claim to fundamental discoveries. Focusing on “academic novelty” (whatever that may mean) is irrelevant. Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point. The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.
Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?
Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange. This is also nothing radically new; in essence, it is a binary XML representation, somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct). In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format. Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on). Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be. So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.
Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable. It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper. It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far. While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries). I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).
Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets. Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy). As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons. It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.
Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic. I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.
A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more). All of them are, of course, open source.