Data harvesting with MapReduce

Combine harvesters
(original image source)

“The combine harvester, […] is a machine that combines the tasks of harvesting, threshing and cleaning grain crops.” If you have acres upon acres of wheat and want to separate the grain from the chaff, a group of combines is what you really want. If you have a bonsai tree and want to trim it, a harvester may be less than ideal.

MapReduce is like a pack of harvesters, well-suited for weeding through a huge volumes of data, residing on a distributed storage system. However, a lot of machine learning work is more akin to trimming bonsai into elaborate patterns. Vice versa, it’s not uncommon to see trimmers used to harvest a wheat field. Well-established and respected researchers, as recently as this year write in their paper “Planetary Scale Views on a Large Instant-messaging Network“:

We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day.

Doing the math, that’s five full days of processing to parse and compress the data on a beast of a machine. Even more surprisingly, I found this exact quote singled out among all the interesting results in the paper! Let me make clear that I’m not criticizing the study; in fact, both the dataset and the exploratory analysis are interesting in many ways. However, it is somewhat surprising that, at least among the research community, such a statement is still treated more like a badge of honor rather than an admission of masochism.

The authors should be applauded for their effort. Me, I’m an impatient sod. Wait one day for the results, I think I can do that. Two days, what the heck. But five? For an exploratory statistical analysis? I’d be long gone before that. And what if I found a serious bug half way down the road? That’s after more than two days of waiting, in case you weren’t counting. Or what if I decided I needed a minor modification to extract some other statistic? Wait another five days? Call me a Matlab-spoiled brat, but forget what I said just now about waiting one day. I changed my mind already. A few hours, tops. But we need a lot more studies like this. Consequently, we need the tools to facilitate them.

Read the rest of this entry »

Comments (1)

“Beyond Relational Databases”

The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.”  Although this is a fairly obvious statement, the article is well-written and worth a read.  The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.

However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.

In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes.  In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it

  • Provides data de-clustering and replication across many machines, and
  • Allows computations to execute on local copies of the data.

Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.

Most of the fundamental ideas for distributed data processing are not new.  For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right.  Reading the original Google papers, none make a claim to fundamental discoveries.  Focusing on “academic novelty” (whatever that may mean) is irrelevant.  Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point.  The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.

Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?

Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange.  This is also nothing radically new; in essence, it is a binary XML representation,  somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct).  In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.  Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on).  Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be.  So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.

Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.  It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper.  It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far.  While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries).  I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).

Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.  Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).  As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.  It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.

Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.  I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.

A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more).  All of them are, of course, open source.

Comments