“Beyond Relational Databases”

The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.”  Although this is a fairly obvious statement, the article is well-written and worth a read.  The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.

However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.

In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes.  In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it

  • Provides data de-clustering and replication across many machines, and
  • Allows computations to execute on local copies of the data.

Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.

Most of the fundamental ideas for distributed data processing are not new.  For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right.  Reading the original Google papers, none make a claim to fundamental discoveries.  Focusing on “academic novelty” (whatever that may mean) is irrelevant.  Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point.  The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.

Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?

Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange.  This is also nothing radically new; in essence, it is a binary XML representation,  somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct).  In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.  Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on).  Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be.  So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.

Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.  It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper.  It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far.  While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries).  I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).

Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.  Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).  As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.  It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.

Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.  I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.

A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more).  All of them are, of course, open source.


The Fall of CAPTCHAs – really?

I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“.  Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked.  They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.

In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of.  Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts.  The idea is too obvious to miss anyway.

CAPTCHAs, even in their current form, have led to numerous contributions.  A non-exclusive list, in no particular order:

  1. They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.”  Doesn’t quite work, does it?
  2. They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people.  What greater measure of success can you think of for a technology?
  3. Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences.  Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).

So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—.  There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology.  I’m pretty sure we’ll see more elaborate tests soon, not less.