{"id":32,"date":"2008-08-21T11:20:59","date_gmt":"2008-08-21T16:20:59","guid":{"rendered":"http:\/\/www.bitquill.net\/blog\/?p=32"},"modified":"2008-11-22T12:22:04","modified_gmt":"2008-11-22T17:22:04","slug":"beyond-relational-databases","status":"publish","type":"post","link":"https:\/\/bitquill.net\/blog\/beyond-relational-databases\/","title":{"rendered":"&#8220;Beyond Relational Databases&#8221;"},"content":{"rendered":"<p>The article &#8220;<a title=\"Beyond Relational Databases (CACM)\" href=\"http:\/\/doi.acm.org\/10.1145\/1364782.1364797\">Beyond Relational Databases<\/a>&#8221; by <a title=\"Margo Seltzer's homepage\" href=\"http:\/\/www.eecs.harvard.edu\/~margo\/\">Margo Seltzer<\/a> in the July 2008 issue of CACM claims that &#8220;there is more to data access than SQL.&#8221;\u00c2\u00a0 Although this is a fairly obvious statement, the article is well-written and worth a read.\u00c2\u00a0 The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.<strong> <\/strong><\/p>\n<p><strong>However, the article offers no concrete examples at all, so I&#8217;ll venture a suggestion. <\/strong>In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, <a title=\"MapReduce: Simplified Processing on Large Clusters (OSDI 2004)\" href=\"http:\/\/labs.google.com\/papers\/mapreduce.html\">MapReduce<\/a> is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for <a title=\"Pig (Apache Incubator)\" href=\"http:\/\/incubator.apache.org\/pig\/\">inspiration<\/a>, as well as target of <a title=\"MapReduce: A Major Step Backwards\" href=\"http:\/\/www.databasecolumn.com\/2008\/01\/mapreduce-a-major-step-back.html\">polemic<\/a> by prominent database researchers.<\/p>\n<p>In database terminology, <strong>MapReduce is an execution engine, largely unconcerned about data models and storage schemes<\/strong>.\u00c2\u00a0 In the simplest case, data reside on a distributed file system (e.g., <a title=\"The Google Filesystem (SOSP 2003)\" href=\"http:\/\/labs.google.com\/papers\/gfs.html\">GFS<\/a>, <a title=\"Hadoop Distributed Filesystem\" href=\"http:\/\/hadoop.apache.org\/core\/docs\/current\/hdfs_design.html\">HDFS<\/a>, or <a title=\"Kosmos Distributed Filesystem\" href=\"http:\/\/hadoop.apache.org\/core\/docs\/current\/hdfs_design.html\">KFS<\/a>) but nothing prevents pulling data from a large data store like <a title=\"BigTable (OSDI 2006)\" href=\"http:\/\/labs.google.com\/papers\/bigtable.html\">BigTable<\/a> (or <a title=\"HBase\" href=\"http:\/\/hadoop.apache.org\/hbase\/\">HBase<\/a>, or <a title=\"Hypertable\" href=\"http:\/\/www.hypertable.org\/\">Hypertable<\/a>), or any other storage engine, as long as it<\/p>\n<ul>\n<li>Provides data de-clustering and replication across many machines, and<\/li>\n<li>Allows computations to execute on local copies of the data.<\/li>\n<\/ul>\n<p>Arguably, <strong>MapReduce is powerful both for the features it provides, as well as for the features it <em>omits<\/em><\/strong>, in order to provide a clean and simple programming abstraction, which facilitates improved usability, <a title=\"Apache Hadoop Wins Terabyte Sort Benchmark\" href=\"http:\/\/developer.yahoo.com\/blogs\/hadoop\/2008\/07\/apache_hadoop_wins_terabyte_sort_benchmark.html\">efficiency<\/a> and fault-tolerance.<\/p>\n<p>Most of the fundamental ideas for distributed data processing are not new.\u00c2\u00a0 For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that &#8220;people think there is something new in all this; there isn&#8217;t, it&#8217;s all <a title=\"The Gamma Database Machine Project (IEEE TKDE)\" href=\"http:\/\/dx.doi.org\/10.1109\/69.50905\">Gamma<\/a>&#8220;\u00e2\u20ac\u201dand he&#8217;s probably right.\u00c2\u00a0 Reading the <a title=\"The Google Filesystem (SOSP 2003)\" href=\"http:\/\/labs.google.com\/papers\/gfs.html\">original<\/a> <a title=\"MapReduce: Simplified Processing on Large Clusters (OSDI 2004)\" href=\"http:\/\/labs.google.com\/papers\/mapreduce.html\">Google<\/a> <a title=\"BigTable (OSDI 2006)\" href=\"http:\/\/labs.google.com\/papers\/bigtable.html\">papers<\/a>, none make a claim to fundamental discoveries.\u00c2\u00a0 Focusing on &#8220;academic novelty&#8221; (whatever that may mean) is irrelevant.\u00c2\u00a0 Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted <a title=\"MapReduce: A Major Step Backwards\" href=\"http:\/\/www.databasecolumn.com\/2008\/01\/mapreduce-a-major-step-back.html\">blog post<\/a> and <a title=\"MapReduce II\" href=\"http:\/\/www.databasecolumn.com\/2008\/01\/mapreduce-continued.html\">its followup<\/a> miss the point.\u00c2\u00a0 <strong>The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer&#8217;s vision<\/strong>, on clusters of commodity hardware.<\/p>\n<p>Michael Stonebraker and David DeWitt do have a valid point: we should <em>not<\/em> fixate on MapReduce; greater things are happening. <strong>So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?<\/strong><\/p>\n<p><strong>Data types:<\/strong> In database speak, these are known as &#8220;schemas.&#8221; Google&#8217;s <a title=\"Protobuf (Google Code)\" href=\"http:\/\/code.google.com\/p\/protobuf\/\">protocol buffers<\/a> the underlying API for data storage and exchange.\u00c2\u00a0 This is also nothing radically new; in essence, it is a <a title=\"XML Binary Characterization (W3C)\" href=\"http:\/\/www.w3.org\/XML\/Binary\/\">binary XML<\/a> representation,\u00c2\u00a0 somewhere between the simple <a title=\"An Evaluation of Binary XML Encoding Optimizations for Fast Stream Based XML Processing (ACM DL)\" href=\"http:\/\/doi.acm.org\/10.1145\/988672.988719\">XTalk<\/a> protocol which underpins <a title=\"Vinci (PDF)\" href=\"http:\/\/www.bitquill.net\/pdf\/comnet02_vinci.pdf\">Vinci<\/a> and the <a title=\"WAP Binary XML Content Format (W3C)\" href=\"http:\/\/www.w3.org\/TR\/wbxml\/\">WBXML<\/a> tokenized representation (both slightly predating protocol buffers and both now largely defunct).\u00c2\u00a0 In fact, if I had to name a major weakness in the open source versions of Google&#8217;s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.\u00c2\u00a0 Hadoop has <tt>Writable<\/tt>, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as <a title=\"Lucene Nutch (Apache)\" href=\"http:\/\/lucene.apache.org\/nutch\/\">Nutch<\/a>, Pig, Cascading, and so on).\u00c2\u00a0 Interestingly, the <tt>rcc<\/tt> record compiler component (which seems to have fallen in disuse) was once called <a title=\"JIRA on rcc naming\" href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-1069\">Jute<\/a> with <em>possibly<\/em> plans grander than what came to be.\u00c2\u00a0 So, I was pleasantly surprised when Google <a title=\"Protocol Buffers (Google Open Source Blog)\" href=\"http:\/\/google-opensource.blogspot.com\/2008\/07\/protocol-buffers-googles-data.html\">decided to open-source protocol buffers<\/a> a few days ago\u00e2\u20ac\u201dalthough it may now turn out to be too little too late.<\/p>\n<p><strong>Data access:<\/strong> In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.\u00c2\u00a0 It started fairly simple, as a &#8220;is a sparse, distributed, persistent multidimensional sorted map&#8221; to quote the original paper.\u00c2\u00a0 It is now part of the <a title=\"Google App Engine (Google Code)\" href=\"http:\/\/code.google.com\/appengine\/\">Google App Engine<\/a> and even has support for general <a title=\"Google App Engine - Datastore API - Transactions\" href=\"http:\/\/code.google.com\/appengine\/docs\/datastore\/transactions.html\">transactions<\/a>. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team&#8217;s excellent track record so far.\u00c2\u00a0 While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (<a title=\"Replace HQL with an HBase-friendly jirb or jython shell (JIRA)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/HBASE-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel\">JIRA entry<\/a> and <a href=\"http:\/\/wiki.apache.org\/hadoop\/Hbase\/HbaseShell\">wiki<\/a> <a href=\"http:\/\/wiki.apache.org\/hadoop\/Hbase\/Shell\/Replacement\">entries<\/a>).\u00c2\u00a0 I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don&#8217;t have their place).<\/p>\n<p><strong>Job and schema management:<\/strong> <a title=\"Pig (Yahoo! Research)\" href=\"http:\/\/research.yahoo.com\/node\/90\">Pig<\/a>, from the database community, is described as a <a title=\"Automatic Optimization of Parallel Dataflow Programs (USENIX 2008)\" href=\"http:\/\/www.cs.cmu.edu\/~olston\/publications\/usenix08.pdf\">parallel dataflow engine<\/a> and employs yet another special-purpose language which <a title=\"Pig-Latin: A Not-So-Foreign Language for Data Processing (SIGMOD 2008, industrial track)\" href=\"http:\/\/www.cs.cmu.edu\/~olston\/publications\/sigmod08.pdf\">tries to look a little like SQL<\/a> (but it is no secret that <a title=\"Chris Olston's blog comment on imperative vs. declarative approach\" href=\"http:\/\/www.databasecolumn.com\/2008\/01\/mapreduce-continued.html#comment-849\">it isn&#8217;t<\/a>). <a title=\"Cascading\" href=\"http:\/\/www.cascading.org\/\">Cascading<\/a> has received no attention in the research community, but it merits a closer look. It is based on a &#8220;build system&#8221; metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.\u00c2\u00a0 Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).\u00c2\u00a0 As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.\u00c2\u00a0 It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, <a title=\"Interpreting the Data: Parallel Analysis with Sawzall\" href=\"http:\/\/labs.google.com\/papers\/sawzall.html\">Sawzall<\/a> deserves an honorable mention, but I won&#8217;t discuss it further as it is a closed technology.<\/p>\n<p><strong>Indexing:<\/strong> Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.\u00c2\u00a0 I suspect that IR-style indices, such as <a title=\"Lucene Java (Apache)\" href=\"http:\/\/lucene.apache.org\/java\/\">Lucene<\/a>, have much to offer (something that <a title=\"Build a Lucene index on top of an HBase table (JIRA)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/HBASE-270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528212\">has not gone unnoticed<\/a>)\u00e2\u20ac\u201dmore on this in another post.<\/p>\n<p>A number of other projects are also worth keeping an eye on, such as <a title=\"CouchDB (Apache Incubator)\" href=\"http:\/\/incubator.apache.org\/couchdb\/\">CouchDB<\/a>, Amazon&#8217;s <a title=\"Amazon S3\" href=\"http:\/\/aws.amazon.com\/s3\">S3<\/a>, Facebook&#8217;s <a title=\"Hive as a contrib module (JIRA)\" href=\"https:\/\/issues.apache.org\/jira\/browse\/HADOOP-3601\">Hive<\/a>, and <a title=\"JAQL homepage\" href=\"http:\/\/www.jaql.org\/\">JAQL<\/a> (and I&#8217;m sure I&#8217;m missing many more).\u00c2\u00a0 All of them are, of course, open source.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The article &#8220;Beyond Relational Databases&#8221; by Margo Seltzer in the July 2008 issue of CACM claims that &#8220;there is more to data access than SQL.&#8221;\u00c2\u00a0 Although this is a fairly obvious statement, the article is well-written and worth a read.\u00c2\u00a0 The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[45],"tags":[29,50,49,48,7,30,31,28,58],"class_list":["post-32","post","type-post","status-publish","format-standard","hentry","category-scitech","tag-cloud-computing","tag-commentary","tag-computer-science","tag-data-management","tag-development","tag-distributed","tag-hadoop","tag-mapreduce","tag-opinion"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7x9xm-w","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts\/32","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/comments?post=32"}],"version-history":[{"count":0,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts\/32\/revisions"}],"wp:attachment":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/media?parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/categories?post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/tags?post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}