Bad planning

Some visions do not really translate into plans.  Becoming rich, established or happy, for example.  It’s like saying “I want to be a superhero!” How do you go about that?

How to become Spiderman

Nope.  Not much of a plan.

Comments (1)

“Beyond Relational Databases”

The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.”  Although this is a fairly obvious statement, the article is well-written and worth a read.  The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.

However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.

In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes.  In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it

  • Provides data de-clustering and replication across many machines, and
  • Allows computations to execute on local copies of the data.

Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.

Most of the fundamental ideas for distributed data processing are not new.  For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right.  Reading the original Google papers, none make a claim to fundamental discoveries.  Focusing on “academic novelty” (whatever that may mean) is irrelevant.  Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point.  The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.

Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?

Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange.  This is also nothing radically new; in essence, it is a binary XML representation,  somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct).  In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.  Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on).  Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be.  So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.

Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.  It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper.  It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far.  While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries).  I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).

Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.  Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).  As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.  It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.

Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.  I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.

A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more).  All of them are, of course, open source.

Comments

A dog’s life

I recently returned from two weeks in Greece, which included four days in Santorini.  Even the dogs sunbathe, enjoying the scenery.  For the most part, I gladly followed their example.

Dog\'s life in Santorini

Upon coming back to New York, somewhat to my surprise, it was the US that felt comparatively dinky.  A few years ago, it used to be the other way around.  However, the contrast between new developments around the 2004 Olympics and New York’s crumbling infrastructure was at least noticeable this time.

Comments (1)

The Fall of CAPTCHAs – really?

I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“.  Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked.  They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.

In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of.  Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts.  The idea is too obvious to miss anyway.

CAPTCHAs, even in their current form, have led to numerous contributions.  A non-exclusive list, in no particular order:

  1. They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.”  Doesn’t quite work, does it?
  2. They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people.  What greater measure of success can you think of for a technology?
  3. Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences.  Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).

So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—.  There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology.  I’m pretty sure we’ll see more elaborate tests soon, not less.

Comments

Life with three cats

I admit it. For the past year I’ve been running a zoo.  Let me introduce the protagonists, in alphabetical order.

Cats: Portraits

Two of them grew up together (Aki and Kiki), but all three have lived together at different times in the past.  For the most part, they get along just fine. Aki and Alan get into a hustle from time to time, both being a little insecure in their own way.  Kiki doesn’t care much—after all, whenever there is a bad thunderstorm, he’s the one that climbs onto the back of the couch and stares intently at the lightning and thunder, while Alan tries to bury himself behind the toilet seat (as for Aki, I still haven’t managed to figure out exactly what he does at moments like this; he simply seems to disappear).

Anyway, it’s a full house and the cats run it. They have all picked their territories by now, and they have trained me to stick to mine. A queen size bed is too small sometimes.

Cats: Pajama party

The couch can get crowded too.  They haven’t gotten hold of the remote yet, but one of these days they probably will.

Cats: Couch potatoes

Of course, it’s not always like that; we sometimes have more “intimate” moments. Things can get quite hairy (literally: the bed, the clothes, my face—all things).

Cats: Alan intimate

A year ago I used to bemoan our crowded co-existence; now I think the apartment would be too big without them. If you’ve never had a cat and wonder what it might be like, I’ll let George Carlin describe it in his unique way.

I’ve experienced all these (except the outdoor activities, since our cats don’t get the chance), but not all from the same cat. Aki is an obsessive self-petter. Alan is on a bad drug too often (but at least sometimes I manage to convince him to do his thing on my shoulders: a pretty good massage). Kiki’s is the proudest (and most active) of the three and his ass button is impressive.  I wonder how many cats George Carlin had.  Must have been too many—really, who would have expected this from him?

Comments (1)

Web science: what and how?

From the article “Web Science: An Interdisciplinary Approach to Understanding the Web” in the July issue CACM (which, by the way, looks quite impressive after the editorial overhaul!):

At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. […] The macro system, that is, the use of the micro system by many users interacting with one another in often-unpredicted ways, is far more interesting in and of itself and generally must be analyzed in ways that are different from the micro system. […] The essence of our understanding of what succeeds on the Web and how to develop better Web applications is that we must create new ways to understand how to design systems to produce the effect we want.  The best we can do today is design and build in the micro, hoping for the best, but how do we know if we’ve built in the right functionality to ensure the desired macroscale effects? How do we predict other side effects and the emergent properties of the macro? […] Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology and economics.

This is a noble goal indeed. The Wikipedia article on sociology sounds quite similar on many aspects:

Sociologists research macro-structures and processes that organize or affect society […] And, they research micro-processes […] Sociologists often use  quantitative methods—such as social statistics or network analysis—to investigate the structure of a social process or describe patterns in social relationships. Sociologists also often use qualitative methods—such as focused interviews, group discussions and ethnographic methods—to investigate social processes.

First, we have to keep in mind that the current Western notion of “science” is fairly recent.  Furthermore, it has not always been the case that technology follows science. As an example, in the book “A People’s History of Science” by Clifford Conner, one can find the following quotation from Gallileo’s Two New Sciences, about Venice’s weapons factory (the Arsenal):

Indeed, I myself, being curious by nature, frequently visit this place for the mere pleasure of observing the work of those who, on account of their superiority over other artisans, we call “first rank men.” Conference with them has often helped me in the investigation of certain effects, including not only those which are striking, but also those which are recondite and almost incredible.

Later on, Conner says (p.284), quoting again Gallileo himself from the same source:

[Gallileo] demonstrated mathematically that “if projectiles are fired … all having the same speed, but each having a different elevation, the maximum range … will be obtained when the elevation is 45°: the other shots, fired at angles greater or less will have a shorter range. But in recounting how he arrived at that conclusion, he revealed that his initial inspiration came from discussions at the Arsenal: “From accounts given by gunners, I was already aware of the fact that in the use of cannons and mortars, the maximum range, that is the one in which the shot goes the farthest, is obtained when the elevation is 45°.” Although Gallileo’s mathematical analysis of the problem was a valuable original contribution, it did not tell workers at the Arsenal anything htey had not previously learned by empirical tests, and had little effect on the practical art of gunnery.

In any case, facilitating “technology” or “engineering” is certainly not the only good reason to pursue scientific knowledge. Conversely, although “pure science” certainly has an important role, it is not the only ingredient of technological progress (something I’ve alluded to in a previous post about, essentially, the venture capital approach to research).  Furthermore, some partly misguided opinions about the future of science have brightly shot through the journalistic sphere.

However, if, for whatever reason, we decide to go the way of science (a worthy pursuit), then I am reminded of the following interview of Richard Feynman by the BBC in 1981 (full programme):

Privacy concerns notwithstanding, the web gives us unprecedented opportunities to collect measurements in quantities and levels of detail that simply were not possible when the venerable state-of-the-art involved, e.g., passing around written notes among a few people. So, perhaps we can now check hypotheses more vigorously and eventually formulate universal laws (in the sense of physics).  Perhaps the web will allow us to prove Feynman wrong.

I’m not entirely convinced that it is possible to get quantitative causal models (aka. laws) of this sort. But if it is, then we need an appropriate experimental apparatus for large-scale data analysis to test hypotheses—what would be, say, the LHC-equivalent for web science?  (Because, pure science seems to have an increasing need for powerful apparatuses.) I’ll write some initial thoughts and early observations on this in another post.

I’m pretty sure that my recent posts have been doing circles around something, but I’m not quite sure yet what that is.  In any case, all this seems an interesting direction worth pursuing.  Even though Feynman was sometimes a very hard critic, we should pehaps remember his words along the way.

Comments (1)

The bless of dimensionality

The cover story in Wired by Chris Anderson, “The End of Theory” relies on a silent assumption, which may be obvious but is still worth stating. The reason that such a “petabyte approach” works is that reality occupies only a tiny fraction of the space of all possibilities.  For example, the human genome consists of about three billion base pairs.  However, not every billion-lengths string of four symbols corresponds to a viable organism, much less an existing one or a human individual.  In other words, the intrinsic dimensionality of your sample (the human population) is much smaller than the raw dimensionality of the possibilities (about 4^3,2000,000 strings).

I won’t try to justify “traditional” models. But I also wouldn’t go so far as to say that models will disappear, just that many will be increasingly statistical in nature. If you can throw the dice a large enough number of times, it doesn’t matter whether “God” plays them or not.  The famous quote by Einstein suggests that quantum mechanics was originally seen as a cop-out by some: we can’t find the underlying “truth”, so we settle with probability distributions for position and momentum.  However, this was only the beggining.

Still, we need models.  Going back to the DNA example, I suspect that few people models the genome as a single, huge, billion-length string.  That is not a very useful random variable.  Chopping it up into pieces with different functional significance and coming up with the appropriate random variables, so one can draw statistical inferences, sounds very much like modeling to me.

Furthermore, hypothesis testing and confidence intervals won’t go away either.  After all, anyone who has taken a course in experimental physics knows that repeating a measurement and calculating confidence intervals based on multiple data points is a fundamental part of the process (and also the main motivating force in the original development of statistics).  Now we can collect petabytes of data points.  Maybe there is a shift in balance between theory (in the traditional, Laplacian sense, which I suspect is what the article really refers to) and experiment.  But the fundamental principles remain much the same.

So, perhaps more is not fundamentally different after all, and we still need to be careful not to overfit.  I’ll leave you with a quote from “A Random Walk down Wall Street” by Burt Malkiel (emphasis mine):

[…] it’s sometimes possible to correlate two completely unrelated events.  Indeed, Mark Hulbert reports that stock-market researcher David Leinweber found that the indicator most closely correlated with the S&P 500 Index is the volume of butter production in Bangladesh.

Dimensionality may be a bless, but it can still be a curse sometimes.

Comments (1)

Research and new media: the academic clowd

I have a little secret: Slashdot may have lost its lustre now, but back in 2001, shortly after returning from my refreshing internship at Almaden, I posted a question to “Ask Slashdot” for the first and last time. I posed the question rather poorly and was ignored. Although I could not find exactly what I wrote back then, it was something along the lines of “why aren’t academic venues more like SourceForge?” You have to remember that this was the early 2000’s, when large and transparent user communities existed only in the technical sphere, and things like SourceForge were the prototypical sites for online focused communities. So why couldn’t academia and the research community open things up a bit more, and leverage new media to set up virtual forums for world-wide lively discussions and collaborations?

Fast-forward seven years. I got a feeling of deja-vu when I saw two recent blog posts and a Slashdot post. The first two question specific aspects of current publishing practices, while the “Ask Slashdot” post wonders whether academic journals are obsolete. The technologies and media have changed dramatically since then, but the essence remains the same.

Going over the comments on Slashdot, even though there are some surprisingly (for Slashdot) insightful ones, there is also one fundamental misconception. I was genuinely surprised at its prevalence. Many commenters seem to identify the general notion of “peer evaluation” with the specific mechanisms currently employed to do it. Is the current way of doing things so deeply entrenched, that people are blind to other possibilities?

Quoting a random vicious comment: “The purpose of restricting published work to that which has passed peer review is to ensure that results do not become obsolete. They must uphold the same quality standards that we expect from all scientific disciplines—not blog-style fads that have become popular and at some stage will cease to be popular.” I wonder if commenter has ever written a blog himself, or whether he even just taken a look at, say, Technorati: there are over four million blogs out there and 99% have just one reader (the author). Very few blogs are popular (i.e., the actually read by a significant number of people). An explosion in quantity of published content does not imply a proprtional explosion in its consumption; quite the contrary. If anything, there is more competition for attention, not less.

Another commenter said that “there isn’t any direct communication between reviewers and submitters.” Not so. Take a look at Julian Besag’s “On the Statistical Analysis of Dirty Pictures” (unfortunately JSTOR is restricted-access, but maybe your institution has a subscription), published in the Journal of the Royal Statistical Society as recently as 1986. The actual paper is 21 pages, while the other 23 pages are devoted to an open discussion. This looks oddly familiar (deja vu again): it looks like very popular blogs, which often have comment sections larger than the original posts. A free and open discussion of ideas has always been an organic part of the research process. A few centuries ago, scientific articles appeared with a date on which they were “read” to the community (just take a look at, e.g., the an issue of the Philosophical Transactions of the Royal Society).

Research on the web

Reaching far out into the long tail of ideas, which I also discussed in a previous post, should arguably be a top priority for research. In other endeavors it is an important means to success (financial or otherwise), but in academia and the research community it is usually an end in itself. The web itself was originally conceived as a venue for the exchange of scientific ideas, but even its creators probably did not envision the full potential nor realize all the implications of democratizing publication.

Modern technology allows more researchers (whether they work for startups, academic institutions, or large corporations) to try out more ideas. In other words, the production of research output is scaling up to unprecedented levels. However, I strongly suspect that traditional ways for evaluating research will not scale for much longer, being unable to keep up with the explosive growth in the rate of new ideas.

The typical process for evaluating and disseminating research—at least in computer science with which I am familiar—seems to be the following (with perhaps a few exceptions). First you come up with an interesting idea. Next, you build a story around it and do the minimal work to support that story. If everything works out, you write it up and submitted to a conference or, more rarely, a journal. On average, three people (chosen largely at random) review your work, making some comments in private. Once your work is published, you move on to the next paper.

I would simply name two artifacts as the main “products” of computer science research: papers and software. The latter is often overlooked, but it’s at least as important as the first. Anyway, what might be the state-of-the-art media for each of those artifacts?

There are some well-known efforts to use the web for the former. For example, there is arXiv for physics and sciences, CoRR for computer science, and PLOS for life sciences. There is also VideoLectures for open access to some talks. All of these, however, largely mirror the established ways of doing things: they are still built using the paradigm of a “library”. Although very important steps in the right direction, they perhaps play second fiddle to traditional media (there is a reason that arXiv is called a “pre-print server”) and thus fail to fully realize the potential offered by the rapidly emerging social media.

Things are perhaps a little more advanced for software artifacts. There are SourceForge, Google Code, and countless other similar sites for hosting source code, tracking issues and holding online discussions. There is also Freshmeat, Ohloh, and other project directories, as well as source code search engines such as Koders. However, none of these (or, as far as I know, anything similar) have been widely embraced by the research community.

Enough about today. It is more interesting to try and imagine how all these things, and more, may come together in the future.

The academic clowd scenario

Shamelessly copying this post, let’s imagine the academic clowd (cloud + crowd).

You have a great new idea and decide to try it out. You write a proof-of-concept implementation and run it on the cloud, using large datasets that also live out there. The implementation itself is available to the clowd, which can analyze the revision control logs and find out who really worked on what.

Your idea works and you decide to write a research article about it. The clowd knows what papers you wrote, who are your co-authors and which conferences and journals you publish in (cf. DBLP). It also knows the content of your papers (cf. CiteSeer). So, when you publish your new article, it compares it with the existing literature and finds the most relevant experts (in terms of content, co-citations, venues of publication, etc) to evaluate your work. It knows who your close friends and relatives are (from Facebook) and automatically excludes them from the list of potential reviewers. It also exlcudes your co-authors from the past three years. Then, it solicits reviews from those experts. Of course, it also allows others who are interested to participate in the discusssion.

In addition to the original paper, all review comments are public and can be moderated (say, similar to Digg or to Slashdot, but perhaps in a more principled and civilized manner). Thus, the review comments are ranked for their correctness, originality and usefulness. These rankings propagate to the papers they refer to.

You present your work in public and the video of your lecture is on the clowd, exposing you to a much larger audience. Anyone can also comment on it and respond to it. The videos are linked to each other, as well as to the articles and to the implementations. They are organized into thousands “virtual research tracks” with several tens of talks in each. “Best of” virtual conference compilations appear on the clowd.

Rising papers and their authors get introduced to each other by the clowd. You can easily find ten potential new collaborators with mutual interests. You try out more things together, write more articles, and so on …until one day you all save the world together (well, maybe not, but it would be nice! :-).

So, what will the future really look like?

Well, who knows? I’m pretty sure the above scenario will seem as ridiculous in ten years, as the SourceForge ideal looks today (what was I thinking then?). Nonetheless, I believe it should be part of the current vision for research. I don’t think that the web and social media will lead to less selection via peer evaluation. Quite the contrary. Nor do I think that they will lead to less elitism. This follows from simple math. Taking the simplistic but common measure of “acceptance ratio”, the numerator cannot grow much, because people’s capacity to absorb information will not grow that much. But, if the potential to produce published content makes the denominator grow to infinity, then the ratio has to approach zero. Methods for evaluating research output need to scale up to this level of filtering, and I simply don’t think that the current way of evaluating research can achieve this.

Comments (1)

« Previous Page« Previous entries « Previous Page · Next Page » Next entries »Next Page »