I have stopped worrying what can be inferred about me, because I’ve accepted the simple fact that, given enough time (data) and resources, anything can be inferred. Consider, as an example, “location privacy.” A number of approaches rely on adaptively coarsening the detail of reported location (using all sorts of criteria to decide detail, from mobility patterns, to spatial query workload characteristics, etc). For example, instead of revealing my exact location, I can reveal my location at a city-block level. In an area like NYC, this would conflate me with hundreds of other people that happen to be on the same block, but a block-level location is still accurate enough to be useful (e.g., for finding nearby shops and restaurants). This might work if I’m reporting my location just once. However, if I travel from home to work, then my trajectory over a few days, even at a city-block granularity, is likely sufficient to distinguish me from other people. I could perhaps counter this by revealing my location at a city-level or state-level. Then a few days worth of data might not be enough to identify me. However, I often travel and data over a period of, say, a year, would likely be enough to identify me even if location detail is quite coarse. Of course, I could take things to the extreme and just reveal that “I am on planet Earth”. But that’s the same as not publishing my location, since this fact is true for everyone.
I’m far from an expert in economics, politics, or history; quite the contrary. Which is why I try to cast events in more familiar, anthropomorphic terms, and also why such analogies are dangerous. Caveat lector—now, let’s get on with the story.
There once was a large family, with many brothers, uncles, and cousins spread over many different places. Each of them led their own lives. The extended family spanned all sorts of lifestyles, from successful businessmen, dignified and well-dressed, to smart but somewhat irresponsible bon viveurs. They lived in many different places and they occasionally exchanged gifts and money, some more frequently than others (admittedly, this part is rather weak in its simplicity, but a single analogy can only be taken so far). But they were getting tired of running to Western Union, paying transaction fees, losing money on currency conversions due to volatility in exchange rates, and so on. Furthermore, some of the more powerful family members had gotten into nasty feuds (world wars).
So, under the leadership of some of the more powerful siblings (Germany and France) they thought: well, we have enough money to go down to an international bank and open a common family account in a solid currency, say, dollars (they in fact created their own currency and bank, perhaps to avoid associations with existing institutions, but it’s probably safe to say that they heavily mirrored those of one of the leading siblings). Then it will be so much easier to do the same things much more efficiently. The richer craftsmen and businessmen among them could send their stuff with less hassle and waste [e.g., paragraph seven], and the poorer ones could gain a bit by wisely using their portion of the funds and an occasional advance withdrawal.
The leading siblings knew how to keep their checkbooks balanced, and it seemed reasonable to assume that these methods were general enough and suitable for everyone. So, after opening the family account with all of them as joint holders, they shook hands and simply agreed to use the money wisely, pretty much in the way that had worked well for the richer and more productive ones (stability and growth pact). Once in a while they might briefly meet and agree on some further rules of how the money should be used, but basically each one of them went their way, living the life they always had, managing their portion of the family funds. One of the more cynical siblings (England) was a bit skeptical about opening a family account while living their separate lives apart, so it chose to stay out, at least for a while. Times were good for several years, but they didn’t last forever.
The post I wrote a few days ago about Android is all over the place. The right elements are in that post, but my composition and conclusions are somewhat incoherent. Perhaps I have been partly infected by the conventional thinking (of, e.g., various older, big corporations) and missed the obvious. Read the rest of this entry »
Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”
This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement. This has two immediately identifiable “creators”: the payer (you) and the payee. The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.
When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on. However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.
Update: I’ll keep this post for the record, even though I’ve completely changed my mind.
I recently upgraded to a T-Mobile G1 (aka. HTC Dream), running Android. The G1 is a very nice and functional device. It’s also compact and decent looking, but perhaps not quite a fashion statement: unlike the iPhone my girlfriend got last year, which was immediately recognizable and a stare magnet, I pretty much have to slap people on the face with the G1 to make them look at it. Also, battery life is acceptable, but just barely. But this post is not about the G1, it’s about Android, which is Google’s Linux-based, open-source mobile application platform.
I’ll start with some light comments, by one of the greatest entertainers out there today: Monkey Boy made fun of the iPhone in January, stating that “Apple is selling zero phones a year“. Now he’s making similar remarks about Android, summarized by his eloquent “blah dee blah dee blah” argument. Less than a year after that interview, the iPhone is ahead of Windows Mobile in worldwide market share of smartphone operating systems (7M versus 5.5M devices). Yep, this guy sure knows how entertain—even if he makes a fool of himself and Microsoft.
Furthermore, Monkey Boy said that “if I went to my shareholder meeting […] and said, hey, we’ve just launched a new product that has no revenue model! […] I’m not sure that my investors would take that very well. But that’s kind of what Google’s telling their investors about Android.” Even if this were true, perhaps no revenue model is better than a simian model.
Anyway, someone from Microsoft should really know better—and quite likely he does, but can’t really say it out loud. There are some obvious parallels between Microsoft MS-DOS and Google Android: Read the rest of this entry »
The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.” Although this is a fairly obvious statement, the article is well-written and worth a read. The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.
However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.
In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes. In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it
- Provides data de-clustering and replication across many machines, and
- Allows computations to execute on local copies of the data.
Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.
Most of the fundamental ideas for distributed data processing are not new. For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right. Reading the original Google papers, none make a claim to fundamental discoveries. Focusing on “academic novelty” (whatever that may mean) is irrelevant. Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point. The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.
Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?
Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange. This is also nothing radically new; in essence, it is a binary XML representation, somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct). In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format. Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on). Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be. So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.
Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable. It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper. It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far. While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries). I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).
Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets. Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy). As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons. It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.
Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic. I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.
I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“. Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked. They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.
In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of. Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts. The idea is too obvious to miss anyway.
CAPTCHAs, even in their current form, have led to numerous contributions. A non-exclusive list, in no particular order:
- They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.” Doesn’t quite work, does it?
- They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people. What greater measure of success can you think of for a technology?
- Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences. Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).
So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—. There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology. I’m pretty sure we’ll see more elaborate tests soon, not less.
From the article “Web Science: An Interdisciplinary Approach to Understanding the Web” in the July issue CACM (which, by the way, looks quite impressive after the editorial overhaul!):
At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. […] The macro system, that is, the use of the micro system by many users interacting with one another in often-unpredicted ways, is far more interesting in and of itself and generally must be analyzed in ways that are different from the micro system. […] The essence of our understanding of what succeeds on the Web and how to develop better Web applications is that we must create new ways to understand how to design systems to produce the effect we want. The best we can do today is design and build in the micro, hoping for the best, but how do we know if we’ve built in the right functionality to ensure the desired macroscale effects? How do we predict other side effects and the emergent properties of the macro? […] Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology and economics.
This is a noble goal indeed. The Wikipedia article on sociology sounds quite similar on many aspects:
Sociologists research macro-structures and processes that organize or affect society […] And, they research micro-processes […] Sociologists often use quantitative methods—such as social statistics or network analysis—to investigate the structure of a social process or describe patterns in social relationships. Sociologists also often use qualitative methods—such as focused interviews, group discussions and ethnographic methods—to investigate social processes.
First, we have to keep in mind that the current Western notion of “science” is fairly recent. Furthermore, it has not always been the case that technology follows science. As an example, in the book “A People’s History of Science” by Clifford Conner, one can find the following quotation from Gallileo’s Two New Sciences, about Venice’s weapons factory (the Arsenal):
Indeed, I myself, being curious by nature, frequently visit this place for the mere pleasure of observing the work of those who, on account of their superiority over other artisans, we call “first rank men.” Conference with them has often helped me in the investigation of certain effects, including not only those which are striking, but also those which are recondite and almost incredible.
Later on, Conner says (p.284), quoting again Gallileo himself from the same source:
[Gallileo] demonstrated mathematically that “if projectiles are fired … all having the same speed, but each having a different elevation, the maximum range … will be obtained when the elevation is 45°: the other shots, fired at angles greater or less will have a shorter range. But in recounting how he arrived at that conclusion, he revealed that his initial inspiration came from discussions at the Arsenal: “From accounts given by gunners, I was already aware of the fact that in the use of cannons and mortars, the maximum range, that is the one in which the shot goes the farthest, is obtained when the elevation is 45°.” Although Gallileo’s mathematical analysis of the problem was a valuable original contribution, it did not tell workers at the Arsenal anything htey had not previously learned by empirical tests, and had little effect on the practical art of gunnery.
In any case, facilitating “technology” or “engineering” is certainly not the only good reason to pursue scientific knowledge. Conversely, although “pure science” certainly has an important role, it is not the only ingredient of technological progress (something I’ve alluded to in a previous post about, essentially, the venture capital approach to research). Furthermore, some partly misguided opinions about the future of science have brightly shot through the journalistic sphere.
However, if, for whatever reason, we decide to go the way of science (a worthy pursuit), then I am reminded of the following interview of Richard Feynman by the BBC in 1981 (full programme):
Privacy concerns notwithstanding, the web gives us unprecedented opportunities to collect measurements in quantities and levels of detail that simply were not possible when the venerable state-of-the-art involved, e.g., passing around written notes among a few people. So, perhaps we can now check hypotheses more vigorously and eventually formulate universal laws (in the sense of physics). Perhaps the web will allow us to prove Feynman wrong.
I’m not entirely convinced that it is possible to get quantitative causal models (aka. laws) of this sort. But if it is, then we need an appropriate experimental apparatus for large-scale data analysis to test hypotheses—what would be, say, the LHC-equivalent for web science? (Because, pure science seems to have an increasing need for powerful apparatuses.) I’ll write some initial thoughts and early observations on this in another post.
I’m pretty sure that my recent posts have been doing circles around something, but I’m not quite sure yet what that is. In any case, all this seems an interesting direction worth pursuing. Even though Feynman was sometimes a very hard critic, we should pehaps remember his words along the way.