## Expectations of privacy

I have stopped worrying what can be inferred about me, because I’ve accepted the simple fact that, given enough time (data) and resources, anything can be inferred. Consider, as an example, “location privacy.”  A number of approaches rely on adaptively coarsening the detail of reported location (using all sorts of criteria to decide detail, from mobility patterns, to spatial query workload characteristics, etc).  For example, instead of revealing my exact location, I can reveal my location at a city-block level. In an area like NYC, this would conflate me with hundreds of other people that happen to be on the same block, but a block-level location is still accurate enough to be useful (e.g., for finding nearby shops and restaurants).  This might work if I’m reporting my location just once.  However, if I travel from home to work, then my trajectory over a few days, even at a city-block granularity, is likely sufficient to distinguish me from other people.  I could perhaps counter this by revealing my location at a city-level or state-level.  Then a few days worth of data might not be enough to identify me.  However, I often travel and data over a period of, say, a year, would likely be enough to identify me even if location detail is quite coarse.  Of course, I could take things to the extreme and just reveal that “I am on planet Earth”.  But that’s the same as not publishing my location, since this fact is true for everyone.

If it’s technically possible to infer my identity (given a long enough period of observation, and enough resources and time to piece the various, possibly inaccurate, pieces of information together), someone (with enough patience and resources) will likely do it. Therefore, as the amount of data about me tends to infinity (which, on the Internet, it probably does), the fraction that I have to hide in order to maintain my privacy tends to one: you have long-term privacy only if you never reveal anything.  There are various ways of not revealing anything.  One is to simply not do it.  Another might be to keep it to yourself and never put it in any digital media.  Yet another might be encrypting the information.

Although I learned not to worry about what can be inferred about me, I am perhaps somewhat worried about knowing who is accessing my data (and making inferences), and how they are using it. Particularly if this is done by parties that have far more resources and determination than myself.  However, who uses my information and how is also another piece of information (data) itself.  Although everything is information, there seems to be an asymmetry: when my information is revealed and used, it may be called “intelligence”, but when the information that it was used is revealed, it may be called “whistleblowing” or even “treason“.  This asymmetry does not seem to have any technical grounding—one might make valid arguments on political, legal, moral, etc grounds, but not on technical grounds. Seen in this context, Zuckerberg’s calls for “more transparency” make perfect sense—he’s calling for less asymmetry.

More generally, privacy does not really seem to be a technical problem, much like DRM isn’t really a technical problem.  That privacy can be guaranteed by technical means seems to be a delusion and, perhaps, a dangerous one, because it gives a false sense of security. Privacy is, for the most part, a social, political and legal problem about how data can be used (any and all data!) and by whom. The apparent technical infeasibility of privacy had led me to believe that people will, eventually, get over the idea. After all, privacy is a 200-300 year old concept (at least in the western world; interestingly, Greek did not have a corresponding word until very recently). I may have missed something obvious, however: if privacy is attainable via a centralized, trusted gatekeeper, then perhaps privacy is the “killer app” for centralization and “walled gardens”. “I want full control over your data” is tougher to sell than “I want to protect your privacy”. Which is why Eric Schmidt’s recent backpedaling is somewhat worrying, even if the goal is noble (and there currently isn’t any evidence to believe otherwise).

I don’t think there are any (technical) solutions to privacy.  Also, enforcing transparency is perhaps almost as hard as enforcing privacy, although I have slightly more hope for the former—but that’s a separate discussion.  Privacy is cat-and-mouse game, much like “piracy” and DRM. However, our expectations should be tempered by the reality of near-zero-cost transmission, collection, and storage of “inifinitely” growing amounts of information, and we should perhaps re-examine existing notions of privacy under this light. I find that many non-technical people are still surprised when I explain the simple example in the opening paragraph, even though they consider it obvious in retrospect.

Personally, I find it safer to just assume that I have no privacy. Saves me the aggravation.

## The pesky cousin from Greece

I’m far from an expert in economics, politics, or history; quite the contrary.  Which is why I try to cast events in more familiar, anthropomorphic terms, and also why such analogies are dangerous. Caveat lector—now, let’s get on with the story.

There once was a large family, with many brothers, uncles, and cousins spread over many different places. Each of them led their own lives.  The extended family spanned all sorts of lifestyles, from successful businessmen, dignified and well-dressed, to smart but somewhat irresponsible bon viveurs.  They lived in many different places and they occasionally exchanged gifts and money, some more frequently than others (admittedly, this part is rather weak in its simplicity, but a single analogy can only be taken so far). But they were getting tired of running to Western Union, paying transaction fees, losing money on currency conversions due to volatility in exchange rates, and so on. Furthermore, some of the more powerful family members had gotten into nasty feuds (world wars).

So, under the leadership of some of the more powerful siblings (Germany and France) they thought: well, we have enough money to go down to an international bank and open a common family account in a solid currency, say, dollars (they in fact created their own currency and bank, perhaps to avoid associations with existing institutions, but it’s probably safe to say that they heavily mirrored those of one of the leading siblings).  Then it will be so much easier to do the same things much more efficiently.  The richer craftsmen and businessmen among them could send their stuff with less hassle and waste [e.g., paragraph seven], and the poorer ones could gain a bit by wisely using their portion of the funds and an occasional advance withdrawal.

The leading siblings knew how to keep their checkbooks balanced, and it seemed reasonable to assume that these methods were general enough and suitable for everyone.  So, after opening the family account with all of them as joint holders, they shook hands and simply agreed to use the money wisely, pretty much in the way that had worked well for the richer and more productive ones (stability and growth pact).  Once in a while they might briefly meet and agree on some further rules of how the money should be used, but basically each one of them went their way, living the life they always had, managing their portion of the family funds.  One of the more cynical siblings (England) was a bit skeptical about opening a family account while living their separate lives apart, so it chose to stay out, at least for a while.  Times were good for several years, but they didn’t last forever.

The first to get into trouble would be one of the younger cousins (Greece), who generally valued time more than money (he occasionally complains about that himself, but to little effect so far). Using some money from the family account, he did a few renovations to make his home look better and bought some decent clothes. Using the family account to boost his creditworthiness and sporting a sharper new look, he managed to get a credit card with a promotional  0% APR (Euro membership).  He even threw a big party that impressed many (Olympic games). But after a few years, the credit card companies came back asking for payment, and he found himself in deeper trouble than before the good times had begun.

Some of the other relatives had also started getting into trouble, even if not all of them had been as irresponsible.  But the immediate problem was that cousin.  What was the family to do?  Other people had started noticing, and were beginning to have some questions.  “What kind of family are you?”  Your cousin deserves what he gets, but did you really think it was that simple to run a family with such a diverse crowd? Obviously the little cousin should be taught a lesson and become more mature and responsible.  But it should also be a lesson that could be repeated on other relatives, if necessary.

One option would be to kick him out (bankruptcy). It might get him to change his ways (or not), but a homeless relative does not make the family look good, even if he’s largely responsible for his predicament (which he is, by the way). And what would happen to the other relatives that weren’t doing that great either?  A 0% promotional APR cannot last forever, and it’s not hard to shoot yourself in the foot with it, even if you aren’t irresponsible.  Will other relatives head for the door too?  If they do, will they come back?  And is it possible to neatly untangle the finances, after decades of using a common account?  Furthermore, the cousin may start hanging out with “strangers”, some of which may be of questionable character (IMF, Russia, etc). In fact, keeping him out of undesirable company might have played a role in inviting him to the extended family account in the first place.

Another option would be to bring him and his family into the home of a richer and more dignified family member, force him into a suit, grab him by the hand (or neck), and teach him how behave like a grown up under close supervision. But the other members of the household (citizens), who contribute to its finances (pay taxes) and get food and shelter in return (welfare and other benefits) would rightfully protest. “Who is this noisy, scruffy guy in our home?  Why do we have to feed him and pay so much attention to him?”  The cousin’s family, who also valued time over money (e.g., preferring a relaxing lifestyle on modest means over hard work), was also not very happy. “I just wish we could go down to the beach and spend 2-3 hours enjoying coffee under the sun like we used to.  And why is your big cousin telling us what to do anyway?”  In addition, it was always likely that other, equally noisy and scruffy distant relatives might show up knocking at the door of the mansion, and demand the same attention.  This was certainly more than big cousin had signed up for when opening the family bank account.

Then there is a third option, which does not so much focus on teaching a lesson, but on saving face and postponing the worst trouble. Just give the little cousin a scolding and some pocket money to pay the rent and interest for a few months.  At least he wouldn’t be out in the street.  And, who knows, he might change his ways on his own in the meantime.  Sweeping the mess under the rug is unlikely (although not provably impossible) that it’ll lead to any long-term solution, but it’s the option easiest to swallow by everyone involved.

Anyway, I’ll stop the anthropomorphic analogies here.  Using a different analogy, I’ll add that tweaking the knobs (fiscal policy targets) and, perhaps, changing batteries (bail-out loans) won’t do much good in the long run if the machine is basically broken.  But it’s hard to fix it if getting down to the cogs and gears that make it work (politics) is taboo, perhaps even more than it used to be (compare Victor Hugo’s vision of the “United States of Europe” more than a century ago, with the Lisbon treaty).

Although it’s a rather overloaded term, you can probably call me a technocrat.  As such, Deng Xiaoping’s famous quote (“it doesn’t matter if it’s a black cat or a white cat, it’s a good cat as long as it catches mice”) is basically appealing.  Cats competing with each other and against mice sounds like a “natural” situation, so it’s easy to overlook whether it’s the only possible state of affairs.  However,  if they’re domesticated and not out in the wild, it’s not hard to imagine the mice and both cats colluding to, basically, take it easy. Sometimes what is “natural” should be examined more closely.

Greece is the first to draw wide attention to such questions, but I don’t think it will be the last, nor is it the first mishap along European integration.  I’d venture that, unless the EU collapses, everyone will find their place in it.  Eventually.

I’ll finish with an annotated graph (original source via metablogging.gr, and public Google spreadsheet with subset of the data), showing Greek public debt (central government) as % of GDP over the past 40 years. I’ll just point out that 1981 looks like a particularly interesting year, for various reasons.

Postscript. It’s often mentioned that “Greece has been in default for 50 years during the past two centuries.”  This is true; after independence in 1821, Greece was bankrupt starting at the end of the 19th century under Charilaos Trikoupis, and ending after WWII.  During this period, Greece was involved in a number of wars in the Balkans and Asia Minor, growing and shrinking in size a few times. Obviously, this didn’t help financial matters, but I don’t think it bears much similarity to the current situation.

I’ve also been puzzled somewhat about the role of corruption.  Obviously, it’s not good and I’m not trying to justify it in any way.  On the other hand, it doesn’t seem to be the sole cause of trouble, as is often suggested.  Several East Asian countries (notably China, although it’s not the first neither the only one) have shown progress despite corruption. I don’t have an answer, but it seems to me that, when you steal money, it matters where you steal it from.  If I swipe some cash from my little brother’s wallet, it will make my brother poorer and angrier, but it probably won’t bankrupt the household; someone earned that money, even if it wasn’t me.  However, if I pocket an advance withdrawal using the credit card our father gave us, it’ll get everyone in trouble, eventually.

Finally, as for 0% APR credit cards, it’s rather different if, say, Bill Gates (US) gets one versus if I get one (not that I’m that irresponsible : ).  One of us has deeper pockets and that makes a difference on whether we deserve it, on the kind of trouble we can get in, and even on the moral hazards we face.  As long as the card is used wisely for an appropriate period of time, it isn’t necessarily bad.  Any comparisons between US and Greece are, at best, premature.

## Revised thoughts on Android

The post I wrote a few days ago about Android is all over the place. The right elements are in that post, but my composition and conclusions are somewhat incoherent. Perhaps I have been partly infected by the conventional thinking (of, e.g., various older, big corporations) and missed the obvious.

First, in a networked environment, it is common standards, rather than a single, common software platform, which further enable information sharing. So, Google may be doing Android for precisely the opposite reason than I originally suggested: to avoid the emergence of a single, dominant, proprietary platform. Chrome may exist for a similar reason. After all, Android serves a purpose similar to a browser, but for mobile devices with various sensing modalities.

Finally, mobile is arguably an important area and Google probably wants to encourage diversity and experimentation which, as I wrote in a previous post, is a pre-requisite for innovation. This is in contrast to the established mentality summarized by the quote I previously mentioned, to “find an idea and ask yourself: is the potential market worth at least one billlion dollars? If not, then walk away.” In fairness, this approach is appropriate to preserve the status quo. (By the way, in the same public speech, the person who gave this advice also responded to a question about competition by saying with commendable directness that “Look: we’ll all be dead some day.  But there’s a lot of money to be made until then.”)  But for innovation of any kind, one should “ask ‘why not?’ instead of ‘why should we do it?'” as Jeff Bezos said, or “innovate toward the light, not against the darkness” as Ray Ozzie said.

## On data ownership in a networked world

Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”

This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement.  This has two immediately identifiable “creators”: the payer (you) and the payee.  The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.

When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on.  However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.

#### Unique identity

People call me “Spiros”, my identity documents list me as “Spyridon Papadimitriou” and on most online sites I’m registered as spapadim.  However, sometimes I’m s_papadim or spiros_papadimitriou, and so on.  Like most people, I lost track of all my accounts a time ago.  Vice versa, I’m not the only “Spiros Papadimitriou” in the real world.  For example, I occasionally get confused with my cousin, and receive comments about my interesting architectural designs!  Nor am I the only spapadim on the net.

A framework and mechanisms that allow (but do not enforce) asserting and verifying which of those labels (i.e., names, userids, etc) refer to the same entity (i.e., me) is missing. However, this is a prerequisite: how can we talk about data ownership and tackle portability, transparency and accountability, if we have to jump through countless hoops just to prove identity?

Some people, especially in the US, may object or even outright panic at the thought of such a global identifier.  In Greece, and in much of Europe, we’ve had national identity cards for decades.  Which is fine, as long as you know they exist and what are permissible uses-in other words, as long as transparency is ensured.  Furthermore, the illusion of privacy should not be confused with privacy itself—if in doubt, I suggest reading “Database Nation” (official site).  Its examples are largely US-centric, but the lessons are not.

Both high-profile individual developers and major companies are involved in these efforts.  For example, Yahoo! already supports OpenID and plans to support OAuth as well, while Google supports OAuth directly and OpenID indirectly in various ways.  Wide adoption of these standards would be a major step forwards for data portability and web interoperability.  However, I suspect they fall slightly short of providing a truly permanent and global personal identity.  What if, for any reason, my Yahoo! account disappears, either because I decided to shut it down or because Yahoo! went bust?

I was going to suggest a DNS-based solution and I was surprised when I found that the generic top-level domain .name has been instituted since 2001 to provide URIs for personal identities. You can register for a free three-month trial on FreeYourID (after that, it’s $11/year). What’s more, their service already provides OpenID authentication. In principle, this should allow easy switching of authentication and authorization service providers. Just as I can still keep the “label” for this site even if I move to a different web host, I can still keep my personal “label” no matter who I choose to manage my personal information. So, now my universal username is spiros.papadimitriou.name, any emails sent to spiros@papadimitriou.name will find their way to me, you can call me on Skype using spiros.papadimitriou.name/call, and so on. With such a unique identity tied to authorization and authentication services, the Giant Global Graph and its materializations would be one step closer to becoming really useful. If I want to use my identity to log and controll access to my data, I should be able to prove my claims. Currently, FOAF and XFN allow assertion of relationshipt but provide no way to verify them. #### Data portability The point of this mental exercise so far is the following: A unique identity that can be verifiably associated with each and every data item that I produce is a prerequisite for making data ownership claims. Subsequently, we need to ask what fundamental rights should be associated with data ownership. The first is the right to keep my information with me or, in other words, “data portability”. Just as I can freely move my money from one financial institution to another, I should be able to move any of my information from one data warehouse to another. For example, consider my web search history. I don’t think I need to argue about the importance of historical information to improve search quality. If I decide for any reason to move to another search provider, I should be able to carry along all the information that’s directly associated with me. This should include my search keyword history, as well as any additional information I may have contributed. The actual details, however, may not be that straightforward. Take, say, the third hit on a Google search. Who is the “creator”? Me by entering the search keywords, Google by producing the search results in response to those keywords, or the person who wrote the web page that contains them in the first place? Similarly, when I buy gas, who is the “creator” of the transaction entry: me, Mobil, or American Express? Even though intuition can often be wrong, my intuitive response to the Google search example would be that both I and Google have an ownership claim on this particular search, which includes the query keywords as well as a ranking of URLs. On the other hand, the person who wrote the contents of, say, the third URL has ownership claims only on those, and not the search results. Furthermore, the thousands of people that provided feedback to Google’s ranking algorithms by clicking on this URL on similar searches have ownership claims on those searches, but not on mine. Finally, those two ownership claims (on keywords and on rankings) should probably not be treated the same. If they were, then, say, MS Live could effectively copy Google by getting many users to move. It seems reasonable to have the right to move my search history, but not the actual search results. However, I can imagine that some form of ownership claim on the rankings may be useful for other personal rights. This is a highly idealized example and I’m not sure what an appropriate litmus test for ownership is, but some form of legal consensus must be in place. #### Transparency The second fundamental right is that I should know who is using my personal information and how. For example, if an insurance company accesses my credit history to give me a rate quote, I can find this out. It may not be a completely painless process but it is certainly possible today, with a regulatory framework that ensures this. Similar regulations should be instituted to cover any and all forms of access to personal information. Data access should be fully transparent to all parties involved. If the an insurance company accesses my medical records, I should know this. If the government does a background check on me, I should know this too. Transparency is a prerequisite for accountability. Otherwise, individuals have very limited power to protect themselves from improper uses of their personal information. #### Concluding remarks Much of the privacy research in computer science seems to assume that we can keep the existing legal and regulatory frameworks intact. Computer scientists taking such a position is even sadder than lawyers doing so; we have no excuse of failing to understand the technical issues. We cannot and should not make this assumption. Technical solutions should be subsidiary to new regulations. But that doesn’t mean technologists cannot lead. We should work towards supporting full transparency (for both individuals, as well as governments and corporations) rather than opacity and I’m currently in favor of a “shoot first, ask questions later” approach (and help lawmakers figure out the answers). After all, if there is anything that the DRM wars have taught us, it’s that information really wants to be free. Why do we think it’s technically hard (to say the least) to prevent copying of music, movies and software but we still think it may be possible to prevent copying of personal information? As I pointed out in an older post, it’s usually the use and not the possession of information that’s the problem. My point in this post is simple: we should not fight the wrong war. Instead, we need an easy way to make data ownership claims, and use this to enforce at least two fundamental rights: the ability to keep any personal data with us, and the ability to know who is using this data and how. Postscript. This post was wallowing for a while as a draft (originally separated from this post, then forgotten). Since then, a recent MIT TR article discusses some aspects of data ownership. Even better, I have since found an excellent short piece in the same issue by Esther Dyson, with which I could not agree more. Update. After posting this last night, I did some further Googling and found another piece by Esther Dyson in the Scientific American. If you’ve read through my ramblings so far, then I’d urge you to read her article; she’s a much better writer than me, and has apparently been thinking about these issues for almost a decade, way before many people even knew what the Internet is. I should probably follow her more closely myself, as I agree disturbingly often with what I’ve read from her so far. ## First thoughts on Android Update: I’ll keep this post for the record, even though I’ve completely changed my mind. I recently upgraded to a T-Mobile G1 (aka. HTC Dream), running Android. The G1 is a very nice and functional device. It’s also compact and decent looking, but perhaps not quite a fashion statement: unlike the iPhone my girlfriend got last year, which was immediately recognizable and a stare magnet, I pretty much have to slap people on the face with the G1 to make them look at it. Also, battery life is acceptable, but just barely. But this post is not about the G1, it’s about Android, which is Google’s Linux-based, open-source mobile application platform. I’ll start with some light comments, by one of the greatest entertainers out there today: Monkey Boy made fun of the iPhone in January, stating that “Apple is selling zero phones a year“. Now he’s making similar remarks about Android, summarized by his eloquent “blah dee blah dee blah” argument. Less than a year after that interview, the iPhone is ahead of Windows Mobile in worldwide market share of smartphone operating systems (7M versus 5.5M devices). Yep, this guy sure knows how entertain—even if he makes a fool of himself and Microsoft. Furthermore, Monkey Boy said that “if I went to my shareholder meeting […] and said, hey, we’ve just launched a new product that has no revenue model! […] I’m not sure that my investors would take that very well. But that’s kind of what Google’s telling their investors about Android.” Even if this were true, perhaps no revenue model is better than a simian model. Anyway, someone from Microsoft should really know better—and quite likely he does, but can’t really say it out loud. There are some obvious parallels between Microsoft MS-DOS and Google Android: • Disruptive technology: In the 80s, it was the personal computer. Today, many think it is “cloud computing” (or “services”, or “ubiquitous computing”, or “utility computing”, or whatever else you want to call it). • Commodity infrastructure: In the 80s, PC-compatibles became a commodity through standardization of the hardware platform and fierce competition that drove prices (and profit margins) down. Today, network infrastructure (the Internet at the core, and mobile devices on the fringes) as well as systems software (LAMP) are facing similar pressures. • Common software platform: MS-DOS was the engine that fueled the growth of the personal computer. For cloud computing, there is still some way to go (which Android hopes to help pave). • Revenue model: Microsoft made a profit out of every PC sold. In today’s networked world, profit should come from services offered over the network and accessed via a multitude of devices (including mobile phones), rather than from selling software licenses. An executive once said that money is really made by controlling the middleware platform. Lower levels of the stack face high competition and have low profit margins. Higher levels of the stack (except perhaps some key applications) are too special-purpose and more of a niche. The sweet-spot lies somewhere in the middle. This is where MS-DOS was and where Android wants to be. Microsoft established itself by providing the platform for building applications on the “revolution” of its day, the personal computer. MS-DOS became the de-facto standard, much more open than anything else at that time. Subsequently, Microsoft took a cut of the profits out of each PC sold ever since. Taiwanese “PC-compatibles” helped fuel Microsoft’s (as well as Intel’s) growth. The rest is history. In “cloud” computing, the ubiquitous, commodity infrastructure is the network. This enables access to applications and information from any networked device. Even though individual components matter, it is common standards, rather than a single, common software platform, which further enable information sharing. If you believe that the future will be the same as the past, i.e., selling shrink-wrapped applications and software licenses, then Android not only has no revenue model, but has no hope of ever coming up with one. Ballmer would be absolutely right. But if there is a shift towards network-hosted data and applications, money can be made whenever users access those. There are plenty of obvious examples which could be profitable: geographically targeted advertising, smart shopping broker/assistant (see below), mobile office and add-on services, online games (location based or not), and so on. It’s not clear whether Google plans to get directly involved in those (I would doubt it), or just stay mostly on the back end and provide an easy-to-use “cloud infrastructure” for application developers. The services provided by network operators are becoming commodities. This is nothing new. A quote I liked is that “ISPs have nothing to offer other than price and speed“. I wouldn’t really include security in their offerings, as it is really an end-to-end service. As for devices, there is already evidence that commoditization similar to that of PC-compatibles may happen. Just one month after Android was open-sourced, Chinese manufacturers have started deploying it on smartphones. Even big manufacturers are quickly getting in the game; for example, Huawei recently announced an Android phone. Most cellphones are already manufactured in China anyway. The iPhone is assembled in Shenzhen, where Huawei’s headquarters are also located (coincidence?). The Chinese already have a decent track record when it comes to building hardware and it’s only a matter of time until they fully catch up. So, it’s quite simple: Android wants to be for ubiquitous services as MS-DOS was for personal computers. But Microsoft in the 80s did not really start out by saying “our revenue model is this: we’ll build a huge user base at all costs, which will subsequently allow us to get$200 out of each and every PC sold”?  Not really.  Similarly, Google is not going to say that “we want to build a user base, so we can make a profit from all services hosted on the [our?] cloud and accessed via mobile devices [and set-top boxes, and cars, and…].”  Such an announcement would be premature, and one of the surest ways to scare off your user base: unless Google first provides more evidence that it means no evil, the general public will tend to assume the worst.

The most interesting feature of Android is it’s component-based architecture, as pointed out by some of the more insightful blog posts. Components are like iGoogle gadgets, only Android calls them “activities.” Applications themselves are built using a very browser-like metaphor: a “task” (which is Android-speak for running applications) is simply a stack of activites, which users can navigate backwards and forwards. The platform already has a set of basic activities that handle, e.g., email URLs, map URLs, calendar URLs, Flickr URLs, Youtube URLs, photo capture, music files, and so on. Any application can seamlessly invoke any of these reusable activities, either directly or via a registry of capabilities (which, roughly speaking, are called “intents”). The correspondence between a task and an O/S process is not necessarily one-to-one. Processes are used behind the scenes, for security and resource isolation purposes. Activities invoked by the same task may or may not run in the same process.

In addition to activities and intents, Android also supports other types of components, such as “content providers” (to expose data sources, such as your calendar or todo list, via a common API), “services” (long-running background tasks, such as a music player, which can be controlled via remote calls) and “broadcast receivers” (handlers for external events, such as receiving an SMS).

I think that Google is really pushing Android because it needs a component-based platform, and not so much to avoid the occasional snafu. If embraced by developers, this is the major ace up Android’s sleeve.  Furthermore, the open source codebase is the strongest indication (among several) that Google has no intention to tightly regulate application frameworks like Apple, or to leverage it’s position to attack the competition like Microsoft has done in the past.  Google wants to give itself enough leverage to realize it’s cloud-based services vision. If others benefit too, so much the better—Google is still too young to be “evil“.  After all, as Jeff Bezos said, “like our retail business, [there] is not going to be one winner. […] Important industries are rarely made by single companies.” I find the comparison to retail interesting. In fact, it is quite likely that many “cloud services” themselves will also become commodities.

I’d wager that really successful Android applications won’t be just applications, but components with content provided over the network. A shopping list app is nice. It was exciting in the PalmPilot era, a decade ago. But a shopping list component, accessible from both my laptop and my cellphone, able to automatically pull good deals from a shopping component, and allow a navigation component to alert me that the supermarket I’m about to drive by has items I need—well, that would be great! Android is built with that vision in mind, even though it’s not quite there yet.

It’s kind of disappointing, but not surprising, that many app developers do not yet think in terms of this component-based architecture. In fairness, there are already efforts, such as OpenIntents, to build collections of general-purpose intents. Furthermore, the sync APIs are not (yet) for the faint of heart. Even Google-provided services could perhaps be improved. For example, Google Maps does not synchronize stored locations with the web-based version. When I recently missed a highway exit on the way to work and needed to get directions, I had to pull over and re-type the full address. Neither does it expose those locations via a data provider. When I installed Locale, I had to manually re-enter most of “My Locations” from the web version of Google Maps. So, there are clearly some rough edges that I’m sure will be smoothed out.  After all, there have been other rough edges, such as forgotten debugging hooks, something I find more amusing than alarming or embarrassing and certainly not the “Worst. Bug. Ever.

Android has a lot of potential, but it still needs work and Google should move fast. The top two items on my wish list would be:

1. Release a “signature” device (or two), like the Motorola Razr was a couple of years ago and the Apple iPhone was last year. The G1 is really nice, but not enough.  A device that people desire may be neither a necessary nor a sufficient condition for success, but it will sure help as a vehicle to move Android forward in market share.
2. Expand the set of available activities and content providers, and release an easy-to-use data sync service and API. In principle, everything that is an iGoogle gadget should also be an Android activity, sharing the same data sources. This is at the core of what “cloud computing” is about.  After all, you could think of Android as a glorified modern browser for devices with small screens, intermittent network connectivity, location sensors, and so on.

I suspect it might not be that hard to build a Google gadget container for Android.  Google Gears is already there and some form of interaction with the local device via Javascript is already allowed.  Many gadgets don’t need that much screen real estate anyway, so this may be an interesting hack to try out.

But not many people will buy an Android device for what it could do some day. Google has created a lot of positive buzz, backed by a few actual features. Now it needs some sexy devices and truly interesting apps, to really jumpstart the necessary network effect. Building the smart shopping list app should be as easy as building the dumb one. In the longer run, the set of devices on which Android is deployed should be expanded.  Move beyond cell phones, to in-car computers, set-top boxes, and so on (Microsoft Windows does both cars and set-top boxes already, but with limited success so far)—in short, anything that can be used to access network-hosted data and applications.

## “Beyond Relational Databases”

The article “Beyond Relational Databases” by Margo Seltzer in the July 2008 issue of CACM claims that “there is more to data access than SQL.”  Although this is a fairly obvious statement, the article is well-written and worth a read.  The main message is simple: bundling data storage, indexing, query execution, transaction control, and logging components into a monolithic system and wrapping them with a veneer of SQL is not the best solution to all data management problems. Consequently, the author makes a call for solutions based on a modular approach, using open components.

However, the article offers no concrete examples at all, so I’ll venture a suggestion. In a growing open source ecosystem of scalable, fault-tolerant, distributed data processing and management components, MapReduce is emerging as a predominant elementary abstraction for distributed execution of a large class of data-intensive processing tasks. It has attracted a lot of attention, proving both a source for inspiration, as well as target of polemic by prominent database researchers.

In database terminology, MapReduce is an execution engine, largely unconcerned about data models and storage schemes.  In the simplest case, data reside on a distributed file system (e.g., GFS, HDFS, or KFS) but nothing prevents pulling data from a large data store like BigTable (or HBase, or Hypertable), or any other storage engine, as long as it

• Provides data de-clustering and replication across many machines, and
• Allows computations to execute on local copies of the data.

Arguably, MapReduce is powerful both for the features it provides, as well as for the features it omits, in order to provide a clean and simple programming abstraction, which facilitates improved usability, efficiency and fault-tolerance.

Most of the fundamental ideas for distributed data processing are not new.  For example, a researcher involved in some of the projects mentioned once said, with notable openness and directness, that “people think there is something new in all this; there isn’t, it’s all Gamma“—and he’s probably right.  Reading the original Google papers, none make a claim to fundamental discoveries.  Focusing on “academic novelty” (whatever that may mean) is irrelevant.  Similarly, most of the other criticisms in the irresponsibly written and oft (mis)quoted blog post and its followup miss the point.  The big thing about the technologies mentioned in this post is, in fact, their promise to materialize Margo Seltzer’s vision, on clusters of commodity hardware.

Michael Stonebraker and David DeWitt do have a valid point: we should not fixate on MapReduce; greater things are happening. So, if we are indeed witnessing the emergence of an open ecosystem for scalable, distributed data processing, what might be the other key components?

Data types: In database speak, these are known as “schemas.” Google’s protocol buffers the underlying API for data storage and exchange.  This is also nothing radically new; in essence, it is a binary XML representation,  somewhere between the simple XTalk protocol which underpins Vinci and the WBXML tokenized representation (both slightly predating protocol buffers and both now largely defunct).  In fact, if I had to name a major weakness in the open source versions of Google’s infrastructure (Hadoop, HBase, etc), it would be the lack of such a common data representation format.  Hadoop has Writable, but that is much too low-level (a data-agnostic, minimalistic abstraction for lightweight, mutable, serializable objects), leading to replication of effort in many projects that rely on Hadoop (such as Nutch, Pig, Cascading, and so on).  Interestingly, the rcc record compiler component (which seems to have fallen in disuse) was once called Jute with possibly plans grander than what came to be.  So, I was pleasantly surprised when Google decided to open-source protocol buffers a few days ago—although it may now turn out to be too little too late.

Data access: In the beginning there was BigTable, which has been recently followed by HBase and Hypertable.  It started fairly simple, as a “is a sparse, distributed, persistent multidimensional sorted map” to quote the original paper.  It is now part of the Google App Engine and even has support for general transactions. HBase, at least as of version 0.1 was relatively immature, but there is a flurry of development and we should expect good things pretty soon, given the Hadoop team’s excellent track record so far.  While writing this post, I remembered an HBase wish list item which, although lower priority, I had found interesting: support for scripting languages, instead of HQL. Turns out this has already been done (JIRA entry and wiki entries).  I am a fan of modern scripting languages and generally skeptical about new special-purpose languages (which is not to say that they don’t have their place).

Job and schema management: Pig, from the database community, is described as a parallel dataflow engine and employs yet another special-purpose language which tries to look a little like SQL (but it is no secret that it isn’t). Cascading has received no attention in the research community, but it merits a closer look. It is based on a “build system” metaphor, aiminig to be the equivalent of Make or Ant for distributed processing of huge datasets.  Instead of introducing a new language, it provides a clean Java API and also integrates with scripting languages that support functional programming (at the moment, Groovy).  As I have used neither Cascading nor Pig at the moment, I will reserve any further comparisons.  It is worth noting that both projects build upon Hadoop core and do not integrate, at the moment, with other components, such as HBase. Finally, Sawzall deserves an honorable mention, but I won’t discuss it further as it is a closed technology.

Indexing: Beyond lookups based on row keys in BigTable, general support for indexing is a relatively open topic.  I suspect that IR-style indices, such as Lucene, have much to offer (something that has not gone unnoticed)—more on this in another post.

A number of other projects are also worth keeping an eye on, such as CouchDB, Amazon’s S3, Facebook’s Hive, and JAQL (and I’m sure I’m missing many more).  All of them are, of course, open source.

## The Fall of CAPTCHAs – really?

I recently saw a Slashdot post dramatically titled “Fallout From the Fall of CAPTCHAs“, citing an equally dramatic article about “How CAPTCHA got trashed“.  Am I missing something? Ignoring their name for a moment, CAPTCHAs are computer programs, following specific rules, and therefore they are subject to the same cat-and-mouse games that all security mechanisms go through. Where exactly is the surprise? So Google’s or Yahoo’s current versions were cracked.  They’ll soon come up with new tricks, and still newer ones after those are cracked, and so on.

In fact, I was always confused about one aspect of CAPTCHAs. I thought that a Turing test is, by definition, administered by a human, so a “completely-automated Turing-test” is an oxymoron, something like a “liberal conservative”. An unbreakable authentication system based on Turing tests should rely fully on human computation: humans should also be at the end that generates the tests. Let humans come up with questions, using references to images, web site content, and whatever else they can think of.  Then match these to other humans who can gain access to a web service by solving the riddles. Perhaps the tests should also be somehow rated, lest the simple act of logging in turns into an absurd treasure hunt. I’m not exactly sure if and how this could be turned into an addictive game, but I’ll leave that to the experts.  The idea is too obvious to miss anyway.

CAPTCHAs, even in their current form, have led to numerous contributions.  A non-exclusive list, in no particular order:

1. They have a catchy name. That counts a lot. Seriously. I’m not joking; if you don’t believe me, repeat out loud after me: “I have no idea what ‘onomatopoeia’ is—I’d better MSN-Live it” or “… I’d better Yahoo it.”  Doesn’t quite work, does it?
2. They popularized an idea which, even if not entirely new, was made accesible to webmasters the world over, and is now used daily by thousands if not millions of people.  What greater measure of success can you think of for a technology?
3. Sowed the seeds for Luis von Ahn’s viral talk on human computation, which has featured in countless universities, companies and conferences.  Although not professionally designed, the slides’ simplicity matches their content in a Jobs-esque way. As for delivery and timing, Steve might even learn something from this talk (although, in fairness, Steve Jobs probably doesn’t get the chance to introduce the same product hundreds of times).

So is anyone really surprised that the race for smarter tests and authentication mechanisms has not ended, and probably never will? (Incidentally, the lecture video above is from 2006, over three years after the first CAPTCHAs were succesfully broken by another computer program—see also CVPR 2003 paper—.  There are no silver bullets, no technology is perfect, but some are really useful. Perhaps CAPTCHAs are, to some extent, victim of their own hype which, however, is instrumental and perhaps even necessary for the wide adoption of any useful technology.  I’m pretty sure we’ll see more elaborate tests soon, not less.

## Web science: what and how?

From the article “Web Science: An Interdisciplinary Approach to Understanding the Web” in the July issue CACM (which, by the way, looks quite impressive after the editorial overhaul!):

At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. […] The macro system, that is, the use of the micro system by many users interacting with one another in often-unpredicted ways, is far more interesting in and of itself and generally must be analyzed in ways that are different from the micro system. […] The essence of our understanding of what succeeds on the Web and how to develop better Web applications is that we must create new ways to understand how to design systems to produce the effect we want.  The best we can do today is design and build in the micro, hoping for the best, but how do we know if we’ve built in the right functionality to ensure the desired macroscale effects? How do we predict other side effects and the emergent properties of the macro? […] Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology and economics.

This is a noble goal indeed. The Wikipedia article on sociology sounds quite similar on many aspects:

Sociologists research macro-structures and processes that organize or affect society […] And, they research micro-processes […] Sociologists often use  quantitative methods—such as social statistics or network analysis—to investigate the structure of a social process or describe patterns in social relationships. Sociologists also often use qualitative methods—such as focused interviews, group discussions and ethnographic methods—to investigate social processes.

First, we have to keep in mind that the current Western notion of “science” is fairly recent.  Furthermore, it has not always been the case that technology follows science. As an example, in the book “A People’s History of Science” by Clifford Conner, one can find the following quotation from Gallileo’s Two New Sciences, about Venice’s weapons factory (the Arsenal):

Indeed, I myself, being curious by nature, frequently visit this place for the mere pleasure of observing the work of those who, on account of their superiority over other artisans, we call “first rank men.” Conference with them has often helped me in the investigation of certain effects, including not only those which are striking, but also those which are recondite and almost incredible.

Later on, Conner says (p.284), quoting again Gallileo himself from the same source:

[Gallileo] demonstrated mathematically that “if projectiles are fired … all having the same speed, but each having a different elevation, the maximum range … will be obtained when the elevation is 45°: the other shots, fired at angles greater or less will have a shorter range. But in recounting how he arrived at that conclusion, he revealed that his initial inspiration came from discussions at the Arsenal: “From accounts given by gunners, I was already aware of the fact that in the use of cannons and mortars, the maximum range, that is the one in which the shot goes the farthest, is obtained when the elevation is 45°.” Although Gallileo’s mathematical analysis of the problem was a valuable original contribution, it did not tell workers at the Arsenal anything htey had not previously learned by empirical tests, and had little effect on the practical art of gunnery.

In any case, facilitating “technology” or “engineering” is certainly not the only good reason to pursue scientific knowledge. Conversely, although “pure science” certainly has an important role, it is not the only ingredient of technological progress (something I’ve alluded to in a previous post about, essentially, the venture capital approach to research).  Furthermore, some partly misguided opinions about the future of science have brightly shot through the journalistic sphere.

However, if, for whatever reason, we decide to go the way of science (a worthy pursuit), then I am reminded of the following interview of Richard Feynman by the BBC in 1981 (full programme):

Privacy concerns notwithstanding, the web gives us unprecedented opportunities to collect measurements in quantities and levels of detail that simply were not possible when the venerable state-of-the-art involved, e.g., passing around written notes among a few people. So, perhaps we can now check hypotheses more vigorously and eventually formulate universal laws (in the sense of physics).  Perhaps the web will allow us to prove Feynman wrong.

I’m not entirely convinced that it is possible to get quantitative causal models (aka. laws) of this sort. But if it is, then we need an appropriate experimental apparatus for large-scale data analysis to test hypotheses—what would be, say, the LHC-equivalent for web science?  (Because, pure science seems to have an increasing need for powerful apparatuses.) I’ll write some initial thoughts and early observations on this in another post.

I’m pretty sure that my recent posts have been doing circles around something, but I’m not quite sure yet what that is.  In any case, all this seems an interesting direction worth pursuing.  Even though Feynman was sometimes a very hard critic, we should pehaps remember his words along the way.