So long and thanks for all the fish

On a recent trip to Seattle, we were waiting for someone in front of Pike Place Fish Market. This is a stall that has been selling fish for over 40 years. Whenever someone buys a fish, the staff put up a show, throwing the fish around and chanting. I can see how this could work as a marketing trick: people want to see the show, so they end up buying more fish. They might also tell their friends, who also visit, and so on.

Well, it seems that’s not quite how it is these days. While we were waiting there, for over twenty minutes, there was a huge crowd with cameras of all sorts (I counted at least eight cameras on just one side of the stall), taking aim and ready to shoot the action. However, nobody was actually buying any fish! And why should they? Just search YouTube to find over a hundred videos, some much better than you would have ever taken yourself. As for your friends, well, you can just send them that link.

Participatory, interactive advertising requires other tricks. Perhaps the customers should be allowed to throw the fish themselves. But that would be messy. Instead, you can put your wares on the web, and reach a much wider audience. Live webcam? Check. Motivational books and t-shirts? Check. Free delivery of fish to your hotel room? Check (hmm…). Delivery via UPS anywhere in the US? Check.


The long tail of ideas

It’s not clear how you measure the size of an idea. Is it billions of generated revenue? Is it number of papers published? Number of citations? Brain-ounces (whatever that is)? But, let’s say that based on any or all of these measures, some ideas are big and some are small. The long tail applies here too: a few ideas make it big, but most of them remain small. But how do we find these big ideas?

I’ve heard the following piece of advice several times over the past few years, and more recently in a talk by one VP. It goes something like this: “Find an idea an ask yourself: is the potential market worth at least one billion [sic] dollars? If not, then walk away.” This is very similar to something I read in one of Seth Godin’s riffs, about a large consulting company recommending to a large book publisher that they should “only publish bestsellers.” They would, if they knew in advance which books would be bestsellers. But, in reality, this advice is simply absurd.

For example, who thought back in the 90’s that search would be so important, with search marketing worth about 10 billion and expected to exceed 80 billion within 10 years? Nobody, and perhaps following the above advice, projects such as CLEVER and it’s follow-up (which put a “business intelligence” spin to search), WebFountain, went nowhere. The only thing that went somewhere is the researchers; they moved to Google, Yahoo! and Microsoft.

When I onced talked to someone from WebFountain, two things he said struck me and I still remember them after several years. First, the cost of crawling a reasonable fraction of the web and maintaining an index was quite small (a handful of machines, a T1 pipe and maybe one sysadm). Second, it turns out that back then WebFountain received some flak for “starting small.”

In other words, the engineers seemed to recognize they were starting at the far end of the tail, and decided to put some wheels on their idea and see how to move it up towards the head, growing along the way. But management wanted more to justify the project. Four wheels is just a car, but how about four hundred? “Is this something really big or isn’t it? If it is, then why 10 machines and not 300? Why 5 people and not 50? Why just make 3 features that work, instead of design and advertise 30 or 50?” As far as I can tell as an outsider, this is what happened and such an over-planning (combined, perhaps, with rather poor execution on the development side) did not lead to the expected results.

Perhaps such a mentality would have made sense a few decades ago, when computing power was far from a commodity, barriers to entry were large, and supercomputing was thriving. These days however, instead of asking “how much is this idea worth”, it’s better to ask “how much does it cost to try this idea?” and strive to make the answer “almost zero.” You don’t know in advance what will be big—that was always true. You should not start big because it’s not cost-efficient—that was not always possible, but it is today. Start big and you will likely end up small (like countless startups from the bubble-era); but start small and you may end up big. Google just appeared one day and did only one thing (search) for many years; Amazon sold only books; and so on.

Barriers to entry should not be made artificially high. Some companies seem to recognize this better than others (although this may be changing as they grow), and strive to provide an infrastructure, environment and culture that makes it easy to try out many new things by starting small and cheap. And other companies are enabling the masses to do the same.

I’m not saying that you should try every idea, even if it seems clearly unpromising. I’m also not saying that any idea can become big or that, once an idea becomes big, it will still cost zero to scale it up. But, technology, if properly used and combined with the right organizational structures, allows more ideas from the long tail to be tried out, at minimal cost. You’re expected to fail most of the time, but if the cost to try is near-zero, it doesn’t matter.

Comments (3)

Animal abuse

The cafeteria served “Jamaican jerk chicken”—again. Horrible. Not only did they kidnap, slaughter, quarter and cook it, they’re also calling it names. Did the chicken really deserve this? Maybe I should become vegetarian.


Data Mining: “I’m feeling lucky” ?

In an informal presentation on MapReduce that I recently gave, I included the following graphic, to summarize the “holy grail” of systems vs. mining:

Systems vs. Data mining

This was originally inspired by a quote that I read sometime ago:

Search is more about systems software than algorithms or relevance tricks.

How often do you click the “lucky” button, instead of “search”? Incidentally, I would be very interested in finding some hard numbers on this (I couldn’t)—but that button must exist for good reason, so a number of people must be using it. Anyway, I believe it’s a safe assumption that “search” gets clicked more often than “lucky” by most people. And when you click “search”, you almost always expect to get something relevant, even if not perfectly so.

In machine learning or data mining, the holy grail is to invent algorithms that “learn from the data” or that “discover the golden nugget of information in the massive rubble of data”. But how often have you taken a random learning algorithm, fed it a random dataset, and expected to get something useful. I’d venture a guess: not very often.

So it doesn’t quite work that way. The usefulness of the results is a function of both the data and the algorithm. That’s common sense: drawing any kind of inference involves both (i) making the right observations, and (ii) using them in the right way. I would argue that in most succesful applications, it’s the data takes center stage, rather than the algorithms. Furthermore, mining aims to develop the analytic algorithms, but systems development is what enables running those algorithms on the appropriate and, often, massive data sets. So, I do not see how the former makes sense without the latter. In research however, we sometimes forget this, and simply pick our favorite hammer and clumsily wield it in the air, ignoring both (i) the data collection and pre-processing step, and (ii) the systems side.

It may be that “I’m feeling lucky” often hits the target (try it, you may be surprised). However, in machine learning and mining research, we sometimes shoot the arrow first, and paint the bullseye around it. There are many reasons for this, but perhaps one stands out. A well-known European academic (from way up north) once said that his government’s funding agency once criticized him for succeeding too often. Now, that’s something rare!


The shift from private to public channels of information

Many discussions about privacy these days obsess over the shifting balance between public and private channels of information, while missing the real issues and opportunities.

The information landscape is unquestionably changing. We are experiencing the emergence and rapid proliferation of social media, such as instant messaging (e.g., IRC, Jabber et al., AIM, MSN, Skype), sharing sites (e.g., Flickr, Picasa, YouTube, Plaxo), blogs (e.g., Blogger, WordPress, LiveJournal) and forums (e.g., Epinions), wikis (e.g., Wikipedia, PBWiki), microblogs (e.g., Twitter, Jaiku), social networks (e.g., MySpace, Facebook, Ning), and so on. Also, much financial information (e.g., your bank’s website or Quicken) as well as health records are or soon will be online.

A rather obvious distinction is between public vs. private channels of information or content:

  • In public channels, the default policy on data sharingis “opt-in”.
  • In private channels, the default is “opt-out” (along with some, hopefully enforceable, guarantees that this is the case).

Most people, at least of a certain age, take the former for granted. However, this is changing. Just a couple of decades back, schoolchildren would keep journals (you know, those with a locket and “Hello Kitty” or “Transformers” on the cover). These days they are on MySpace and Twitter, and they do not assume “opt-out” is the default. Quoting from the article “The Talk of Town: You” (subscriber-only access) in the MIT Technology Review:

New York‘s reporter made a big deal about how “the kids” made her “feel very, very old.” Not only did they casually accept that the record of their lives could be Googled by anyone at any time, but they also tended to think of themselves as having an audience. Some even considered their elders’ expectations about privacy to be a weird, old-fogey thing—a narcissistic hang-up.

Said differently, an increasing fraction of content is produced in public, rather than private channels and “opt-in” is becoming the norm rather than the exception. Social aggregation sites, such as Profilactic, are a step towards easy access to this corpus. Despite some alarmism about blogs, Twitter, MySpace profiles, etc, all this information is, by definition, in public channels. Perhaps soon 99% of information will be in public channels.

So, which information channels should be perceived as public? Many people have a knee-jerk reaction when it comes to thinking of what should be private. For example, this blog is clearly a public channel. But how about your health records? In an interesting opinion about making health records public, most commenters’ expressed a fear of being denied health coverage by an insurance company. However, this is more an indication of a broken healthcare system, than of a problem with making this data public. Most countries (the U.S. included) are behind in this area, but others (such as the Scandinavians or Koreans) are making important steps forward. Now, how about your financial records? For example, credit reporting already relies on aggregation and analysis of publicly available data. How about your company’s financial records? Or how about your phonecall records? Or your images captured by surveillance cameras? The list can go on forever.

We should avoid that knee-jerk reaction and carefully consider what can be gained by moving to public channels, as well as what technology and regulation is required to make this work. The benefits can be substantial; for example, the success of the open source movement is largely due to switching to public, transparent channels of communication, as well as open standards. Openness is usually a good thing.

Even in the enterprise world of grownups, tools such as SmallBlue (aka. Atlas) are effectively changing the nature of intra-company email from a private to a (partially) public channel. The alternative would be to establish new public channels and favor their use over the older, “traditional” (and usually private) channels. Both approaches are equivalent.

Moreover, how should we deal with the information in private channels? The danger with private channels arises when privacy is breached. If that happens, not only do you get a false sense of security when you have none, but you may also have a very hard time proving that it happened. However, the notion itself of a “breach” in public channels is clearly meaningless. In that sense, public channels are a safer option and should be carefully considered.

Even when the data itself is private, who is accessing it and for what purpose should be public information. The MIT TR article continues to mention David Brin’s opinion that

“[…] our only real choice is between a society that offers the illusion of privacy, by restricting the power of surveillance to those in power, and one where the masses have it too.”

The need for full transparency on data how they are used is more pressing than ever. Ensuring that individuals’ rights are not violated requires less secrecy, not more. A recent CACM article by a gang of CS authority figures makes a similar case (although their proposal for an ontology-based heavyweight scheme for all data out there is somewhat dubious; it might make sense for the 1% niche of sensitive data, though). Interestingly, one of their key examples is essentially about health records and they also come to the same conclusion, i.e., that the problem is inappropriate use of the data.

I actually look forward to the day I’ll be able to type “” on Google (as well as any other search engine) and find all the content that I ever produced. And going one step beyond that, also find the “list of citations” (i.e., all the content that referenced or used my data), like I can find for my research papers on Google scholar, or for posts on this blog with trackbacks. Although I cannot grasp all the implications, it would at least mean we’ve addressed most of these issues and the world is a more open, democratic place. McLuhan’s notion of the global village is more relevant than ever, but his doom and gloom is largely misplaced; let’s focus on the positive potential instead.

Comments (1)

Hello world!

The reason for this site is pretty simple: it is one way to actively establish an identity in the “global village” (aka. Internet), which is something I have avoided doing for too long. Speaking of identity, this blog supports OpenID which, despite some weaknesses, is emerging as a long-needed standard. Page footers have links to my identity on other sites; when I find a satisfactory OpenID provider, perhaps those will go away.

Why, in the age of cloud computing, would I bother to set up relational databases, CGI binaries, and so on? I could just say “for the heck of it”, and it would be true. I like to see how things work first-hand; for example, I am (still) the kind of guy that actually prefers to run his own Hadoop instance on a handful of machines, rather than use e.g., EC2! Beyond that, I’m not sure I have a really solid answer.

But enough for a “hello world” post! Welcome, and I hope to see a few people around for things to follow.


« Previous Page « Previous Page Next entries »