I have stopped worrying what can be inferred about me, because I’ve accepted the simple fact that, given enough time (data) and resources, anything can be inferred. Consider, as an example, “location privacy.” A number of approaches rely on adaptively coarsening the detail of reported location (using all sorts of criteria to decide detail, from mobility patterns, to spatial query workload characteristics, etc). For example, instead of revealing my exact location, I can reveal my location at a city-block level. In an area like NYC, this would conflate me with hundreds of other people that happen to be on the same block, but a block-level location is still accurate enough to be useful (e.g., for finding nearby shops and restaurants). This might work if I’m reporting my location just once. However, if I travel from home to work, then my trajectory over a few days, even at a city-block granularity, is likely sufficient to distinguish me from other people. I could perhaps counter this by revealing my location at a city-level or state-level. Then a few days worth of data might not be enough to identify me. However, I often travel and data over a period of, say, a year, would likely be enough to identify me even if location detail is quite coarse. Of course, I could take things to the extreme and just reveal that “I am on planet Earth”. But that’s the same as not publishing my location, since this fact is true for everyone.
Every piece of content has a creator and owner (in this post, I will assume they are by default the same entity). I do not mean ownership in the traditional sense of, e.g., stashing a piece of paper in a drawer, but in the metaphysical sense that each artifact is forever associated with one or more “creators.”
This is certainly true of the end-products of intellectual labor, such as the article you are reading. However, it is also true of more mundane things, such as checkbook register entries or credit card activity. Whenever you pay a bill or purchase an item, you implicitly “create” a piece of content: the associated entry in your statement. This has two immediately identifiable “creators”: the payer (you) and the payee. The same is true for, e.g., your email, your IM chats, your web searches, etc. Interesting tidbit: over 20% of search terms entered daily in Google are new, which would imply roughly 20 million new pieces of content per day, or over 7 billion (over twice the earth’s population) per year—all this from just one activity on one website.
When I spend a few weeks working on, say, a research paper, I have certain expectations and demands about my rights as a “creator.” However, I give almost no thought to my rights on the trail of droppings (digital or otherwise) that I “create” each day, by searching the web, filling up the gas tank, getting coffee, going through a toll booth, swiping my badge, and so on. However, with the increasing ease of data collection and distribution in digital form, we should re-think our attitudes towards “authorship”.
Many discussions about privacy these days obsess over the shifting balance between public and private channels of information, while missing the real issues and opportunities.
The information landscape is unquestionably changing. We are experiencing the emergence and rapid proliferation of social media, such as instant messaging (e.g., IRC, Jabber et al., AIM, MSN, Skype), sharing sites (e.g., Flickr, Picasa, YouTube, Plaxo), blogs (e.g., Blogger, WordPress, LiveJournal) and forums (e.g., Epinions), wikis (e.g., Wikipedia, PBWiki), microblogs (e.g., Twitter, Jaiku), social networks (e.g., MySpace, Facebook, Ning), and so on. Also, much financial information (e.g., your bank’s website or Quicken) as well as health records are or soon will be online.
A rather obvious distinction is between public vs. private channels of information or content:
- In public channels, the default policy on data sharingis “opt-in”.
- In private channels, the default is “opt-out” (along with some, hopefully enforceable, guarantees that this is the case).
Most people, at least of a certain age, take the former for granted. However, this is changing. Just a couple of decades back, schoolchildren would keep journals (you know, those with a locket and “Hello Kitty” or “Transformers” on the cover). These days they are on MySpace and Twitter, and they do not assume “opt-out” is the default. Quoting from the article “The Talk of Town: You” (subscriber-only access) in the MIT Technology Review:
New York‘s reporter made a big deal about how “the kids” made her “feel very, very old.” Not only did they casually accept that the record of their lives could be Googled by anyone at any time, but they also tended to think of themselves as having an audience. Some even considered their elders’ expectations about privacy to be a weird, old-fogey thing—a narcissistic hang-up.
Said differently, an increasing fraction of content is produced in public, rather than private channels and “opt-in” is becoming the norm rather than the exception. Social aggregation sites, such as Profilactic, are a step towards easy access to this corpus. Despite some alarmism about blogs, Twitter, MySpace profiles, etc, all this information is, by definition, in public channels. Perhaps soon 99% of information will be in public channels.
So, which information channels should be perceived as public? Many people have a knee-jerk reaction when it comes to thinking of what should be private. For example, this blog is clearly a public channel. But how about your health records? In an interesting opinion about making health records public, most commenters’ expressed a fear of being denied health coverage by an insurance company. However, this is more an indication of a broken healthcare system, than of a problem with making this data public. Most countries (the U.S. included) are behind in this area, but others (such as the Scandinavians or Koreans) are making important steps forward. Now, how about your financial records? For example, credit reporting already relies on aggregation and analysis of publicly available data. How about your company’s financial records? Or how about your phonecall records? Or your images captured by surveillance cameras? The list can go on forever.
We should avoid that knee-jerk reaction and carefully consider what can be gained by moving to public channels, as well as what technology and regulation is required to make this work. The benefits can be substantial; for example, the success of the open source movement is largely due to switching to public, transparent channels of communication, as well as open standards. Openness is usually a good thing.
Even in the enterprise world of grownups, tools such as SmallBlue (aka. Atlas) are effectively changing the nature of intra-company email from a private to a (partially) public channel. The alternative would be to establish new public channels and favor their use over the older, “traditional” (and usually private) channels. Both approaches are equivalent.
Moreover, how should we deal with the information in private channels? The danger with private channels arises when privacy is breached. If that happens, not only do you get a false sense of security when you have none, but you may also have a very hard time proving that it happened. However, the notion itself of a “breach” in public channels is clearly meaningless. In that sense, public channels are a safer option and should be carefully considered.
Even when the data itself is private, who is accessing it and for what purpose should be public information. The MIT TR article continues to mention David Brin’s opinion that
“[…] our only real choice is between a society that offers the illusion of privacy, by restricting the power of surveillance to those in power, and one where the masses have it too.”
The need for full transparency on data how they are used is more pressing than ever. Ensuring that individuals’ rights are not violated requires less secrecy, not more. A recent CACM article by a gang of CS authority figures makes a similar case (although their proposal for an ontology-based heavyweight scheme for all data out there is somewhat dubious; it might make sense for the 1% niche of sensitive data, though). Interestingly, one of their key examples is essentially about health records and they also come to the same conclusion, i.e., that the problem is inappropriate use of the data.
I actually look forward to the day I’ll be able to type “creator:email@example.com” on Google (as well as any other search engine) and find all the content that I ever produced. And going one step beyond that, also find the “list of citations” (i.e., all the content that referenced or used my data), like I can find for my research papers on Google scholar, or for posts on this blog with trackbacks. Although I cannot grasp all the implications, it would at least mean we’ve addressed most of these issues and the world is a more open, democratic place. McLuhan’s notion of the global village is more relevant than ever, but his doom and gloom is largely misplaced; let’s focus on the positive potential instead.