Web science: what and how?

From the article “Web Science: An Interdisciplinary Approach to Understanding the Web” in the July issue CACM (which, by the way, looks quite impressive after the editorial overhaul!):

At the micro scale, the Web is an infrastructure of artificial languages and protocols; it is a piece of engineering. […] The macro system, that is, the use of the micro system by many users interacting with one another in often-unpredicted ways, is far more interesting in and of itself and generally must be analyzed in ways that are different from the micro system. […] The essence of our understanding of what succeeds on the Web and how to develop better Web applications is that we must create new ways to understand how to design systems to produce the effect we want.  The best we can do today is design and build in the micro, hoping for the best, but how do we know if we’ve built in the right functionality to ensure the desired macroscale effects? How do we predict other side effects and the emergent properties of the macro? […] Given the breadth of the Web and its inherently multi-user (social) nature, its science is necessarily interdisciplinary, involving at least mathematics, CS, artificial intelligence, sociology, psychology, biology and economics.

This is a noble goal indeed. The Wikipedia article on sociology sounds quite similar on many aspects:

Sociologists research macro-structures and processes that organize or affect society […] And, they research micro-processes […] Sociologists often use  quantitative methods—such as social statistics or network analysis—to investigate the structure of a social process or describe patterns in social relationships. Sociologists also often use qualitative methods—such as focused interviews, group discussions and ethnographic methods—to investigate social processes.

First, we have to keep in mind that the current Western notion of “science” is fairly recent.  Furthermore, it has not always been the case that technology follows science. As an example, in the book “A People’s History of Science” by Clifford Conner, one can find the following quotation from Gallileo’s Two New Sciences, about Venice’s weapons factory (the Arsenal):

Indeed, I myself, being curious by nature, frequently visit this place for the mere pleasure of observing the work of those who, on account of their superiority over other artisans, we call “first rank men.” Conference with them has often helped me in the investigation of certain effects, including not only those which are striking, but also those which are recondite and almost incredible.

Later on, Conner says (p.284), quoting again Gallileo himself from the same source:

[Gallileo] demonstrated mathematically that “if projectiles are fired … all having the same speed, but each having a different elevation, the maximum range … will be obtained when the elevation is 45°: the other shots, fired at angles greater or less will have a shorter range. But in recounting how he arrived at that conclusion, he revealed that his initial inspiration came from discussions at the Arsenal: “From accounts given by gunners, I was already aware of the fact that in the use of cannons and mortars, the maximum range, that is the one in which the shot goes the farthest, is obtained when the elevation is 45°.” Although Gallileo’s mathematical analysis of the problem was a valuable original contribution, it did not tell workers at the Arsenal anything htey had not previously learned by empirical tests, and had little effect on the practical art of gunnery.

In any case, facilitating “technology” or “engineering” is certainly not the only good reason to pursue scientific knowledge. Conversely, although “pure science” certainly has an important role, it is not the only ingredient of technological progress (something I’ve alluded to in a previous post about, essentially, the venture capital approach to research).  Furthermore, some partly misguided opinions about the future of science have brightly shot through the journalistic sphere.

However, if, for whatever reason, we decide to go the way of science (a worthy pursuit), then I am reminded of the following interview of Richard Feynman by the BBC in 1981 (full programme):

Privacy concerns notwithstanding, the web gives us unprecedented opportunities to collect measurements in quantities and levels of detail that simply were not possible when the venerable state-of-the-art involved, e.g., passing around written notes among a few people. So, perhaps we can now check hypotheses more vigorously and eventually formulate universal laws (in the sense of physics).  Perhaps the web will allow us to prove Feynman wrong.

I’m not entirely convinced that it is possible to get quantitative causal models (aka. laws) of this sort. But if it is, then we need an appropriate experimental apparatus for large-scale data analysis to test hypotheses—what would be, say, the LHC-equivalent for web science?  (Because, pure science seems to have an increasing need for powerful apparatuses.) I’ll write some initial thoughts and early observations on this in another post.

I’m pretty sure that my recent posts have been doing circles around something, but I’m not quite sure yet what that is.  In any case, all this seems an interesting direction worth pursuing.  Even though Feynman was sometimes a very hard critic, we should pehaps remember his words along the way.

Comments (1)

The bless of dimensionality

The cover story in Wired by Chris Anderson, “The End of Theory” relies on a silent assumption, which may be obvious but is still worth stating. The reason that such a “petabyte approach” works is that reality occupies only a tiny fraction of the space of all possibilities.  For example, the human genome consists of about three billion base pairs.  However, not every billion-lengths string of four symbols corresponds to a viable organism, much less an existing one or a human individual.  In other words, the intrinsic dimensionality of your sample (the human population) is much smaller than the raw dimensionality of the possibilities (about 4^3,2000,000 strings).

I won’t try to justify “traditional” models. But I also wouldn’t go so far as to say that models will disappear, just that many will be increasingly statistical in nature. If you can throw the dice a large enough number of times, it doesn’t matter whether “God” plays them or not.  The famous quote by Einstein suggests that quantum mechanics was originally seen as a cop-out by some: we can’t find the underlying “truth”, so we settle with probability distributions for position and momentum.  However, this was only the beggining.

Still, we need models.  Going back to the DNA example, I suspect that few people models the genome as a single, huge, billion-length string.  That is not a very useful random variable.  Chopping it up into pieces with different functional significance and coming up with the appropriate random variables, so one can draw statistical inferences, sounds very much like modeling to me.

Furthermore, hypothesis testing and confidence intervals won’t go away either.  After all, anyone who has taken a course in experimental physics knows that repeating a measurement and calculating confidence intervals based on multiple data points is a fundamental part of the process (and also the main motivating force in the original development of statistics).  Now we can collect petabytes of data points.  Maybe there is a shift in balance between theory (in the traditional, Laplacian sense, which I suspect is what the article really refers to) and experiment.  But the fundamental principles remain much the same.

So, perhaps more is not fundamentally different after all, and we still need to be careful not to overfit.  I’ll leave you with a quote from “A Random Walk down Wall Street” by Burt Malkiel (emphasis mine):

[…] it’s sometimes possible to correlate two completely unrelated events.  Indeed, Mark Hulbert reports that stock-market researcher David Leinweber found that the indicator most closely correlated with the S&P 500 Index is the volume of butter production in Bangladesh.

Dimensionality may be a bless, but it can still be a curse sometimes.

Comments (1)