Thingiverse remix graph: visualizing the net of physical things

I recently became a happy owner of a Solidoodle 2 3D printer. This has been the start of a beautiful addiction, but more on the hardware hacking aspects in another post.

If you haven’t heard of it before, 3D printing refers to a family of manufacturing methods, originally developed for rapid prototyping, the first of which appeared almost three decades ago. Much like mainframe computers in the 1960s, professional 3D printers cost up to hundred thousands of dollars. Starting with the RepRap project a few years ago, home 3D printers are now becoming available, in the few hundred to a couple of thousand dollar price range.  For now, these are targeted mostly to tinkerers, much closer to an Altair or, at best, an Apple II, than a MacBook. Despite the hype that currently surrounds 3D printing, empowering average users to turn bits into atoms (and vice versa) will likely have profound effects, similar to those witnessed when content (music, news, books, etc) went digital, as Chris Anderson eloquently argues with his usual, captivating dramatic flair. Personally, I’m similarly excited about this as I was about “big data” (for lack of a better term) around 2006 and mobile around 2008, so I’ll take this as a good sign. :)

One of the key challenges, however, is finding things to print!  This is crucial for 3D printing to really take off. Learning CAD software and successfully designing 3D objects takes substantial, time, effort, and skill. Affordable 3D scanners (like the ones from MatterformCADscan, and Makerbot) are beginning to appear. However, the most common way to find things is via online sharing of designs. Thingiverse is the most popular online community for “thing” sharing. Thingiverse items are freely available (usually under Creative Commons licenses), but there is also commercial potential: companies like Shapeways offer both manufacturing (using industrial 3D printers and manual post-processing) and marketing services for “thing” designs.

I’ve become a huge fan of Thingiverse.  You can check out my own user profile to find things that I’ve designed myself, or things that I’ve virtually “collected” because I thought they were really cool or useful (or both). Thingiverse is run by MakerBot, which manufactures and sells 3D printers, and needs to help people find things to print. It’s a social networking site centered around “thing” designs. Consequently, the main entities are people (users) and things, and links/relationships revolve around people creating things, people liking things, people downloading and making things, people virtually collecting things, and so on. Other than people-thing relationships, links can also represent people following other people (a-la Twitter or Facebook), and things remixing other things (more on this soon). Each thing also has a number of associated files (polygon meshes for 3D printing, vector paths for lasercutting, original CAD files—anything that’s needed to make the thing).

The data is quite rich and interesting. I chose to start with the remix relationships. When a user uploads a new design, the can optionally enter one or more things that their design “remixes”. In a sense, a remix is similar to a citation, and it conflates a few, related meanings. It can indicate an original source of inspiration; e.g., I see a design for 3D printable chainmail and decide that I could use a similar link shape and pattern to make a chain link bracelet.  I could design the bracelent from scratch, using just the chainmail idea, or perhaps I could download the original chainmail CAD files (if their creator made them available) and re-use part of the code/design.  A remix could also indicate partial relatedness: I download and make a 3D printer (yes, it’s possible, if you have the time—or, in this case, you can buy it instead) and decide to design a small improvement to a part.  Finally, a remix may indicate use of a component library (e.g., for embossed text, gears, electronic components, and much more).

Remix links can also be created automatically by apps. Like any good social networking platform, Thingiverse also has an API for 3rd party apps. The most popular Thingiverse app is the Customizer: anyone who can write parametric CAD designs may upload them and allow other users to create custom instances of the general design by choosing specific parameter values (which can be dimensions, angles, text or photos to emboss, etc).  For example, the customizable iPhone case allows you to chose your iPhone model, the case thickness, and the geometric patterns on the back.  Another popular parametric design is the wall plate customizer, which allows you to choose the configuration of various cutouts (for power outlets, switches, Ethernet jacks, etc) and print a custom-fit wallplate. A parametric design is essentially a simple computer program that describes a solid shape (via constructive solid geometry and extrusion operators). The Customizer will execute this program and render the solid on a server, generating a new thing, which will automatically have a remix link to the original parametric design.

So let’s get back to the remix relationship.  While I was waiting for my 3D printer to arrive, I spent some time browsing Thingiverse.  I noticed that I was frequently following remix hyperlinks to find related things, but following a trail was getting tedious and I was losing track.  So, I decided to make something that gives a birds eye view of those relationships. What are people creating, and how are they reusing both ideas and their digital representations?  Last week I hacked together a visualization (using the absolutely amazing D3 libraries) to begin answering this question. Here is the largest connected component of the remix graph, which consists of about 3,500 things (nodes). If you think about it, its pretty amazing: more than 5% of the things (or at least those in my data) are somehow related to one another.  It may not seem like much at first, but check out the variety of things and you’ll see some pretty unexpected relationships (direct or indirect).

Remix graph: largest connected component

Clicking on the hyperlink or the image above will take you to an interactive visualization (if you’re on an iPad, you may want to grab your laptop for this component; D3 is pretty darn fast, but 3,500 nodes on an iPad is pushing it a bit :).  You can click-drag (on a blank area) to pan, and turn your scroll wheel (or two-finger scroll on a touchpad, or pinch on an iPad) to zoom. Nodes in red are things that a site editor/curator chose to feature on Thingiverse. Each featured thing is prominently displayed on the site’s frontpage for about a day. Graph links (edges) are directed and represent remix relationships (from a source thing to a derived thing).  If you mouse over a node, you’ll see some basic information in a tooltip, and outgoing links (i.e., links to things inspired or derived from that node) will be highlighted in green, whereas incoming links will be highlighted in orange. You can open the corresponding Thingiverse page to check out a thing by clicking on a graph node.  Finally, on the right-hand panel you can tweak a few more visualization parameters, or choose another connected component of the remix graph.

Before moving on to other components, a few remarks on the graph above: Although cycles are conceivable (I see thing X and it inspires me to remix it into thingY, then the creator of X sees the remix action in his newsfeed, checks out my thing, and incorporates some of my ideas back into X, adding a remix link annotation in the process), it seems that this is never the case: the remix graph (or at least the fraction in this visualization, which is substantial) is, in practice, a DAG (directed, acyclic graph). Next, many of the large star subgraphs are centered around customizable (parametric) things; for example the large star on the left is the iPhone case (noted above) and it’s variations. Most of the remixes are simple instances of the parametric design, but some sport more involved modifications (e.g., cases branded with company logos). However not all stars fall in this category. For example, the star graph with many red nodes near the bottom left is centered around a 3D scan of Stephen Colbert, made on the set of the show. This has inspired may remixes, into things like ColberT-Rex, or Cowbert. Most of these remixes have one parent node, but some combine more than one 3D model; for example a cross between Colbert and the Stanford bunny is the Colberabbit, and a cross between Colbert and an octopus is Colberthulu. The original Colbert scan and most of it’s remixes were featured on Thingiverse’s frontpage (apparently the site editors are huge Colbert fans?).

So, anyway, how about the other connected components? The distribution of component sizes follows a power law (again, click on the image for an interactive plot—singleton components are not included), no surprises here:

Connected component size distribution (log-log; click for interactive version)

Components beyond the giant one are also interesting (as always, click on each image for the interactive visualization).  For example, the component on the left below consists of things inspired by a 3D-printable chainmail design, which also includes things like link bracelets, etc.  The component on the right contains various designs for… catapults!

Component 12: Chainmail inspired thingsComponent 13: Catapults

Some components contain pretty useful stuff, such as the one with items for kid’s parties (e.g., coasters, cookie cutters) — on the left.  Since many people in the community are tinkerers, there are many 3D-printable parts for.. 3D-printers!  An example is the component on the right, which is centered around the design files for the original MakerBot Replicator, and around it are related items (like covers and other modifications).

Component 23: Kid's party stuff Component 18: MakerBot Replicator

Other components contain cool, geeky things, such as the small but well-featured component on the left, with figures and items from the Star Wars universe (including Darth Vader, as well as Yoda, remixed into a “gangsta” and other things).  Finally, not all components consist of 3D-printable things.  The component on the right has designs for lasercutting plywood so it can be folded, which was remixed into book covers, Kindle covers, and other things:

Component 17: Star WarsComponent 34: Foldable lasercut wood designs

All this is just a fraction of what’s out there. Thingiverse is also growing at an amazing pace: around March when I collected some of this data there were about 60,000 things and now there are over 100,000 things (the latter number is based simply on what appears to be linearly assigned thing IDs).  That’s roughly a doubling in four months; the exponential trend is going on strong!  This is quite impressive given the small (but fast-growing) size of the home 3D printer market.

Visualizing just the remix aspect of the Thingiverse is a start. For example, another thing I found myself doing when browsing Thingiverse is following indirect same-collection links (rather than direct remix links) to find related items. Once I get over gaping at the graph and all the stuff on Thingiverse (some of which I’ve printed on my Solidoodle), there are a few things to try in terms of data/graph properties as well as in terms of improving the visualization as a tool to browse the Thingiverse and discover interesting and useful things more effectively. If anyone is actually reading this :) and there is something you’d like to see, please chip in a comment.

Postscript: My favorite cluster among those I spotted in the visualization is probably the one related to Colbert (see above), with the Colberabbit (“a godless killing machine”) a particular favorite.  I’ll be printing one of those soon. :)

Comments

Data harvesting with MapReduce

Combine harvesters
(original image source)

“The combine harvester, [...] is a machine that combines the tasks of harvesting, threshing and cleaning grain crops.” If you have acres upon acres of wheat and want to separate the grain from the chaff, a group of combines is what you really want. If you have a bonsai tree and want to trim it, a harvester may be less than ideal.

MapReduce is like a pack of harvesters, well-suited for weeding through a huge volumes of data, residing on a distributed storage system. However, a lot of machine learning work is more akin to trimming bonsai into elaborate patterns. Vice versa, it’s not uncommon to see trimmers used to harvest a wheat field. Well-established and respected researchers, as recently as this year write in their paper “Planetary Scale Views on a Large Instant-messaging Network“:

We gathered data for 30 days of June 2006. Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day.

Doing the math, that’s five full days of processing to parse and compress the data on a beast of a machine. Even more surprisingly, I found this exact quote singled out among all the interesting results in the paper! Let me make clear that I’m not criticizing the study; in fact, both the dataset and the exploratory analysis are interesting in many ways. However, it is somewhat surprising that, at least among the research community, such a statement is still treated more like a badge of honor rather than an admission of masochism.

The authors should be applauded for their effort. Me, I’m an impatient sod. Wait one day for the results, I think I can do that. Two days, what the heck. But five? For an exploratory statistical analysis? I’d be long gone before that. And what if I found a serious bug half way down the road? That’s after more than two days of waiting, in case you weren’t counting. Or what if I decided I needed a minor modification to extract some other statistic? Wait another five days? Call me a Matlab-spoiled brat, but forget what I said just now about waiting one day. I changed my mind already. A few hours, tops. But we need a lot more studies like this. Consequently, we need the tools to facilitate them.

Hence my decision to frolic with Hadoop. This post focuses on exploratory data analysis tasks: the kind I usually do with Matlab or IPython/SciPy scripts, which involve many iterations of feature extraction, data summarization, model building and validation. This may be contrary to Hadoop’s design priorities: it is not intended for quick turnaround or interactive response times with modestly large datasets. However, it can still make life much easier.

Scale up on large datasets

First, we start with a very simple benchmark, which scans a 350GB text log. Each record is one line, consisting of a comma-separated list of key=value pairs. The job extracts the value for a specific key using a simple regular expression and computes the histogram of the corresponding values (i.e., how many times each distinct value appears in the log). The input consists of approximately 500M records and the chosen key is associated with about 130 distinct values.

Scalability: histogram

The plot above shows aggregate throughput versus number of nodes. HDFS and MapReduce cluster sizes are always equal, with HDFS rebalanced before each run. The job uses a split size of 256MB (or four HDFS blocks) and one reducer. All machines have a total of four cores (most Xeon, a few AMD) and one local disk. Disks range from ridiculously slow laptop-type drives (the most common type), to ridiculously fast SAS drives. Hadoop 0.16.2 (yes, this post took a while to write) and Sun’s 1.6.0_04 JRE were used in all experiments.

For such an embarrassingly parallel task, scaleup is linear. No surprises here, but it’s worth pointing out some numbers. As you can see from the plot, extracting simple statistics from this 350GB dataset took less than ten minutes with 39 nodes, down from several hours on one node. Without knowing the details of how the data were processed, if we assume similar throughput, then processing time of the raw instant messaging log could be roughly reduced from five days to just a few hours. In fact, when parsing a document corpus (about 1TB of raw text) to extract a document-term graph, we witnessed similar scale-up, going down from well over a day on a beast of a machine, to a couple of hours on the Hadoop cluster.

Hadoop is also reasonably simple to program with. It’s main abstraction is natural, even if your familiarity with functional programming concepts is next to none. Furthermore, most distributed execution details are, by default, hidden: if the code runs correctly on your laptop (with a smaller dataset, of course), then it will run correctly on the cluster.

Single core performance

Linear scaleup is good, but how about absolute performance? I implemented the same simple benchmark in C++, using Boost for regex matching. For a rough measure of sustained sequential disk throughput, I simply cat the same large file to /dev/null.

I collected measurements from various machines I had access to: (i) a five year old Mini-ITX system I use with my television at home, running Linux FC8 for this experiment, (ii) a two year old desktop at work, again with FC8, (iii) my three year old Thinkpad running Windows XP and Cygwin, and (iv) a recent IBM blade running RHEL4.

Single core performance

The hand-coded version in C++ is about 40% faster on the older machines and 33% faster on the blade [Note: I'm missing the C++ times for my laptop and it's drive crashed since then -- I was too lazy to reload the data and rerun everything, so I simply extrapolated from single-thread Hadoop assuming a 40% improvement, which seems reasonable enough for these back-of-the-envelope calculations]. Not bad, considering that Hadoop is written in Java and also incurs additional overheads to process each file split separately.

Perhaps I’m flaunting my ignorance but, surprisingly, this workload was CPU-bound and not I/O-bound—with the exception of my laptop, which has a really crappy 2.5″ drive (and Windows XP). Scanning raw text logs is a rather representative workload for real-world data analysis (e.g., AWK was built at AT&T for this purpose).

The blade has a really fast SAS drive (suspiciously fast, except perhaps if it runs at 15K RPM) and the results are particularly instructive. The drive reaches 120MB/sec sustained read throughput. Stated differently, the 3GHz CPU can only dwell on each byte for 24 cycles on average, if it’s to keep up with the drive’s read rate. Even on the other machines, the break-even point is between 30-60 cycles [Note: The laptop drive seems to be an exception, but I wouldn't be so sure that at least part of the inefficiency isn't due to Cygwin].

On the other hand, the benchmark throughput translates into 150-500 cycles per byte, on average. If I get the chance, I’d like to instrument the code with PAPI, validate these numbers and perhaps obtain a breakdown (into average cycles for regex state machine transition per byte, average cycles for hash update per record, etc). I would never have thought the numbers to be so high and I still don’t quite believe it. In any case, if we believe these measurements, at least 4-6 cores are needed to handle the sequential read throughput from a single drive!

The common wisdom in algorithms and databases textbooks, as far as I remember, was that when disk I/O is involved, CPU cycles can be more or less treated as a commodity. Perhaps this is an overstatement, but I didn’t expect it to be so off the mark.

This also raises another interesting question, which was the original motivation for measuring on a broad set of machines: what would be the appropriate cost-performance balance between CPU and disk for a purpose-built machine? I thought one might be able to get away with a setup similar to active disks: a really cheap and power-efficient Mini-ITX board, attached to a couple of moderately priced drives. For example, see this configuration, which was once used in the WayBack machine (I just found out that the VIA-based models have apparently been withdrawn, but the pages are still there for now). This does not seem to be the case.

The blades may be ridiculously expensive, perhaps even a complete waste of money for a moderately tech-savvy person. However, you can’t just throw together any old motherboard and hard disk, and magically turn them into a “supercomputer.” This is common sense, but some of the hype might have you believe the opposite.

Performance on smaller datasets

Once the original, raw data is processed, the representation of the features relevant to the analysis task typically occupies much less space. In this case, a bipartite graph extracted from the same 350GB text logs (the details don’t really matter for this discussion) takes up about 3GB, or two orders of magnitude less space.

Scalability: coclustering iteration

The graph shows aggregate throughput for one iteration of an algorithm similar to k-means clustering. This is fundamentally very similar to computing a simple histogram. In both cases, the output size is very small compared to the input size: the histogram has size proportional to the number of distinct values, whereas the cluster centers occupy space proportional to k. Furthermore, both computations iterate over the entire dataset and perform a hash-based group-by aggregation. For k-means, each point is “hashed” based on its distance to the closest cluster center, and the aggregation involves a vector sum.

Nothing much to say here, except that the linear scaleup tapers off after about 10-15 nodes, essentially due to lack of data: the fixed per-split overheads start dominating the total processing time. Hadoop is not really built to process datasets of modest size, but fundamentally I see nothing to prevent MapReduce from doing so. More importantly, when the dataset becomes really huge, I would expect Hadoop to scale almost-linearly with more nodes.

Hadoop can clearly help pre-process the raw data quickly. Once the relevant features are extracted, they may occupy at least an order of magnitude less space. It may be possible to get away with single-node processing on the appropriate representation of the features, at least for exploratory tasks.  Sometimes it may even be better to use a centralized approach.

Summary

My focus is on exploratory analysis of large datasets, which is a pre-requisite for the design of mining algorithms. Such tasks typically involve (i) raw data pre-processing and feature extraction stages, and (ii) model building and testing stages. Distributed data processing platforms and, in particular, Hadoop are well-suited for such tasks, especially the feature extraction stages.  In fact, tools such as Sawzall (which is akin to AWK, but on top of Google’s MapReduce and protocol buffers), excel at the feature extraction and summarization stages.

The original, raw data may reside in a traditional database, but more often than not they don’t: packet traces, event logs, web crawls, email corpora, sales data, issue-tracking ticket logs, and so on. Hadoop is especially well-suited for “harvesting” those features out of the original data. In its present form, it can also help in model building stages, if the dataset is really large.

In addition to reducing processing time, Hadoop is also quite easy to use. My experience is that the programming effort compares very favorably to the usual approach of writing my own, quick Python scripts for data pre-processing. Furthermore, there are ongoing efforts for even further simplification (e.g., Cascading and Pig).

I was somewhat surprised with the CPU vs I/O trade-offs for what I would consider real-world data processing tasks. Perhaps also influenced by the original work on active disks (one of the inspirations for MapReduce), which suggested using the disk controller to process data. However, there is a cross-over point for the performance of active disks versus centralized processing; I was way off with my initial guess on how much CPU power it takes for a reasonably low cross-over point (which is workload-dependent, of course, and any results herein should be treated as indicative and not conclusive).

Footnote: For what it’s worth, I’ve put up some of the code (and hope to document it sometime). Also, thanks to Stavros Harizopoulos for pointing out the simple cycles-per-byte metric.

Comments (1)

Data Mining: “I’m feeling lucky” ?

In an informal presentation on MapReduce that I recently gave, I included the following graphic, to summarize the “holy grail” of systems vs. mining:

Systems vs. Data mining

This was originally inspired by a quote that I read sometime ago:

Search is more about systems software than algorithms or relevance tricks.

How often do you click the “lucky” button, instead of “search”? Incidentally, I would be very interested in finding some hard numbers on this (I couldn’t)—but that button must exist for good reason, so a number of people must be using it. Anyway, I believe it’s a safe assumption that “search” gets clicked more often than “lucky” by most people. And when you click “search”, you almost always expect to get something relevant, even if not perfectly so.

In machine learning or data mining, the holy grail is to invent algorithms that “learn from the data” or that “discover the golden nugget of information in the massive rubble of data”. But how often have you taken a random learning algorithm, fed it a random dataset, and expected to get something useful. I’d venture a guess: not very often.

So it doesn’t quite work that way. The usefulness of the results is a function of both the data and the algorithm. That’s common sense: drawing any kind of inference involves both (i) making the right observations, and (ii) using them in the right way. I would argue that in most succesful applications, it’s the data takes center stage, rather than the algorithms. Furthermore, mining aims to develop the analytic algorithms, but systems development is what enables running those algorithms on the appropriate and, often, massive data sets. So, I do not see how the former makes sense without the latter. In research however, we sometimes forget this, and simply pick our favorite hammer and clumsily wield it in the air, ignoring both (i) the data collection and pre-processing step, and (ii) the systems side.

It may be that “I’m feeling lucky” often hits the target (try it, you may be surprised). However, in machine learning and mining research, we sometimes shoot the arrow first, and paint the bullseye around it. There are many reasons for this, but perhaps one stands out. A well-known European academic (from way up north) once said that his government’s funding agency once criticized him for succeeding too often. Now, that’s something rare!

Comments