## Mobile OCR input: “Fully automatic” and reality

Recently I’ve been toying around with WordSnap OCR (project page, source code, app on Android Market), an app for OCR-based camera input on Android. In the process, I found out a few things about “smart” versus “fast”.

At least in data mining, “fully automatic” is an often unquestioned holy grail.  There are certainly several valid reasons for this, such as if you’re trying to scan huge collections of books such as this, or index images from your daily life like this.  In this case, you use all the available processing power to make as few errors as possible (i.e., maximize accuracy).

However, if the user is sitting right in front of your program, watching your algorithms and their output, things are a little different. No matter how smart your algorithm is, some errors will occur. This tends to annoy users. In that sense, actively involved users are a liability. However, they can also be an asset: since they’re sitting there anyway, waiting for results, you may as well get them really involved. If you have cheap but intelligent labor ready and willing, use it! The results will be better or, at the very least, no worse.  Also, users tend to remember the failures. So, even if end results were similar on average, allowing users to correct failures as early as possible will make them happier.

Instead of making algorithms as smart as possible, the goal now is to make them as fast as possible, so that they produce near-realtime results that don’t have to be perfect; they just shouldn’t be total garbage. When I started playing with the idea for WordSnap, I was thinking how to make the algorithms as smart as possible.  However, for the reasons above, I soon changed tactics.

The rest of this post describes some of the successful design decisions but,  more importantly, the failures in the balance between “automatic” and “realtime guidance”. The story begins with the following example image:

Incidentally, this image was the inspiration for WordSnap: I wanted to look up “inimical” but I was too lazy to type. Also, for the record, WordSnap uses camera preview frames, which are semi-planar YUV data at HVGA resolution (480×320). This image is a downsampled (512×384) full-resolution photograph taken with the G1 camera (2048×1536); most experiments here were performed before WordSnap existed in any usable form. Finally, I should point out that OCR isn’t really my area; what I describe below is based on common sense rather than knowledge of prior art, although just before writing this post I did try a quick review of the literature.

### Binarization

A basic operation for OCR is binarization: mapping grayscale intensities between 0 and 255 to just two values: black (0) and white (1).  Only then can we start talking about shapes (lines, words, characters, etc).  One of the most widely used binarization algorithms is Otsu’s method.  It picks a single, global threshold so that it maximizes the within-class (black/white) variance, or equivalently maximizes the across-class variance. This is very simple to implement, very fast and works well for flatbed scans, which have uniform illumination.

However, camera images are not uniformly illuminated. The example image may look fine to human eyes, but it turns out that even for this image no global threshold is suitable (click on image for animation showing various global thresholds):

If you looked at the animation carefully, you might have noticed that at some point, at least the word of interest (“inimical”) is correctly binarized in this picture.  However, if the lighting gradient were steeper, this would not be possible. Incidentally, ZXing uses Otsu’s method for binarization, because of it is fast. So, if you wondered why barcode scanning sometimes fails, now you know.

So, a slightly smarter approach is needed: instead of using one global threshold, the threshold should be determined individually for each pixel (i,j). A natural threshold t(i,j) is the mean intensity μw(i,j) of pixels within a w×w neighborhood around pixel (i,j).  The key operation here is mean filtering: convolving the original image with a w×w matrix with constant entries 1/w2.

The problem is that, using pure Java running on Dalvik, mean filtering is prohibitively slow.  First, Dalvik is fully interpreted (no JIT, yet). Firthermore, the fact that Java bytes are always signed doesn’t help: casting to int and masking off the 24 most significant bits almost doubles running time.

Method Dalvik (msec) JNI (msec) Speedup
Naïve 109,882 ± 4,813 1,712 ± 261 64×
Sliding 2,435 ± 141 71 ± 19 34×

JNI to the rescue. The table above shows speedups for two implementations. The naïve approach uses a triple nested loop and has complexity O(w2mn), where m and n is the image height and width, respectively (m = 348, n = 512 in this example). The 1-D equivalent would simply be:

for i = 0 to N-1:
s = 0
for j = max(i-r,0) to min(i+r,N-1):
s += a[j]

where w = 2r+1 is the window size. The second implementation updates the sums incrementally, based on the values of adjacent windows. The complexity now is just O(mn). An interesting aside is the relative performance of two implementations for sliding window sums (where w = 2r+1 is the window size). The first checks border conditions inside each iteration:

Initialize s = sum(a[0]..a[r])
for i = 1 to N-1:
if i > r:
s -= a[i-r-1]
if i < N-r:
s += a[i+r]

The second moves the border condition checks outside the loop which, if you think about it for a second, amounts to:

Initialize s = sum(a[0]..a[r])
for i = 1 to r:
s += a[i+r]
for i = r+1 to N-r-1:
s -= a[i-r-1]
s += a[i+r]
for i = N-r to N-1:
s -= a[i-r-1]

Among these two, the first one is faster, at least on a laptop running Sun’s JVM with JIT (I didn’t time Dalvik or JNI). I’m guessing that the second one messes loop unrolling, but I haven’t checked my guess.

It turns out that there is a very similar approach in the literature, called Sauvola’s method. Furthermore, there are efficient methods to compute it, using integral images. These are simply the 2-D generalization of partial sums. In 1-D, if partial sums are pre-computed, window sums can be estimated in O(1) time using the simple observation that sum(i…j) = sum(1..j) – sum(1..i-1).

Savuola’s method also computes local variance σw(i,j), and uses a relative threshold t(i,j) = μw(i,j)(1 + λσw(i,j)/127). WordSnap uses the global variance and an additive threshold t(i,j) = μw(i,j) + λσglobal, but after doing a contrast stretch of the original image (i.e., linearly mapping minimum intensity to 0 and maximum to 255). Doing floating point math or 64-bit integer arithmetic is much more expensive, hence the additive threshold. Furthermore, WordSnap does not use integral images because the same runtime can be achieved without the need to allocate a large buffer. Memory allocation on a mobile device is not cheap: the time needed to allocate a 480×320 buffer of 32-bit integers (about 600KB total) varies significantly depending on how much system memory is available, whether the garbage collector is triggered and so on, but on average it’s about half a second on the G1. Even though most buffers can be allocated once, startup time is important for this application: if it takes more than 2-3 seconds to start scanning, the user might as well have typed the result.

Anyway, here is the final result of locally adaptive thresholding:

Conclusion: In this case we needed the slightly smarter approach, so we invested the time to implement it efficiently. WordSnap currently uses a 21×21 neighborhood.  Altogether, binarization takes under 100ms.

### Skew

Another problem is that the orientation of the text lines may not be aligned with image edges.  This is called skew and makes recognition much harder.

Initially, I set out to find a way to correct for skew.  After a few searches on Google, I came across the Hough transform.  The idea is simple.  Sayyou want to detect a curve desribed by a set of parameters. E.g., for a line, those would be distance ρ from origin and slope θ. For each black pixel, find the parameter values for all possible curves to which this pixel may belong. For a line, that’s all angles θ from 0 to 180 degrees, and all distances ρ from 0 to sqrt(m2+n2).  Then, compute the density distribution of parameter tuples.  If a line (ρ00) is present in the image, then the parameter density distribution should have a local maximum at (ρ00).

If we apply this approach to our example image, the first maximum is detected at an angle of 20 degrees. Here is the image counter-rotated by that amount:

Success!  However, computing the Hough transform is too slow!  Typical implementations bucketize the parameter space. This would require a buffer of about 180×580 32-bit integers (for a 480×320 image), or about 410KB. In addition, it would require trigonometric operations or lookups to find the buckets for each pixel, not to mention counter-rotation. There are obvious optimizations one can try, such as computing histograms at multiple resolutions to progressively prune the parameter space.  Still, the cost implied by back-of-the envelope calculations put me off from even trying to implement this on the phone. Instead, why not just try to use the users:

Conclusion: Simple approach with help from user wins, and the computer doesn’t even have to do anything to solve the problem! Incidentally, the guideline width is determined by the size of typical newsprint text at the smallest distance that the G1’s camera can focus.

### Font size

Next, we need to detect individual words.  The approach WordSnap uses is to dilate the binary image with a rectangular structuring element (in the following image, the size 7×7), and then expand a rectangle (shown in green) until it covers the connected component which, presumably, is one word.

However, the size of the structuring element should really depend on the inter-word spacing, which in turn depends on the typeface as well as the distance of the camera from the text.  For example, if we use a 5×5 element, we would get the following:

I briefly toyed with two ideas for font size detection.  The first is to do a Fourier transform.  Presumably the first spatial frequency mode would correspond to inter-word and/or inter-line spacing and the second mode to inter-character spacing. But that assumes we apply Fourier to a “large enough” portion of the image, and things start becoming complicated.  Not to mention computationally expensive.

The second approach (which also appears to be the most common?) is to to hierarchical grouping. First expand rectangles to cover individual letters (or, sometimes, ligatures), then compute histogram of horizontal distances and re-group into word rectangles, and so on.  This is also non-trivial.

Instead, WordSnap uses a fixed dilation radius.  The implementation is optimized to allow near-realtime annotation of the detected word extent.  This video should give you an idea:

Conclusion: Simple wins again, but this time we have to do something (and let the user help with the rest). But, instead of trying to be smart and find the best parameters given the camera position, we try to be fast: fix the parameters and let the user find the camera position that works given the parameters. WordSnap uses a 5×5 rectangular structuring element, although you can change that to 3×3 or 7×7 in the preferenfces screen. Altogether, word extent detection takes about 150-200ms, although it could be significantly optimized, if necessary, by using only JNI only, instead of a mix of pure Java and JNI calls.

I’m now looking into the possibility of moving OCR into the “live” loop: as you move the camera, the phone shows not only the word extent rectangle, but also the recognized word.  Perhaps as a hyperlink to Google, or along with Google Translate results.  Then I can justifiably use the buzzword of the day, “augmented reality”!  It looks that it might just be possible, but let me get back to you in a week or two.  :)

Postscript: Some of the papers referenced were pointed out to me by Hideaki Goto, who started and maintains WeOCR. Also, skew detection and correction experiments are based on this quick-n-dirty Python script (needs OpenCV and it ain’t pretty!). Update (9/2): Fixed really stupid mistake in parametrization of line.

## Revised thoughts on Android

The post I wrote a few days ago about Android is all over the place. The right elements are in that post, but my composition and conclusions are somewhat incoherent. Perhaps I have been partly infected by the conventional thinking (of, e.g., various older, big corporations) and missed the obvious.

First, in a networked environment, it is common standards, rather than a single, common software platform, which further enable information sharing. So, Google may be doing Android for precisely the opposite reason than I originally suggested: to avoid the emergence of a single, dominant, proprietary platform. Chrome may exist for a similar reason. After all, Android serves a purpose similar to a browser, but for mobile devices with various sensing modalities.

Finally, mobile is arguably an important area and Google probably wants to encourage diversity and experimentation which, as I wrote in a previous post, is a pre-requisite for innovation. This is in contrast to the established mentality summarized by the quote I previously mentioned, to “find an idea and ask yourself: is the potential market worth at least one billlion dollars? If not, then walk away.” In fairness, this approach is appropriate to preserve the status quo. (By the way, in the same public speech, the person who gave this advice also responded to a question about competition by saying with commendable directness that “Look: we’ll all be dead some day.  But there’s a lot of money to be made until then.”)  But for innovation of any kind, one should “ask ‘why not?’ instead of ‘why should we do it?'” as Jeff Bezos said, or “innovate toward the light, not against the darkness” as Ray Ozzie said.

## First thoughts on Android

Update: I’ll keep this post for the record, even though I’ve completely changed my mind.

I recently upgraded to a T-Mobile G1 (aka. HTC Dream), running Android.  The G1 is a very nice and functional device. It’s also compact and decent looking, but perhaps not quite a fashion statement: unlike the iPhone my girlfriend got last year, which was immediately recognizable and a stare magnet, I pretty much have to slap people on the face with the G1 to make them look at it.  Also, battery life is acceptable, but just barely.  But this post is not about the G1, it’s about Android, which is Google’s Linux-based, open-source mobile application platform.

I’ll start with some light comments, by one of the greatest entertainers out there today: Monkey Boy made fun of the iPhone in January, stating that “Apple is selling zero phones a year“. Now he’s making similar remarks about Android, summarized by his eloquent “blah dee blah dee blah” argument.  Less than a year after that interview, the iPhone is ahead of Windows Mobile in worldwide market share of smartphone operating systems (7M versus 5.5M devices). Yep, this guy sure knows how entertain—even if he makes a fool of himself and Microsoft.

Furthermore, Monkey Boy said that “if I went to my shareholder meeting [...] and said, hey, we’ve just launched a new product that has no revenue model! [...] I’m not sure that my investors would take that very well. But that’s kind of what Google’s telling their investors about Android.”  Even if this were true, perhaps no revenue model is better than a simian model.

Anyway, someone from Microsoft should really know better—and quite likely he does, but can’t really say it out loud. There are some obvious parallels between Microsoft MS-DOS and Google Android:

• Disruptive technology: In the 80s, it was the personal computer.  Today, many think it is “cloud computing” (or “services”, or “ubiquitous computing”, or “utility computing”, or whatever else you want to call it).
• Commodity infrastructure: In the 80s, PC-compatibles became a commodity through standardization of the hardware platform and fierce competition that drove prices (and profit margins) down. Today, network infrastructure (the Internet at the core, and mobile devices on the fringes) as well as systems software (LAMP) are facing similar pressures.
• Common software platform: MS-DOS was the engine that fueled the growth of the personal computer.  For cloud computing, there is still some way to go (which Android hopes to help pave).
• Revenue model: Microsoft made a profit out of every PC sold. In today’s networked world, profit should come from services offered over the network and accessed via a multitude of devices (including mobile phones), rather than from selling software licenses.

An executive once said that money is really made by controlling the middleware platform. Lower levels of the stack face high competition and have low profit margins.  Higher levels of the stack (except perhaps some key applications) are too special-purpose and more of a niche.  The sweet-spot lies somewhere in the middle. This is where MS-DOS was and where Android wants to be.

Microsoft established itself by providing the platform for building applications on the “revolution” of its day, the personal computer. MS-DOS became the de-facto standard, much more open than anything else at that time. Subsequently, Microsoft took a cut of the profits out of each PC sold ever since. Taiwanese “PC-compatibles” helped fuel Microsoft’s (as well as Intel’s) growth. The rest is history.

In “cloud” computing, the ubiquitous, commodity infrastructure is the network.  This enables access to applications and information from any networked device. Even though individual components matter, it is common standards, rather than a single, common software platform, which further enable information sharing. If you believe that the future will be the same as the past, i.e., selling shrink-wrapped applications and software licenses, then Android not only has no revenue model, but has no hope of ever coming up with one. Ballmer would be absolutely right.  But if there is a shift towards network-hosted data and applications, money can be made whenever users access those.  There are plenty of obvious examples which could be profitable: geographically targeted advertising, smart shopping broker/assistant (see below), mobile office and add-on services, online games (location based or not), and so on. It’s not clear whether Google plans to get directly involved in those (I would doubt it), or just stay mostly on the back end and provide an easy-to-use “cloud infrastructure” for application developers.

The services provided by network operators are becoming commodities. This is nothing new. A quote I liked is that “ISPs have nothing to offer other than price and speed“.  I wouldn’t really include security in their offerings, as it is really an end-to-end service. As for devices, there is already evidence that commoditization similar to that of PC-compatibles may happen. Just one month after Android was open-sourced, Chinese manufacturers have started deploying it on smartphones. Even big manufacturers are quickly getting in the game; for example, Huawei recently announced an Android phone. Most cellphones are already manufactured in China anyway.  The iPhone is assembled in Shenzhen, where Huawei’s headquarters are also located (coincidence?). The Chinese already have a decent track record when it comes to building hardware and it’s only a matter of time until they fully catch up.

So, it’s quite simple: Android wants to be for ubiquitous services as MS-DOS was for personal computers. But Microsoft in the 80s did not really start out by saying “our revenue model is this: we’ll build a huge user base at all costs, which will subsequently allow us to get \$200 out of each and every PC sold”?  Not really.  Similarly, Google is not going to say that “we want to build a user base, so we can make a profit from all services hosted on the [our?] cloud and accessed via mobile devices [and set-top boxes, and cars, and...].”  Such an announcement would be premature, and one of the surest ways to scare off your user base: unless Google first provides more evidence that it means no evil, the general public will tend to assume the worst.

The most interesting feature of Android is it’s component-based architecture, as pointed out by some of the more insightful blog posts. Components are like iGoogle gadgets, only Android calls them “activities.” Applications themselves are built using a very browser-like metaphor: a “task” (which is Android-speak for running applications) is simply a stack of activites, which users can navigate backwards and forwards. The platform already has a set of basic activities that handle, e.g., email URLs, map URLs, calendar URLs, Flickr URLs, Youtube URLs, photo capture, music files, and so on. Any application can seamlessly invoke any of these reusable activities, either directly or via a registry of capabilities (which, roughly speaking, are called “intents”). The correspondence between a task and an O/S process is not necessarily one-to-one. Processes are used behind the scenes, for security and resource isolation purposes. Activities invoked by the same task may or may not run in the same process.

In addition to activities and intents, Android also supports other types of components, such as “content providers” (to expose data sources, such as your calendar or todo list, via a common API), “services” (long-running background tasks, such as a music player, which can be controlled via remote calls) and “broadcast receivers” (handlers for external events, such as receiving an SMS).

I think that Google is really pushing Android because it needs a component-based platform, and not so much to avoid the occasional snafu. If embraced by developers, this is the major ace up Android’s sleeve.  Furthermore, the open source codebase is the strongest indication (among several) that Google has no intention to tightly regulate application frameworks like Apple, or to leverage it’s position to attack the competition like Microsoft has done in the past.  Google wants to give itself enough leverage to realize it’s cloud-based services vision. If others benefit too, so much the better—Google is still too young to be “evil“.  After all, as Jeff Bezos said, “like our retail business, [there] is not going to be one winner. [...] Important industries are rarely made by single companies.” I find the comparison to retail interesting. In fact, it is quite likely that many “cloud services” themselves will also become commodities.

I’d wager that really successful Android applications won’t be just applications, but components with content provided over the network. A shopping list app is nice. It was exciting in the PalmPilot era, a decade ago. But a shopping list component, accessible from both my laptop and my cellphone, able to automatically pull good deals from a shopping component, and allow a navigation component to alert me that the supermarket I’m about to drive by has items I need—well, that would be great! Android is built with that vision in mind, even though it’s not quite there yet.

It’s kind of disappointing, but not surprising, that many app developers do not yet think in terms of this component-based architecture. In fairness, there are already efforts, such as OpenIntents, to build collections of general-purpose intents. Furthermore, the sync APIs are not (yet) for the faint of heart. Even Google-provided services could perhaps be improved. For example, Google Maps does not synchronize stored locations with the web-based version. When I recently missed a highway exit on the way to work and needed to get directions, I had to pull over and re-type the full address. Neither does it expose those locations via a data provider. When I installed Locale, I had to manually re-enter most of “My Locations” from the web version of Google Maps. So, there are clearly some rough edges that I’m sure will be smoothed out.  After all, there have been other rough edges, such as forgotten debugging hooks, something I find more amusing than alarming or embarrassing and certainly not the “Worst. Bug. Ever.

Android has a lot of potential, but it still needs work and Google should move fast. The top two items on my wish list would be:

1. Release a “signature” device (or two), like the Motorola Razr was a couple of years ago and the Apple iPhone was last year. The G1 is really nice, but not enough.  A device that people desire may be neither a necessary nor a sufficient condition for success, but it will sure help as a vehicle to move Android forward in market share.
2. Expand the set of available activities and content providers, and release an easy-to-use data sync service and API. In principle, everything that is an iGoogle gadget should also be an Android activity, sharing the same data sources. This is at the core of what “cloud computing” is about.  After all, you could think of Android as a glorified modern browser for devices with small screens, intermittent network connectivity, location sensors, and so on.

I suspect it might not be that hard to build a Google gadget container for Android.  Google Gears is already there and some form of interaction with the local device via Javascript is already allowed.  Many gadgets don’t need that much screen real estate anyway, so this may be an interesting hack to try out.

But not many people will buy an Android device for what it could do some day. Google has created a lot of positive buzz, backed by a few actual features. Now it needs some sexy devices and truly interesting apps, to really jumpstart the necessary network effect. Building the smart shopping list app should be as easy as building the dumb one. In the longer run, the set of devices on which Android is deployed should be expanded.  Move beyond cell phones, to in-car computers, set-top boxes, and so on (Microsoft Windows does both cars and set-top boxes already, but with limited success so far)—in short, anything that can be used to access network-hosted data and applications.