bitquill - Spiros Papadimitriou

WordSnap OCR

What it does

"Making print clickable"
Reading something and want to look up a word on Google or Wikipedia, but are too lazy to type? Now you can point the camera at it, and let this app do the rest.

 This blog post describes progress so far, and gives a few details on how WordSnap currently works.

 
REQUEST FOR FEEDBACK: The app has been tested only on the G1; feedback from users of other hardware is especially welcome (the emulator is pretty much useless for debugging camera capture). Please send email, as I cannot follow up on Market comments. Also, if you want to help improve the image preprocessing algorithms, you can enable image data logging to SD card.

Quick instructions

  • Point camera at word, placing the word you want to recognize in the viewfinder guide. It doesn't have to be perfect, but try to put the center of the word near the center of the viewfinder rectangle, and to roughly align the orientation of the text line according to the red lines.
  • You can use either the hardware camera button, or just tap on the screen. The hardware button works as expected. Tapping on the screen will initiate an autofocus attempt. If autofocus is either successful or is taking too long, the touch will eventually trigger an image capture.
  • Once the results come back, a number of action buttons will appear (search Google, search Wikipedia, or copy to clipboard). Tapping on any of them will perform the corresponding action and close WordSnap (long-press the home key if you want to go back to it).

How it works

First, the image is preprocessed (to determine background/foreground, extent of word in image, etc). The main purpose of preprocessing is to reduce the amount of data sent over the air. A JPEG-encoded 8bit grayscale image is around 20K, but after preprocessing and sending just the word as a binary image, typical size is 1-2KB. Also, in the future preprocessing can involve more sophisticated interaction, such as interactive text extent selection (so the user can do OCR on more than just one word) etc.

After the image is preprocessed, it is sent to an OCR webservice to do extract the text. I'm using  WeOCR as the backend. This is a free service that people at Tohoku University in Japan have put together since 2005. Among other things, it provides a cgi-bin frontend to various open source OCR engines ( Tesseract,  Ocrad,  GOCR, etc). You can select a different engine in the application preferences.

Download

Binary release

The app is available on Android Market, under the name  WordSnap OCR.

Source release

The source is  available on Google Code, and distributed under the terms of GPLv3.

Known issues

  • Crash on resume from suspend (background thread not re-initialized appropriately?) -- reported by Alain Tuor.
  • Method to determine whether text is light color over dark background or vice versa assumes uniform image (should really segment image?). For example, method may fail if non-text parts are in the frame (such as a dark image, or part of a dark table, etc). It may also fail on glossy paper with strong reflections, even if those reflections are not near the scanned word.
  • Word extents are rectangles; this means that it must be possible to cover a word with a rectangle that overlaps no other word. This is not always true for text with tight linespacing (e.g., newspaper print). Should really use natural contour of dilated text image, not a rectangle?

Changeog

  • v0.1 8/20/2009
    • Initial release
  • v0.2 8/22/2009
    • Contrast stretch captured preview image
  • v0.3 8/28/2009
    • Live capture mode (continuous capture with live word extent annotation)
    • UI warnings (extent too large, low contrast, out of focus)
    • Warning alerts, configurable depending on network connectivity (e.g., "if on EDGE and word extent too large (i.e., will send too much data), actively ask for user confirmation")
    • Allow user to edit recognized text before sending out (Google, Wikipedia, Clipboard); user-configurable, default off

To do

  • Contrast stretching before binarization. [v0.2alpha]
  • Add some feedback (e.g., low lighting/contrast, excessive shake/out-of-focus, extent too large when on EDGE, etc). [v0.3]
  • Allow text editing before sending to Google or Wikipedia (preferences option). [v0.3]
  • Move OCR-related stuff to separate thread with event loop (a-la-ZXing), allocate image buffers once. [v0.3]
  • Continuous ("live") mode for word detection. [v0.3]
  • Clean up main scanning activity to separate scanning logic and state from UI state (code rewrite, no additional features until done) [in progress]
  • Look into porting one of the lighter-weight engines ( GOCR, perhaps via  Conjecture) to Android
  • Touch triggering of autofocus and capture needs some polishing [after UI rewrite]
  • Expose main activity as a callable component (similar to ZXing), so other applications can use it.
  • User-extensible actions (so anyone can add, e.g., online dictionaries, etc) [by Caleb Marcus]
  • Do whatever tricks the Android Camera application does to speed up startup [after UI rewrite]
  • Interactive selection of word extents (?)
  • Separate JNI image operations into self-contained library (?)
  • More testing for lighting conditions and backgrounds; should really use distribution of pixel intensities, rather than overall variance to determine adaptive filter threshold offset? [probably no]
  • Set up feedback collection, to collect opt-in "training" data (??) [probably no]

BTW, due to limited time, updates will be infrequent and I may be sluggish responding to email (but I appreciate all feedback and *will* read your emails).