{"id":119,"date":"2009-09-01T14:06:08","date_gmt":"2009-09-01T18:06:08","guid":{"rendered":"http:\/\/www.bitquill.net\/blog\/?p=119"},"modified":"2016-05-12T14:45:37","modified_gmt":"2016-05-12T19:45:37","slug":"mobile-ocr-input-fully-automatic-and-reality","status":"publish","type":"post","link":"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/","title":{"rendered":"Mobile OCR input: &#8220;Fully automatic&#8221; and reality"},"content":{"rendered":"<p>Recently I&#8217;ve been toying around with\u00c2\u00a0<a title=\"WordSnap OCR\" href=\"http:\/\/www.bitquill.net\/trac\/wiki\/Android\/OCR\">WordSnap OCR<\/a> (<a title=\"WordSnap OCR - Wiki\" href=\"http:\/\/www.bitquill.net\/trac\/wiki\/Android\/OCR\">project page<\/a>, <a title=\"WordSnap OCR - Google Code\" href=\"http:\/\/code.google.com\/p\/wordsnap-ocr\/source\/browse\/#svn\/trunk\">source code<\/a>, <a title=\"WordSnap OCR - Cyrket\" href=\"http:\/\/www.cyrket.com\/package\/net.bitquill.ocr\">app on Android Market<\/a>), an app for OCR-based camera input on Android. In the process, I found out a few things about <strong>&#8220;smart&#8221; versus &#8220;fast&#8221;<\/strong>.<\/p>\n<p>At least in data mining, &#8220;fully automatic&#8221; is an often unquestioned holy grail. \u00c2\u00a0There are certainly several valid reasons for this, such as\u00c2\u00a0if you&#8217;re trying to scan huge collections of books such as\u00c2\u00a0<a title=\"Google Books\" href=\"http:\/\/books.google.com\/\">this<\/a>, or index images from your daily life like <a title=\"Evernote\" href=\"http:\/\/evernote.com\/\">this<\/a>. \u00c2\u00a0<strong>In this case, you use all the available processing power to make as few errors as possible<\/strong> (i.e., maximize accuracy).<\/p>\n<p>However, if the user is sitting right in front of your program, watching your algorithms and their output, things are a little different. <strong>No matter how smart your algorithm is, some errors will occur.<\/strong> This tends to annoy users. In that sense, actively involved users are a liability. However, they can also be an asset: since they&#8217;re sitting there anyway, waiting for results, you may as well get them <em>really<\/em> involved.<strong> If you have cheap but intelligent labor ready and willing, use it!<\/strong> The results will be better or, at the very least, no worse. \u00c2\u00a0<strong>Also, users tend to remember the failures.<\/strong> So, even if end results were similar <em>on average<\/em>, allowing users to correct failures as early as possible will make them happier.<\/p>\n<p><strong>Instead of making algorithms as smart as possible, the goal now is to make them as fast as possible, so that they produce near-realtime results that don&#8217;t have to be perfect; they just shouldn&#8217;t be total garbage.\u00c2\u00a0<span style=\"font-weight: normal;\">When I started playing with the idea for WordSnap, I was thinking how to make the algorithms as smart as possible. \u00c2\u00a0However, for the reasons above, I soon changed tactics.<\/span><\/strong><\/p>\n<p><strong><span style=\"font-weight: normal;\"><span style=\"font-weight: normal;\"> <\/span><\/span><span style=\"font-weight: normal;\">The rest of this post describes some of the successful design decisions but, \u00c2\u00a0more importantly, the failures in the balance between &#8220;automatic&#8221; and &#8220;realtime guidance&#8221;.<\/span><span style=\"font-weight: normal;\"><span style=\"font-weight: normal;\"> <\/span><\/span><span style=\"font-weight: normal;\">The story begins with the following example image:<\/span><\/strong><\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"149\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_original\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_original.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Original image\" data-image-description=\"&lt;p&gt;Original grayscale camera image (scaled down to 512&amp;#215;384)&lt;\/p&gt;\n\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_original.png?fit=512%2C384&amp;ssl=1\" class=\"size-full wp-image-149 alignnone\" title=\"Original grayscale image\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_original.png?resize=512%2C384\" alt=\"Original image\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_original.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_original.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p>Incidentally, this image was the inspiration for WordSnap: I wanted to look up &#8220;inimical&#8221; but I was too lazy to type. Also, for the record, WordSnap uses camera preview frames, which are semi-planar YUV data at HVGA resolution (480\u00c3\u2014320). This image is a downsampled (512\u00c3\u2014384) full-resolution photograph taken with the G1 camera (2048\u00c3\u20141536); most experiments here were performed before WordSnap existed in any usable form. Finally, I should point out that OCR isn&#8217;t really my area; what I describe below is based on common sense rather than knowledge of prior art, although just before writing this post I did try a quick review of the literature.<\/p>\n<p><!--more--><\/p>\n<h3>Binarization<\/h3>\n<p>A basic operation for OCR is binarization: mapping grayscale intensities between 0 and 255 to just two values: black (0) and white (1). \u00c2\u00a0Only then can we start talking about shapes (lines, words, characters, etc). \u00c2\u00a0One of the most widely used binarization algorithms is\u00c2\u00a0<a title=\"Otsu's method - Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Otsu's_method\">Otsu&#8217;s method<\/a>. \u00c2\u00a0It picks a single, global threshold so that it maximizes the within-class (black\/white) variance, or equivalently maximizes the across-class variance. This is very simple to implement, very fast and works well for flatbed scans, which have uniform illumination.<\/p>\n<p><strong>However, camera images are not uniformly illuminated.<\/strong> The example image may look fine to human eyes, but it turns out that even for this image no global threshold is suitable (click on image for <a title=\"Global thresholding - Animation\" href=\"http:\/\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global.gif\">animation showing various global thresholds<\/a>):<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global.gif\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"157\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_bin_global_static\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global_static.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Binarization with global threshold\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global_static.png?fit=512%2C384&amp;ssl=1\" class=\"alignnone size-full wp-image-157\" title=\"Binarization with global threshold\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global_static.png?resize=512%2C384\" alt=\"Binarization with global threshold\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global_static.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_global_static.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/a><\/p>\n<p>If you looked at the animation carefully, you might have noticed that at some point, at least the word of interest (&#8220;inimical&#8221;) is correctly binarized in this picture. \u00c2\u00a0However, if the lighting gradient were steeper, this would not be possible. Incidentally, <a title=\"ZXing - Google COde\" href=\"http:\/\/code.google.com\/p\/zxing\/\">ZXing<\/a> uses Otsu&#8217;s method for binarization, because of it is fast. So, if you wondered why barcode scanning sometimes fails, now you know.<\/p>\n<p>So, a slightly smarter approach is needed: instead of using one global threshold,<strong> the threshold should be determined individually for each pixel (i,j)<\/strong>. A natural threshold t(i,j) is the mean intensity \u00ce\u00bc<sub>w<\/sub>(i,j) of pixels within a w\u00c3\u2014w neighborhood around pixel (i,j). \u00c2\u00a0The key operation here is mean filtering: convolving the original image with a\u00c2\u00a0w\u00c3\u2014w matrix with constant entries 1\/w<sup>2<\/sup>.<\/p>\n<p>The problem is that, using pure Java running on Dalvik, mean filtering is prohibitively slow. \u00c2\u00a0First, Dalvik is fully interpreted (no JIT, yet). Firthermore, the fact that Java bytes are always signed doesn&#8217;t help: casting to int and masking off the 24 most significant bits almost doubles running time.<\/p>\n<table border=\"0\" cellspacing=\"3\" cellpadding=\"2\">\n<tbody>\n<tr style=\"background-color: #dddddd\">\n<th> Method<\/th>\n<th colspan=\"3\" align=\"center\">Dalvik (msec)<\/th>\n<th colspan=\"3\" align=\"center\">JNI (msec)<\/th>\n<th align=\"center\"> Speedup<\/th>\n<\/tr>\n<tr>\n<td>Na\u00c3\u00afve<\/td>\n<td align=\"right\">109,882<\/td>\n<td>\u00c2\u00b1<\/td>\n<td>4,813<\/td>\n<td align=\"right\">1,712<\/td>\n<td>\u00c2\u00b1<\/td>\n<td>261<\/td>\n<td align=\"center\">64\u00c3\u2014<\/td>\n<\/tr>\n<tr>\n<td>Sliding<\/td>\n<td align=\"right\">2,435<\/td>\n<td>\u00c2\u00b1<\/td>\n<td>141<\/td>\n<td align=\"right\">71<\/td>\n<td>\u00c2\u00b1<\/td>\n<td>19<\/td>\n<td align=\"center\">34\u00c3\u2014<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>JNI to the rescue. The table above shows speedups for two implementations.  The na\u00c3\u00afve approach uses a triple nested loop and has complexity O(w<sup>2<\/sup>mn), where m and n is the image height and width, respectively (m = 348, n = 512 in this example).  The 1-D equivalent would simply be:<\/p>\n<pre>for i = 0 to N-1:\r\n   s = 0\r\n   for j = max(i-r,0) to min(i+r,N-1):\r\n      s += a[j]<\/pre>\n<p>where w = 2r+1 is the window size. The second implementation updates the sums incrementally, based on the values of adjacent windows. The complexity now is just O(mn). An interesting aside is the relative performance of two implementations for sliding window sums (where w = 2r+1 is the window size).  The first checks border conditions inside each iteration:<\/p>\n<pre>Initialize s = sum(a[0]..a[r])\r\nfor i = 1 to N-1:\r\n   if i &gt; r:\r\n      s -= a[i-r-1]\r\n   if i &lt; N-r:\r\n      s += a[i+r]<\/pre>\n<p>The second moves the border condition checks outside the loop which, if you think about it for a second, amounts to:<\/p>\n<pre>Initialize s = sum(a[0]..a[r])\r\nfor i = 1 to r:\r\n   s += a[i+r]\r\nfor i = r+1 to N-r-1:\r\n   s -= a[i-r-1]\r\n   s += a[i+r]\r\nfor i = N-r to N-1:\r\n   s -= a[i-r-1]<\/pre>\n<p>Among these two, the <em>first<\/em> one is faster, at least on a laptop running Sun&#8217;s JVM with JIT (I didn&#8217;t time Dalvik or JNI).<strong> I&#8217;m guessing that the second one messes loop unrolling<\/strong>, but I haven&#8217;t checked my guess.<\/p>\n<p>It turns out that there is a very similar approach in the literature, called <em>Sauvola&#8217;s method<\/em>. Furthermore, there are <a href=\"http:\/\/www.dfki.uni-kl.de\/~shafait\/papers\/Shafait-efficient-binarization-SPIE08.pdf\">efficient methods<\/a> to compute it, using integral images. These are simply the 2-D generalization of partial sums. In 1-D, if partial sums are pre-computed, window sums can be estimated in O(1) time using the simple observation that sum(i&#8230;j) = sum(1..j) &#8211; sum(1..i-1).<\/p>\n<p>Savuola&#8217;s method also computes local variance \u00cf\u0192<sub>w<\/sub>(i,j), and uses a relative threshold t(i,j) = \u00ce\u00bc<sub>w<\/sub>(i,j)(1 + \u00ce\u00bb\u00cf\u0192<sub>w<\/sub>(i,j)\/127).  <strong>WordSnap uses the global variance and an additive threshold<\/strong> t(i,j) = \u00ce\u00bc<sub>w<\/sub>(i,j) + \u00ce\u00bb\u00cf\u0192<sub>global<\/sub>, but after doing a contrast stretch of the original image (i.e., linearly mapping minimum intensity to 0 and maximum to 255).  Doing floating point math or 64-bit integer arithmetic is much more expensive, hence the additive threshold.  Furthermore, WordSnap does not use integral images because the same runtime can be achieved without the need to allocate a large buffer. <strong>Memory allocation on a mobile device is not cheap:<\/strong> the time needed to allocate a 480\u00c3\u2014320 buffer of 32-bit integers (about 600KB total) varies significantly depending on how much system memory is available, whether the garbage collector is triggered and so on, but on average it&#8217;s about half a second on the G1. Even though most buffers can be allocated once, <strong>startup time is important<\/strong> for this application: if it takes more than 2-3 seconds to start scanning, the user might as well have typed the result.<\/p>\n<p>Anyway, here is the final result of locally adaptive thresholding:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"153\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_bin\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Binarization with local mean filter\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin.png?fit=512%2C384&amp;ssl=1\" class=\"alignnone size-full wp-image-153\" title=\"Binarization with local mean filter\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin.png?resize=512%2C384\" alt=\"Binarization with local mean filter\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p><span style=\"font-family: 'Lucida Grande', Tahoma, Arial, sans-serif; font-weight: bold; font-size: 1em\">Conclusion:<\/span> In this case we needed the slightly smarter approach, so we invested the time to implement it efficiently. WordSnap currently uses a 21\u00c3\u201421 neighborhood. \u00c2\u00a0Altogether, binarization takes under 100ms.<\/p>\n<h3>Skew<\/h3>\n<p>Another problem is that the orientation of the text lines may not be aligned with image edges. \u00c2\u00a0This is called skew and makes recognition much harder.<\/p>\n<p>Initially, I set out to find a way to correct for skew. \u00c2\u00a0After a few searches on Google, I came across the\u00c2\u00a0<a title=\"Hough transform - Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Hough_transform\">Hough transform<\/a>. \u00c2\u00a0The idea is simple. \u00c2\u00a0Sayyou want to detect a curve desribed by a set of parameters. E.g., for a line, those would be distance \u00cf\u0081 from origin and slope \u00ce\u00b8. For each black pixel, find the parameter values for all possible curves to which this pixel may belong. For a line, that&#8217;s all angles\u00c2\u00a0\u00ce\u00b8\u00c2\u00a0from 0 to 180 degrees, and all distances \u00cf\u0081 from 0 to sqrt(m<sup>2<\/sup>+n<sup>2<\/sup>). \u00c2\u00a0Then, compute the density distribution of parameter tuples. \u00c2\u00a0If a line (\u00cf\u0081<sub>0<\/sub>,\u00ce\u00b8<sub>0<\/sub>)\u00c2\u00a0is present in the image, then the parameter density distribution should have a local maximum at (\u00cf\u0081<sub>0<\/sub>,\u00ce\u00b8<sub>0<\/sub>).<\/p>\n<p>If we apply this approach to our example image, the first maximum is detected at an\u00c2\u00a0angle of 20 degrees. Here is the image counter-rotated by that amount:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"150\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_bin_rot\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Skew correction\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;After rotating by angle detected using Hough transform&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot.png?fit=512%2C384&amp;ssl=1\" class=\"size-full wp-image-150 alignnone\" title=\"Skew correction using Hough transform\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot.png?resize=512%2C384\" alt=\"After rotating by angle detected using Hough transform\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p>Success! \u00c2\u00a0However, computing the Hough transform is too slow! \u00c2\u00a0Typical implementations bucketize the parameter space. This would require a buffer of about 180\u00c3\u2014580 32-bit integers (for a 480\u00c3\u2014320 image), or about 410KB. In addition, it would require trigonometric operations or lookups to find the buckets for each pixel, not to mention counter-rotation. There are obvious optimizations one can try, such as computing histograms at multiple resolutions to progressively prune the parameter space. \u00c2\u00a0Still, the cost implied by back-of-the envelope calculations put me off from even trying to implement this on the phone. Instead, why not just try to use the users:<\/p>\n<p><a href=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"181\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/finder\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png?fit=480%2C320&amp;ssl=1\" data-orig-size=\"480,320\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Finder alignment guides\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png?fit=480%2C320&amp;ssl=1\" class=\"alignnone size-full wp-image-181\" title=\"Finder alignment guides\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png?resize=480%2C320\" alt=\"Finder alignment guides\" width=\"480\" height=\"320\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png?w=480&amp;ssl=1 480w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/finder.png?resize=300%2C200&amp;ssl=1 300w\" sizes=\"auto, (max-width: 480px) 100vw, 480px\" \/><\/a><\/p>\n<p><span style=\"font-family: 'Lucida Grande', Tahoma, Arial, sans-serif; font-weight: bold; font-size: 1em\">Conclusion:<\/span> Simple approach with help from user wins, and<strong> the computer doesn&#8217;t even have to do <em>anything<\/em> to solve the problem!<\/strong> Incidentally, the guideline width is determined by the size of typical newsprint text at the smallest distance that the G1&#8217;s camera can focus.<\/p>\n<h3>Font size<\/h3>\n<p>Next, we need to detect individual words. \u00c2\u00a0The approach WordSnap uses is to <a title=\"Dilation - Mathematical morphology - Wikipedia\" href=\"http:\/\/en.wikipedia.org\/wiki\/Mathematical_morphology#Dilation\">dilate<\/a> the binary image with a rectangular structuring element (in the following image, the size 7\u00c3\u20147), and then expand a rectangle (shown in green) until it covers the connected component which, presumably, is one word.<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"154\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_bin_rot_dilate3\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate3.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Dilation with 7&amp;#215;7 rectangle\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate3.png?fit=512%2C384&amp;ssl=1\" class=\"alignnone size-full wp-image-154\" title=\"Dilation with 7x7 rectangle\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate3.png?resize=512%2C384\" alt=\"Dilation with 7x7 rectangle\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate3.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate3.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p>However, the size of the structuring element should really depend on the inter-word spacing, which in turn depends on the typeface as well as the distance of the camera from the text. \u00c2\u00a0For example, if we use a 5\u00c3\u20145 element, we would get the following:<\/p>\n<p><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"155\" data-permalink=\"https:\/\/bitquill.net\/blog\/mobile-ocr-input-fully-automatic-and-reality\/skew_bin_rot_dilate2\/\" data-orig-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate2.png?fit=512%2C384&amp;ssl=1\" data-orig-size=\"512,384\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;}\" data-image-title=\"Dilation with 5&amp;#215;5 rectangular element\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate2.png?fit=512%2C384&amp;ssl=1\" class=\"alignnone size-full wp-image-155\" title=\"Dilation with 5x5 rectangular element\" src=\"https:\/\/i0.wp.com\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate2.png?resize=512%2C384\" alt=\"Dilation with 5x5 rectangular element\" width=\"512\" height=\"384\" srcset=\"https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate2.png?w=512&amp;ssl=1 512w, https:\/\/i0.wp.com\/bitquill.net\/blog\/wp-content\/uploads\/2009\/08\/skew_bin_rot_dilate2.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><\/p>\n<p>I briefly toyed with two ideas for font size detection. \u00c2\u00a0The first is to do a Fourier transform. \u00c2\u00a0Presumably the first spatial frequency mode would correspond to inter-word and\/or inter-line spacing and the second mode to inter-character spacing. But that assumes we apply Fourier to a &#8220;large enough&#8221; portion of the image, and things start becoming complicated. \u00c2\u00a0Not to mention computationally expensive.<\/p>\n<p>The second approach (which also appears to be the most common?) is to to hierarchical grouping. First expand rectangles to cover individual letters (or, sometimes, ligatures), then compute histogram of horizontal distances and re-group into word rectangles, and so on. \u00c2\u00a0This is also non-trivial.<\/p>\n<p>Instead, WordSnap uses a fixed dilation radius. \u00c2\u00a0The implementation is optimized to allow near-realtime annotation of the detected word extent. \u00c2\u00a0This video should give you an idea:<br \/>\n<object width=\"425\" height=\"344\" data=\"http:\/\/www.youtube.com\/v\/GhUOWbOmn6s&amp;hl=en&amp;fs=1&amp;rel=0\" type=\"application\/x-shockwave-flash\"><param name=\"allowFullScreen\" value=\"true\" \/><param name=\"allowscriptaccess\" value=\"always\" \/><param name=\"src\" value=\"http:\/\/www.youtube.com\/v\/GhUOWbOmn6s&amp;hl=en&amp;fs=1&amp;rel=0\" \/><param name=\"allowfullscreen\" value=\"true\" \/><\/object><\/p>\n<p><span style=\"font-family: 'Lucida Grande', Tahoma, Arial, sans-serif; font-weight: bold; font-size: 1em\">Conclusion:<\/span> Simple wins again, but this time we have to do <em>something<\/em> (and let the user help with the rest). But,<strong> instead of trying to be <em>smart<\/em> and find the best parameters given the camera position, we try to be <em>fast<\/em>: fix the parameters and let the user find the camera position that works given the parameters. <\/strong>WordSnap uses a 5\u00c3\u20145 rectangular structuring element, although you can change that to 3\u00c3\u20143 or 7\u00c3\u20147 in the preferenfces screen. Altogether, word extent detection takes about 150-200ms, although it could be significantly optimized, if necessary, by using only JNI only, instead of a mix of pure Java and JNI calls.<\/p>\n<hr \/>\n<p>I&#8217;m now looking into the possibility of moving OCR into the &#8220;live&#8221; loop: as you move the camera, the phone shows not only the word extent rectangle, but also the recognized word. \u00c2\u00a0Perhaps as a hyperlink to Google, or along with Google Translate results. \u00c2\u00a0Then I can justifiably use the buzzword of the day, <strong>&#8220;augmented reality&#8221;<\/strong>! \u00c2\u00a0It looks that it might just be possible, but let me get back to you in a week or two.\u00c2\u00a0\u00c2\u00a0:)<\/p>\n<p><strong>Postscript:<\/strong> Some of the papers referenced were pointed out to me by <a title=\"Hideaki Goto - Homepage\" href=\"http:\/\/www.sc.isc.tohoku.ac.jp\/~hgot\/\">Hideaki Goto<\/a>, who started and maintains <a title=\"WeOCR - Homepage\" href=\"http:\/\/weocr.ocrgrid.org\/\">WeOCR<\/a>. Also, skew detection and correction experiments are based on this <a href=\"http:\/\/www.bitquill.net\/blog\/wp-content\/uploads\/2009\/09\/test_skew.txt\">quick-n-dirty Python script<\/a> (needs <a title=\"OpenCV - Homepage\" href=\"http:\/\/opencv.willowgarage.com\/wiki\/\">OpenCV<\/a> and it ain&#8217;t pretty!). <em>Update (9\/2):<\/em> Fixed really stupid mistake in parametrization of line.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently I&#8217;ve been toying around with\u00c2\u00a0WordSnap OCR (project page, source code, app on Android Market), an app for OCR-based camera input on Android. In the process, I found out a few things about &#8220;smart&#8221; versus &#8220;fast&#8221;. At least in data mining, &#8220;fully automatic&#8221; is an often unquestioned holy grail. \u00c2\u00a0There are certainly several valid reasons [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[45],"tags":[59,49,7,62,57,11,61,63],"class_list":["post-119","post","type-post","status-publish","format-standard","hentry","category-scitech","tag-android","tag-computer-science","tag-development","tag-machine-learning","tag-mobile-devices","tag-research","tag-sci-tech","tag-user-interfaces"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p7x9xm-1V","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts\/119","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/comments?post=119"}],"version-history":[{"count":77,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts\/119\/revisions"}],"predecessor-version":[{"id":709,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/posts\/119\/revisions\/709"}],"wp:attachment":[{"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/media?parent=119"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/categories?post=119"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bitquill.net\/blog\/wp-json\/wp\/v2\/tags?post=119"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}