Advances in Multilingual OCR

Bringing OCR technology into the next generation is something of a nightmarish challenge. We may be able to teach computers to drive without committing mass vehicular manslaughter, but teaching them to read has proven a far more challenging processes. Teaching them to read multiple languages, especially older documents with patchy inkwork, hand-writing, and other anomalies? You might as well just forget it. And yet, more than a few researchers are trying to leap this seemingly uncrossable chasm.

One approach to this is the EU’s IMPACT project (IMproving ACess to Text), which is specifically aimed that that most dreaded of demographics: those old, damaged, time-beaten documents that even human eyes struggle to decipher. Needless to say it’s no small task to create a program capable to digitizing these texts with any level of acceptable accuracy, yet that’s precisely what IMPACT seeks to achieve. Much of the work is still in the theoretical stage but the idea is to rely less on a hard set of external character recognition patterns and instead collects a wide range of sample characters and begins the process of comparing a suspect letter or word to this litany of examples. The one they cite in particular is a particular curse of handwriting: rn or m. While a typical OCR program might look at it and simply say m or just plain get confused, what IMPACT’s OCR would do is look into its database of similar character clusters, looking at how they were resolved to achieve a fairly accurate resolution of this conundrum.

Nonetheless, errors would still be unavoidable and some human oversight and correction would be required. Instead of making this an arduous process of skimming through a text with extreme paranoia, the IMPACT OCR would instead guide individuals through each conflict it encountered, having the user check and either approve and correct the OCR’s decision. Thus not only are errors caught, but this system of approvals steadily improves the accuracy of the overall OCR system. Hence it becomes an adaptive system, constantly growing and improving with each text it processes. While its initial products may be underwhelming, it will in theory steadily improve to the point where it can vastly outpace any other form of OCR (the article itself estimates it could be as fast as 30 minutes for a short book that would take 4 hours to transcribe by hand).

The second approach put forward by Kae, Smith, and Learned-Miller similarly eschews a reliance on hard and static databases of standardized typefaces. Their approach is in some ways reminiscent of that of IMPACT but it delves far deeper into a cryptographic approach that abandons a reliance on characters altogether. It may seem an odd idea to try to create character recognition technology that doesn’t use characters, but the methodology is just shy of ingenious. The basic idea is to use similar processes to those used to break codes.

Needless to say, when you’re spying on your enemies it’s often far easier to intercept their communications than it is to actually decipher them. Nonetheless, any coding system, no matter how complex, has to correlate to some sort of meaning, and this meaning can be deciphered through an analysis of patterns. In the much simpler case of deciphering a pre-existing language that is only badly written rather than outright encoded, it becomes much easier as one can employ the matter of frequency. As Kae et al. mention in the article, e is the most frequent letter in the English alphabet. Given this, the symbol which shows up most frequently can be assumed to be an e, even if it would not be recognizable as such to the casual observer. Similarly, English has only a few instances of one-letter words, namely a and I. This if a glyph is detected in isolation, it can assumed to either be an a or an I, and as these two symbols are relatively easy to distinguish it becomes all the easier to steadily “decode” the document by comparing the patterns of these glyphs to our knowledge of frequencies in English, and then using these knowledge to decipher other instances of this glyph in words, and thus decipher those words and so on. To be sure this process gets into a level of math that is well beyond me, but it does nonetheless seem fascinating.

So at the end you have on one hand a project that seeks to build a vast library of character recognition by analyzing a multitude of non-standard texts with human correction reinforcing and improving the system, allowing it to adapt and steadily become more accurate. On the other you have one that approaches a document as a code to be deciphered, building its database of characters from the ground up with each document and relying on complex algorithms to divine the true identity of difficult-to-read documents. Both are still highly experimental but, together, they could steadily allow us to digitize more and more of the vast historical backlog of documents that would otherwise be isolated or even lost with time. It may even be possible to synthesize the two, utilizing these cryptographic methods to further decrease the need for human intervention in the IMPACT system while allowing it to adapt and learn even more quickly. It remains to be seen, but in ten or twenty years we may reach that dream point where nearly any document can be readily and quickly digitized and shared with the world.

Works Cited

"The Digitization of Historic European Texts". 2010. International Journal of Micrographics & Optical Technology. 28 (3). http://uncg.worldcat.org/oclc/680027844

Kae, Andrew, David A. Smith, and Erik Learned-Miller. 2011. "Learning on the fly: a font-free approach toward multilingual OCR". International Journal of Document Analysis and Recognition (IJDAR). 14 (3): 289-301. http://uncg.worldcat.org/oclc/750097978