yargo - Slashdot User

Comment OCR, a Primer (Score 1) 188

by yargo on Saturday June 17, 2000 @07:50AM (#997805) Attached to: From Paper to PDF?

I've worked on OCR software for a number of companies. From a Unix based desktop OCR application at Vividata to a high end form processing system at Oyster Software. Doing what you want to do is far from an uncommon wish. Doing what you want in an easy, systematic, scalable and open source way is just not a reality at this point.

To start with, you need a good OCR engine. There are several out there that I've used that are very good (from Caere, Nestor, Mitek or CGK). These companies all offer libraries for putting together your own document processing engine. They return the text, often return font/pointsize information and even let you know the confidence of the return value. You could use a fullblown app and try and wrap it, but OCR Shop from Vividata is the only app with a command line interface, which you'd need to handle any reasonable volume.

From this, you can generate all sorts of output with the correct formatting. OCR Shop, which I worked on for Vividata, allowed output to many different formats, including HTML, Word and Framemaker. Depending on the complexity of your document, you can do a fairly good job of outputing what you see on the page. Outputing to PDF wouldn't be all that hard. We set that up as an output format for our scanning software at Vividata. Granted, it was a CCIT G4 bitmap wrapped in a PDF shell, but Acrobat is as close to a cross platform image viewer that most people will have installed that you can find.

So how to make it searchable. You can go the route of saving the bitmap image and do searching on the accompanying OCR text output. This way, you get the formatting of the document right, but you end up using up a lot more space. Or you can try and do the formatting correctly on the text document, fix up any typos (or not) and use that. Both have advantages.

I've gotten the itch on several occasions to put together an open source OCR program, with both command line and GUI interfaces. A lot of the pieces are already there. The best, free OCR engine I know of is the NIST OCR Engine. It's a bit old, the code needs some polishing and one would need to put train some memories for standard fonts, but it would make for a pretty nice little app. Then it's just a matter of creating some internal representation of the formatting and write some output functions for the different types of output (HTML, PDF, RTF...). But my copious free time has not yet given me that opportunity.

Slashdot Top Deals