Comment Re:All OCR vendors are BATSHITE INSANE (Score 4, Informative) 56
I've used tesseract + ghoscript as a front end to do OCRs of PDF documents. From my experience, tesseract is OK if you have original images that are pretty high quality (300 DPI minimum) printed using standard fonts with pretty standard layouts (the newest versions mostly works OK with a basic 2 column format). You'll still only get results in the high 90% range (which sounds good but is actually pretty atrocious compared to high-end OCR systems that are well up into the 9's for reliability). Oh, and even though you specify a language, tesseract has very little contextual knowledge of what it is scanning so you'll regularly see it run together two letters in properly spelled words to come up with mispelled words.
Oh, and you have to have a blacklist of characters since tesseract is absolutely in love with the idea of the letter A with the circle coming out of the top even though you tell tesseract that you are specifically scanning English documents where you just have the plain ordinary letter "A". A few other characters are like that too.
If, however you leave the reservation of high-quality scans of standard black & white printed text with normal layouts, tesseract quickly turns into a lovely random noise generator.