Comment Re:Finally... NOT so final... (Score 2, Informative) 212
Actually, GOCR works very well (100%) on the image-based text
that some sites use to prevent screen scrapping.
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.
1. Download and save the image.
2. If it's a gif, convert it to a jpg.
gif2jpg -a tmp.gif
3. Reduce the colors to 2 (black & white).
djpeg -colors 2 -greyscale -dither none tmp.jpg tmp.pnm
4. If there is a border, crop it off.
pnmcut a b c d tmp.pnm > OCR.pnm
(The dimensions a,b,c,d can be determined by any tool that returns useful info about an image, in general remove 1 or 2 pixels from the edges to get rid of borders.)
5. OCR it.
gocr -n 1 OCR.pnm >> OCR.txt
Of course, this is all automated within the screen scraper, I just broke it out here to explain the steps.
For CAPTCHAs, you have to demorph the severely distorted images after step 4, before you OCR it. I'm still working on the demorpher, but it's about 50% accurate now. Basically, it unstretches long strings of pixels to the average of other strings of pixels in the x and y axis. Works even better if you determine the angle of the pixel sting and shrink on that, along with some rotation to the nearest x or y axis.