Slashdot Log In
Google Pushes Open Source OCR
Posted by
Zonk
on Tue Apr 10, 2007 01:02 PM
from the google-has-taken-all-knowledge-to-be-its-provice dept.
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.

Sign of times to come? (Score:3, Interesting)
Re:Sign of times to come? (Score:5, Insightful)
(http://www.talklets.com/)
Build instructions are outdated (Score:2, Informative)
(Last Journal: Saturday January 20 2007, @10:37AM)
svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
More build info; Ubuntu Feisty (Score:5, Informative)
(http://www.hyperlogos.org/ | Last Journal: Wednesday July 18, @08:19PM)
Here's some other information you might need/want to build this software; note that I am on Ubuntu Feisty.
To build tesseract-ocr you must install autoconf.
If you are smart you will figure out a way to omit ocropus/data-test-pages from your checkout. Do you really want to use their pages to test? No! You want to use your own data.
I built with gcc 4.1.2, YMMV. Some people have reported errors trying to just compile tesseract.
to build ocropus I ended up installing jam, libaspell-dev and libtiff-dev.
The goal of the project (Score:5, Insightful)
(http://www.puremango.co.uk/)
among other things, sure, but it's got to be a high priority for google.
Re:The goal of the project (Score:4, Insightful)
(Last Journal: Monday February 04 2002, @03:31PM)
Small price if it helps email spam. (Score:5, Insightful)
(http://kadin.sdf-us.org/ | Last Journal: Tuesday October 16, @01:46PM)
True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.
Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.
Re:captcha's (Score:5, Funny)
(http://blog.macb.net/ | Last Journal: Monday March 05 2007, @04:38PM)
You mean as in:
Describe what the following expression does in 30 words or less:
{"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}
Man, I'll never get into forum postings if they do that!
Re:captcha's (Score:5, Funny)
(http://nystrom.nl/ | Last Journal: Sunday April 03 2005, @02:17PM)
{"ab", "c"}{"d", "ef"} = {"abd", "abef", "cd", "cef"}
Answer: Makes my head hurt...
*click* Access to MySpace granted, have a nice day.
Which forum were you taking about again?
Re:Small price if it helps email spam. (Score:5, Interesting)
(http://www.livejournal.com/~pxtl)
Captchas are by far the better solution.
The problem is that, long term, they will eventually be cracked. I'm imagining the ultimate solution will only be to allow users with email addresses from "approved" major ISPs.
Well... (Score:5, Informative)
required_hits 5
score SARE_GIF_ATTACH 5
I don't see image spam any more. I resorted to that after I was getting a hundred or so of them a day.
The CAPTCHA solution (Score:4, Interesting)
(http://www.rogertheshrubber.net/)
To defeat *this*, you would need someone with a greater command of the english language than simple recognition of characters, or very advanced image recogniion software. I wouldn't worry about the software anytime soon if you chose images carefully.
Re:The goal of the project (Score:4, Interesting)
(http://www.ajs.com/~ajs/)
among other things, sure, but it's got to be a high priority for google.
Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.
I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
Re:The goal of the project (Score:5, Insightful)
Why Google wouldn't want this:
1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.
IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.
Re:The goal of the project (Score:4, Insightful)
(http://www.ajs.com/~ajs/)
1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
So much for captcha (Score:2, Redundant)
(Last Journal: Friday November 10 2006, @02:16PM)
Re:So much for captcha (Score:4, Informative)
(http://127.0.0.1/ | Last Journal: Thursday September 20, @12:52PM)
The beginning of the end? (Score:4, Insightful)
Re:The beginning of the end? (Score:5, Insightful)
(http://www.puremango.co.uk/)
If the text is parsable, it takes nothing to google it.
I mean, those two examples you give; just slap it into google and screenscrape it. So you're going to need harder questions than that.
So the next generation of crapchas will ask "what color is the sky".
Go and take a glance at ultraHal or another relatively advance NLP AI; a large knowledgebase is not hard to construct. When it doesn't know, it guesses. If it gets it right, then the knowledgebase increases by one fact.
So then, what, you have to ask "Given that all bleeps and blue, and blank is a bleep, is blank blue?"
Not only is that also easily computationally solved, but also a lot of people aren't going to be able to answer (smartass questions about stopping spam and idiots aside)
So *then* I suppose you have to ask "In the first mathematical antimony, does Kant conclusively prove both that there can have been no beginning to time and that there must have been a beginning to time?"
and give the user a 255 character textarea to put their answer in.
So... please, text question based captchas are DOOMED TO FAIL. stop thinking that they could work. They can't.
Re:The beginning of the end? (Score:5, Informative)
(http://lawpoop.blogspot.com/ | Last Journal: Friday May 28 2004, @06:51PM)
Part of what makes OCR work is that it assumes that the text was written to communicate meaning. It has regular characters in alignment, forming common words and abbreviations in more or less grammatical sentences with close-to-proper punctuation.
A good captcha has a non-sense string of characters in various cases, all skewed and distorted, with extra geometric elements obscuring the characters. This renders unavailable somewhere around half of the clues that an OCR uses.
the presidential papers (Score:4, Funny)
(http://www.atomjax.com/)
So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.
Finally... (Score:3, Interesting)
(http://www.phoboid.net/)
Re:Finally... (Score:5, Insightful)
(Last Journal: Friday January 03 2003, @03:39PM)
Captchas (Score:2)
So will something like this eventually render captchas used as a security/anti-spam measure obsolete?
Not like something wasn't bound to eventually come out to counter that idea, anyway.
Very cool. (Score:5, Insightful)
(http://kadin.sdf-us.org/ | Last Journal: Tuesday October 16, @01:46PM)
There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR [sourceforge.net], and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.
I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.
Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.
Orcopus? (Score:4, Funny)
Level: 15
Race: Fell Marine
HP: 290/290
EP: 200/200
Water elemental
Drops: Tentacle
Wonderful! (Score:4, Insightful)
One thing leads to another... (Score:2, Insightful)
captchas (Score:5, Insightful)
(http://evil.google.com/)
Re:captchas (Score:4, Informative)
Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
searchable pdfs (Score:5, Interesting)
(http://www.music.mcgill.ca/~sinclair)
(Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)
Perhaps this library could be used to build such an application if none exists...
Language? (Score:5, Interesting)
Re:Language? (Score:5, Funny)
(http://rustyp.freeshell.org/ | Last Journal: Tuesday April 29 2003, @09:22AM)
Remember kids, there are no stupid questions.
Only people who don't RTFA who ask questions.
Mathematics? (Score:1)
(http://obsessivemathsfreak.org/ | Last Journal: Friday June 09 2006, @08:15PM)
Open Source Ballot Scanning! (Score:2)
(http://bolson.org/ | Last Journal: Friday May 20 2005, @03:44PM)
Comics (Score:3, Interesting)
(http://www.dieblinkenlights.com/)
I would love that!
Sheesh.... (Score:1)
(http://home.comcast.net/~rickrich1/)
if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -g -O2 -MT tface.o -MD -MP -MF ".deps/tface.Tpo" -c -o tface.o tface.cpp; \
then mv -f ".deps/tface.Tpo" ".deps/tface.Po"; else rm -f ".deps/tface.Tpo"; exit 1; fi
make[3]: *** [tface.o] Error 1
make[3]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/home/rick/tesseract-ocr'
Where's the Package? (Score:2)
(http://slashdot.org/~Doc%20Ruby/journal | Last Journal: Thursday March 31 2005, @01:48PM)
clara - Free OCR program for Unix Systems
gocr - A command line OCR
ocrad - Optical Character Recognition program
unpaper - post-processing tool for scanned pages
Will this Google OCR really work, and can I install it with APT?
Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?
Re:Where's the Package? (Score:4, Insightful)
(http://www.hyperlogos.org/ | Last Journal: Wednesday July 18, @08:19PM)
Just for you, I made one, because I'm that fucking cool.
Prepare to be unimpressed, because Results follow:
JABBERWOCKY Lewis Carroll
(from Through the Looking-Glass and What Alice Found There, 1872) `Twas bri11ig,_ andjghe 4s1it_hy toyes Digl gyre amid gimblejn thg wabe: All xiiimsy wei^e thg borogovgs, And theamome raths outigrabe. ''ggwqre thg Jalgbervvpck,_my sqn! The jaw; that bijtel the clayksathat catch! Bgyvaiie the Jubjub bird, anti shun The frumidus Bandersnatch!' I-Ie took his yorpal sword in hand: Long timg tlgewmangome foe he sought So rgSted he by the Tu_mtum tree, And stood awhile in thought. And, as in uffish thought he stood, The Jalgbgjwoclg, with eyes of flame, Cqmgwhjfflixgg through fhe tulgey wood, And burbled as it came! Qne, two! One, two! And through and thIi`Ollgh The jrorpgal b]ade went; snicaker-snack! I-Ie left iifdead, and with its head He went galumphing back. ''And, has thou slain thejabbexfwpck? Cpmg to my a_rxps!_my ljgaxjgishboyl Ojralqjousi dwgy! Qalladhl Callayl' He chortled in his joy. S
\ A S
X A ?`^s :
, ' Was ga. ka%#* mm. -- M 1 1 Q at ) a iv 2. `Ail A it 3*,* `i 2 (V H ;. ````( * 4 ^Nq@ Eu..*s..%im X M is ? lgh ~ ``A? S [ A Fax I /),2*gE it ^`* 4 ~ *: ' X A mg x ix, ,t~;;;..: v' it ix '~ t ~ ^ ,4~ ---= =-^ A A i gv ; * XX, x> . . N S A ft 1 A-`A 3; `> ' ''YY \Jh ^***`(?i* , ~~ x `* at -;v- *<~ ' H ~~~-=.- ; `Twas bri11ig,_ and_the 4s1it_hy toyes Dig gyre arid gimblejn the wabe; All Qiixjnsy wei^e thq borogovgs, And thdmome raths outvgrabe.
dshaw@iabbenNockv.com
Return to Glorious Nonsense Return to Lewis Carroll
Results End.
Beautiful, eh? I also tried a 100 dpi grayscale scan, which came out even more like hash (one big paragraph) and a 300 dpi bitmap (1bpp) which was about the same as the 100 dpi gray scan in quality, though a bit better.
Looks like ocropus has a while to go before it can slay the Jabberwock instead of thejabbexfwpck.
Apache license is incompatible with GPLv3 (Score:2)
Even though v3 no longer has the anti-google Affero provisions, Google still chooses Apache instead of GPLv3 or even v2 with a rider to upgrade to v3. You gotta believe the Google lawyers were thinking about this issue before release...
Ocropus? (Score:3, Funny)
Advancement in AI (Score:1)
It is interesting that web forms have become a measure of AI strength in the world wide web. As soon as Captchas are largely solved, there will be new and improved human tests. I am guessing the next step will be identifying logos, or some sort of symbol. Eventually that problem will be solved too. So what do we do when we can't tell a human from a machine?
Please send me your registered DNA sequence, a voice recording reading this message, and a picture of you in the current location...
I guess a central database of information (identification and secure communication channel) is going to be the only way to ensure you are who you say you are.
Eventually I guess it won't really matter if you are human.
Patents? (Score:2)
Re:From? (Score:3, Funny)