Google Releases Tesseract as Open Source 251
An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
I take back every bad thing I said about Google (Score:5, Interesting)
Sonny Bono pwned Gutenberg (Score:2)
Should we praise technology that helps Project Gutenberg run out of pre-1923 books faster? Once all notable pre-1923 books are scanned, OCR'd, and cleaned up, then what does PG do?
Un-Finishable (Score:5, Interesting)
Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can. Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration. I think this would be covered by fair use law even if the work was still protected. Perhaps this sort of archival work is not exactly the aim of PG, but it's still critically important.
With that said, I don't mean to in any way excuse the disgusting abuse of our political and legal system that was and is the "Sonny Bono Copyright Term Extension Act." That thing is a disgusting example of pretty much everything that's wrong with our government today.
Re:Un-Finishable (Score:5, Insightful)
Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)
I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...
Re: (Score:2, Interesting)
One enormous area I'm personnally interested in is sheet music. Some of the music I'm interested in playing has come out of copyright decades or even centuries ago. No one is going to reclaim copyright on Mozart's requiem for instance. Yet it is by and large not available to the public because translati
Re: (Score:3, Informative)
This is just so un-true. In the United States (the only place that project Gutenberg worries about) nothing is entering the Public Domain except unpublished manuscripts where the author died 70 years ago. Nothing else will enter the public domain until 2019. Congress has affectivly frozen the public domain.
Re: (Score:3, Informative)
Just because its not common (or likely) doesn't mean it can't happen.
Chastity Bono's next step is life+100 (Score:3, Insightful)
Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.
Re: (Score:2)
Re: (Score:2)
Indeed it has. And as their scanning FAQ [gutenberg.org] explains, they recommend you buy an OCR software package. I'm all for having the right tools for the job, even if it means going non-OSS, but if these packages are available for free, it encourages more people to participate. Surely that's a good thing?
Re: (Score:2)
Re: (Score:2, Interesting)
as some commercial ones out there. Abby Finereader seems to be the OCR software of choice for
Distributed Proofreaders, at least.
Tesseract just has ASCII support (for now, as they like to add), so it ignores italics, accents etc.
In the case of the book I'm working on, it had a very hard time with the ff ligature and had some
trouble with b and c, but became hut, he became be, c was often an o or e.
The words diffi
Re: (Score:2, Informative)
Anti-spam (Score:3, Interesting)
Re: (Score:2)
Let me let you in on a little secret. CAPTCHAs were brpken a long time ago. They're the eqivalent of writing your password on a sticky note and putting it under your keyboard.
I recommend authenticating people with strong cryptography, which is how people can post to my blog [jrock.us].
I call bullshit (Score:5, Interesting)
And after all, it's not about authentication, it's about making a service accessible only for humans.
BTW, it's funny that you praise your own cryptography solution in your blog, but it's obvious that you have the problem of replay attacks, you even mention it in the "caveat" section below the text box.
Re: (Score:2)
The human brain is NOT capable of coping with an arbitrary level of distortion. Many people have remarked that recent captchas are sometimes difficult to read due to the very heavy distortion.
This is true at least for letters and numbers. "Pictures of things" might do better, but they require an enormous amount of work compared to a little program spitting out JPGs of text.
Re: (Score:2)
To lay this out clearly: human capability of recognition is still much better than those of computer programs, and that's what CAPTCHAs are exploiting: generally, every AI-hard problem can be used for distinguishing between humans and computers, which also means that everytime a CAPTCHA building upon an AI-hard problem has been broken, an AI-hard problem has been solved (provided no implementation errors have been used to bypass the need of solving the actua
Re:I call bullshit (Score:4, Informative)
Re: (Score:2)
Re: (Score:3, Interesting)
Yes, absolutely, and spammers are already using image obfuscation techniques: using italic difficult-to-read fonts spaced very close together (difficult to separate the image into individual characters and difficult to identify each character once you do), using colored backgrounds to make the text very low-contrast when converted into a monochrome image the OCR
improvements (Score:5, Funny)
i.e., added AdSense to the OCR output.
Hoping OCR will improve? (Score:3, Insightful)
Re: (Score:2)
More likely the computer vision research community, actually. "Many eyes" help a lot with bugs and bugfixes, but, ironically, not so well on nontrivial vision tasks.
Finally! (Score:3, Funny)
(Credit to S.G.)
From the Project (Score:5, Insightful)
> It was open-sourced by HP and UNLV in 2005.
So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?
> License: (None Listed)
I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.
So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.
Re: (Score:3, Informative)
License (Score:3, Informative)
This package contains the Tesseract Open Source OCR Engine.
Orignally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, the majority of the code
in this distribution is now licensed under the Apache License:
** Licensed under the Apache License, Version 2.0 (the "License");
** you may not use this file except in compliance with the Licen
Re: (Score:2)
Re: (Score:3, Interesting)
Anybody know how important this headache library is to the software, and how easily replaced it is?
Re: (Score:3, Informative)
I'm sorry Dave... (Score:5, Funny)
Yeah, but how is it on lip-reading? That's when we really need to worry.
Re: (Score:3, Interesting)
Given that my laptop has a microphone I was a bit worried about the recent article on google sampling sound on peoples computers. But my wife's laptop also has a webcam. Should I tell my wife not to google in bed? If the mic is off will they still catch what she is talking about?
Dave why don't you take a stress pill and lie down. If you are looking for something to read there is always google news.
Hosting (Score:5, Interesting)
Re:Hosting (Score:5, Funny)
Re: (Score:2)
Re:Hosting (Score:5, Funny)
i hope it can augment the SpamAssassin OCR plugin (Score:2, Informative)
Yay! (Score:2)
No binaries! Only source code! Good luck getting it to compile on Windows, I gave up after I got several dozen obscure errors I had never seen before from the compiler.*
* If anyone can get VC++2K5 to compile it, please post.
No luck for OS X either (Score:2)
Re: (Score:2)
my thoughts (Score:4, Interesting)
I just checked out tesseract. One thing I have to look at more is the license. It appears to be the Apache license, which seems like a decent free license. But it also includes MITRE's aspirin. I'm not sure how dependent it is on aspirin and what the license restrictions of aspirin are.
The two best free OCR engines out right now are clara and gocr. While they are the best, they are not that great yet. I just ran the same tiff I had run with those two (I also have the document in pbm and other formats). Tesseract did not read it, it bailed with "IMAGE::check_legal_access:Error:Can't seek backwards in a buffered image!"
Clara and GOCR are written in C, Tesseract is written in C++, a language I don't know. Tesseract did well in the UNLV challenge so it probably has some good features. It does say it has no page layout analysis though.
Hopefully this can be improved, or good parts of it can be borrowed and incorporated into gocr or clara. It couldn't handle my test that both clara and gocr could, but it probably has strengths the other two doesn't. One day hopefully we'll have a free OCR that handles things as automagically as the commercial ones do. I will see what I can contribute to that as well. Although this is C++ and I don't know that language.
Re: (Score:2)
Vividata works quite well (Score:2, Interesting)
HP decided to got out of the OCR business? (Score:5, Funny)
Actually, shortly thereafter, HP decided to get out technology innovation business, and into the printer ink business.
W0W1 (Score:3, Funny)
THAHKS, G00GLL!1!!!
What about "rough ocr" (Score:2)
Lately I've been thinking about computerizing these documents into a web based system, so that any of the club executive can search and pull out a document they need etc, we could also flag documents as "general release" so that people
Re: (Score:3, Insightful)
Re: (Score:2)
Non-English Charsets? (Score:4, Interesting)
Re: (Score:3, Informative)
License issue: not free software (Score:2, Interesting)
The piece in question is a neural
Test example of tesseract. (Score:2, Interesting)
Input image: it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code
Output text: ii has been lamed as one of lhe mos! accurale Oplical Characler Recognilion (OCR) programs available. Having sat on lhe shelf galhering dusk for so many ye
Re: (Score:2)
meh. a _screenshot_ contains perfectly regular characters - if it can't ace _that_ then I don't _want_ to see what it does with a scanned page.
Re: (Score:3, Interesting)
License? (Score:2)
An interesting demonstration (Score:2)
Google's business interest in releasing this as open source is obvious: the greater the value of the materials available to the Internet, the greater the value of its service.
Re:As much as I like open source software ... (Score:5, Informative)
Re: (Score:2)
Re:As much as I like open source software ... (Score:4, Insightful)
Re: (Score:3, Funny)
Comment removed (Score:4, Insightful)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Insightful)
Re: (Score:3, Informative)
trivial about generating a good binary images from images taken in the field (in my case,
images of boxes moving down a conveyor belt or hand imaged by workers).
Even if you disregard such problems as uneven lighting, glare, and distortion due the
unavoidable vibration inherrent to plant settings, most forms that are interesting to
OCR are handwritten and not designed to be OCR friendly. Hopefully this will change as
the peop
Re:As much as I like open source software ... (Score:5, Funny)
Re:As much as I like open source software ... (Score:5, Insightful)
Re: (Score:3, Informative)
Image spam (Score:3, Interesting)
If not because of spam, then because of the idiotic format. Images are for illustrations, but using them to transfer major amounts of text is just stupid and inefficient.
Re: (Score:3, Insightful)
Re:As much as I like open source software ... (Score:4, Funny)
Re: (Score:2, Insightful)
The SCAA must be the ones responsible for not letting Java be open sourced.
Re: (Score:2, Flamebait)
Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).
and also the GNAA (Gay Nigger Association of America)
Don't ask me what's my point in mentionning this because I have no fucking idea :-) have a good day!
Re: (Score:3, Insightful)
NFB owns you (Score:5, Interesting)
They're already useless if installing one will subject your business to boycotts and/or lawsuits from National Federation of the Blind [nfb.org] and other advocates for people with disabilities.
Re:NFB owns you (Score:5, Informative)
Re: (Score:2)
I'm blind and deaf, you insensitive clod!
(Not really, but someone could be...)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2, Funny)
I'm tired of all of the anti-Americanism on /. If you want to exclude Americans from your site, go ahead; but don't rub our noses in it.
Re: (Score:2, Funny)
Audible captchas (Score:2)
Re: (Score:2)
Re: (Score:3, Interesting)
Two reasons (Score:5, Insightful)
The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.
By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/ [hotcaptcha.com]
Re: (Score:2)
Since you ask, here's why: (Score:4, Insightful)
1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
Re:As much as I like open source software ... (Score:4, Insightful)
In order to generate it, you're going to end up using a grammar.
Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.
Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.
The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.
(Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)
Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.
(You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
Re: (Score:2, Interesting)
My wife would fail this test. My father will fail this test. My step-mother will fail this test. My children will fail this test.
A computer will very easily get this test right one time on 26.
In one word: Useless.
No Wrinkle in Time comments? (Score:2, Interesting)
Re: (Score:2)
Captchas are designed to be difficult to OCR. Besides there are plenty of OCR apps around already, if you hadn't noticed. I don't think spammers have been holding out for a GPL one.
Re: (Score:2)
Re: (Score:3, Funny)
You're just not avant-garde enough.
Re:Music OCR (Score:4, Interesting)
I really should ask google to help buy this technology and set it free.
Re: (Score:2)
Re: (Score:2)
outputFilename.raw #???
outputFilename.map # seems to be a location map of 0/1's where 1's are valid text and 0's aren't
outputFilename.txt # the text from the OCR event
I also found that the tessdata directory did not get installed into the
Without "batch", it tries to bring up and X wind
Re: (Score:2)
Re: (Score:2)