Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

Google Releases Tesseract as Open Source 251

An anonymous reader writes "Google recently released Tesseract as open source. Originally developed at the HP Labs from 1985-1995, it has been touted as one of the most accurate Optical Character Recognition (OCR) programs available. Having sat on the shelf gathering dust for so many years, Google cleaned up some of the more outdated portions of the code and released it for general consumption. You can download Tesseract over at Sourceforge.
This discussion has been archived. No new comments can be posted.

Google Releases Tesseract as Open Source

Comments Filter:
  • by smileytshirt ( 988345 ) on Monday September 04, 2006 @11:34PM (#16041733) Homepage
    My guess is that they are doing this in the hope the open source community will build on and improve OCR technology. This would be in Google's interest, as it can then index text from images (such as their own Books project) more accurately and efficiently.
  • by Carthag ( 643047 ) on Monday September 04, 2006 @11:35PM (#16041739) Homepage
    OCR is most effective when the letter boundaries are clear and well-defined, such as fixed-width text, or text that is at least on a straight line. Most CAPTCHAs put the letters on a curved path, as well as distorting the letters so they are no longer within a clearly defined rectangular shape. This makes it very hard to identify which parts of the images are letters and which parts are not, making OCRing CAPTCHAs a non-trivial problem.
  • From the Project (Score:5, Insightful)

    by Gopal.V ( 532678 ) on Monday September 04, 2006 @11:43PM (#16041772) Homepage Journal

    > It was open-sourced by HP and UNLV in 2005.

    So google basically did what ? Fix bit-rot ? Google has re-released some open source code, essentially forking off the orginal ?

    > License: (None Listed)

    I'm a fan of the FOSS idea. Basically that makes sures that the whole work to which I contributed, always remains available to me (and others). It might not always work for a company, but as a developer it makes sense to me. And the second thing I need to see is a License after I see some code.

    So explain to me how exactly this is open source (other than the "compile, but don't touch" version of it) and *then* I might think of downloading it and probably fix a few bugs or write docs.

  • by djtack ( 545324 ) on Monday September 04, 2006 @11:51PM (#16041822)
    Plus, good OCR could help recognize image spam (where they send the text in an image attachment, to avoid filtering, and fill the message body with "bayes poison").
  • Two reasons (Score:5, Insightful)

    by patio11 ( 857072 ) on Tuesday September 05, 2006 @12:49AM (#16042108)
    You've got two constraints. One is that you have to be able to compose an arbitrarily large numbers of capchas algorithmically. For example, that example you just used is human-composed. If its the only CAPTCHA you have, the following program gets me a job at Google: gawk 'BEGIN{print "b"}' . If you have 100 CAPTCHAS, I only need to add a switch statement and some elbow grease and then I get to break your CAPTCHA a trillion times.

    The other contraint is that you have to have your problem be trivially solvable by humans. I know plenty of people who cannot solve the CAPTCHA you have given: one obvious example would be, umm, all of my coworkers, because I live in Japan and "sub sandwitch" is not generally on the Japanese English curriculum. Similarly, you could any number of parsing problems which are very difficult for machines ("Here are 10 pictures chosen from HotOrNot. Click the three hot chicks.") but which may also be difficult for some users, such as Slashdotters who have never met a girl before.

    By the way, you can find an implementation of that CAPTCHA at http://www.hotcaptcha.com/ [hotcaptcha.com]
  • by Jerf ( 17166 ) on Tuesday September 05, 2006 @12:54AM (#16042128) Journal
    In order to pose the question, you have to generate it randomly. If it's not random, you already lost.

    In order to generate it, you're going to end up using a grammar.

    Running grammars in reverse is merely a matter of patience (to explore the space of problems the test program will pose) and the right tools; it's a fundamental bit of computer science.

    Granted, expecting spammers to be conversant with the fundamental elements of computer science is a pretty high bar, but it only takes one to leap it and the rest to buy the program from him.

    The image tests have the advantage that done properly, it takes more than just patience and computer science fundamentals to crack, it would require fundamental advances in the art.

    (Note that nowhere in this message do I claim that image tests are perfect; in fact everything I know is vulnerable to the "feed it to a human in another context (viz, 'porn') and let them do the work" attack, and there are also points to be made about how widespread any given grammar/image test becomes; I know a website where the image test actually is a constant and so far it doesn't seem to be a problem because of scale issues. My point is that text tests have an additional disadvantage. It's not an intrinsically bad idea, though.)

    Google wouldn't be interested in hiring people who could crack this, merely because they can crack this. Might make a decent interview question, though.

    (You might also be tempted to think that you could just use a really complicated grammar, but you are constrained by two things, the human supposedly reading and taking the test, and the complexity of the human language itself. By the time you write some problem generator that could reliably throw off a parser, you'll be reliably confusing the hell out of your human users, too.)
  • by Otto ( 17870 ) on Tuesday September 05, 2006 @01:01AM (#16042165) Homepage Journal
    Or write up a quick script to cut the images in half down the middle and save them as a series of other images.
  • Re:Un-Finishable (Score:5, Insightful)

    by mrchaotica ( 681592 ) * on Tuesday September 05, 2006 @01:58AM (#16042365)
    In all honesty, I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

    Your argument makes the fundamentally flawed assumption that the "new rules" will remain constant. The reality is that Copyright will continue getting extended so that new content never comes into Public Domain. (I hope the copyright fuckers are the first against the wall when the revolution comes!)

    Even if they can't disseminate the information, they could still scan documents in and store them for later OCR-ing, thus preserving them against deterioration.

    I'm sure they could even OCR them... they just couldn't make them available to the public. Of course, given the community-driven mechanism by which Project Gutenberg works, they couldn't legally distribute them to the volunteers either...

  • by Anonymous Coward on Tuesday September 05, 2006 @02:11AM (#16042423)
    You're a secretary? Do you do anal? If so, I can double your pay.
  • by illuminatedwax ( 537131 ) <stdrange@nOsPAm.alumni.uchicago.edu> on Tuesday September 05, 2006 @02:23AM (#16042472) Journal
    Also: *AA includes the MAA (Mathematical Association of America), the ADAA (Anxiety Disorders Association of America), the MSAA (Multiple Sclerosis Association of America), and the SCAA (Specialty Coffee Association of America).

    The SCAA must be the ones responsible for not letting Java be open sourced.
  • Comment removed (Score:4, Insightful)

    by account_deleted ( 4530225 ) on Tuesday September 05, 2006 @03:40AM (#16042740)
    Comment removed based on user account deletion
  • Re:Image spam (Score:3, Insightful)

    by maxwell demon ( 590494 ) on Tuesday September 05, 2006 @04:16AM (#16042878) Journal
    Unless it's a scanned page, where you might be interested in more than just the raw text, or simply don't want to risk errors in converting it to text (think official documents).
  • Yes, by using contrasting colors that convert to the same tone in grayscale. A side effect being that most such technologies also shut out colorblind people...
  • by patio11 ( 857072 ) on Tuesday September 05, 2006 @07:56AM (#16043541)
    The name of the system you propose is called challenge/response (CR). CR is not a good idea for the following reasons:

    1) It says "My time is more important than yours" to all your correspondents, because you're not willing to look at a few spams getting past your Bayesian filter every day so instead you offload that time burden to people who want to talk to you.
    2) Dueling CR systems ("Hey, bob@example.com, I don't recognize you. Please prove you are a human" "Re: Hey, bob -- steve@stupid.com, I don't recognize you. Please prove you are a human"). Even more fun in a potentially infinite loop. Any system you can make to shortcircuit this loop can be abused by spam to avoid the CR altogether.
    3) Doesn't survive the Chinese Sweatshop Spam Attack, which will be ubiquitous if CR becomes popular. (Take poor Chinese person, teach them 10 words of English, pay them 2 cents an hour to answer CAPTCHAs so you get guaranteed delivery of your Maximize Your Mr. Wiggly offers.)
    4) Breaks legitimate bulk mail senders, such as Amazon, Paypal, eBay, mailing lists, etc etc. Mailing lists in particular are going to be very fun, since a lot of CR systems would spam the entire list -- perhaps provoking 100 challenges! Which then leads to combinatorial hilarity!
  • by tepples ( 727027 ) <tepples.gmail@com> on Tuesday September 05, 2006 @09:47AM (#16044102) Homepage Journal
    I doubt Project Gutenberg will have run out of pre-1923 books by the time that new stuff starts coming out of Copyright under the new rules.

    Are you insinuating that the 115th Congress won't try to enact a Chastity Bono Copyright Term Extension Act? Given Mexico's life plus 100 copyright term, the next step of "harmonization" for the United States and its trading partners is life plus 100 or, in the case of works made for hire, 125 years after publication.

    Just assuming that somehow they did manage to digitize everything that was out of copyright, then I think what they should do is start archiving everything that they can.

    Who's to say that publishers won't fight back against Gutenberg the way (ObTopic) they did against Google [eweek.com]? It's only fair use if you can pay a judge to tell you that it is and if you can pay your lawyer to tell the judge to tell you that it is.

To the systems programmer, users and applications serve only to provide a test load.

Working...