Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Google Pushes Open Source OCR

Posted by Zonk on Tue Apr 10, 2007 01:02 PM
from the google-has-taken-all-knowledge-to-be-its-provice dept.
SocialWorm writes "Google has just announced work on OCRopus, which it says it hopes will 'advance the state of the art in optical character recognition and related technologies.' OCRopus will be available under the Apache 2.0 License. Obviously, there may be search and image search implications from OCRopus. 'The goal of the project is to advance the state of the art in optical character recognition and related technologies, and to deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis, and general desktop use. In addition, we are structuring the system in such a way that it will be easy to reuse by other researchers in the field.'"
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Sign of times to come? (Score:3, Interesting)

    by Anonymous Coward on Tuesday April 10 2007, @01:05PM (#18679067)
    Now that they will be able to recognize and tag our images, I wonder if Picassa will finally get increased storage. Google will be able to deliver targeted ads based on our pictures.
  • Build instructions are outdated (Score:2, Informative)

    by What the Frag (951841) * on Tuesday April 10 2007, @01:06PM (#18679073)
    (Last Journal: Saturday January 20 2007, @10:37AM)
    Use this line to checkout ocropus:

    svn co http://ocropus.googlecode.com/svn/trunk/ ocropus
  • The goal of the project (Score:5, Insightful)

    by user24 (854467) on Tuesday April 10 2007, @01:07PM (#18679089)
    (http://www.puremango.co.uk/)
    The goal of the project is to stop the damn email image spammers.

    among other things, sure, but it's got to be a high priority for google.
    • Re:The goal of the project (Score:4, Insightful)

      by sammy baby (14909) on Tuesday April 10 2007, @01:15PM (#18679253)
      (Last Journal: Monday February 04 2002, @03:31PM)
      And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)
      [ Parent ]
      • And of course, as a side effect they'll probably wind up with a lovely distributed system for solving captcha. ;)

        True, but CAPTCHAs always seemed like a bit of an inelegant hack anyway. First, they're horrible from a disabled-access standpoint, and second they're really not all that effective against a concerted enemy when there's a lot of money on the line. Spammers can just pay a few kids in some Third World country to sit there all day and solve CAPTCHAs if they want to.

        Since message boards, which are the major users of CAPTCHAs, are practically by design little fiefdoms, I don't think they're nearly as hard to patrol as a common-carrier network like email. The solution to message-board spam is to either institute a moderator-delay (for small blogs and boards), or simply make enough admins with IP-ban powers so that the second someone starts spamming, they get banned and the spam gets deleted. Lameness filters working on the same principles as email spam-filters are probably helpful, too.
        [ Parent ]
      • Re:The goal of the project by UbuntuDupe (Score:3) Tuesday April 10 2007, @01:53PM
    • Re:The goal of the project (Score:4, Interesting)

      by ajs (35943) <ajs@aj s . com> on Tuesday April 10 2007, @02:25PM (#18680307)
      (http://www.ajs.com/~ajs/)

      The goal of the project is to stop the damn email image spammers.

      among other things, sure, but it's got to be a high priority for google.
      I don't buy either one. I think the goal of the project is to get sued.

      Google knows darned well that there are tons of patents around OCR, so they're not going to roll their own internally. Instead, they'll open source the project and make as much noise about enhancing the state of the art through collaboration as possible. Then, when they get sued (and they will), they can bring this case front-and-center in the debate surrounding patent reform, citing this as the textbook example of how the promotion of the sciences and useful arts (as specified by the Constitution) is hobbled by current patent law surrounding software.

      I could be wrong, but they'd be stupid to think that high-profile, open source OCR software won't be challenged by those who hold the patents....
      [ Parent ]
      • Re:The goal of the project (Score:5, Insightful)

        by slashbob22 (918040) on Tuesday April 10 2007, @02:50PM (#18680745)
        Ok, I'll bite and play DA for a bit.

        Why Google wouldn't want this:
        1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
        2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.

        IANIGHQ (In Google's HQ) but I don't see the value of getting sued at this point in time. Besides, if Google is doing this under appropriate conditions there shouldn't be concern of suits - but I suppose their Chinese plagiarism case doesn't support this point.

        // End DA
        [ Parent ]
        • Re:The goal of the project (Score:4, Insightful)

          by ajs (35943) <ajs@aj s . com> on Tuesday April 10 2007, @03:36PM (#18681451)
          (http://www.ajs.com/~ajs/)

          Why Google wouldn't want this:
          1) Google's own patents on search techniques, distributing advertisement, etc. Yes, I understand that you are talking about the need for certain limits to patent law - but this could as easily hurt Google as help them.
          Google takes the same stand on patent reform as IBM, as far as I know: the current law hurts innovation. They're not looking to have all of their patents stripped, just to reform the system so that innovation is encouraged. At the very least, IBM has (and I think Google too) lobbied for open source exemption. Keep in mind that IBM and Google hold tons of patents, but they mostly use them as a "warchest" to dissuade others from filing patent-related suits.

          2) They are already challenging the copyright laws with GooTube, I don't see any sense in tackling both at the same time.
          I don't buy that one. Patent and copyright law are radically different, and in the copyright case Google is just trying to argue for existing interpretation of the law, not a change.

          Google is doing this under appropriate conditions there shouldn't be concern of suits
          That's not how patent law works. If someone holds a patent on looking at the pixel to the left of the the one you're evaluating, and Google's software does that, then the holder could sue. What's more, there are many dozens of such simple patents surrounding OCR. It's probably the second-most over-patented area of CS next to color-space management.[1] [google.com]
          [ Parent ]
      • HP Tesseract by chill (Score:2) Tuesday April 10 2007, @07:17PM
    • Re:The goal of the project by w_mute (Score:1) Tuesday April 10 2007, @02:25PM
    • Re:The goal of the project by cheater512 (Score:2) Tuesday April 10 2007, @05:06PM
  • So much for captcha (Score:2, Redundant)

    by Red Flayer (890720) on Tuesday April 10 2007, @01:08PM (#18679111)
    (Last Journal: Friday November 10 2006, @02:16PM)
    Oh great. I, for one, do not welcome the increase in message board spamming.
  • The beginning of the end? (Score:4, Insightful)

    by Iphtashu Fitz (263795) on Tuesday April 10 2007, @01:08PM (#18679113)
    ... for Captchas [wikipedia.org]? If Google is pushing OCR I could see it eventually becoming good enough to parse at least some types of captchas.
  • the presidential papers (Score:4, Funny)

    by User 956 (568564) on Tuesday April 10 2007, @01:10PM (#18679157)
    (http://www.atomjax.com/)
    The goal of the project is to ... deliver a high quality OCR system suitable for document conversions, electronic libraries, vision impaired users, historical document analysis

    So, will it work on documents written in crayon? It would be a tragic loss for Dubya's presidential documents to get lost in the sands of time. On the scale of the library of Alexandria. No, seriously.
  • Finally... (Score:3, Interesting)

    by Searinox (833879) on Tuesday April 10 2007, @01:10PM (#18679159)
    (http://www.phoboid.net/)
    An OCR system that runs on Linux. I've been waiting for quite some time for something like this.
  • Captchas (Score:2)

    by Radon360 (951529) on Tuesday April 10 2007, @01:11PM (#18679189)

    So will something like this eventually render captchas used as a security/anti-spam measure obsolete?

    Not like something wasn't bound to eventually come out to counter that idea, anyway.

  • Very cool. (Score:5, Insightful)

    I've been hoping that someone with deep pockets (Google, IBM, Sun) would take on this area for a while.

    There is a major need for an OSS OCR package, and right now the field is pretty bare. There's GOCR [sourceforge.net], and a commercial offering called OCRShop, and at least that I've run across, that's about it. Nothing really on par with Omnipage, or other commercial packages for other platforms.

    I think there are some really neat applications for OCR that have never really been investigated, because it's so expensive to build that capability into other products. A free OCR engine that really worked could lead to some very neat book-scanning applications, just for starters. I don't think that there's really any integrated packages around for helping people scan books and manuscripts. (Right now you have to photograph the pages, keep them organized, then OCR them and proofread the text against the images. Bit of a nightmare.) I'd love to see a free application for libraries that let a user batch scan (via a digital camera -- let's not get into what I think of SANE and scanners generally) a book, and then provided a nice interface for proofreading the OCRed text against the original image.

    Something like that could have a huge social impact. There are a lot of libraries where I'm sure they'd love to scan some of their out-of-copyright assets and provide them to patrons in a digital form, but it's just too technically complicated. An easy-to-use program that let the proofreading be done by nontechnical users (maybe remotely, as long as we're dreaming) could vastly increase the volume of digital materials available.
  • Orcopus? (Score:4, Funny)

    by voice_of_all_reason (926702) on Tuesday April 10 2007, @01:23PM (#18679377)
    Orcopus:

    Level: 15
    Race: Fell Marine
    HP: 290/290
    EP: 200/200
    Water elemental
    Drops: Tentacle
  • Wonderful! (Score:4, Insightful)

    by jshriverWVU (810740) on Tuesday April 10 2007, @01:27PM (#18679461)
    This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
    • Re:Wonderful! by Anonymous Coward (Score:1) Tuesday April 10 2007, @01:59PM
  • One thing leads to another... (Score:2, Insightful)

    by jojoba_oil (1071932) on Tuesday April 10 2007, @01:29PM (#18679493)
    Okay, so one thing will lead to another and soon Google will be creating technology to recognize non-symbol shapes... How long before I can login to my G-Accounts by smiling at my computer?
    • 1 reply beneath your current threshold.
  • captchas (Score:5, Insightful)

    by gEvil (beta) (945888) on Tuesday April 10 2007, @01:40PM (#18679651)
    (http://evil.google.com/)
    All you people who are worried about this breaking captchas seem to be missing something--there have been a number of fairly decent OCR packages out there for a long time. The goal of this Google project is to create an open-sourced one that does a good job deciphering HUMAN-READABLE TEXT. Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.
    • Re:captchas (Score:4, Informative)

      by arrrrg (902404) on Tuesday April 10 2007, @01:59PM (#18679977)
      Captchas are far from human-readable (the good ones at least), and I seriously doubt this project will help very much in that arena.

      Um, the whole point of CAPTCHAS is that they are human-readable (albeit often not easily so) but not machine-readable.
      [ Parent ]
      • Re:captchas by gEvil (beta) (Score:2) Tuesday April 10 2007, @02:17PM
    • Re:captchas by MoriaOrc (Score:2) Tuesday April 10 2007, @02:21PM
    • Re:captchas by AeroIllini (Score:2) Tuesday April 10 2007, @02:51PM
      • Re:captchas by ChaosDiscord (Score:2) Tuesday April 10 2007, @03:50PM
        • Re:captchas by drinkypoo (Score:2) Tuesday April 10 2007, @04:18PM
          • Re:captchas by user24 (Score:2) Tuesday April 10 2007, @05:00PM
          • Re:captchas by ChaosDiscord (Score:2) Tuesday April 10 2007, @05:05PM
            • Re:captchas by drinkypoo (Score:2) Tuesday April 10 2007, @05:09PM
        • Re:captchas by AeroIllini (Score:2) Wednesday April 11 2007, @10:25AM
  • searchable pdfs (Score:5, Interesting)

    by radarsat1 (786772) on Tuesday April 10 2007, @01:44PM (#18679705)
    (http://www.music.mcgill.ca/~sinclair)
    Anyone know of an open source utility that can convert scanned image-based PDF files into searchable PDFs ?
    (Extra points if it somehow re-generates the actual file so it looks nice instead of pixelated.)

    Perhaps this library could be used to build such an application if none exists...
  • Language? (Score:5, Interesting)

    by ceeam (39911) on Tuesday April 10 2007, @01:45PM (#18679731)
    English only I suppose?
  • And will it be able to recognised and latexify handwriten mathematics. The world and it's mother can do OCR, but I've yet to an honest attempt at making writing mathematics papers easier.
  • by Soong (7225) on Tuesday April 10 2007, @02:05PM (#18680049)
    (http://bolson.org/ | Last Journal: Friday May 20 2005, @03:44PM)
    Ok, I got excited too early. Actually, ballot scanning is a specialized task and general purpose OCR probably doesn't play much of a part in that, but if any part of it does apply, then this is still awesome.
  • Comics (Score:3, Interesting)

    by rbanffy (584143) on Tuesday April 10 2007, @02:26PM (#18680339)
    (http://www.dieblinkenlights.com/)
    Will I be able to search my comics strips (downloaded since ever) by keyword?

    I would love that!
  • Sheesh.... (Score:1)

    by Rick Richardson (87058) on Tuesday April 10 2007, @02:54PM (#18680799)
    (http://home.comcast.net/~rickrich1/)
    make[3]: Entering directory `/home/rick/tesseract-ocr/wordrec'
    if g++ -DHAVE_CONFIG_H -I. -I. -I.. -I../ccstruct -I../ccutil -I../cutil -I../classify -I../image -I../dict -I../viewer -g -O2 -MT tface.o -MD -MP -MF ".deps/tface.Tpo" -c -o tface.o tface.cpp; \
    then mv -f ".deps/tface.Tpo" ".deps/tface.Po"; else rm -f ".deps/tface.Tpo"; exit 1; fi ../cutil/globals.h:46: error: previous declaration of 'int optind' with 'C++' linkage ../ccutil/getopt.h:23: error: conflicts with new declaration with 'C' linkage ../cutil/globals.h:47: error: previous declaration of 'char* optarg' with 'C++' linkage ../ccutil/getopt.h:24: error: conflicts with new declaration with 'C' linkage
    make[3]: *** [tface.o] Error 1
    make[3]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
    make[2]: *** [all-recursive] Error 1
    make[2]: Leaving directory `/home/rick/tesseract-ocr/wordrec'
    make[1]: *** [all-recursive] Error 1
    make[1]: Leaving directory `/home/rick/tesseract-ocr'
    • 1 reply beneath your current threshold.
  • All the OCR available to my Ubuntu 6.10 (Edgy) APT are worthless (< 50% correct characters), after trying them on real scans (usually faxes) that are perfectly clear to my eye:

    clara - Free OCR program for Unix Systems
    gocr - A command line OCR
    ocrad - Optical Character Recognition program
    unpaper - post-processing tool for scanned pages

    Will this Google OCR really work, and can I install it with APT?

    Meanwhile, why is it all Optical Character Recognition, when the accuracy we expect is really Optical Word Recognition? How come spelling, grammar and phrase frequency (including typos etc) isn't used to error correct at a symbolic level higher than pixels?
    • Re:Where's the Package? by drinkypoo (Score:2) Tuesday April 10 2007, @04:20PM
      • Re:Where's the Package? by Doc Ruby (Score:2) Tuesday April 10 2007, @04:36PM
        • Re:Where's the Package? (Score:4, Insightful)

          Do you have a result from scanning Jabberwocky (or other verse in a similar vein) with Google's OCR?

          Just for you, I made one, because I'm that fucking cool.

          1. Visited http://www.jabberwocky.com/carroll/jabber/jabberwo cky.html [jabberwocky.com].
          2. Printed page 1 (all but one link at the bottom of the page) with default settings on a HP LaserJet 2300.
          3. Scanned on an Epson 3170 as a 300 dpi grayscale PNG with otherwise default settings. (God DAMN this scanner is fast. But then my scanner at home is a shitty Mustek 1200UB since I broke my Canon LiDe.) 2528x3281 pixels.
          4. mespinoza@sec2lpt7-linux:~/ocropus/ocropus-cmd$ ./ocropus ocr ~/Desktop/out.png | tee /home/mespinoza/Desktop/jabberwocky.html (lots of output)

          Prepare to be unimpressed, because Results follow:

          JABBERWOCKY Lewis Carroll

          (from Through the Looking-Glass and What Alice Found There, 1872) `Twas bri11ig,_ andjghe 4s1it_hy toyes Digl gyre amid gimblejn thg wabe: All xiiimsy wei^e thg borogovgs, And theamome raths outigrabe. ''ggwqre thg Jalgbervvpck,_my sqn! The jaw; that bijtel the clayksathat catch! Bgyvaiie the Jubjub bird, anti shun The frumidus Bandersnatch!' I-Ie took his yorpal sword in hand: Long timg tlgewmangome foe he sought So rgSted he by the Tu_mtum tree, And stood awhile in thought. And, as in uffish thought he stood, The Jalgbgjwoclg, with eyes of flame, Cqmgwhjfflixgg through fhe tulgey wood, And burbled as it came! Qne, two! One, two! And through and thIi`Ollgh The jrorpgal b]ade went; snicaker-snack! I-Ie left iifdead, and with its head He went galumphing back. ''And, has thou slain thejabbexfwpck? Cpmg to my a_rxps!_my ljgaxjgishboyl Ojralqjousi dwgy! Qalladhl Callayl' He chortled in his joy. S

          \ A S

          X A ?`^s :

          , ' Was ga. ka%#* mm. -- M 1 1 Q at ) a iv 2. `Ail A it 3*,* `i 2 (V H ;. ````( * 4 ^Nq@ Eu..*s..%im X M is ? lgh ~ ``A? S [ A Fax I /),2*gE it ^`* 4 ~ *: ' X A mg x ix, ,t~;;;..: v' it ix '~ t ~ ^ ,4~ ---= =-^ A A i gv ; * XX, x> . . N S A ft 1 A-`A 3; `> ' ''YY \Jh ^***`(?i* , ~~ x `* at -;v- *<~ ' H ~~~-=.- ; `Twas bri11ig,_ and_the 4s1it_hy toyes Dig gyre arid gimblejn the wabe; All Qiixjnsy wei^e thq borogovgs, And thdmome raths outvgrabe.

          dshaw@iabbenNockv.com

          Return to Glorious Nonsense Return to Lewis Carroll

          Results End.

          Beautiful, eh? I also tried a 100 dpi grayscale scan, which came out even more like hash (one big paragraph) and a 300 dpi bitmap (1bpp) which was about the same as the 100 dpi gray scan in quality, though a bit better.

          Looks like ocropus has a while to go before it can slay the Jabberwock instead of thejabbexfwpck.

          [ Parent ]
    • Re:Where's the Package? by timmarhy (Score:1) Tuesday April 10 2007, @05:00PM
    • Faxes seem to be worthless for affordable OCR by name_already_taken (Score:2) Wednesday April 11 2007, @08:45AM
  • by morganew (194299) * on Tuesday April 10 2007, @04:36PM (#18682367)
    It's fascinating that Google has chosen the Apache license for the release of this product. Given that Eben Moglen has explicitly stated that the Apache License is incompatible with GPLv3, what does this mean for mixing this code into other projects?

    Even though v3 no longer has the anti-google Affero provisions, Google still chooses Apache instead of GPLv3 or even v2 with a rider to upgrade to v3. You gotta believe the Google lawyers were thinking about this issue before release...

    • 1 reply beneath your current threshold.
  • Ocropus? (Score:3, Funny)

    by 6Yankee (597075) on Tuesday April 10 2007, @05:10PM (#18682791)
    Is that a Chinese mispronunciation? ;)
  • Advancement in AI (Score:1)

    by chip33550336 (614139) on Tuesday April 10 2007, @08:37PM (#18684529)

    It is interesting that web forms have become a measure of AI strength in the world wide web. As soon as Captchas are largely solved, there will be new and improved human tests. I am guessing the next step will be identifying logos, or some sort of symbol. Eventually that problem will be solved too. So what do we do when we can't tell a human from a machine?

    Please send me your registered DNA sequence, a voice recording reading this message, and a picture of you in the current location...

    I guess a central database of information (identification and secure communication channel) is going to be the only way to ensure you are who you say you are.

    Eventually I guess it won't really matter if you are human.

  • Patents? (Score:2)

    by BubbaFett (47115) on Wednesday April 11 2007, @11:55AM (#18691215)
    One would assume that OCR is a heavily patented space, and a patent search seems to agree. Caere could make things difficult [google.com] for the competition.
  • Re:From? (Score:3, Funny)

    by MightyYar (622222) on Tuesday April 10 2007, @03:25PM (#18681311)
    Ha! It even works on the whole string:

    [ Parent ]
  • 9 replies beneath your current threshold.