Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Fill Out CAPTCHAs, Digitize Books At The Same Time

Posted by Zonk on Thu May 24, 2007 06:15 PM
from the i-would-like-to-subscribe-to-your-newsletter dept.
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."

Related Stories

[+] IT: Carnegie Mellon CAPTCHA Digitization Project Now Underway 119 comments
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Verification? (Score:5, Insightful)

    by traindirector (1001483) * on Thursday May 24, @06:16PM (#19262111)

    CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.

    I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...

    • Re:Verification? (Score:5, Informative)

      by greatgregg (1106739) on Thursday May 24, @06:19PM (#19262153)
      From recaptcha.net [recaptcha.net]: "But if a computer can't read such a CAPTCHA, how does the system know the correct answer to the puzzle? Here's how: Each new word that cannot be read correctly by OCR is given to a user in conjunction with another word for which the answer is already known. The user is then asked to read both words. If they solve the one for which the answer is known, the system assumes their answer is correct for the new one. The system then gives the new image to a number of other people to determine, with higher confidence, whether the original answer was correct."
      [ Parent ]
      • Official reCAPTCHA site (Score:5, Informative)

        by traindirector (1001483) * on Thursday May 24, @06:23PM (#19262213)

        I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/ [recaptcha.net]

        [ Parent ]
      • Re:Verification? by nwbvt (Score:2) Thursday May 24, @06:42PM
        • Re:Verification? (Score:5, Insightful)

          by Falkkin (97268) on Thursday May 24, @07:00PM (#19262635)
          (http://colinm.org/)
          "So if you want to screw with it, all you have to do is intentionally get exactly one word wrong each time."

          Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
          [ Parent ]
        • 1 reply beneath your current threshold.
      • Re:Verification? (Score:5, Funny)

        by bugnuts (94678) on Thursday May 24, @06:48PM (#19262503)
        (Last Journal: Friday November 09, @05:49PM)
        The problem is that any unsophisticated captcha interpreter can spit out the text that's known, and make a (bad) guess at what is hard to read. Then, if there is any significant amount of spammers, we end up with exactly the same issue - computers having trouble with OCR.

        e.g., /. puts in a captcha to translate the following two sections:
        12345
        l1il1

        The captcha software knows the "12345"
        but it doesn't know the "l1ill1". A human could figure out both.

        But spammer captcha deciphering can figure out 12345, and is allowed to incorrectly guess 11ii1 for the 2nd part. End result is
        • a spammer is posting something as indecipherable as this message except insults your penis size
        • some OCRed book is now committed to a false interpretation
        • I have to change the password on my luggage.

        [ Parent ]
      • Re:Verification? by Mr. Underbridge (Score:3) Thursday May 24, @07:07PM
      • Re:Verification? by jambarama (Score:2) Thursday May 24, @07:32PM
      • Re:Verification? by Bluedove (Score:2) Friday May 25, @02:43AM
      • 1 reply beneath your current threshold.
    • Re:Verification? by mikee805 (Score:1) Thursday May 24, @06:19PM
    • Re:Verification? by 26199 (Score:1) Thursday May 24, @06:20PM
    • Re:Verification? by anarchy_man3 (Score:1) Thursday May 24, @06:21PM
    • Re:Verification? by Solder Fumes (Score:2) Thursday May 24, @06:21PM
    • Re:Verification? by redwoodtree (Score:2) Thursday May 24, @06:21PM
    • Exactly what I was wondering by raygundan (Score:2) Thursday May 24, @06:23PM
      • Re:Exactly what I was wondering (Score:4, Informative)

        by Falkkin (97268) on Thursday May 24, @06:29PM (#19262293)
        (http://colinm.org/)
        The system serves two words to the user. The system knows the correct answer to one of these words -- this is the one used to test whether the user is a human or a bot. If the user got the test word right, then there's a good chance they also got the unknown word right. If a bunch of humans all agree on the same transcription of a given unknown word, the system will eventually "know" the correct spelling of the unknown word and can then serve it as a "known" word in the future.
        [ Parent ]
        • Re:Exactly what I was wondering (Score:4, Insightful)

          by hpavc (129350) on Thursday May 24, @07:17PM (#19262869)
          Likely has a good idea on 'unknown' word as well, the example "This aged portion of society were distinguished from" the OCR didn't cut it but it did did kick start a guess. At least on "This -> niis" it can see its not 'ZOMG' or 'Fark' easy enough.

          Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.

          Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.
          [ Parent ]
      • Re:Exactly what I was wondering by camperdave (Score:2) Thursday May 24, @09:29PM
    • Re:Verification? by penguinbroker (Score:1) Thursday May 24, @06:24PM
    • Re:Verification? by alstor (Score:1) Thursday May 24, @06:30PM
    • Look up the human computation google talk by Rix (Score:2) Thursday May 24, @06:52PM
    • Re:Verification? by DeathElk (Score:2) Thursday May 24, @07:50PM
    • Working as intended by mythar (Score:1) Thursday May 24, @09:10PM
    • 1 reply beneath your current threshold.
  • Better links (Score:5, Informative)

    by Falkkin (97268) on Thursday May 24, @06:21PM (#19262165)
    (http://colinm.org/)
    The article is lacking some information. Here are some better links:

    Official reCAPTCHA site [recaptcha.net]
    Hide your email address with reCAPTCHA [recaptcha.net] (super easy!)
    A more detailed blog post about how the system works [blogspot.com]

    Disclaimer: I work with Luis von Ahn [cmu.edu], who's the professor running the reCAPTCHA project.
  • idea (Score:1)

    by brunascle (994197) on Thursday May 24, @06:27PM (#19262271)
    someone set up a database of what the words really say along with what we should type instead, and make it public. it'll be fun! like mad libs!
    • Re:idea by Kaetemi (Score:1) Friday May 25, @09:36AM
  • More than just digitizing text (Score:3, Informative)

    by penguinbroker (1000903) on Thursday May 24, @06:37PM (#19262397)
    This would also be a great approach to a lot of NLP/Translation annotation tasks. Although these types of tasks generally require a robustness (knowing which answers to trust and which to ignore) that anonymity makes difficult.

    http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258 [slashdot.org]

    I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.

    i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?

  • Booger (Score:2, Insightful)

    What if the OCR cannot read a word because there was a booger on it during the scan? A human won't be able to determine it either because it will be mostly a blotch. How are they gonna know the difference between human-decipherable words and lost-cause words (such as booger blotches)?
    • Re:Booger by penguinbroker (Score:1) Thursday May 24, @06:43PM
      • Re:Booger by Tablizer (Score:1) Thursday May 24, @09:39PM
        • Re:Booger by penguinbroker (Score:1) Thursday May 24, @10:06PM
  • How it could work (Score:3, Insightful)

    I can see how this would work, but in order to also provide security, extra letters or words would also need to be in the captcha. I.e. if there's an un-OCRable word "between", the captcha could contain "frog between" or something like that, and the first word could be a previous un-OCRable word that has been validated by enough people.

    Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.
  • A spam tactic? (Score:1)

    by yogikoudou (806237) on Thursday May 24, @06:48PM (#19262501)
    Spammers are already using CAPTCHA techniques to automate account creations on protected websites:
    1. Have a person subscribe to a porn website by typing in a CAPTCHA image that comes from a legitimate website.
    2. The user provides the correct word while subscribing
    3. Not even a "???" step
    4. Profit! The protected website is spammed.
    I'm wondering whether this system will be used for legitimate OCR purposes or for more spam...
  • A better scheme (Score:1)

    by Yossarian45793 (617611) on Thursday May 24, @06:53PM (#19262555)
    A better scheme would be to give out the same capcha to 2 or more users. If they agree on the answer, then there's a better chance that the text is correct.
  • ...Wait till you see these new CAPTCHAS.

    Mushed text with letters that slide into each other, bad lighting and every other kind of bad scanning you can imagine. Hell, you'd be lucky if you can recognize letters at all.

    Question is, if the machine couldn't figure out what the word is, how will it verify your answer? Is it going to be something along "by the popular vote"?

    Something is very not right in all this.
  • A pain for users (Score:2, Insightful)

    by EssenceLumin (755374) on Thursday May 24, @07:11PM (#19262787)
    Great, so now I would have to fill out two of those stupid things instead of one. Why would a company want to inflict this on its users?
  • by jpellino (202698) on Thursday May 24, @07:39PM (#19263133)
    owha tajer kiam

  • This expains (Score:1, Funny)

    by Joebert (946227) on Thursday May 24, @07:50PM (#19263253)
    Please type the characters you see in the following image to register.

    George Bush

    Type: Miserable Failure

    Thankyou, click here to proceed.
  • by scott_karana (841914) on Thursday May 24, @08:39PM (#19263729)
    This project isn't the first of its sort: Amazon has the Mechanical Turk project, where users perform various tasks similar to CAPTCHAs for amazon.com credit.

    http://www.mturk.com/ [mturk.com]
  • How stupid (Score:1)

    by holophrastic (221104) on Thursday May 24, @09:35PM (#19264319)
    So, let me get this straight. There are systems out there, in the wild so to speak, that offer security by presenting a task that humans can do easily but machines have trouble doing. And now, this very same system is going to assist machines in solving the very inability upon which the system is based.

    That's the dumbest most retarded (traditional sense of teh word) thing that I've ever heard.
  • Missed opportunity (Score:1)

    by h4ter (717700) on Thursday May 24, @10:51PM (#19264987)
    (http://www.tie-rack.org/)
    From the security page of the reCAPTCHA site [recaptcha.net]: "if somebody writes a program that can read our distorted images, we can add more distortions in very little time"

    If someone can write a program to solve the distorted images of OCR-unreadable words, don't you just hire that guy to do your OCR and get out of the CAPTCHA business?
  • Image spam (Score:5, Interesting)

    by JeremyR (6924) on Thursday May 24, @11:12PM (#19265151)
    (http://users.frii.net/jeremy/)
    Maybe this technique can be adapted to fight image spam more effectively :-)
  • CAPTCHA+CAPTCHA (Score:1, Redundant)

    by k3vlar (979024) on Friday May 25, @01:59AM (#19266443)
    I thought the point of CAPTCHAs was to compare what a user types with information stored on the hosting server. If the hosting server doesn't know what the book says, then how can it validate the CAPTCHA?
  • Hmmm, That Looks Like A... (Score:3, Funny)

    by WiseWeasel (92224) on Friday May 25, @03:35AM (#19266963)
    Damnit, where's the smushed bug key?!?
  • by kilgoretrout99 (1025702) on Friday May 25, @05:52AM (#19267599)
    The second request is by definition not a CAPTCHA, since the answer is not known. They're using you to try and determine that answer. This after they've met their security criteria by using a real CAPTCHA. That means this is just unpaid labour! Wait 'till my union rep finds out about this, there'll be trouble!!
  • Any method of anti-spam that causes the user to jump through hoops is a bad design. CAPTCHAs are no more effective than a battery of tests against content at preventing spam, period. While an unscrupulous website operator can lift the CAPTCHA and get unwitting users to submit it, they can't fool systems like, say, Spam Karma [unknowngenius.com] that test for the characteristics of spam. I've been using it for quite a while and it's been 100% accurate in telling me what is or is not spam while providing zero inconvenience to the end user. About the only way for spammers to sneak it by is to *gasp* leave comments using a real person, a task so expensive that it's not worth it.
  • by unablepostAC (1044474) on Friday May 25, @10:52AM (#19270941)
    How do they know if what I type is the real text, if they don't know in advance what it says.
    And if they already know what it says, then why would they need someone else to type it for the first time.
    the extent of how academics can be o out of touch with reality.
  • Maybe they can help piece together secrets from East Germany [bbc.co.uk].
  • This class of CAPTCHA is not always going to work first time, every time. It depends upon the subjective opinion or skill of the user. In my view, the ultimate CAPTCHA has been released:

    www.hotcaptcha.com [hotcaptcha.com]
  • It can instead be a "little job" that must be done before you get to the pr0n.

    However the Iron Internet Law of "lolz > human decency" applies ... and we can look forward to books being translated as "chucknorrischucknorrischucknorrischurknorris..."

    [ Parent ]
  • by wsanders (114993) on Thursday May 24, @06:52PM (#19262547)
    OK, for the humor impaired:

    BUSH IS AN IDIOT

    then you can leave off the Obama part.

    Oh, come on, somebody mod this funny - it's even on-topic. Puhleeez?
    [ Parent ]
  • by multipart/mixed (163409) on Thursday May 24, @07:02PM (#19262661)
    Constitution, consititution...

    Oh! You mean the "E. Plebnista?"
    [ Parent ]
  • Come on people, start using your brains please!, just a little!, half the posters have been asking the same 2 stupid questions, or even worse, posting the same 2 stupid questions with question mark removed, as if they were facts.

    We should put a CAPTCHA system on slashdot:

    When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:

      - The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
      - This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.

    This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)
    [ Parent ]
  • 12 replies beneath your current threshold.