Slashdot Log In
Fill Out CAPTCHAs, Digitize Books At The Same Time
Posted by
Zonk
on Thu May 24, 2007 06:15 PM
from the i-would-like-to-subscribe-to-your-newsletter dept.
from the i-would-like-to-subscribe-to-your-newsletter dept.
alphadogg wrote with a link to a Networld article about a noble endeavor: putting CAPTCHAs to work for the good of humanity. A scientist at Carnegie Mellon is looking to create a new type of security check that will assist in a project meant to digitize and make searchable text from books and printed materials. Above and beyond that, the offering would probably be more secure than most current systems. "Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project."
Related Stories
[+]
IT: Carnegie Mellon CAPTCHA Digitization Project Now Underway 119 comments
tomandlu writes "The BBC is reporting that Carnegie Mellon University has found a novel use for CAPTCHAs — deciphering old texts. We've discussed this project before, but it was prior to it getting off the ground. Users Entering text acts as a sort of distributed computing project. Basically, the CAPTCHA is made up of two words — one of which is known to Carnegie, and one of which isn't. If the user correctly deciphers the known word, then the unknown word is assumed to be correct. Well, almost. Two different users must give the same answer to the same unknown CAPTCHA before it is taken off the list. 'Using the reCAPTCHA system von Ahn's team is digitizing documents and manuscripts as fast as the Internet Archive can supply them, and the good news for book lovers (and bad news for spammers) is that the supply of reCAPTCHAs is not likely to dry up any time soon.'"
This discussion has been archived.
No new comments can be posted.
Fill Out CAPTCHAs, Digitize Books At The Same Time
|
Log In/Create an Account
| Top
| 121 comments
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Verification? (Score:5, Insightful)
CAPTCHAs work because the computers sending them already know what the text says; they start with it in text form and change it into a hard-to-read image. In the system discussed in the article, how will the computer verify that the user response actually matches the text? Sure, it could compare the response to its best guess, but if a program trying to guess the text was equally as sophisicated as the guessing computer, the guess would match.
I imagine the computer sending the picture of the image of hard-to-read text will further obfuscate the image in a way that makes it even more difficult for the computer on the receiving end to decipher, but the article doesn't acknowledge that this is one of the first logical questions in conceiving of / implementing this system in a functional way. The article really should cover this...
Re:Verification? (Score:5, Informative)
Official reCAPTCHA site (Score:5, Informative)
I originally missed the link to the official site - D'oh. The article also doesn't mention that the system is already in use! http://recaptcha.net/ [recaptcha.net]
Re:Official reCAPTCHA site (Score:5, Informative)
(Last Journal: Sunday November 06 2005, @11:51PM)
Re:Verification? (Score:5, Insightful)
(http://colinm.org/)
Well... sort of. Multiple agreements are required before the system will accept that it knows the spelling of a previously unknown word. So you're not going to singlehandedly subvert the system; at the very least you need a cabal of friends. But with millions of words available in the system, the chance that you and a bunch of friends will all get the same word and write in the same bogus data is pretty close to zero. I'm not saying it this system is impossible to game, but I think it'd be heck of a lot easier (and more rewarding, if it's the sort of thing that floats your boat) to vandalize Wikipedia instead.
Re:Verification? (Score:5, Funny)
(Last Journal: Friday November 09, @05:49PM)
e.g.,
12345
l1il1
The captcha software knows the "12345"
but it doesn't know the "l1ill1". A human could figure out both.
But spammer captcha deciphering can figure out 12345, and is allowed to incorrectly guess 11ii1 for the 2nd part. End result is
Re:Verification? (Score:5, Insightful)
(http://www.broune.com/)
Re:Verification? (Score:4, Informative)
Yeah, but it's not like you're only allowed to present a given unknown word once. Present it many times, and use the word with the most hits.
--Rob
Re:Exactly what I was wondering (Score:4, Informative)
(http://colinm.org/)
Re:Exactly what I was wondering (Score:4, Insightful)
Also it wouldn't take much to add some grammar to pad the guessing. While we wee two words the system sees them in at least two contexts.
Obviously it has the actual dictionary to help it basically spell check the words we submit to it. If the words we give it are completely garbage, its unlikely to go for it. Which is where knowing that "niis" needs a correction.
Better links (Score:5, Informative)
(http://colinm.org/)
Official reCAPTCHA site [recaptcha.net]
Hide your email address with reCAPTCHA [recaptcha.net] (super easy!)
A more detailed blog post about how the system works [blogspot.com]
Disclaimer: I work with Luis von Ahn [cmu.edu], who's the professor running the reCAPTCHA project.
Re:Better links (Score:5, Interesting)
(http://slashdot.org/ | Last Journal: Wednesday January 29 2003, @02:50AM)
idea (Score:1)
More than just digitizing text (Score:3, Informative)
http://yro.slashdot.org/article.pl?sid=07/04/03/22 11258 [slashdot.org]
I believe amazon.com has filed a patent for a solution to this problem which attributes every annotation input to a unique user id. They then claim to use the average accuracy over the history of that user for whittling away the, 2 out of 10 i think the patent says, worst answers.
i'm sure some form of quality control/check will be needed and i wonder if such a solution would infringe on this patent?
Booger (Score:2, Insightful)
(http://www.geocities.com/tablizer | Last Journal: Saturday March 15 2003, @01:22PM)
How it could work (Score:3, Insightful)
(http://www.doofus.org/)
Another method might be to separate out the un-OCRable letters from words and sprinkle them with known letters, though this might be less effective since people can often recognize words far better than individual letters. If one or two letters in a word cannot be interpreted, a person can often still read the entire word.
A spam tactic? (Score:1)
- Have a person subscribe to a porn website by typing in a CAPTCHA image that comes from a legitimate website.
- The user provides the correct word while subscribing
- Not even a "???" step
- Profit! The protected website is spammed.
I'm wondering whether this system will be used for legitimate OCR purposes or for more spam...A better scheme (Score:1)
So... If you thought that CAPTCHAs were hard... (Score:2)
(http://vitalyb.wordpress.com/)
Mushed text with letters that slide into each other, bad lighting and every other kind of bad scanning you can imagine. Hell, you'd be lucky if you can recognize letters at all.
Question is, if the machine couldn't figure out what the word is, how will it verify your answer? Is it going to be something along "by the popular vote"?
Something is very not right in all this.
A pain for users (Score:2, Insightful)
Here's an early test phrase... (Score:2)
This expains (Score:1, Funny)
Type: Miserable Failure
Thankyou, click here to proceed.
Amazon's Mechanical Turk (Score:1)
http://www.mturk.com/ [mturk.com]
How stupid (Score:1)
That's the dumbest most retarded (traditional sense of teh word) thing that I've ever heard.
Missed opportunity (Score:1)
(http://www.tie-rack.org/)
If someone can write a program to solve the distorted images of OCR-unreadable words, don't you just hire that guy to do your OCR and get out of the CAPTCHA business?
Image spam (Score:5, Interesting)
(http://users.frii.net/jeremy/)
CAPTCHA+CAPTCHA (Score:1, Redundant)
Hmmm, That Looks Like A... (Score:3, Funny)
You're all missing the point (Score:1)
CAPTCHAs are bad design (Score:2)
(http://www.coolestfamilyever.com/)
This sound like not working (Score:1)
And if they already know what it says, then why would they need someone else to type it for the first time.
the extent of how academics can be o out of touch with reality.
Source Material (Score:2)
(http://www.trevorstone.org/)
World's Best CAPTCHA (Score:2)
(http://www.beggarandbird.com/)
www.hotcaptcha.com [hotcaptcha.com]
A captcha doesn't have to function as a password (Score:1)
However the Iron Internet Law of "lolz > human decency" applies ... and we can look forward to books being translated as "chucknorrischucknorrischucknorrischurknorris..."
Re:OK, how to defeat - (Score:1)
BUSH IS AN IDIOT
then you can leave off the Obama part.
Oh, come on, somebody mod this funny - it's even on-topic. Puhleeez?
Re:I got my digitized copy of the US Constitution (Score:3, Funny)
Oh! You mean the "E. Plebnista?"
Great CAPTCHA solution to solve people not RTFA! (Score:5, Interesting)
We should put a CAPTCHA system on slashdot:
When you want to post, You get to type-in a CAPTCHA. The Image for this is generated in this way:
- The links to the article/s actually link to a page with a javascript wrapper that loads the article text, but replaces certain words with the graphical representation of that word, in the form of a CAPTCHA.
- This words form a phrase that the user must type in if he wants to post. There are different combinations of phrases selected from the article, and each poster gets one randomly.
This technology should be called CAPSSAA (for Completely Automated Public Stupidity test to tell Slashdoters and Assholes Apart)