
Journal Journal: Fighting Spam: Testing for Human Beings
I wrote a script to test whether a web visitor is a human being or not. For background info and technical nitty-gritty, read below.
Programs known as "robots" have been masquerading as real web surfers since the beginning of the Web. Robots can navigate between web pages and servers using hyperlinks just as people do, and can do so very quickly. Most commonly, robots are used to search out data to include in search engines.
Robots can also be used to more nefarious ends, like gathering e-mail addresses or signing up for thousands of e-mail accounts for use in spamming. The key to fighting spam robots is by including a question or problem in the sign-up process that only a human being can easily answer. This problem has to be easy for a computer to generate, but difficult for a computer to solve.
For the most part, websites that want to inhibit automated sign-ups include an image in the page containing a code or key. The text in the image is obfuscated to make digital recognition of the key more difficult for OCR software. Check out an example from AOL's AIM sign-up page or one from Yahoo's.
What other questions can computers create that they themselves can't easily solve, and are simple enough to use to verify the humanity of web surfers? I asked myself this question, and the result is a PHP script illustrating one approach that uses Google's vast image search feature.
My script queries Google Image Search for a group of thumbnail images that match a search for something like "3 kittens." It may also try other nouns and other numbers, for searches like "4 pots" or "5 women." Out of the group of thumbnails Google returns, one is randomly chosen and displayed to the user. The script already knows how many of the object are supposed to be in the picture, because the script originally asked for the image. For the client, though, the question of how many objects are in the picture is technically very computationally difficult. Not only must the would-be robot know what the object looks like, it must know what it looks like from all angles and in all sizes.
My method is limited by the number of objects a person would be willing to count, which I suspect is somewhere around 10. This means that at the very least, a brute-force attempt of the script would yield a success rate of 10%. It's also limited by the "inaccuracy" of Google Image Search; sometimes the random thumbnail that's chosen has nothing to do with the query. I have yet to collect data as to what percentage of the Google results are relevant, and whether or not this statistic makes the approach viable when the potential for brute-forcing is considered.
I hope to develop the method and keep track of what thumbnails are presented. Future versions may ask the user whether or not he or she was able to recognize the object. There may be other ways to exploit Google Image Search's potential to test for real people, which I also hope to explore.