Developer Creates Infinite Maze That Traps AI Training Bots 53
An anonymous reader quotes a report from 404 Media: A pseudonymous coder has created and released an open source "tar pit" to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed "offensively" as a honeypot trap to waste AI companies' resources.
"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop." You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).
"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop." You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).
Great! (Score:5, Funny)
Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!
Re: (Score:1)
Re: (Score:3)
"AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole."
Ignoring a robots.txt file is not "violating" anything nor is anything stolen.
Re: (Score:1)
Viewing what a server broadcasts (even via crawler) is definitely not stealing-aka-infringement. ToS noise at most.
AI training in a vacuum probably isn't infringement, there's no redistribution or anyone claiming the results as their own.
Sellers of the "remix" might be. Courts are agonizing over this point.
Re: (Score:2)
It's unauthorised access.
You have stated via the robots.txt file that certain things (ie bots) are NOT authorised to access any other content on the site.
Re:Great! (Score:4, Interesting)
The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?
Thankfully, in most jurisdictions, ad blocking is not illegal.
Re: (Score:3)
AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
That is a nice fantasy you have there. Not enforceable.
Re: (Score:2)
If a web-developer wants his site not crawled
By crawlers that honor the robots.txt file.
, all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
But they don't, and there's zero chance they will any time soon.
This is how search
some
engines already work. This works well
Except when it doesn't.
It would be less egregious if the assholes running the crawlers would limit the number of simultaneous connections to something reasonable to avoid running the CPU at 100% until it catches on fire. But that's not going to happen, either.
Re: (Score:1)
If a web-developer wants his site not crawled, all you need is a robots.txt file.
How cute. You think that the sleazy scumbags who run tech companies give a shit about your robots.txt file
Re: Great! (Score:2)
Re: Great! (Score:1)
så lets all make webpages and make the web bigger
go humanity
Re: Great! (Score:2)
Archive.org respects my robots.txt file so no big deal. Search engines crawlers and AI that does not honor web standard can choke on a firehose of generate crap.
Re:Great! (Score:5, Interesting)
I have zero problem with giving things like robots.txt some teeth, and will be checking Nepenthes out with a view to adding it to the honeypot/tarpit system we already routinely deploy to our clients to blackhole hostile/malicious traffic and provide a softer target than can heads up before the actual systems get hit. When you get right down to it, it's putting the house rules on the door where you can see them on entry; respect them and you'll be fine, but if you choose to ignore them then you shouldn't be surprised when you get bitten.
Re: (Score:2)
Re: (Score:2)
If a web-developer wants his site not crawled, all you need is a robots.txt file.
I bet you believe there is a tooth fairy too.
Re:Great! (Score:4, Funny)
Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!
Put them in a round room and tell them that there's a government subsidy in the corner.
Who (Score:4, Funny)
Perhaps I need to have some young coder teach me how the internet ecology works these days.
Re: (Score:3)
Re: (Score:2, Insightful)
an AI Adventure... (Score:5, Funny)
lock them into a maze of twisty little passages, all alike
Re: (Score:1)
Make sure it's dark. Grues gotta eat.
Nice! (Score:2)
I was not aware crawlers are _this_ primitive. Well, you learn something new every day if you care to.
news that matters? (Score:3)
We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
Re: (Score:2)
This is literally malware.
So are most crawlers.
Re: (Score:3)
Most of us have figured out (Score:3)
I don't think the Amish had the right idea as far as technology goes. But the basic concept of saying no to something that is inherently destructive isn't a bad idea. It's just a matter of degrees at that point.
Or is one person put it
Re: (Score:2)
You know what though
There will be drug discoveries
Done by systems that aren't LLMs.
And LLMs will be attributed these discoveries, because their main feature is that people who don't want to be able to tell the difference can't, and want it to be someone else's problem.
Re: news that matters? (Score:2)
They own you.
Re: (Score:2)
If the crawlers were less aggressive then people wouldn't do this.
Re: (Score:3)
We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
It is neither literally - nor figuratively - malware.
It is the equivalent of a logon screen. AI is unable to productively go where it is unwanted. Authorized users (humans) get past. Good authentication directories are resistant to brute force attack by doing things like temporary lockouts, or increasing the time before the "logon failed" screen shows up, tying up the attacker's resources.
At the human level, it is at worst the equivalent of engaging a telemarketer in conversation knowing full well yo
Be a nice netizen.. (Score:2)
Be a nice netizen..and make sure your robots.txt blocks access to this recursive nightmare...
Year 2000 called. They want their teergrube back (Score:5, Informative)
Eesh.
This is an ancient idea. Probably discussed at length right here in the 2000s. No crawler worth its salt is going to even notice this.
Like oh, this:
https://it.slashdot.org/story/... [slashdot.org]
or this:
https://developers.slashdot.or... [slashdot.org]
Re: (Score:2)
Isn't it *interesting* when disruptors blow away an entire industry and then complain that they need to relearn lessons from decades ago.
While you are correct... it's novel that it still works like this. Or works again like this, as the case may be.
Been done (Score:2)
This has been done before, though in the context of trapping spammers who scrape web sites for email addresses: You generate endless pages with random mailto: URLs. And I'm pretty sure any self-respecting bot will have countermeasures to detect and avoid these sorts of traps.
Re: (Score:3)
Replying to self... an obvious counter-measure is to occasionally request a random file and if you don't get a 404, then you're probably in a spider trap. (Just change the last bit of the URL path to a 32-character random string or something.)
Re: (Score:2)
yeah, but the easy counter to that is 404ing links that you didn't generate and hand out. Which involves remembering the links you handed out but that's not too onerous.
Re: (Score:2)
If it's for AI training bots (Score:3)
Then each randomly generated URL should deliver randomly generated garbage for the specific purpose of poisoning the training.
If someone were to write a module for Apache that would do this automatically, after identifying the source IP of an AI training crawler, I'd install it in a heartbeat.
Re: (Score:2)
if you go to the link, you'll see that each page is unique.
Re: (Score:3)
Rewrites, and/or a module
https://www.usenix.org.uk/cont... [usenix.org.uk]
This stuff is kind of cute (Score:4, Interesting)
Re: (Score:2)
It's like trying to fight a nation state with guerilla warfare. It doesn't really work.
The Taliban and the Viet Cong say hi.
Bravo, now duck and cover. (Score:2)
I would suspect with the chumminess between the tech bros and Trump, you may not want to brag too much about wasting AI web crawler's resources. If one of the broligarchy gets upset with you, you may just get disappeared. I expect we'll hear of some new regulations relatively quickly now that this story is out there.
The circle is closing. (Score:3)
I read some interesting articles published in the 60's and 50's (shocking!) that predicted the ONLY way for humans to control thinking software will be to introduce binary level viruses just to slow them down enough for us humans to interact with them. And I thought..... haha, we would have to be haha, pretty STUPID, hahaha, to let that happen.....
wouldn't it be better.. (Score:1)
Did this to email scrapers (Score:2)
Made a "directory" of email addresses that were all garbage with a next page link that would load infinitely more garbage. It was about poisoning the well so they couldn't sell the lists to spam mailers for much.