
Developer Creates Infinite Maze That Traps AI Training Bots 87
An anonymous reader quotes a report from 404 Media: A pseudonymous coder has created and released an open source "tar pit" to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed "offensively" as a honeypot trap to waste AI companies' resources.
"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop." You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).
"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop." You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).
Great! (Score:5, Funny)
Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!
Re: (Score:1)
Re: (Score:3)
"AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole."
Ignoring a robots.txt file is not "violating" anything nor is anything stolen.
Re: (Score:1)
Viewing what a server broadcasts (even via crawler) is definitely not stealing-aka-infringement. ToS noise at most.
AI training in a vacuum probably isn't infringement, there's no redistribution or anyone claiming the results as their own.
Sellers of the "remix" might be. Courts are agonizing over this point.
Re: (Score:2)
It's unauthorised access.
You have stated via the robots.txt file that certain things (ie bots) are NOT authorised to access any other content on the site.
Re:Great! (Score:4, Interesting)
The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?
Thankfully, in most jurisdictions, ad blocking is not illegal.
Re:Great! (Score:5, Insightful)
No, because an ad is just extra content you wanted me to fetch. I.e., you stick an ad on your web page, and you hope I will fetch it. I have the option to not fetch it if I so desire.
robots.txt is more like a keep out sign. The web site is telling you to NOT fetch certain elements - perhaps those elements are not worth fetching or cost a lot to fetch. In this case it's more like unauthorized access because I told you to not go there, but you did anyways.,
It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.
Ad blocking is merely deciding to not look at something on the property. If the property owner puts up a billboard, ad-blocking is basically choosing not to look at the billboard as you walk by on the public street. Whether you deliberately chose not to look at it or didn't look at it because you weren't looking at the property is irrelevant. In this case no trespass will ever exist because you never stepped on their property.
Ad blocking can never be illegal - because that's choosing to not retrieve a web page or content that you're not choosing to retrieve. Now, some sites might retaliate and say if you don't get this content, you can't get other content, but that's a different issue.
robots.txt blocking is telling others to keep out - like putting up a "no soliciting" sign to keep out salespeople.
Re: (Score:1)
It’s a good opinion, but is not in fact the law.
Re: (Score:3)
Maybe the law is where you are or in your head. This has gone to court multiple times and they've always found in favor of the end user. it's no more illegal to get up during an ad-supported network show so you can take a piss.
GTFOH with your 10 years of watching Law & Order expertise.
Re: Great! (Score:2)
But maybe they decided not to fetch it.
Re: (Score:2)
It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.
That's a pretty silly analogy. You can't "trespass" on a website. It needs to be served to you. It's as much trespass as having a sign on a door saying "Slashdot user tlhlngan is not welcome." while the owner of the door holds the door open for you and ushers you inside.
Enforcing robots.txt restrictions (Was Re:Great!) (Score:2)
Shouldn't be too hard to update this concept for today's technology.
Re: (Score:3, Insightful)
AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
That is a nice fantasy you have there. Not enforceable.
Re: Great! (Score:2)
The main thing is sites are already doing this for spam, may as well get on board. I saw it when I downloaded when of the uncensored horror and adult models from hugging face, the gay content it generated was loaded with in story references to actual gay sites, somehow they embedded it in the adult stories it trained on knowing bots would train on it and pass on the spam.
Re: (Score:2)
If a web-developer wants his site not crawled
By crawlers that honor the robots.txt file.
, all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
But they don't, and there's zero chance they will any time soon.
This is how search
some
engines already work. This works well
Except when it doesn't.
It would be less egregious if the assholes running the crawlers would limit the number of simultaneous connections to something reasonable to avoid running the CPU at 100% until it catches on fire. But that's not going to happen, either.
Re: (Score:1)
If a web-developer wants his site not crawled, all you need is a robots.txt file.
How cute. You think that the sleazy scumbags who run tech companies give a shit about your robots.txt file
Re: Great! (Score:2)
Re: Great! (Score:1)
så lets all make webpages and make the web bigger
go humanity
Re: Great! (Score:2)
Archive.org respects my robots.txt file so no big deal. Search engines crawlers and AI that does not honor web standard can choke on a firehose of generate crap.
Re:Great! (Score:5, Interesting)
I have zero problem with giving things like robots.txt some teeth, and will be checking Nepenthes out with a view to adding it to the honeypot/tarpit system we already routinely deploy to our clients to blackhole hostile/malicious traffic and provide a softer target than can heads up before the actual systems get hit. When you get right down to it, it's putting the house rules on the door where you can see them on entry; respect them and you'll be fine, but if you choose to ignore them then you shouldn't be surprised when you get bitten.
Re: (Score:2)
Re: (Score:3)
Re: (Score:1)
2 things:
- with the level of anti-AI/... hate there is, I would be surprised if this thing hadn't been deployed by lots of people already without any robots.txt blocking
- I wouldn't be surprised if crawlers already have mitigations against this type of things, it can't be that new, and if this get any traction, that will detect it easily enough
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
If a web-developer wants his site not crawled, all you need is a robots.txt file.
I bet you believe there is a tooth fairy too.
Re:Great! (Score:5, Funny)
Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!
Put them in a round room and tell them that there's a government subsidy in the corner.
Re: Great! (Score:2)
Re: (Score:2)
Isn't this what the "Cryptocurrency working group" and "Project Stargate" are for?
Who (Score:3, Funny)
Perhaps I need to have some young coder teach me how the internet ecology works these days.
Re: (Score:3)
Re: (Score:2, Insightful)
Re: (Score:2)
an AI Adventure... (Score:5, Funny)
lock them into a maze of twisty little passages, all alike
Re: (Score:2)
Make sure it's dark. Grues gotta eat.
Re: (Score:2)
"You've found the Wumpus"
Nice! (Score:2)
I was not aware crawlers are _this_ primitive. Well, you learn something new every day if you care to.
Re:Nice! (Score:5, Funny)
These are the AI-generated crawlers. They only have the python stack exchange examples, not the hand-crafted perl bots of yesteryear.
news that matters? (Score:2, Interesting)
We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
Re: (Score:2)
This is literally malware.
So are most crawlers.
Re: (Score:3)
Re: news that matters? (Score:2)
Re: (Score:2)
Most of us have figured out (Score:5, Interesting)
I don't think the Amish had the right idea as far as technology goes. But the basic concept of saying no to something that is inherently destructive isn't a bad idea. It's just a matter of degrees at that point.
Or is one person put it, I want AI to do my laundry and dishes so I can paint pictures and write but instead I got AI that paints pictures so that I can do laundry and dishes...
Re: (Score:3)
You know what though
There will be drug discoveries
Done by systems that aren't LLMs.
And LLMs will be attributed these discoveries, because their main feature is that people who don't want to be able to tell the difference can't, and want it to be someone else's problem.
Re: news that matters? (Score:3)
They own you.
Re: (Score:2)
If the crawlers were less aggressive then people wouldn't do this.
Re:news that matters? (Score:5, Interesting)
We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
It is neither literally - nor figuratively - malware.
It is the equivalent of a logon screen. AI is unable to productively go where it is unwanted. Authorized users (humans) get past. Good authentication directories are resistant to brute force attack by doing things like temporary lockouts, or increasing the time before the "logon failed" screen shows up, tying up the attacker's resources.
At the human level, it is at worst the equivalent of engaging a telemarketer in conversation knowing full well you will never buy their product. You are deliberately denying them access to your funds while simultaneously wasting their time so they cannot hassle others.
Re: (Score:2)
It's like eating until you explode, which apparently Al crawlers are programmed to do.
Re: (Score:2)
Meh. As long as they put the "tar pit" into the robots.txt.
Ignore robots.txt, get tarpitted. A nice little incentive.
Re: (Score:2)
Be a nice netizen.. (Score:2)
Be a nice netizen..and make sure your robots.txt blocks access to this recursive nightmare...
Year 2000 called. They want their teergrube back (Score:5, Informative)
Eesh.
This is an ancient idea. Probably discussed at length right here in the 2000s. No crawler worth its salt is going to even notice this.
Like oh, this:
https://it.slashdot.org/story/... [slashdot.org]
or this:
https://developers.slashdot.or... [slashdot.org]
Re: (Score:2)
Isn't it *interesting* when disruptors blow away an entire industry and then complain that they need to relearn lessons from decades ago.
While you are correct... it's novel that it still works like this. Or works again like this, as the case may be.
Re: (Score:2)
No, this countermeasure isn't going to do anything useful. Any web spider is going to run into thousands of these, and therefore already is going to be coded to tolerate it fine.
These countermeasures were made to trip up worms and naively coded Perl scripts used by mom and pop spamming operations. That is useful to an extent, but isn't going to do absolutely anything against the bigger players.
All you have to do is to keep track of stats like depth and performance, notice that a branch is doing badly, and p
Re: (Score:3)
I was going to say the same - I seem to remember people doing the same sort of thing back in the day, but with a view to gaining 'SEO' for something or other. That is, a billion computer generated pages all talking about $subject and linking to $site would somehow make Altavista rank them higher. The 'proper' crawlers took this as a bug report and fixed their bugs - to this day, Google only follows links on the same site to a specific 'depth' (4, IIRC).
However, this is a great idea, and will cause some prob
Been done (Score:2)
This has been done before, though in the context of trapping spammers who scrape web sites for email addresses: You generate endless pages with random mailto: URLs. And I'm pretty sure any self-respecting bot will have countermeasures to detect and avoid these sorts of traps.
Re: (Score:3)
Replying to self... an obvious counter-measure is to occasionally request a random file and if you don't get a 404, then you're probably in a spider trap. (Just change the last bit of the URL path to a 32-character random string or something.)
Re: (Score:2)
yeah, but the easy counter to that is 404ing links that you didn't generate and hand out. Which involves remembering the links you handed out but that's not too onerous.
Re: (Score:3)
Re: (Score:2)
Yeah, back then it brought the nightmare of backscatter :)
If it's for AI training bots (Score:3)
Then each randomly generated URL should deliver randomly generated garbage for the specific purpose of poisoning the training.
If someone were to write a module for Apache that would do this automatically, after identifying the source IP of an AI training crawler, I'd install it in a heartbeat.
Re: (Score:2)
if you go to the link, you'll see that each page is unique.
Re: (Score:3)
Rewrites, and/or a module
https://www.usenix.org.uk/cont... [usenix.org.uk]
This stuff is kind of cute (Score:4, Insightful)
Re:This stuff is kind of cute (Score:5, Insightful)
It's like trying to fight a nation state with guerilla warfare. It doesn't really work.
The Taliban and the Viet Cong say hi.
Re: (Score:2)
The Viet Cong, as an effective fighting force, disappeared after Tet '69.
Re:This stuff is kind of cute (Score:4, Insightful)
Not really. This tarpit targets a key weakness of LLMs - if you train them on AI generated information, you get devolution of the model. The more AI generated content the model is trained on, the worse the end result.
And many AI crawlers are not obeying robots.txt and hammering websites with way more traffic than they normally get. This is because in the big AI rush, they're just trying to get data.
So I see no reason why you can't sprinkle a ton of ChatGPT produced content (among other things) into hidden web pages. You can exclude them with robots.txt, so proper crawlers (which include most search engines) will never see them, but AI crawlers likely will in their quest for dominance.
Meanwhile if we're deploying a great chunk of AI generated crap for AI crawlers to read, you're also destroying the models, which is a bonus.
Bravo, now duck and cover. (Score:2)
I would suspect with the chumminess between the tech bros and Trump, you may not want to brag too much about wasting AI web crawler's resources. If one of the broligarchy gets upset with you, you may just get disappeared. I expect we'll hear of some new regulations relatively quickly now that this story is out there.
The circle is closing. (Score:3)
I read some interesting articles published in the 60's and 50's (shocking!) that predicted the ONLY way for humans to control thinking software will be to introduce binary level viruses just to slow them down enough for us humans to interact with them. And I thought..... haha, we would have to be haha, pretty STUPID, hahaha, to let that happen.....
Re: (Score:2)
Right. What you do is tell crawlers in your robots.txt to not go into the tarpit's endpoint.
That way only the LLMs that ignore your robots.txt will fall into the rabbit hole of endless crud.
wouldn't it be better.. (Score:1)
Did this to email scrapers (Score:3, Insightful)
Made a "directory" of email addresses that were all garbage with a next page link that would load infinitely more garbage. It was about poisoning the well so they couldn't sell the lists to spam mailers for much.
Re: (Score:3)
me too! and they were all hotmail addresses too.
took a census list of names and randomly combined them.
Re: Did this to email scrapers (Score:2)
A suggestion (Score:2)
Maybe have it also dump some random generated text as well? Or even random images?
That will hopefully screw some of the AI text / image generation capabilities.
Re: (Score:2)
Re: (Score:2)
Don't ML systems feeding on other ML systems outputs degrade over time?
Maybe put a bunch of ML generated junk over there on ""relevant topics" and let it all feed on each other, lol.
Or if your site have the resources, have it generate the junk live as the bots crawl.
Re: (Score:2)
ML systems degrade on low quality content. That includes: low quality content, generative content (like here), uncurated AI output that includes bad results, spam, slashdot comments, conspiracy theoreticians, etc.
What one does against this is to pre-filter the dataset after crawling. For example images are assigned an aesthetics score. If your image scores too low it is not included into the training data (even when it was downloaded). A good filter will recognize all in the list above, a bad may not.
Anyone think about the consequences? (Score:2)
As in "Wasting Resources" - particularly energy hungry "Resources" .
On the face of it this seems clever, but ultimately it cant be seen as a viable solution.
Pointless (Score:2)
That doesn't mean it's not easy to avoid, it just means you haven't caused them enough pain to bother. The very second you do cause them enough pain they'll implement checks to detect the trap and bail out. Limit the recursive link depth, or the numbe