Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
AI

Developer Creates Infinite Maze That Traps AI Training Bots 53

An anonymous reader quotes a report from 404 Media: A pseudonymous coder has created and released an open source "tar pit" to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed "offensively" as a honeypot trap to waste AI companies' resources.

"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop."
You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).

Developer Creates Infinite Maze That Traps AI Training Bots

Comments Filter:
  • Great! (Score:5, Funny)

    by jenningsthecat ( 1525947 ) on Thursday January 23, 2025 @04:45PM (#65113433)

    Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!

    • If a web-developer wants his site not crawled, all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole. This is how search engines already work. This works well
      • by dfghjk ( 711126 )

        "AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole."

        Ignoring a robots.txt file is not "violating" anything nor is anything stolen.

        • by Anonymous Coward

          Viewing what a server broadcasts (even via crawler) is definitely not stealing-aka-infringement. ToS noise at most.

          AI training in a vacuum probably isn't infringement, there's no redistribution or anyone claiming the results as their own.

          Sellers of the "remix" might be. Courts are agonizing over this point.

      • by gweihir ( 88907 )

        AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.

        That is a nice fantasy you have there. Not enforceable.

      • by taustin ( 171655 )

        If a web-developer wants his site not crawled

        By crawlers that honor the robots.txt file.

        , all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.

        But they don't, and there's zero chance they will any time soon.

        This is how search

        some

        engines already work. This works well

        Except when it doesn't.

        It would be less egregious if the assholes running the crawlers would limit the number of simultaneous connections to something reasonable to avoid running the CPU at 100% until it catches on fire. But that's not going to happen, either.

      • If a web-developer wants his site not crawled, all you need is a robots.txt file.

        How cute. You think that the sleazy scumbags who run tech companies give a shit about your robots.txt file

      • Too late. Microsoft already said the are going to ingest the entire web and that is their interpretation of fair use.
      • Re:Great! (Score:5, Interesting)

        by Zocalo ( 252965 ) on Thursday January 23, 2025 @05:43PM (#65113637) Homepage
        Since there's no real obligation to honour robots.txt you have it backwards. If enough users deploy Nepenthes, and it doesn't need to be everyone - just enough to cause pain, that becomes "if an AI crawler (or any other crawler for that matter) wants to avoid getting tarpitted then they *need* to honor robots.txt".

        I have zero problem with giving things like robots.txt some teeth, and will be checking Nepenthes out with a view to adding it to the honeypot/tarpit system we already routinely deploy to our clients to blackhole hostile/malicious traffic and provide a softer target than can heads up before the actual systems get hit. When you get right down to it, it's putting the house rules on the door where you can see them on entry; respect them and you'll be fine, but if you choose to ignore them then you shouldn't be surprised when you get bitten.
        • I wonder how quickly something like this gets your site on some blacklists shared by the "crawler industry" because surely they already have something like this for other problematic site behaviors
      • If a web-developer wants his site not crawled, all you need is a robots.txt file.

        I bet you believe there is a tooth fairy too.

    • Re:Great! (Score:4, Funny)

      by PPH ( 736903 ) on Thursday January 23, 2025 @06:29PM (#65113759)

      Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!

      Put them in a round room and tell them that there's a government subsidy in the corner.

  • Who (Score:4, Funny)

    by dpille ( 547949 ) on Thursday January 23, 2025 @04:58PM (#65113465)
    Who is going to want to "protect their own content" by dropping off the first hundred pages of web search results?

    Perhaps I need to have some young coder teach me how the internet ecology works these days.
    • by Zocalo ( 252965 )
      I routinely deploy tarpits as part of a honeypot system and most of the big search engine crawlers, with a few exceptions, do actually honour robots.txt, although there are some that have multiple crawlers with some that honour it and some that don't. Even so, it's trivial enough to put the stuff you want crawled under one set of rules and the rest under another and only tarpit things that go off piste in the latter set - put the entrances to the tarpit a few levels down from your homepage and on a hidden
      • Re: (Score:2, Insightful)

        by Anonymous Coward
        But how do the good spiders get "the stuff you want crawled" yet the AI training ones don't?
  • by guygo ( 894298 ) on Thursday January 23, 2025 @05:02PM (#65113471)

    lock them into a maze of twisty little passages, all alike

  • I was not aware crawlers are _this_ primitive. Well, you learn something new every day if you care to.

  • by dfghjk ( 711126 ) on Thursday January 23, 2025 @05:10PM (#65113501)

    We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.

    • by taustin ( 171655 )

      This is literally malware.

      So are most crawlers.

      • Most? If there is any AI company of note whose crawler is disregarding robots.txt I'd be interested to hear about it.
    • That AI is not going to benefit us. For everyone drug discovery we get there's going to be a thousand AIs designed to take our jobs and make us redundant and a thousand more design to post social media and whoop up the uneducated into a frenzy of violence and stupidity.

      I don't think the Amish had the right idea as far as technology goes. But the basic concept of saying no to something that is inherently destructive isn't a bad idea. It's just a matter of degrees at that point.

      Or is one person put it
      • You know what though
        There will be drug discoveries

        Done by systems that aren't LLMs.

        And LLMs will be attributed these discoveries, because their main feature is that people who don't want to be able to tell the difference can't, and want it to be someone else's problem.

    • What isn't malware these days? I hope you don't use any Microsoft, Apple, Google, Amazon, Adobe, Etc

      They own you.
    • If the crawlers were less aggressive then people wouldn't do this.

    • We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.

      It is neither literally - nor figuratively - malware.

      It is the equivalent of a logon screen. AI is unable to productively go where it is unwanted. Authorized users (humans) get past. Good authentication directories are resistant to brute force attack by doing things like temporary lockouts, or increasing the time before the "logon failed" screen shows up, tying up the attacker's resources.

      At the human level, it is at worst the equivalent of engaging a telemarketer in conversation knowing full well yo

  • Be a nice netizen..and make sure your robots.txt blocks access to this recursive nightmare...

  • by vadim_t ( 324782 ) on Thursday January 23, 2025 @05:12PM (#65113519) Homepage

    Eesh.

    This is an ancient idea. Probably discussed at length right here in the 2000s. No crawler worth its salt is going to even notice this.

    Like oh, this:

    https://it.slashdot.org/story/... [slashdot.org]

    or this:

    https://developers.slashdot.or... [slashdot.org]

    • Isn't it *interesting* when disruptors blow away an entire industry and then complain that they need to relearn lessons from decades ago.

      While you are correct... it's novel that it still works like this. Or works again like this, as the case may be.

  • This has been done before, though in the context of trapping spammers who scrape web sites for email addresses: You generate endless pages with random mailto: URLs. And I'm pretty sure any self-respecting bot will have countermeasures to detect and avoid these sorts of traps.

    • by dskoll ( 99328 )

      Replying to self... an obvious counter-measure is to occasionally request a random file and if you don't get a 404, then you're probably in a spider trap. (Just change the last bit of the URL path to a 32-character random string or something.)

      • by suutar ( 1860506 )

        yeah, but the easy counter to that is 404ing links that you didn't generate and hand out. Which involves remembering the links you handed out but that's not too onerous.

  • by taustin ( 171655 ) on Thursday January 23, 2025 @05:17PM (#65113549) Homepage Journal

    Then each randomly generated URL should deliver randomly generated garbage for the specific purpose of poisoning the training.

    If someone were to write a module for Apache that would do this automatically, after identifying the source IP of an AI training crawler, I'd install it in a heartbeat.

  • by rsilvergun ( 571051 ) on Thursday January 23, 2025 @05:28PM (#65113599)
    But it's written by hobbyists while the AI is written by professionals working around the clock. It will be defeated because they just have more resources than you do. It's like trying to fight a nation state with guerilla warfare. It doesn't really work. Not unless a third party steps in to fund the guerrillas like a color revolution. And somehow I don't think that's happening here. All the major AI companies are pretty well aligned and while they're competing to see who's top dog they're not going to actively attack each other or help people who do.
    • It's like trying to fight a nation state with guerilla warfare. It doesn't really work.

      The Taliban and the Viet Cong say hi.

  • I would suspect with the chumminess between the tech bros and Trump, you may not want to brag too much about wasting AI web crawler's resources. If one of the broligarchy gets upset with you, you may just get disappeared. I expect we'll hear of some new regulations relatively quickly now that this story is out there.

  • by dowhileor ( 7796472 ) on Thursday January 23, 2025 @06:23PM (#65113747)

    I read some interesting articles published in the 60's and 50's (shocking!) that predicted the ONLY way for humans to control thinking software will be to introduce binary level viruses just to slow them down enough for us humans to interact with them. And I thought..... haha, we would have to be haha, pretty STUPID, hahaha, to let that happen.....

  • to trap the crawler in an argument that results in the AI itself changing it's mind and becoming a goatherder?
  • Made a "directory" of email addresses that were all garbage with a next page link that would load infinitely more garbage. It was about poisoning the well so they couldn't sell the lists to spam mailers for much.

FORTUNE'S FUN FACTS TO KNOW AND TELL: A black panther is really a leopard that has a solid black coat rather then a spotted one.

Working...