Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
AI

Developer Creates Infinite Maze That Traps AI Training Bots 87

An anonymous reader quotes a report from 404 Media: A pseudonymous coder has created and released an open source "tar pit" to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed "offensively" as a honeypot trap to waste AI companies' resources.

"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop."
You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).
This discussion has been archived. No new comments can be posted.

Developer Creates Infinite Maze That Traps AI Training Bots

Comments Filter:
  • Great! (Score:5, Funny)

    by jenningsthecat ( 1525947 ) on Thursday January 23, 2025 @05:45PM (#65113433)

    Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!

    • If a web-developer wants his site not crawled, all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole. This is how search engines already work. This works well
      • by dfghjk ( 711126 )

        "AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole."

        Ignoring a robots.txt file is not "violating" anything nor is anything stolen.

        • by Anonymous Coward

          Viewing what a server broadcasts (even via crawler) is definitely not stealing-aka-infringement. ToS noise at most.

          AI training in a vacuum probably isn't infringement, there's no redistribution or anyone claiming the results as their own.

          Sellers of the "remix" might be. Courts are agonizing over this point.

          • by Bert64 ( 520050 )

            It's unauthorised access.
            You have stated via the robots.txt file that certain things (ie bots) are NOT authorised to access any other content on the site.

            • Re:Great! (Score:4, Interesting)

              by GuB-42 ( 2483988 ) on Thursday January 23, 2025 @09:12PM (#65113993)

              The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?

              Thankfully, in most jurisdictions, ad blocking is not illegal.

              • Re:Great! (Score:5, Insightful)

                by tlhIngan ( 30335 ) <slashdot@@@worf...net> on Thursday January 23, 2025 @11:31PM (#65114141)

                The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?

                No, because an ad is just extra content you wanted me to fetch. I.e., you stick an ad on your web page, and you hope I will fetch it. I have the option to not fetch it if I so desire.

                robots.txt is more like a keep out sign. The web site is telling you to NOT fetch certain elements - perhaps those elements are not worth fetching or cost a lot to fetch. In this case it's more like unauthorized access because I told you to not go there, but you did anyways.,

                It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.

                Ad blocking is merely deciding to not look at something on the property. If the property owner puts up a billboard, ad-blocking is basically choosing not to look at the billboard as you walk by on the public street. Whether you deliberately chose not to look at it or didn't look at it because you weren't looking at the property is irrelevant. In this case no trespass will ever exist because you never stepped on their property.

                Ad blocking can never be illegal - because that's choosing to not retrieve a web page or content that you're not choosing to retrieve. Now, some sites might retaliate and say if you don't get this content, you can't get other content, but that's a different issue.

                robots.txt blocking is telling others to keep out - like putting up a "no soliciting" sign to keep out salespeople.

                • by 1nt3lx ( 124618 )

                  It’s a good opinion, but is not in fact the law.

                  • by aergern ( 127031 )

                    Maybe the law is where you are or in your head. This has gone to court multiple times and they've always found in favor of the end user. it's no more illegal to get up during an ad-supported network show so you can take a piss.

                    GTFOH with your 10 years of watching Law & Order expertise.

                • But maybe they decided not to fetch it.

                • It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.

                  That's a pretty silly analogy. You can't "trespass" on a website. It needs to be served to you. It's as much trespass as having a sign on a door saying "Slashdot user tlhlngan is not welcome." while the owner of the door holds the door open for you and ushers you inside.

        • Many years ago I was a client of a web hosting service that blocked rogue web crawlers by a simple expedient. Its robots.txt file banned a directory and there was an invisible link on the page footers to a file in that directory. Anything accessing that file had its IP address blocked.

          Shouldn't be too hard to update this concept for today's technology.
      • Re: (Score:3, Insightful)

        by gweihir ( 88907 )

        AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.

        That is a nice fantasy you have there. Not enforceable.

      • by taustin ( 171655 )

        If a web-developer wants his site not crawled

        By crawlers that honor the robots.txt file.

        , all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.

        But they don't, and there's zero chance they will any time soon.

        This is how search

        some

        engines already work. This works well

        Except when it doesn't.

        It would be less egregious if the assholes running the crawlers would limit the number of simultaneous connections to something reasonable to avoid running the CPU at 100% until it catches on fire. But that's not going to happen, either.

      • If a web-developer wants his site not crawled, all you need is a robots.txt file.

        How cute. You think that the sleazy scumbags who run tech companies give a shit about your robots.txt file

      • Too late. Microsoft already said the are going to ingest the entire web and that is their interpretation of fair use.
      • Re:Great! (Score:5, Interesting)

        by Zocalo ( 252965 ) on Thursday January 23, 2025 @06:43PM (#65113637) Homepage
        Since there's no real obligation to honour robots.txt you have it backwards. If enough users deploy Nepenthes, and it doesn't need to be everyone - just enough to cause pain, that becomes "if an AI crawler (or any other crawler for that matter) wants to avoid getting tarpitted then they *need* to honor robots.txt".

        I have zero problem with giving things like robots.txt some teeth, and will be checking Nepenthes out with a view to adding it to the honeypot/tarpit system we already routinely deploy to our clients to blackhole hostile/malicious traffic and provide a softer target than can heads up before the actual systems get hit. When you get right down to it, it's putting the house rules on the door where you can see them on entry; respect them and you'll be fine, but if you choose to ignore them then you shouldn't be surprised when you get bitten.
        • I wonder how quickly something like this gets your site on some blacklists shared by the "crawler industry" because surely they already have something like this for other problematic site behaviors
          • by Zocalo ( 252965 )
            AFAICT, none that matter. Firstly, these are not typically consumer facing sites that need to be discovered at random via search, and secondly there's generally a pretty clear line between the actual and honeypot servers. Often that line means adjacent IP addressess/ranges to the active servers, and where they are down in the weeds of an active site, it'll typically align with the section being on different sub-domain and IP, doing so DB-driven data provisioning that has zero value in being indexed anyway
        • by Anonymous Coward

          2 things:
          - with the level of anti-AI/... hate there is, I would be surprised if this thing hadn't been deployed by lots of people already without any robots.txt blocking
          - I wouldn't be surprised if crawlers already have mitigations against this type of things, it can't be that new, and if this get any traction, that will detect it easily enough

          • by Zocalo ( 252965 )
            Most of the big crawlers, like Google's, do indeed seem to have mitigations. Honeypots and tarpits have been around for years, so there's usually some mechanism to detect things are taking too long and terminate the connection and move to the next site/URL. Script-kiddle level stuff, not so much - some of that you can keep trapped indefinitely. That's all fine. The idea is to gather data on genuinely malicious actors for blocklisting from production servers and any tarpitting is also usually very asymmet
        • by jsonn ( 792303 )
          I mean you can combine things: have Nepenthes and also a robots.txt that black lists it...
      • If a web-developer wants his site not crawled, all you need is a robots.txt file.

        I bet you believe there is a tooth fairy too.

    • Re:Great! (Score:5, Funny)

      by PPH ( 736903 ) on Thursday January 23, 2025 @07:29PM (#65113759)

      Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!

      Put them in a round room and tell them that there's a government subsidy in the corner.

    • Isn't this what the "Cryptocurrency working group" and "Project Stargate" are for?

  • Who (Score:3, Funny)

    by dpille ( 547949 ) on Thursday January 23, 2025 @05:58PM (#65113465)
    Who is going to want to "protect their own content" by dropping off the first hundred pages of web search results?

    Perhaps I need to have some young coder teach me how the internet ecology works these days.
    • by Zocalo ( 252965 )
      I routinely deploy tarpits as part of a honeypot system and most of the big search engine crawlers, with a few exceptions, do actually honour robots.txt, although there are some that have multiple crawlers with some that honour it and some that don't. Even so, it's trivial enough to put the stuff you want crawled under one set of rules and the rest under another and only tarpit things that go off piste in the latter set - put the entrances to the tarpit a few levels down from your homepage and on a hidden
      • Re: (Score:2, Insightful)

        by Anonymous Coward
        But how do the good spiders get "the stuff you want crawled" yet the AI training ones don't?
        • by Zocalo ( 252965 )
          Crawlers are just automated web browsers and identify themselves just like web browsers do and, just like web browsers, they can also fake this. So, you need to make a judgement call on how you handle that info, and whether you allow the crawler to proceed or not. It also depends on the site data; it may be that you're fine with anyone indexing your home page so on (which will get you your search rankings), but everything else can be off limits because it has limited SEO value such as some online datasets
  • by guygo ( 894298 ) on Thursday January 23, 2025 @06:02PM (#65113471)

    lock them into a maze of twisty little passages, all alike

  • I was not aware crawlers are _this_ primitive. Well, you learn something new every day if you care to.

  • news that matters? (Score:2, Interesting)

    by dfghjk ( 711126 )

    We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.

    • by taustin ( 171655 )

      This is literally malware.

      So are most crawlers.

    • by rsilvergun ( 571051 ) on Thursday January 23, 2025 @06:31PM (#65113607)
      That AI is not going to benefit us. For everyone drug discovery we get there's going to be a thousand AIs designed to take our jobs and make us redundant and a thousand more design to post social media and whoop up the uneducated into a frenzy of violence and stupidity.

      I don't think the Amish had the right idea as far as technology goes. But the basic concept of saying no to something that is inherently destructive isn't a bad idea. It's just a matter of degrees at that point.

      Or is one person put it, I want AI to do my laundry and dishes so I can paint pictures and write but instead I got AI that paints pictures so that I can do laundry and dishes...
      • You know what though
        There will be drug discoveries

        Done by systems that aren't LLMs.

        And LLMs will be attributed these discoveries, because their main feature is that people who don't want to be able to tell the difference can't, and want it to be someone else's problem.

    • What isn't malware these days? I hope you don't use any Microsoft, Apple, Google, Amazon, Adobe, Etc

      They own you.
    • If the crawlers were less aggressive then people wouldn't do this.

    • by PsychoSlashDot ( 207849 ) on Thursday January 23, 2025 @06:48PM (#65113647)

      We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.

      It is neither literally - nor figuratively - malware.

      It is the equivalent of a logon screen. AI is unable to productively go where it is unwanted. Authorized users (humans) get past. Good authentication directories are resistant to brute force attack by doing things like temporary lockouts, or increasing the time before the "logon failed" screen shows up, tying up the attacker's resources.

      At the human level, it is at worst the equivalent of engaging a telemarketer in conversation knowing full well you will never buy their product. You are deliberately denying them access to your funds while simultaneously wasting their time so they cannot hassle others.

    • If anyone enters an unknown maze and gets lost while ignoring the fact you can just stop advancing or participating, who's fault is that really?

      It's like eating until you explode, which apparently Al crawlers are programmed to do.
    • Meh. As long as they put the "tar pit" into the robots.txt.

      Ignore robots.txt, get tarpitted. A nice little incentive.

      • I haven’t ran a personal webserver for at least 20 years, but when I did I had exactly that. An infinite amount of generated web content with mail addresses that would add you to a spam blacklist, that was -only- mentioned as ‘do not crawl’ in the robots.txt.
  • Be a nice netizen..and make sure your robots.txt blocks access to this recursive nightmare...

  • by vadim_t ( 324782 ) on Thursday January 23, 2025 @06:12PM (#65113519) Homepage

    Eesh.

    This is an ancient idea. Probably discussed at length right here in the 2000s. No crawler worth its salt is going to even notice this.

    Like oh, this:

    https://it.slashdot.org/story/... [slashdot.org]

    or this:

    https://developers.slashdot.or... [slashdot.org]

    • Isn't it *interesting* when disruptors blow away an entire industry and then complain that they need to relearn lessons from decades ago.

      While you are correct... it's novel that it still works like this. Or works again like this, as the case may be.

      • by vadim_t ( 324782 )

        No, this countermeasure isn't going to do anything useful. Any web spider is going to run into thousands of these, and therefore already is going to be coded to tolerate it fine.

        These countermeasures were made to trip up worms and naively coded Perl scripts used by mom and pop spamming operations. That is useful to an extent, but isn't going to do absolutely anything against the bigger players.

        All you have to do is to keep track of stats like depth and performance, notice that a branch is doing badly, and p

    • I was going to say the same - I seem to remember people doing the same sort of thing back in the day, but with a view to gaining 'SEO' for something or other. That is, a billion computer generated pages all talking about $subject and linking to $site would somehow make Altavista rank them higher. The 'proper' crawlers took this as a bug report and fixed their bugs - to this day, Google only follows links on the same site to a specific 'depth' (4, IIRC).

      However, this is a great idea, and will cause some prob

  • This has been done before, though in the context of trapping spammers who scrape web sites for email addresses: You generate endless pages with random mailto: URLs. And I'm pretty sure any self-respecting bot will have countermeasures to detect and avoid these sorts of traps.

    • by dskoll ( 99328 )

      Replying to self... an obvious counter-measure is to occasionally request a random file and if you don't get a 404, then you're probably in a spider trap. (Just change the last bit of the URL path to a 32-character random string or something.)

      • by suutar ( 1860506 )

        yeah, but the easy counter to that is 404ing links that you didn't generate and hand out. Which involves remembering the links you handed out but that's not too onerous.

    • Yeah, back then it brought the nightmare of backscatter :)

  • by taustin ( 171655 ) on Thursday January 23, 2025 @06:17PM (#65113549) Homepage Journal

    Then each randomly generated URL should deliver randomly generated garbage for the specific purpose of poisoning the training.

    If someone were to write a module for Apache that would do this automatically, after identifying the source IP of an AI training crawler, I'd install it in a heartbeat.

  • by rsilvergun ( 571051 ) on Thursday January 23, 2025 @06:28PM (#65113599)
    But it's written by hobbyists while the AI is written by professionals working around the clock. It will be defeated because they just have more resources than you do. It's like trying to fight a nation state with guerilla warfare. It doesn't really work. Not unless a third party steps in to fund the guerrillas like a color revolution. And somehow I don't think that's happening here. All the major AI companies are pretty well aligned and while they're competing to see who's top dog they're not going to actively attack each other or help people who do.
    • by Kernel Kurtz ( 182424 ) on Thursday January 23, 2025 @07:53PM (#65113841)

      It's like trying to fight a nation state with guerilla warfare. It doesn't really work.

      The Taliban and the Viet Cong say hi.

    • by tlhIngan ( 30335 ) <slashdot@@@worf...net> on Thursday January 23, 2025 @11:38PM (#65114145)

      But it's written by hobbyists while the AI is written by professionals working around the clock. It will be defeated because they just have more resources than you do. It's like trying to fight a nation state with guerilla warfare. It doesn't really work. Not unless a third party steps in to fund the guerrillas like a color revolution. And somehow I don't think that's happening here. All the major AI companies are pretty well aligned and while they're competing to see who's top dog they're not going to actively attack each other or help people who do.

      Not really. This tarpit targets a key weakness of LLMs - if you train them on AI generated information, you get devolution of the model. The more AI generated content the model is trained on, the worse the end result.

      And many AI crawlers are not obeying robots.txt and hammering websites with way more traffic than they normally get. This is because in the big AI rush, they're just trying to get data.

      So I see no reason why you can't sprinkle a ton of ChatGPT produced content (among other things) into hidden web pages. You can exclude them with robots.txt, so proper crawlers (which include most search engines) will never see them, but AI crawlers likely will in their quest for dominance.

      Meanwhile if we're deploying a great chunk of AI generated crap for AI crawlers to read, you're also destroying the models, which is a bonus.

  • I would suspect with the chumminess between the tech bros and Trump, you may not want to brag too much about wasting AI web crawler's resources. If one of the broligarchy gets upset with you, you may just get disappeared. I expect we'll hear of some new regulations relatively quickly now that this story is out there.

  • by dowhileor ( 7796472 ) on Thursday January 23, 2025 @07:23PM (#65113747)

    I read some interesting articles published in the 60's and 50's (shocking!) that predicted the ONLY way for humans to control thinking software will be to introduce binary level viruses just to slow them down enough for us humans to interact with them. And I thought..... haha, we would have to be haha, pretty STUPID, hahaha, to let that happen.....

  • to trap the crawler in an argument that results in the AI itself changing it's mind and becoming a goatherder?
  • by Wokan ( 14062 ) on Thursday January 23, 2025 @08:31PM (#65113905) Journal

    Made a "directory" of email addresses that were all garbage with a next page link that would load infinitely more garbage. It was about poisoning the well so they couldn't sell the lists to spam mailers for much.

  • Maybe have it also dump some random generated text as well? Or even random images?

    That will hopefully screw some of the AI text / image generation capabilities.

    • Instead of random text, misuse an AI system to create a fake (and wrong) article about something. Keep adding to the library of fake articles. many layers of what looks like valuable training information that actually is misinforming the AI being trained by the robot crawler. This makes it very difficult to automate a process to reject these 'nasty' sites. Get a nerd social network (slashdot/Imgur/even FB) to have us nerds vying to create more and more outrageous articles, sharing them and adding them t
      • Don't ML systems feeding on other ML systems outputs degrade over time?

        Maybe put a bunch of ML generated junk over there on ""relevant topics" and let it all feed on each other, lol.

        Or if your site have the resources, have it generate the junk live as the bots crawl.

        • by allo ( 1728082 )

          ML systems degrade on low quality content. That includes: low quality content, generative content (like here), uncurated AI output that includes bad results, spam, slashdot comments, conspiracy theoreticians, etc.
          What one does against this is to pre-filter the dataset after crawling. For example images are assigned an aesthetics score. If your image scores too low it is not included into the training data (even when it was downloaded). A good filter will recognize all in the list above, a bad may not.

  • As in "Wasting Resources" - particularly energy hungry "Resources" .
    On the face of it this seems clever, but ultimately it cant be seen as a viable solution.

  • On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media "If that's, true, I've several million lines of access log that says even Google Almighty didn't graduate" to avoiding the trap.

    That doesn't mean it's not easy to avoid, it just means you haven't caused them enough pain to bother. The very second you do cause them enough pain they'll implement checks to detect the trap and bail out. Limit the recursive link depth, or the numbe

"It's what you learn after you know it all that counts." -- John Wooden

Working...