Developer Creates Infinite Maze That Traps AI Training Bots 87

Posted by BeauHD on Thursday January 23, 2025 @05:40PM from the humans-vs-machines dept.

An anonymous reader quotes a report from 404 Media: A pseudonymous coder has created and released an open source "tar pit" to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed "offensively" as a honeypot trap to waste AI companies' resources.

"It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself -- the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself," Aaron B, the creator of Nepenthes, told 404 Media. "Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time," they added. "But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop." You can try Nepenthes via this link (it loads slowly and links endlessly on purpose).

Developer Creates Infinite Maze That Traps AI Training Bots

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 87 Comments Log In/Create an Account

Comments Filter:

Great! (Score:5, Funny)

by jenningsthecat ( 1525947 ) writes: on Thursday January 23, 2025 @05:45PM (#65113433)

Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!

- Re: (Score:1)
  
  by saloomy ( 2817221 ) writes:
  
  If a web-developer wants his site not crawled, all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole. This is how search engines already work. This works well
  - Re: (Score:3)
    
    by dfghjk ( 711126 ) writes:
    
    "AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole."
    Ignoring a robots.txt file is not "violating" anything nor is anything stolen.
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      Viewing what a server broadcasts (even via crawler) is definitely not stealing-aka-infringement. ToS noise at most.
      AI training in a vacuum probably isn't infringement, there's no redistribution or anyone claiming the results as their own.
      Sellers of the "remix" might be. Courts are agonizing over this point.
      - Re: (Score:2)
        
        by Bert64 ( 520050 ) writes:
        
        It's unauthorised access.
        You have stated via the robots.txt file that certain things (ie bots) are NOT authorised to access any other content on the site.
        
        Re:Great! (Score:4, Interesting)
        
        by GuB-42 ( 2483988 ) writes: on Thursday January 23, 2025 @09:12PM (#65113993)
        
        The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?
        Thankfully, in most jurisdictions, ad blocking is not illegal.
        
        
        Re:Great! (Score:5, Insightful)
        
        by tlhIngan ( 30335 ) writes: <slashdot@worf.COUGARnet minus cat> on Thursday January 23, 2025 @11:31PM (#65114141)
        
        The same argument can be used to make ad blockers illegal. If a website tells you not to use an ad blocker and you use one anyways, is it unauthorized access?
        No, because an ad is just extra content you wanted me to fetch. I.e., you stick an ad on your web page, and you hope I will fetch it. I have the option to not fetch it if I so desire.
        robots.txt is more like a keep out sign. The web site is telling you to NOT fetch certain elements - perhaps those elements are not worth fetching or cost a lot to fetch. In this case it's more like unauthorized access because I told you to not go there, but you did anyways.,
        It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.
        Ad blocking is merely deciding to not look at something on the property. If the property owner puts up a billboard, ad-blocking is basically choosing not to look at the billboard as you walk by on the public street. Whether you deliberately chose not to look at it or didn't look at it because you weren't looking at the property is irrelevant. In this case no trespass will ever exist because you never stepped on their property.
        Ad blocking can never be illegal - because that's choosing to not retrieve a web page or content that you're not choosing to retrieve. Now, some sites might retaliate and say if you don't get this content, you can't get other content, but that's a different issue.
        robots.txt blocking is telling others to keep out - like putting up a "no soliciting" sign to keep out salespeople.
        
        
        Re: (Score:1)
        
        by 1nt3lx ( 124618 ) writes:
        
        It’s a good opinion, but is not in fact the law.
        
        Re: (Score:3)
        
        by aergern ( 127031 ) writes:
        
        Maybe the law is where you are or in your head. This has gone to court multiple times and they've always found in favor of the end user. it's no more illegal to get up during an ad-supported network show so you can take a piss.
        GTFOH with your 10 years of watching Law & Order expertise.
        
        Re: Great! (Score:2)
        
        by AcidFnTonic ( 791034 ) writes:
        
        But maybe they decided not to fetch it.
        
        Re: (Score:2)
        
        by thegarbz ( 1787294 ) writes:
        
        It's like trespass - if a property owner says "no trespassing" you are not allowed on that property (unless otherwise authorized). That's robots.txt.
        That's a pretty silly analogy. You can't "trespass" on a website. It needs to be served to you. It's as much trespass as having a sign on a door saying "Slashdot user tlhlngan is not welcome." while the owner of the door holds the door open for you and ushers you inside.
    - Enforcing robots.txt restrictions (Was Re:Great!) (Score:2)
      
      by kiore ( 734594 ) writes:
      
      Many years ago I was a client of a web hosting service that blocked rogue web crawlers by a simple expedient. Its robots.txt file banned a directory and there was an invisible link on the page footers to a file in that directory. Anything accessing that file had its IP address blocked.
      
      Shouldn't be too hard to update this concept for today's technology.
  - Re: (Score:3, Insightful)
    
    by gweihir ( 88907 ) writes:
    
    AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
    
    That is a nice fantasy you have there. Not enforceable.
  - - Re: Great! (Score:2)
      
      by javaman235 ( 461502 ) writes:
      
      The main thing is sites are already doing this for spam, may as well get on board. I saw it when I downloaded when of the uncensored horror and adult models from hugging face, the gay content it generated was loaded with in story references to actual gay sites, somehow they embedded it in the adult stories it trained on knowing bots would train on it and pass on the spam.
  - Re: (Score:2)
    
    by taustin ( 171655 ) writes:
    
    If a web-developer wants his site not crawled
    By crawlers that honor the robots.txt file.
    , all you need is a robots.txt file. AI companies that violate the robots.txt should pay massive fines to those website holders far in excess of the value of the data they stole.
    But they don't, and there's zero chance they will any time soon.
    This is how search
    some
    engines already work. This works well
    Except when it doesn't.
    It would be less egregious if the assholes running the crawlers would limit the number of simultaneous connections to something reasonable to avoid running the CPU at 100% until it catches on fire. But that's not going to happen, either.
  - Re: (Score:1)
    
    by rudy_wayne ( 414635 ) writes:
    
    If a web-developer wants his site not crawled, all you need is a robots.txt file.
    
    How cute. You think that the sleazy scumbags who run tech companies give a shit about your robots.txt file
  - Re: Great! (Score:2)
    
    by Big Hairy Gorilla ( 9839972 ) writes:
    
    Too late. Microsoft already said the are going to ingest the entire web and that is their interpretation of fair use.
    - Re: Great! (Score:1)
      
      by pyzondar ( 1234980 ) writes:
      
      sÃ¥ lets all make webpages and make the web bigger
      go humanity
      - Re: Great! (Score:2)
        
        by OrangeTide ( 124937 ) writes:
        
        Archive.org respects my robots.txt file so no big deal. Search engines crawlers and AI that does not honor web standard can choke on a firehose of generate crap.
  - Re:Great! (Score:5, Interesting)
    
    by Zocalo ( 252965 ) writes: on Thursday January 23, 2025 @06:43PM (#65113637) Homepage
    
    Since there's no real obligation to honour robots.txt you have it backwards. If enough users deploy Nepenthes, and it doesn't need to be everyone - just enough to cause pain, that becomes "if an AI crawler (or any other crawler for that matter) wants to avoid getting tarpitted then they *need* to honor robots.txt".
    
    I have zero problem with giving things like robots.txt some teeth, and will be checking Nepenthes out with a view to adding it to the honeypot/tarpit system we already routinely deploy to our clients to blackhole hostile/malicious traffic and provide a softer target than can heads up before the actual systems get hit. When you get right down to it, it's putting the house rules on the door where you can see them on entry; respect them and you'll be fine, but if you choose to ignore them then you shouldn't be surprised when you get bitten.
    
    - Re: (Score:2)
      
      by Rockoon ( 1252108 ) writes:
      
      I wonder how quickly something like this gets your site on some blacklists shared by the "crawler industry" because surely they already have something like this for other problematic site behaviors
      - Re: (Score:3)
        
        by Zocalo ( 252965 ) writes:
        
        AFAICT, none that matter. Firstly, these are not typically consumer facing sites that need to be discovered at random via search, and secondly there's generally a pretty clear line between the actual and honeypot servers. Often that line means adjacent IP addressess/ranges to the active servers, and where they are down in the weeds of an active site, it'll typically align with the section being on different sub-domain and IP, doing so DB-driven data provisioning that has zero value in being indexed anyway
    - Re: (Score:1)
      
      by Anonymous Coward writes:
      
      2 things:
      - with the level of anti-AI/... hate there is, I would be surprised if this thing hadn't been deployed by lots of people already without any robots.txt blocking
      - I wouldn't be surprised if crawlers already have mitigations against this type of things, it can't be that new, and if this get any traction, that will detect it easily enough
      - Re: (Score:3)
        
        by Zocalo ( 252965 ) writes:
        
        Most of the big crawlers, like Google's, do indeed seem to have mitigations. Honeypots and tarpits have been around for years, so there's usually some mechanism to detect things are taking too long and terminate the connection and move to the next site/URL. Script-kiddle level stuff, not so much - some of that you can keep trapped indefinitely. That's all fine. The idea is to gather data on genuinely malicious actors for blocklisting from production servers and any tarpitting is also usually very asymmet
    - Re: (Score:2)
      
      by jsonn ( 792303 ) writes:
      
      I mean you can combine things: have Nepenthes and also a robots.txt that black lists it...
  - Re: (Score:2)
    
    by Kernel Kurtz ( 182424 ) writes:
    
    If a web-developer wants his site not crawled, all you need is a robots.txt file.
    I bet you believe there is a tooth fairy too.
- Re:Great! (Score:5, Funny)
  
  by PPH ( 736903 ) writes: on Thursday January 23, 2025 @07:29PM (#65113759)
  
  Now if only we could have a maze and a tar pit that would trap the CEOs of AI companies indefinitely!
  Put them in a round room and tell them that there's a government subsidy in the corner.
  
  - Re: Great! (Score:2)
    
    by ihavesaxwithcollies ( 10441708 ) writes:
    
    Just put bed sheets, american history X and mein kampf in a room, youâ(TM)ll catch yourself an elon almost instantly.
- Re: (Score:2)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  Isn't this what the "Cryptocurrency working group" and "Project Stargate" are for?
Who (Score:3, Funny)

by dpille ( 547949 ) writes: on Thursday January 23, 2025 @05:58PM (#65113465)

Who is going to want to "protect their own content" by dropping off the first hundred pages of web search results?

Perhaps I need to have some young coder teach me how the internet ecology works these days.

- Re: (Score:3)
  
  by Zocalo ( 252965 ) writes:
  
  I routinely deploy tarpits as part of a honeypot system and most of the big search engine crawlers, with a few exceptions, do actually honour robots.txt, although there are some that have multiple crawlers with some that honour it and some that don't. Even so, it's trivial enough to put the stuff you want crawled under one set of rules and the rest under another and only tarpit things that go off piste in the latter set - put the entrances to the tarpit a few levels down from your homepage and on a hidden
  - Re: (Score:2, Insightful)
    
    by Anonymous Coward writes:
    
    But how do the good spiders get "the stuff you want crawled" yet the AI training ones don't?
    - Re: (Score:2)
      
      by Zocalo ( 252965 ) writes:
      
      Crawlers are just automated web browsers and identify themselves just like web browsers do and, just like web browsers, they can also fake this. So, you need to make a judgement call on how you handle that info, and whether you allow the crawler to proceed or not. It also depends on the site data; it may be that you're fine with anyone indexing your home page so on (which will get you your search rankings), but everything else can be off limits because it has limited SEO value such as some online datasets
an AI Adventure... (Score:5, Funny)

by guygo ( 894298 ) writes: on Thursday January 23, 2025 @06:02PM (#65113471)

lock them into a maze of twisty little passages, all alike

- Re: (Score:2)
  
  by mrbester ( 200927 ) writes:
  
  Make sure it's dark. Grues gotta eat.
- Re: (Score:2)
  
  by tsetem ( 59788 ) writes:
  
  I can only imagine what the AI crawler will think when
  "You've found the Wumpus"
Nice! (Score:2)

by gweihir ( 88907 ) writes:

I was not aware crawlers are _this_ primitive. Well, you learn something new every day if you care to.
- Re:Nice! (Score:5, Funny)
  
  by Mr. Dollar Ton ( 5495648 ) writes: on Friday January 24, 2025 @12:11AM (#65114173)
  
  These are the AI-generated crawlers. They only have the python stack exchange examples, not the hand-crafted perl bots of yesteryear.
  
news that matters? (Score:2, Interesting)

by dfghjk ( 711126 ) writes:

We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
- Re: (Score:2)
  
  by taustin ( 171655 ) writes:
  
  This is literally malware.
  So are most crawlers.
  - Re: (Score:3)
    
    by timeOday ( 582209 ) writes:
    
    Most? If there is any AI company of note whose crawler is disregarding robots.txt I'd be interested to hear about it.
    - Re: news that matters? (Score:2)
      
      by Kelxin ( 3417093 ) writes:
      
      https://www.tomshardware.com/t... [tomshardware.com] ... Yes, it has happened and will continue to.
      - Re: (Score:2)
        
        by timeOday ( 582209 ) writes:
        
        Not cool, they should name names.
- Most of us have figured out (Score:5, Interesting)
  
  by rsilvergun ( 571051 ) writes: on Thursday January 23, 2025 @06:31PM (#65113607)
  
  That AI is not going to benefit us. For everyone drug discovery we get there's going to be a thousand AIs designed to take our jobs and make us redundant and a thousand more design to post social media and whoop up the uneducated into a frenzy of violence and stupidity.
  
  I don't think the Amish had the right idea as far as technology goes. But the basic concept of saying no to something that is inherently destructive isn't a bad idea. It's just a matter of degrees at that point.
  
  Or is one person put it, I want AI to do my laundry and dishes so I can paint pictures and write but instead I got AI that paints pictures so that I can do laundry and dishes...
  
  - Re: (Score:3)
    
    by Gideon Fubar ( 833343 ) writes:
    
    You know what though
    There will be drug discoveries
    Done by systems that aren't LLMs.
    And LLMs will be attributed these discoveries, because their main feature is that people who don't want to be able to tell the difference can't, and want it to be someone else's problem.
- Re: news that matters? (Score:3)
  
  by Big Hairy Gorilla ( 9839972 ) writes:
  
  What isn't malware these days? I hope you don't use any Microsoft, Apple, Google, Amazon, Adobe, Etc
  
  They own you.
- Re: (Score:2)
  
  by eneville ( 745111 ) writes:
  
  If the crawlers were less aggressive then people wouldn't do this.
- Re:news that matters? (Score:5, Interesting)
  
  by PsychoSlashDot ( 207849 ) writes: on Thursday January 23, 2025 @06:48PM (#65113647)
  
  We're now celebrating a hack designed to accomplish nothing other than to inflict damage? This is literally malware.
  It is neither literally - nor figuratively - malware.
  
  It is the equivalent of a logon screen. AI is unable to productively go where it is unwanted. Authorized users (humans) get past. Good authentication directories are resistant to brute force attack by doing things like temporary lockouts, or increasing the time before the "logon failed" screen shows up, tying up the attacker's resources.
  
  At the human level, it is at worst the equivalent of engaging a telemarketer in conversation knowing full well you will never buy their product. You are deliberately denying them access to your funds while simultaneously wasting their time so they cannot hassle others.
  
- Re: (Score:2)
  
  by Fly Swatter ( 30498 ) writes:
  
  If anyone enters an unknown maze and gets lost while ignoring the fact you can just stop advancing or participating, who's fault is that really?
  
  It's like eating until you explode, which apparently Al crawlers are programmed to do.
- Re: (Score:2)
  
  by samwichse ( 1056268 ) writes:
  
  Meh. As long as they put the "tar pit" into the robots.txt.
  Ignore robots.txt, get tarpitted. A nice little incentive.
  - Re: (Score:2)
    
    by Incadenza ( 560402 ) writes:
    
    I haven’t ran a personal webserver for at least 20 years, but when I did I had exactly that. An infinite amount of generated web content with mail addresses that would add you to a spam blacklist, that was -only- mentioned as ‘do not crawl’ in the robots.txt.
Be a nice netizen.. (Score:2)

by usedtobestine ( 7476084 ) writes:

Be a nice netizen..and make sure your robots.txt blocks access to this recursive nightmare...
Year 2000 called. They want their teergrube back (Score:5, Informative)

by vadim_t ( 324782 ) writes: on Thursday January 23, 2025 @06:12PM (#65113519) Homepage

Eesh.
This is an ancient idea. Probably discussed at length right here in the 2000s. No crawler worth its salt is going to even notice this.
Like oh, this:
https://it.slashdot.org/story/... [slashdot.org]
or this:
https://developers.slashdot.or... [slashdot.org]

- Re: (Score:2)
  
  by Gideon Fubar ( 833343 ) writes:
  
  Isn't it *interesting* when disruptors blow away an entire industry and then complain that they need to relearn lessons from decades ago.
  While you are correct... it's novel that it still works like this. Or works again like this, as the case may be.
  - Re: (Score:2)
    
    by vadim_t ( 324782 ) writes:
    
    No, this countermeasure isn't going to do anything useful. Any web spider is going to run into thousands of these, and therefore already is going to be coded to tolerate it fine.
    These countermeasures were made to trip up worms and naively coded Perl scripts used by mom and pop spamming operations. That is useful to an extent, but isn't going to do absolutely anything against the bigger players.
    All you have to do is to keep track of stats like depth and performance, notice that a branch is doing badly, and p
- Re: (Score:3)
  
  by coofercat ( 719737 ) writes:
  
  I was going to say the same - I seem to remember people doing the same sort of thing back in the day, but with a view to gaining 'SEO' for something or other. That is, a billion computer generated pages all talking about $subject and linking to $site would somehow make Altavista rank them higher. The 'proper' crawlers took this as a bug report and fixed their bugs - to this day, Google only follows links on the same site to a specific 'depth' (4, IIRC).
  However, this is a great idea, and will cause some prob
Been done (Score:2)

by dskoll ( 99328 ) writes:

This has been done before, though in the context of trapping spammers who scrape web sites for email addresses: You generate endless pages with random mailto: URLs. And I'm pretty sure any self-respecting bot will have countermeasures to detect and avoid these sorts of traps.
- Re: (Score:3)
  
  by dskoll ( 99328 ) writes:
  
  Replying to self... an obvious counter-measure is to occasionally request a random file and if you don't get a 404, then you're probably in a spider trap. (Just change the last bit of the URL path to a 32-character random string or something.)
  - Re: (Score:2)
    
    by suutar ( 1860506 ) writes:
    
    yeah, but the easy counter to that is 404ing links that you didn't generate and hand out. Which involves remembering the links you handed out but that's not too onerous.
    - Re: (Score:3)
      
      by Rockoon ( 1252108 ) writes:
      
      No need to remember... just validate that it follows the "real page" pattern
- Re: (Score:2)
  
  by Mr. Dollar Ton ( 5495648 ) writes:
  
  Yeah, back then it brought the nightmare of backscatter :)
If it's for AI training bots (Score:3)

by taustin ( 171655 ) writes: on Thursday January 23, 2025 @06:17PM (#65113549) Homepage Journal

Then each randomly generated URL should deliver randomly generated garbage for the specific purpose of poisoning the training.
If someone were to write a module for Apache that would do this automatically, after identifying the source IP of an AI training crawler, I'd install it in a heartbeat.

- Re: (Score:2)
  
  by austinpoet ( 789122 ) writes:
  
  if you go to the link, you'll see that each page is unique.
- Re: (Score:3)
  
  by eneville ( 745111 ) writes:
  
  Rewrites, and/or a module
  https://www.usenix.org.uk/cont... [usenix.org.uk]
This stuff is kind of cute (Score:4, Insightful)

by rsilvergun ( 571051 ) writes: on Thursday January 23, 2025 @06:28PM (#65113599)

But it's written by hobbyists while the AI is written by professionals working around the clock. It will be defeated because they just have more resources than you do. It's like trying to fight a nation state with guerilla warfare. It doesn't really work. Not unless a third party steps in to fund the guerrillas like a color revolution. And somehow I don't think that's happening here. All the major AI companies are pretty well aligned and while they're competing to see who's top dog they're not going to actively attack each other or help people who do.

- Re:This stuff is kind of cute (Score:5, Insightful)
  
  by Kernel Kurtz ( 182424 ) writes: on Thursday January 23, 2025 @07:53PM (#65113841)
  
  It's like trying to fight a nation state with guerilla warfare. It doesn't really work.
  The Taliban and the Viet Cong say hi.
  
  - Re: (Score:2)
    
    by Saffaya ( 702234 ) writes:
    
    The Viet Cong, as an effective fighting force, disappeared after Tet '69.
- Re:This stuff is kind of cute (Score:4, Insightful)
  
  by tlhIngan ( 30335 ) writes: <slashdot@worf.COUGARnet minus cat> on Thursday January 23, 2025 @11:38PM (#65114145)
  
  But it's written by hobbyists while the AI is written by professionals working around the clock. It will be defeated because they just have more resources than you do. It's like trying to fight a nation state with guerilla warfare. It doesn't really work. Not unless a third party steps in to fund the guerrillas like a color revolution. And somehow I don't think that's happening here. All the major AI companies are pretty well aligned and while they're competing to see who's top dog they're not going to actively attack each other or help people who do.
  Not really. This tarpit targets a key weakness of LLMs - if you train them on AI generated information, you get devolution of the model. The more AI generated content the model is trained on, the worse the end result.
  And many AI crawlers are not obeying robots.txt and hammering websites with way more traffic than they normally get. This is because in the big AI rush, they're just trying to get data.
  So I see no reason why you can't sprinkle a ton of ChatGPT produced content (among other things) into hidden web pages. You can exclude them with robots.txt, so proper crawlers (which include most search engines) will never see them, but AI crawlers likely will in their quest for dominance.
  Meanwhile if we're deploying a great chunk of AI generated crap for AI crawlers to read, you're also destroying the models, which is a bonus.
  
Bravo, now duck and cover. (Score:2)

by nightflameauto ( 6607976 ) writes:

I would suspect with the chumminess between the tech bros and Trump, you may not want to brag too much about wasting AI web crawler's resources. If one of the broligarchy gets upset with you, you may just get disappeared. I expect we'll hear of some new regulations relatively quickly now that this story is out there.
The circle is closing. (Score:3)

by dowhileor ( 7796472 ) writes: on Thursday January 23, 2025 @07:23PM (#65113747)

I read some interesting articles published in the 60's and 50's (shocking!) that predicted the ONLY way for humans to control thinking software will be to introduce binary level viruses just to slow them down enough for us humans to interact with them. And I thought..... haha, we would have to be haha, pretty STUPID, hahaha, to let that happen.....

- Re: (Score:2)
  
  by Barny ( 103770 ) writes:
  
  Right. What you do is tell crawlers in your robots.txt to not go into the tarpit's endpoint.
  That way only the LLMs that ignore your robots.txt will fall into the rabbit hole of endless crud.
wouldn't it be better.. (Score:1)

by codevark ( 1070362 ) writes:

to trap the crawler in an argument that results in the AI itself changing it's mind and becoming a goatherder?
Did this to email scrapers (Score:3, Insightful)

by Wokan ( 14062 ) writes: on Thursday January 23, 2025 @08:31PM (#65113905) Journal

Made a "directory" of email addresses that were all garbage with a next page link that would load infinitely more garbage. It was about poisoning the well so they couldn't sell the lists to spam mailers for much.

- Re: (Score:3)
  
  by bussdriver ( 620565 ) writes:
  
  me too! and they were all hotmail addresses too.
  took a census list of names and randomly combined them.
  - Re: Did this to email scrapers (Score:2)
    
    by AnonymousNoel ( 6972222 ) writes:
    
    Then it's fairly likely you added a large number of legitimate email addresses to the spammers list. Well done, you.
A suggestion (Score:2)

by Deal In One ( 6459326 ) writes:

Maybe have it also dump some random generated text as well? Or even random images?
That will hopefully screw some of the AI text / image generation capabilities.
- Re: (Score:2)
  
  by k2backhoe ( 1092067 ) writes:
  
  Instead of random text, misuse an AI system to create a fake (and wrong) article about something. Keep adding to the library of fake articles. many layers of what looks like valuable training information that actually is misinforming the AI being trained by the robot crawler. This makes it very difficult to automate a process to reject these 'nasty' sites. Get a nerd social network (slashdot/Imgur/even FB) to have us nerds vying to create more and more outrageous articles, sharing them and adding them t
  - Re: (Score:2)
    
    by Deal In One ( 6459326 ) writes:
    
    Don't ML systems feeding on other ML systems outputs degrade over time?
    Maybe put a bunch of ML generated junk over there on ""relevant topics" and let it all feed on each other, lol.
    Or if your site have the resources, have it generate the junk live as the bots crawl.
    - Re: (Score:2)
      
      by allo ( 1728082 ) writes:
      
      ML systems degrade on low quality content. That includes: low quality content, generative content (like here), uncurated AI output that includes bad results, spam, slashdot comments, conspiracy theoreticians, etc.
      What one does against this is to pre-filter the dataset after crawling. For example images are assigned an aesthetics score. If your image scores too low it is not included into the training data (even when it was downloaded). A good filter will recognize all in the list above, a bad may not.
Anyone think about the consequences? (Score:2)

by polyp2000 ( 444682 ) writes:

As in "Wasting Resources" - particularly energy hungry "Resources" .
On the face of it this seems clever, but ultimately it cant be seen as a viable solution.
Pointless (Score:2)

by Chelloveck ( 14643 ) writes:

On a Hacker News thread, someone claiming to be an AI company CEO said a tarpit like this is easy to avoid; Aaron B told 404 Media "If that's, true, I've several million lines of access log that says even Google Almighty didn't graduate" to avoiding the trap.
That doesn't mean it's not easy to avoid, it just means you haven't caused them enough pain to bother. The very second you do cause them enough pain they'll implement checks to detect the trap and bail out. Limit the recursive link depth, or the numbe

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Great! (Score:5, Funny)

Re: (Score:1)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re:Great! (Score:4, Interesting)

Re:Great! (Score:5, Insightful)

Re: (Score:1)

Re: (Score:3)

Re: Great! (Score:2)

Re: (Score:2)

Enforcing robots.txt restrictions (Was Re:Great!) (Score:2)

Re: (Score:3, Insightful)

Re: Great! (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: Great! (Score:2)

Re: Great! (Score:1)

Re: Great! (Score:2)

Re:Great! (Score:5, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re:Great! (Score:5, Funny)

Re: Great! (Score:2)

Re: (Score:2)

Who (Score:3, Funny)

Re: (Score:3)

Re: (Score:2, Insightful)

Re: (Score:2)

an AI Adventure... (Score:5, Funny)

Re: (Score:2)

Re: (Score:2)

Nice! (Score:2)

Re:Nice! (Score:5, Funny)

news that matters? (Score:2, Interesting)

Re: (Score:2)

Re: (Score:3)

Re: news that matters? (Score:2)

Re: (Score:2)

Most of us have figured out (Score:5, Interesting)

Re: (Score:3)

Re: news that matters? (Score:3)

Re: (Score:2)

Re:news that matters? (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Be a nice netizen.. (Score:2)

Year 2000 called. They want their teergrube back (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Been done (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

If it's for AI training bots (Score:3)

Re: (Score:2)

Re: (Score:3)

This stuff is kind of cute (Score:4, Insightful)

Re:This stuff is kind of cute (Score:5, Insightful)

Re: (Score:2)

Re:This stuff is kind of cute (Score:4, Insightful)

Bravo, now duck and cover. (Score:2)

The circle is closing. (Score:3)

Re: (Score:2)

wouldn't it be better.. (Score:1)

Did this to email scrapers (Score:3, Insightful)

Re: (Score:3)

Re: Did this to email scrapers (Score:2)

A suggestion (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Anyone think about the consequences? (Score:2)