Stopping SpamBots With Apache

Become a fan of Slashdot on Facebook

Stopping SpamBots With Apache 55

Posted by timothy on Friday October 19, 2001 @12:59AM from the do-not-pass-go-do-not-collect-200-email-addresses dept.

primetyme writes: "Sick of email harvesting spam robots cruising your Apache based site? Here's an in depth article that shows one way you can configure a base Apache installation to keep those nasty bots of your site - and the spam out of your Inbox." Anything that helps annoy spammers is a good thing.

This discussion has been archived. No new comments can be posted.

Stopping SpamBots With Apache

Load All Comments

Search 55 Comments Log In/Create an Account

Comments Filter:

One Way... (Score:3, Funny)

by ekrout ( 139379 ) writes: on Friday October 19, 2001 @01:03AM (#2450452) Journal

is to give your site a terrible color scheme, like purple & brown. ;-)

Share
twitter facebook
A better way! (Score:1, Interesting)

by Anonymous Coward writes:

I'd say add at least the following email adresses to your webpage and strike back that way (somehow):

president@whitehouse.gov, abuse@127.0.0.1, some MAP adress.
- Re:A better way! (Score:2)
  
  by ekrout ( 139379 ) writes:
  
  Why bash the president? Let's fvck bin Laden with:
  
  worthlessPOS@taliban.gov, ROOTofallevil@taliban.gov, some MAP address
- Re:A better way! (Score:1)
  
  by RedOregon ( 161027 ) writes:
  
  Another important address to add: uce@ftc.gov
  
  That's the complaint address at the Federal Trade Commission for spam; granted, intelligent Email harvesters would check for and discard that address (sending spam to it would be tantamount to turning yourself in), but not all spammers meet the intelligence test ;)
- Re:A better way! (Score:2)
  
  by Erasmus Darwin ( 183180 ) writes:
  
  "abuse@127.0.0.1"
  This is incorrect. You want to use abuse@[127.0.0.1] as the address.
Another way to stop spam on your webserver... (Score:2, Funny)

by ekrout ( 139379 ) writes:

is to not install Apache at all. Instead, throw a year or two-old copy of Microsoft IIS and watch the virii propagate. You won't have enough bandwidth or enough minutes of up-time to be able to serve pages with email addresses on them ;-)
Now I guess I am off to hack (Score:2)

by bstrahm ( 241685 ) writes:

First it was the hack to reboot systems asking for your default.ida file. Now it is code to trap and kill spiders...

What is an apache admin to do, it is so configurable there doesn't appear to be anything that it can't do. What is next using apache to brew my morning coffee (well there is the coffee pot cam - anyone know what webserver it ran on) write my website for me, solve world hunger ???

WHY WHY WHY do people run IIS anyway, I would love to see what it would take to do this with IIS, any takers ?
- Re:Now I guess I am off to hack (Score:2, Funny)
  
  by Anonymous DWord ( 466154 ) writes:
  
  What is next using apache to brew my morning coffee (well there is the coffee pot cam - anyone know what webserver it ran on) write my website for me, solve world hunger ???
  
  Hey, Emacs has to be good for something, right?
- Re:Now I guess I am off to hack (Score:2)
  
  by purplemonkeydan ( 214160 ) writes:
  
  WHY WHY WHY do people run IIS anyway, I would love to see what it would take to do this with IIS, any takers ?
  
  You could write an ISAPI filter that intercepts the requests just before IIS processes them.
It won't work long (Score:3, Insightful)

by anothernobody ( 204957 ) writes: on Friday October 19, 2001 @01:09AM (#2450465) Homepage

Checking the user agent won't work for long - how hard will it be for the spammers to change the user agent to "Mozilla..."

Using some client side Javascript would be harder for them to deal with (although if your browser can view it they will be able to also).

I guess graphics would be next...

Share
twitter facebook
- Re:It won't work long (Score:1)
  
  by jkthatcher ( 473278 ) writes:
  
  Checking the user agent won't work for long - how hard will it be for the spammers to change the user agent to "Mozilla..."
  Certainly not as hard as convincing every web site on the net that displays email addresses to dick around with this..
Also useful for... (Score:5, Informative)

by dpete4552 ( 310481 ) writes: <slashdot@tuxcont[ ].com ['act' in gap]> on Friday October 19, 2001 @01:10AM (#2450467) Homepage

I have been using this method for a long time, I don't know how new that article is, but I used it a long time ago to not only block all the spambots I could find, but all of the software for mirroring my webpage as well.

Here is a longer list of common spam bots and mirror bots that I have been able to find:

SetEnvIfNoCase User-Agent "EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "CherryPickerSE" bad_bot
SetEnvIfNoCase User-Agent "CherryPickerElite" bad_bot
SetEnvIfNoCase User-Agent "Crescent" bad_bot
SetEnvIfNoCase User-Agent "EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "MCspider" bad_bot
SetEnvIfNoCase User-Agent "bew" bad_bot
SetEnvIfNoCase User-Agent "Deweb" bad_bot
SetEnvIfNoCase User-Agent "FEZhead" bad_bot
SetEnvIfNoCase User-Agent "Fetcher" bad_bot
SetEnvIfNoCase User-Agent "Getleft" bad_bot
SetEnvIfNoCase User-Agent "GetURL" bad_bot
SetEnvIfNoCase User-Agent "HTTrack" bad_bot
SetEnvIfNoCase User-Agent "IBM_Planetwide" bad_bot
SetEnvIfNoCase User-Agent "KWebGet" bad_bot
SetEnvIfNoCase User-Agent "Monster" bad_bot
SetEnvIfNoCase User-Agent "Mirror" bad_bot
SetEnvIfNoCase User-Agent "NetCarta" bad_bot
SetEnvIfNoCase User-Agent "OpaL" bad_bot
SetEnvIfNoCase User-Agent "PackRat" bad_bot
SetEnvIfNoCase User-Agent "pavuk" bad_bot
SetEnvIfNoCase User-Agent "PushSite" bad_bot
SetEnvIfNoCase User-Agent "Rsync" bad_bot
SetEnvIfNoCase User-Agent "Shai" bad_bot
SetEnvIfNoCase User-Agent "Spegla" bad_bot
SetEnvIfNoCase User-Agent "SpiderBot" bad_bot
SetEnvIfNoCase User-Agent "SuperBot" bad_bot
SetEnvIfNoCase User-Agent "tarspider" bad_bot
SetEnvIfNoCase User-Agent "Templeton" bad_bot
SetEnvIfNoCase User-Agent "WebCopy" bad_bot
SetEnvIfNoCase User-Agent "WebFetcher" bad_bot
SetEnvIfNoCase User-Agent "WebMiner" bad_bot
SetEnvIfNoCase User-Agent "webvac" bad_bot
SetEnvIfNoCase User-Agent "webwalk" bad_bot
SetEnvIfNoCase User-Agent "w3mir" bad_bot
SetEnvIfNoCase User-Agent "XGET" bad_bot
SetEnvIfNoCase User-Agent "Wget" bad_bot
SetEnvIfNoCase User-Agent "WebReaper" bad_bot
SetEnvIfNoCase User-Agent "WUMPUS" bad_bot
SetEnvIfNoCase User-Agent "FAST-WebCrawler" bad_bot

Share
twitter facebook
You can't win an arms race (Score:5, Insightful)

by CmdrTroll ( 412504 ) writes: on Friday October 19, 2001 @01:16AM (#2450472) Homepage
The premise behind this article is patently ridiculous. Spambots are voluntarily identifying themselves, and any spambot author with an ounce of common sense will simply change their user-agent string to the standard "Mozilla 4.0 (Microsoft Internet Explorer 5.5)" string that every Windows client uses. A well-designed spambot is indistinguishable from a valid user, or Google, or ht://dig.
On the other hand, there are ways to fight spambots; they just don't rely on trusting the user. Here's one way:
- Buy a domain.
- Set up a cgi that generates a unique email address @ that domain for every visitor. Log the address used, the date/time of visit, the visitor's IP, and other characteristics (user-agent?) of the visitor.
- Use the logged data to block the user when spam mail gets sent to one of the random accounts.
- Use the logged data as evidence to present to the offender's ISP, to get their fast connection pulled.
- Find a way to automate this on a large scale, then get a bunch of sysadmins together to sue and prosecute the spammer for abuse of resources.
There are good ways to deal with spammers but this isn't one of them. It *might* work on a small scale and it definitely won't work on a medium or large scale. It's about as useful as the Sendmail "MX/domain validation" trick that Eric Raymond and the rest of the Sendmail team thought would stop spammers dead in its tracks. (It didn't.) Instead he was "surprised by spam."
-CT
Share
twitter facebook
- Re:You can't win an arms race (Score:3, Insightful)
  
  by primetyme ( 22415 ) writes:
  
  Thats pretty much what I do in the Hook, line, and sinker section of the article.. By capturing the user-agent's and IP's of the Spiderts that *blatently* disregard the robots.txt file, its like shootin fish in a barrel..
  In the next installment of this article, I'm working on a script that grabs the NetBlock of a bot that goes against the robots.txt file, does a ARIN lookup on that block, and emails the administrator of that block with the prob.. Comments have been made that any bot can switch their user-agent string, which is true. If a Spidert does that though, they're more than likely also going to run through the parts of a site that you *specifically* tell them they can't go in the robots.txt file. When they do that, its a lot easier to block their user-agent, email the admin of thier netblock, or block their class c IP block alltogether.
  It's like a honeypot for black-hats if you think about it.. And thats one of the *best* ways to find the problem Spiderts and block them out, without blocking any good natured bot :)
- Re:You can't win an arms race (Score:1, Interesting)
  
  by Anonymous Coward writes:
  
  A variation on that idea would be to add a link labeled something like: "If you are a spam bot click HERE". If someone follows the link they get a warning page which explains that anyone who clicks on the next link will get blocked as a spam program, then have a link which triggers the spambot-blocking script. Of course, use search-engine-detecting tech to hide that warning page from search engines to avoid detection of the page.
  (posting anonymously so as to not tell spambots what tech I'm using on my site)
  - - Re:You can't win an arms race (Score:1)
      
      by GreatUnknown ( 160372 ) writes:
      
      Use a robots.txt file. Tell spiders not to visit that page. If *any* spider, spambot or otherwise, disregards the robots.txt file it should be banned.
Wget is not a spider! (Score:4, Informative)

by Anonymous Coward writes: on Friday October 19, 2001 @01:17AM (#2450477)

"Here are a couple of the User-Agents that fell for our trap that I pulled out of last months access_log for lists.evolt.org:

Wget/1.6"

Email spider, my ass! Wget is a damn useful HTTP downloader utility which is great for obtaining large files as it can resume interrupted transfers. It can also mirror web sites, which I assume is why it fell into the honeypot. Oh, and you can also change what it says it is on the command line.

And to add my 2 cents to the email problems, one other solution I've seen is to translate email addresses into an image and drop that onto the page. It's not a fantastic solution for those still using Lynx, and you can no longer just click to send mail to somebody, but at least it doesn't go the Javascript route and should be a sufficient technical hurdle to stop automated harvesters for a couple of years at least.

- Anonymous and happy.

Share
twitter facebook
- Re:Wget is not a spider! (Score:1)
  
  by gooberguy ( 453295 ) writes:
  
  Another way to hide mail addresses is to throw in the words AT and DOT where the "@" and "." go (like most ./ ers). Some simple perl scripting should do it: (an example script is at http://gooberguy.homeip.net/cgi-bin/email.cgi?emai l=your_emailaddress@your_domain)
  
  $email_address=~s/\@/ AT /g; $email_address=~s/\./ DOT /g;
  
  D/\ Gooberguy
  - Re:Wget is not a spider! (Score:2)
    
    by Bill Currie ( 487 ) writes:
    
    ~s/ AT /@/g ~s/ DOT /./g
    Do you really think spammers haven't figured that out yet?
wget == spambot? (Score:1)

by hazyshadeofwinter ( 529262 ) writes:

Thank ghod the article only mentioned wget 1.6 as a spambot, I'm running 1.5.3, which doesn't have the --evil-bastard or --potted-meat options.
- Re:wget == spambot? (Score:1)
  
  by J'raxis ( 248192 ) writes:
  
  Yes, but he blocked /^wget/ which will match them all. I believe you can change your user-agent string in wget however; most Linux browsers allow you to do this (Lynx, Links, Mozilla, etc.). Or just pass your requests through a proxy like JunkBuster [junkbuster.com], which can strip out and/or change headers like the User-Agent. You can run JB on your own machine.
  
  I guess he thinks wget is a bot because it can be made to recursively download a whole website, following all anchor tags like a bot even though it is being controlled by a human.
My php solution (Score:2, Informative)

by sphix42 ( 144155 ) writes:

I used the tip from the article and put
Disallow: /email-addresses/
in my robots.txt then in my .htaccess:

ForceType application/x-httpd-php

and in email-addresses:

and chgrp'd .htaccess to web user's group. This will provide me a list of unique ip's in my .htaccess.
I find mod_rewrite and RewriteEngine more useful (Score:2, Informative)

by ShaunC ( 203807 ) writes:

I do selective agent blocking using mod_rewrite directives in .htaccess files. The article claims that mod_rewrite is difficult to learn, but I disagree, and its major advantage is especially visible in shared/virtual hosting environments. If Apache was compiled with mod_rewrite support, anyone on the system can create their own set of agent filters and place them in an .htaccess file. You don't need access to httpd.conf!

The syntax is simple,

#Send filesucking programs to hell
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^FlashGet.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline Explorer.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^wget.* [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar.* [NC]
RewriteRule ^.*$ /nofilesucking.php [L]

Seems effective enough for me, and it ain't tough to learn when you can find an example. Of course this does rely on the idea that filesucking programs (or email harvesting bots) identify themselves, but I think naysayers would be surprised at how many of them do just that.

Shaun
*Do* spam bots cruise web sites? (Score:1)

by ddyer-bennet ( 222311 ) writes:

I've had a spambot-trap on my web site for over a year, and while I've had around 10,000 page views each day during that time, I've never gotten one single spam to the email addresses featured in the trapped space.

Or does this mean that the spam bots are sufficiently sophisticated that they recognize my trap for what it is? It's meant to be obvious to humans.
WebPoison anyone? (Score:2)

by Dimensio ( 311070 ) writes:

Long ago I heard of a CGI script by the name of WebPoison. It would generate a page of random text; the first set of text would be random words that all linked to differently parsed URLs right back to the same page. The second and much longer set of text was a long list of randomly generated bogus e-mail addresses. Because the recursive links were all different (and random) it would theoretically cause a spambot to contunally follow a circular path and constantly retrieve hundreds of fake e-mail addresses (thus the name Webpoison -- it poisons their list).

There were some flaws. You'd need a webserver that let you run CGI scripts without necessarily having .cgi show up in the URL (to fool the spambots) and you'd have to have some mechanism to check that the random addresses did not use real domains. Might also use up your bandwidth as bots got stuck, but you could then use their IP to file a complaint against their ISP (and ban them from hitting your server in the future).

Sadly, I've not found any information on it recently. Perhaps someone could hack out a more efficient version of such to address potential problems and bugs.
- Re:WebPoison anyone? (Score:2, Insightful)
  
  by Anonymous Coward writes:
  
  It's not exactly what you mean, but something similar is The Book of Infinity [eleves.ens.fr]. It doesn't generate email addresses, but it does generate an infinite website.
- Re:WebPoison anyone? (Score:2)
  
  by leviramsey ( 248057 ) writes:
  
  All you do is set the domains to a machine on your network that has its SMTP port firewalled. No bandwidth gets lost from the spam and you don't have to worry about the domain being valid.
- Re:WebPoison anyone? (Score:1)
  
  by Lyka ( 162251 ) writes:
  
  It's still at:
  http://www.monkeys.com/wpoison/ [monkeys.com].
- Re:WebPoison anyone? (Score:2, Insightful)
  
  by asackett ( 161377 ) writes:
  
  It's called wpoison, and it's found at http://www.monkeys.com/wpoison/ [monkeys.com]. The problem is that it's very easy to detect -- note the lack of punctuation marks, scarcity of two and three letter words, capital letters, verbs... and the fact that there's a four second pause in the same place, page after page... in short, it would be easy enough to spot a wpoison-generated page.
  
  I've coded up an alternative that suffers none of those obvious defects, and instead of throwing out bogus email addresses, it throws out valid spamcatcher addresses. Any SMTP host who sends a message to one of those addresses is blocked (via DJB's rbldns) for a month from sending mail into my domain. The blocklist is self-maintaining, so I never need to mess with it.
  
  It's been in place for about three months now, and my blocklist contains 125 entries right now -- five of which are netblocks I've manually added. The URL, sure to catch a bucketful of bad spiders thanks to this link, is http://www.artsackett.com/personnel/ [artsackett.com] and it is intentionally as slow as the rectification of sin.
chargen (Score:1)

by scorp1us ( 235526 ) writes:

Does anyone know how to have the webserver return a constantly running stream of garbage?

I had heard of a guy taking chargen or /dev/random data and delivering it to some hacker that had hacked into his system (instead of what the hacker was trying to syphon off.) This would keep the connection open, and if enough people implemented it, would seriously limit their through put.
- Re:chargen (Score:1)
  
  by Embedded Geek ( 532893 ) writes:
  
  Check out the page entitled "SPAMBOT Harassment [turnstep.com]" on the SPAMBOT Beware [turnstep.com] page. It's a little dated (~1999) but looks reasonable to me.
This won't work for long. (Score:1)

by thogard ( 43403 ) writes:

Most spambots don't id themselves. A few do but most don't thouse that do won't for long if this info gets acted on.

What does work is building a nice static list of email addresses and names. Link to another page and have it full of the same info. Do this on serveral virtual servers and make sure the web bots can find it.

You can also be nice to the real search engines and tell them not to visit you spam traps and since robots.txt is offten used by the spam bots, telling google not to search that page works out good for both sides of the spider wars.

The next thing is to lock down your mail program once it detects any of the spam traps. There are serveral good ways of doing this based on how you pay for bandwidth. Two of the best options are either play dead with the connection or return a "user mailbox is full". Both of these tie up resources on the spamers end. The other choice is reject 99.99% of the mail and hope they pull your domain out of their lists for being full of junk.

I run @abnormal.com which tends to sort near the top, has lots of bougus addresses and has been running spam traps for years. Everyday I get hit by spamers that have sorted addresses.

One thing to keep in mind is that most bots are run by people only selling lists, not the spamers. Because of that there is no direct link between the searching bots and the mail host that spams latter.

I wonder if its its time to make a RBL like thing that is just for poisoned addresses.
- Re:This won't work for long. (Score:1)
  
  by asackett ( 161377 ) writes:
  
  I'm doing this ("RBL like thing that is just for poisoned addresses") locally -- don't have the resources available to offer such a thing publicly, but have some ideas for a distributed system that would not only share the load (and therefor cost) but would make it difficult to attack because there would be no central node. If there were sufficient interest and enough talent to make it a viable project, I'd set up a mailing list and whatnot to assist the effort.
  
  You interested? If so, email me by replacing slashdot with the user name asackett in the email address above.
- Re:Good for the goose (Score:3, Insightful)
  
  by Snootch ( 453246 ) writes:
  
  One big difference - MSN discriminated against valid browsers that were just people trying to view their website. The user agent IDs here (with a coupla exceptions - *cough* wget *cough*) are all things that are only ever used for spam purposes. There is a difference between blocking people because they don't use your software and blocking spam robots.
Displaying E-mail Addresses as Graphics (Score:1)

by hendridm ( 302246 ) writes:

If you're running apache, you could have your web site display e-mail addresses as graphics. You could have it match the same fonts your site is using, so it would look like normal test but a spide couldn't read it.

Sample PHP Script [planetsourcecode.com]
- Re:Displaying E-mail Addresses as Graphics (Score:1)
  
  by Sir Runcible Spoon ( 143210 ) writes:
  
  Just what I was about to suggest. To reduce processing on your box and the number of items downloaded you could use one jpeg for the @ and another for dot. Maybe using others to replace .com and .co.uk etc. as these tell tales.
- Re:Displaying E-mail Addresses as Graphics (Score:1)
  
  by yivi ( 236776 ) writes:
  
  I think that your method simply wouldn't work.
  
  Generally, whenever you display an e-mail address you have a mailto: link laying around it. The bot would take the address from there, and it would continue happily ignoring your jpeg's.
Post Gates & SPAM traps - do they work? (Score:1)

by Embedded Geek ( 532893 ) writes:
Some questions:
- I've implemented a post gate on my site, as described here [turnstep.com]. Unfortunately, the mail account attached to it is already getting SPAM, so I can't tell if it's working. Does anyone know if the 'bots that SPAMers use these days are sophisticated enough to handle post methods?
- The SPAMBOT Beware [turnstep.com] has a lot of other suggestions, and any page titled SPAMbot Harassment [turnstep.com] gets my vote. I do wonder how effective these dated (~1999) techniques would be today. Any opinions?
- Finally, I was thinking of implementing a SPAMbot CGI trap that sleeps, say, five seconds before posting a page of bogus addresses (and domains) and a link to another page that's simply a soft link to itself. Does this sound like it ought to work? After all, if I like recursion, shouldn't a SPAMbot? ;)
- Re:Post Gates & SPAM traps - do they work? (Score:1)
  
  by Embedded Geek ( 532893 ) writes:
  
  Dho! Forget that last quetion. Just read the posts about Webpoison. I'll probably use it so I can design my own CGI. I also like the idea of using a SPAMcatcher address...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

One Way... (Score:3, Funny)

A better way! (Score:1, Interesting)

Re:A better way! (Score:2)

Re:A better way! (Score:1)

Re:A better way! (Score:2)

Another way to stop spam on your webserver... (Score:2, Funny)

Now I guess I am off to hack (Score:2)

Re:Now I guess I am off to hack (Score:2, Funny)

Re:Now I guess I am off to hack (Score:2)

It won't work long (Score:3, Insightful)

Re:It won't work long (Score:1)

Also useful for... (Score:5, Informative)

You can't win an arms race (Score:5, Insightful)

Re:You can't win an arms race (Score:3, Insightful)

Re:You can't win an arms race (Score:1, Interesting)

Re:You can't win an arms race (Score:1)

Wget is not a spider! (Score:4, Informative)

Re:Wget is not a spider! (Score:1)

Re:Wget is not a spider! (Score:2)

wget == spambot? (Score:1)

Re:wget == spambot? (Score:1)

My php solution (Score:2, Informative)

I find mod_rewrite and RewriteEngine more useful (Score:2, Informative)

*Do* spam bots cruise web sites? (Score:1)

WebPoison anyone? (Score:2)

Re:WebPoison anyone? (Score:2, Insightful)

Re:WebPoison anyone? (Score:2)

Re:WebPoison anyone? (Score:1)

Re:WebPoison anyone? (Score:2, Insightful)

chargen (Score:1)

Re:chargen (Score:1)

This won't work for long. (Score:1)

Re:This won't work for long. (Score:1)

Re:Good for the goose (Score:3, Insightful)

Displaying E-mail Addresses as Graphics (Score:1)

Re:Displaying E-mail Addresses as Graphics (Score:1)

Re:Displaying E-mail Addresses as Graphics (Score:1)

Post Gates & SPAM traps - do they work? (Score:1)

Re:Post Gates & SPAM traps - do they work? (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Do spam bots cruise web sites? (Score:1)