Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Unusual HTTP Requests For robots.txt?
Apache Posted by Cliff on Friday September 22, @01:50PM
from the suspicious-behavior dept.
Fooster asks: "I edit several (mostly) unrelated Web sites hosted on a Linux virtual hosting machine running Apache. Often in an idle moment between edits, I'll watch my logs with a 'ail -f access &'. Today, I started to get bursts of requests for robots.txt from several different major service provider IP blocks that were almost simultaneous. Some time later, I'd get another burst, with some of the requests coming from different IPs. All in all, I had over 100 times more requests for robots.txt today than ever before in one day. Unlike most search engine robots.txt requests, there was no info in the referrer field and a reverse DNS lookup did not lead me back to a search engine info provider. I found the requests to be coming from blocks owned by ISPs like Qwest, AT&T, BBN and others. A cursory examination of the literature revealed no reports of exploits based on robots.txt, so I decided to 'Ask Slashdot.' Have any other Webmasters noticed this? Am I just being paranoid? Take a look at the logs yourself, and let me know please."

Gnutella Not Scaling? | Distribute Stuff: Cosm Project's CS-SDK  >

  
Slashdot Login
Nickname:

Password:

Don't have an account yet? Go Create One. A user account will allow you to customize all these nutty little boxes, tailor the stories you see, as well as remember your comment viewing preferences.

Related Links
  • Linux
  • Slashdot
  • Fooster
  • logs
  • More on Apache
  • Also by Cliff
  • Ask Slashdot
  • How Should I Treat My Notebook Battery?
  • UNIX Internship Programs?
  • Working With The Bandwidth Problem?
  • Parts For Discontinued Hardware?
  • Adding More Space to the Nomad Jukebox?
  • Apache vs IIS in Performance?
  • Does J2EE Live Up To Its Promise?
  • Can IP Masquerading Handle L2TP Connections?
  • Affordable Backup Hardware for Today's Systems?
  • Will Linux Ever be Ported to the Palm?
  • This discussion has been archived. No new comments can be posted.
    Looks like IP Spoofing (Score:5, Informative)
    by scotpurl on Friday September 22, @02:04PM EDT (#1)
    (User #28825 Info) http://homepage.davesworld.net/~scott
    I think someone's using you as a test case for some IP spoofing. Awful lot of .41 and .81 ending IP addresses in there, but from vastly different subnets. Looks too similar for me to beleive it's coincidence. I think the exploit works that one box sends hundreds of spoofs, then another box (somewhere else) receives the response. Some responses go to legitimate boxes (which didn't ask for the info), some to unused IP space, and one to the actual box you wanted the results to go to. The exploiter is hoping you wont' figure out which of the hundreds of requests actually went to a box you can trace back to them.

    Also, since your robots.txt file says what not to index, that's frequently the list of directories with tasty things that people would most like to hack into. Think about it. What's in your robots.txt file? Things that change too often to be listed in search engine results, or the sorts of things that you don't want out there.

    I think you're being probed. Make sure your backups are up to date, and that the box is secured. :-)
    Most likely (Score:3, Interesting)
    by Dast (cfy1@ra.msstate.eduspamtodevnullplease) on Friday September 22, @03:50PM EDT (#7)
    (User #10275 Info) http://www2.msstate.edu/~cfy1/pub/decss.tar.gz
    it is looking for some insecure cgi type package (search bugtraq for the many possibilities) that puts something in robots.txt. Whatever it puts in there could be used to identify whether the package is installed on the server, letting the cracker know the box is can be compromised.

    Better double check your security.

    This sig is false.

    Re:Looks like IP Spoofing (Score:1)
    by qux.net (jeremy@qux.net.spam) on Saturday September 23, @01:09AM EDT (#13)
    (User #107853 Info) http://qux.net/
    What's in your robots.txt file? Things that change too often to be listed in search engine results, or the sorts of things that you don't want out there.
    I've always taken the approach that robots.txt shouldn't contain anything that reveals semi-private sites on the server. Since it matches anything after the string listed, you don't really have to put even a whole directory name, just enough to make it not exclude things you might want listed...
    Spoofing is mostly impossible with TCP (Score:2)
    by Broccolist on Wednesday September 27, @08:09PM EDT (#19)
    (User #52333 Info)
    I'm fairly well versed in the workings of TCP/IP and I don't think what you describe is technically possible. Because TCP uses a 3-way handshake, the only packet that can be spoofed is the initial SYN packet (which can be useful for port scanning in ways pretty much like you described). But in order to send the request string for robots.txt, a full TCP connection must be established.

    Say host A is connecting to host B. This needs to happen in order to have a successful connection:

    1. A sends initial SYN packet to B requesting connection
    2. B sends back ACK,SYN packet accepting connection, with a random sequence number
    3. A sends back an ACK packet containing the sequence number given to it in step 2
    Only after this is done can text like "GET /robots.txt" be sent. As you can see, step 3 can't be spoofed, because the correct sequence number is required and the only way to get it (barring router-level spying, which most attackers can't do) is to actually be the host A which receives packet 2. Check out RFC793 for excessive detail :).

    So, I would say a bunch of hosts really are requesting robots.txt for some weird reason (still perhaps security-related, but not spoofing). Someone correct me if I'm wrong, but I'm pretty sure about this.

    batten down the hatches, cap'n (Score:1)
    by j_d (theeaterofsocks@hotmail.com) on Friday September 22, @02:21PM EDT (#2)
    (User #26865 Info)
    those are all isps that care more about $ than their reputation, and will let anyone go amok in their sandboxes. you've probably disagreed with someone in some forum, and you're going to be punished for it.
    Either that, or someone's abusing robots.txt by culling its info, and noting it for interesting things for manual perusal at a later date.
    shields up, red alert.
    That's normal. (Score:1, Funny)
    by Anonymous Coward on Friday September 22, @02:31PM EDT (#3)
    To avoid the RIAA, people have been changing their extensions on their mp3 files to avoid detection; the latest one is .txt.

    There's been a retro-80's movement going on lately, so everyone's looking for that 'robots.txt' mp3; I think it's by Styx.

    Just put up a notice on your pages that says in big letters "We don't have the 'robots.txt' mp3; look for it on eBay". That should do it.
    Re:That's normal. (Score:1)
    by epodrevol (peckert@aopsolutions.com) on Friday September 22, @04:35PM EDT (#8)
    (User #219315 Info)
    Theres a styx album thats good????

    who knew....
    "Our life is frittered away by detail...simpify, simplify." -WW

    Some days ago I suffer from the same (Score:3, Informative)
    by overlord on Friday September 22, @02:46PM EDT (#4)
    (User #5277 Info) http://sonrisas.8k.com
    Some days ago I have the following logs:

    206.229.153.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    206.64.105.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    206.98.113.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    208.47.242.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    208.47.242.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    12.27.166.121 - - [19/Sep/2000:15:14:01 -0300] "GET /robots.txt" 200 37 "-" "-"
    route.ocy.pnap.net - - [19/Sep/2000:15:14:05 -0300] "GET /robots.txt" 200 37 "-" "-"
    route.ocy.pnap.net - - [19/Sep/2000:15:14:05 -0300] "GET /robots.txt" 200 37 "-" "-"
    207.86.73.121 - - [19/Sep/2000:15:14:08 -0300] "GET /robots.txt" 200 37 "-" "-"
    4.20.90.121 - - [19/Sep/2000:15:14:17 -0300] "GET /robots.txt" 200 37 "-" "-"

    Seems to be pretty similar.
    Basically it was repeted every hour.

    a test for a DOS ?

    Bye

    OverLord


    Re:Some days ago I suffer from the same (Score:2)
    by ptomblin on Saturday September 23, @10:41AM EDT (#15)
    (User #1378 Info) http://xcski.com/~ptomblin/
    I've got the same damn thing:
    /var/log/httpd/access_log.4:204.123.28.10 - - [20/Sep/2000:02:10:05 -0400] "GET /robots.txt HTTP/1.0" 404 278 "-" "Mercator-1.0"
    /var/log/httpd/access_log.4:208.47.242.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:12.27.166.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:206.229.153.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:206.98.113.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:4.20.90.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:206.64.105.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:216.52.254.37 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:216.52.254.37 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:208.47.242.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"
    /var/log/httpd/access_log.4:207.95.133.41 - - [20/Sep/2000:02:23:02 -0400] "GET /robots.txt" 404 - "-" "-"

    I think it would be useful to blackhole any attempt to get robots.txt from anybody who doesn't give a referrer string. Not just give them a 404, but just don't respond at all to the request. Is this possible in Apache?

    My company is hiring 20 C/C++ and Java Solaris developers in Rochester NY. Email (text, not MS-Word) if interested.
    incident list (Score:3, Informative)
    by po_boy (amoore at dynodns dot net) on Friday September 22, @03:14PM EDT (#6)
    (User #69692 Info) http://dynodns.net
    I personally don't believe this is a security related incident, but if you do, you may want to take this up on the incidents list at INCIDENTS (at) SECURITYFOCUS.COM. Head over to securityfocus.com and check out the list. It's like BUGTRAQ, but for reporting/discussing incidents.

    Hope it helps.

    I've seen this too.. (Score:2, Interesting)
    by Tairan (john@johncglass.com) on Friday September 22, @05:13PM EDT (#9)
    (User #167707 Info) http://www.johncglass.com
    I log a few things on my server (what time, request, referrer) and have been noticing lots of requests for robots.txt. I've only registered my site with the One True Search Engine, so I would expect a hit to it once a month. But, I am getting 15 or so a day! I just assumed it was some stupid script kiddy who thought the robots.txt file would have something way cool like my root password...I placed a blank file in the web root, and thats it. Anyone else?
    John Glass /. is a commercial entity. goto slashdot.com
    Re:That's normal... (Score:2, Interesting)
    by Tairan (john@johncglass.com) on Saturday September 23, @12:29AM EDT (#12)
    (User #167707 Info) http://www.johncglass.com
    Right, I understand this. But would one search engine index your site 15 times a day, every day? Even if all the search engines decided to inxed my site, that leaves it covered in 2-3 days. There is probably something suspicios going on, as suggested by someone else. I just hope I am not part of it!
    John Glass /. is a commercial entity. goto slashdot.com
    IE "Make Available Offline" (Score:3, Interesting)
    by whyDNA? (whydna@fuckspam.hotmail.com) on Friday September 22, @10:15PM EDT (#11)
    (User #9312 Info) http://dcaff.com
    I realize a large majority of the audience avoids products like MSIE... but I believe that that's the source of the problem...

    When a user bookmarks a page, they age given an option to "Make Available Offline" which, if selected, pops up some configuration dialog boxes (where they get to choose how many layers deep, etc). It essentially grabs all the code, graphics, etc. and saves it locally.

    Personally, I use this function when I don't know if the content is likely to be around for a while. As it is processing, it shows that it is grabbing all sorts of robots.txt files from all over the damned place (especially if it follows a number of links deep).

    It's not the brightest of MS's "wizards", so i probably keeps requesting the same one repeatitively when links follow to the same server. Try to check what the HTTP_USER_AGENT
      says about that robots.txt file.

    If your logs can't tell you, Make php process .txt files (in you Apache settings of via a .htaccess file) and run a little script in your robots.txt file that'll log the HTTP_USER_AGENT
    to a db or text file, etc.

    The HTTP_USER_AGENT /should/ be in blocks of the same type (more or less)

    -Andy
    Here's some more info (Score:1)
    by Fooster on Saturday September 23, @03:56AM EDT (#14)
    (User #100239 Info) http://www.jbuff.org
    Part of the puzzle to me was why the requests were generating 403s (Forbidden to access).

    It turns out that all the requests were for the robots.txt file in the default web space my host sets up for every account. I have five domains registered and working under that account, but had never paid any attention to, published any links to, or placed any files into that default directory. What's more, I never even made it world readable, thus the 403s. I've since fixed all of that, and placed a redirection page in that directory to shuffle requests off to my vanity page, but I haven't seen any more requests like those. I have seen a few browser requests from /. readers, but no more request bursts like those. Thanks for all your suggestions, even the stupid ones gave me a laugh.


    The wait for tech support doubles every 18 months... Any likelihood they can solve your problem halves. Foosters

    Here's an actual robots.txt (Score:1)
    by apm on Saturday September 23, @06:20PM EDT (#16)
    (User #212573 Info)
    It looks like whoever is going after your system may be looking for hidden data to crack into. At least, that's the best explanation I can think of. Here's a copy of an actual robots.txt file that I found on one of the sites I work on:

    User-agent: *
    Disallow: /snapshots/
    Disallow: /cvsweb/
    Disallow: /cgi-bin/
    Disallow: /pub/
    Disallow: /doc/

    Re:Here's an actual robots.txt (Score:1)
    by Fooster on Sunday September 24, @12:30AM EDT (#17)
    (User #100239 Info) http://www.jbuff.org
    It seems that is a very likely scenario. I've had a few other reports. It might be worth setting up a honey trap to see if anyone goes for it.


    The wait for tech support doubles every 18 months... Any likelihood they can solve your problem halves. Foosters

    Count me in.... got the wierdass logs too (Score:1)
    by dash_t (chris@oneSPAMALICIOUSwolf.com) on Sunday September 24, @12:46AM EDT (#18)
    (User #235968 Info) http://www.onewolf.com
    They all came, in a burst, September 17, 14:28:42, then again September 18, 23:21:54 (Central Time):

    208.51.235.81
    4.20.90.81
    206.229.153.81
    206.64.105.81
    206.191.170.226
    206.98.113.81
    12.27.166.81
    [snip]

    68 hits total. Most of the addresses seem to belong to AT&T or Internap.