Efficient RSS Throttling
This is exactly what we do on Slashdot, of course. Every hit, whether to a dynamically-generated perl script page, or to a static
(By putting it in the cleanup phase, we ensure it doesn't affect page delivery times at all; it just means a few more milliseconds that the httpd child is occupied instead of being available to deliver pages, but the only resource it's taking up is RAM.)
I'm uncomfortable with this solution because it's hard to make it scale. First, you have to hit a database (of some kind) to cross-reference the client IP address with its last fetch time. Maybe that's not a big deal; after all, you're hitting the database to read your website data too. But then you have to write to the database in order to record the new fetch time (if the RSS feed has changed), and database writes are slow.
I'll grant that our accesslog traffic is pretty I/O intensive. But if you were only talking about logging RSS hits and nothing else, it'd be a piece of cake. The table just needs three columns (timestamp, IP address, numeric autoincrement primary key). You expire old entries by deleting off one end of the table while you insert into the other. That way inserts never block, even under MyISAM (though I'd recommend InnoDB).
You only need to keep about an hour of the table around anyway, so it's going to be really slow. How many RSS hits can you get in an hour? A hundred thousand? That's peanuts, especially since each row is fixed size. Crunch that IP address down to a 32-bit int before writing it and each row is 12 bytes, give or take. Throw in the indexes and the whole table is a few megabytes. Even a slow disk should be able to keep up -- but if you're concerned about performance, heck, throw it in RAM.
To catch bandwidth hogs, you create a secondary table that doesn't have so much churn. It has an extra column for the count of RSS hits, so if some miscreant nails your webserver 1,000 times in a minute, the secondary table only gets 1 row. You periodically (every minute or two) check the max id on that table, then
INSERT INTO secondary_table SELECT ip, MAX(ts), COUNT(*) FROM table WHERE id BETWEEN last_checked+1 AND current_max GROUP BY ip
By limiting the id to a range, again, there is no blocking issue with the ongoing inserts. After doing that, you trim off rows from secondary_table older than an exact time amount, and then you're ready to do the only query that even approaches being expensive:
SELECT ip, SUM(hitcount) AS s FROM secondary_table HAVING s > your_limit GROUP BY ip
and you have your list of IP addresses that have exceeded your limit.
What we do is use that data to update a table that keeps track of IP addresses that need to be banned from RSS, and have a PerlAccessHandler function that checks a (heavily cached) copy of that table to see whether the incoming IP gets to proceed to the response phase or not.
Slashdot's resource requirements are actually a lot higher than this, since we log every hit instead of just RSS, we log the query string, user-agent, and so on -- and also because we've voluntarily taken on the privacy burden of MD5'ing incoming IP addresses so we don't know where users are coming from. That makes our IP address field 28 bytes longer than it has to be. But even so, we don't have performance issues. Slashdot's secondary table processing takes about 10-15 seconds every 2 minutes.
As for Dan's concern about IP addresses hidden behind address translation -- yep, that's a concern. (We don't bother checking user-agent because idiots writing RSS-bombing scripts would just spam us with random agents.) The good news is that you can set your limits pretty high and still function, since a large chunk of your incoming bandwidth is that top fraction of a percent of hits that are poorly-written scripts. Even a large number of RSS feeds behind a proxy shouldn't be that magnitude of traffic. We do get reader complaints, though, and for a sample of them, anyone thinking about doing this might want to read this thread first.
Computers are unreliable, but humans are even more unreliable. Any system which depends on human reliability is unreliable. -- Gilb