Please create an account to participate in the Slashdot moderation system


Forgot your password?

Journal jamie's Journal: Efficient RSS Throttling 3

Dan Sandler has an article from a few days ago about RSS throttling, where he discusses the solution of having the server keep track of which clients have hit RSS feeds recently, so it knows when a client crosses the line and needs to be banned.

This is exactly what we do on Slashdot, of course. Every hit, whether to a dynamically-generated perl script page, or to a static .shtml or .rss page, triggers an Apache PerlCleanupHandler which inserts a row into our 'accesslog' table on our MySQL database.

(By putting it in the cleanup phase, we ensure it doesn't affect page delivery times at all; it just means a few more milliseconds that the httpd child is occupied instead of being available to deliver pages, but the only resource it's taking up is RAM.)

Dan writes:

I'm uncomfortable with this solution because it's hard to make it scale. First, you have to hit a database (of some kind) to cross-reference the client IP address with its last fetch time. Maybe that's not a big deal; after all, you're hitting the database to read your website data too. But then you have to write to the database in order to record the new fetch time (if the RSS feed has changed), and database writes are slow.

I'll grant that our accesslog traffic is pretty I/O intensive. But if you were only talking about logging RSS hits and nothing else, it'd be a piece of cake. The table just needs three columns (timestamp, IP address, numeric autoincrement primary key). You expire old entries by deleting off one end of the table while you insert into the other. That way inserts never block, even under MyISAM (though I'd recommend InnoDB).

You only need to keep about an hour of the table around anyway, so it's going to be really slow. How many RSS hits can you get in an hour? A hundred thousand? That's peanuts, especially since each row is fixed size. Crunch that IP address down to a 32-bit int before writing it and each row is 12 bytes, give or take. Throw in the indexes and the whole table is a few megabytes. Even a slow disk should be able to keep up -- but if you're concerned about performance, heck, throw it in RAM.

To catch bandwidth hogs, you create a secondary table that doesn't have so much churn. It has an extra column for the count of RSS hits, so if some miscreant nails your webserver 1,000 times in a minute, the secondary table only gets 1 row. You periodically (every minute or two) check the max id on that table, then

INSERT INTO secondary_table SELECT ip, MAX(ts), COUNT(*) FROM table WHERE id BETWEEN last_checked+1 AND current_max GROUP BY ip

By limiting the id to a range, again, there is no blocking issue with the ongoing inserts. After doing that, you trim off rows from secondary_table older than an exact time amount, and then you're ready to do the only query that even approaches being expensive:

SELECT ip, SUM(hitcount) AS s FROM secondary_table HAVING s > your_limit GROUP BY ip

and you have your list of IP addresses that have exceeded your limit.

What we do is use that data to update a table that keeps track of IP addresses that need to be banned from RSS, and have a PerlAccessHandler function that checks a (heavily cached) copy of that table to see whether the incoming IP gets to proceed to the response phase or not.

Slashdot's resource requirements are actually a lot higher than this, since we log every hit instead of just RSS, we log the query string, user-agent, and so on -- and also because we've voluntarily taken on the privacy burden of MD5'ing incoming IP addresses so we don't know where users are coming from. That makes our IP address field 28 bytes longer than it has to be. But even so, we don't have performance issues. Slashdot's secondary table processing takes about 10-15 seconds every 2 minutes.

As for Dan's concern about IP addresses hidden behind address translation -- yep, that's a concern. (We don't bother checking user-agent because idiots writing RSS-bombing scripts would just spam us with random agents.) The good news is that you can set your limits pretty high and still function, since a large chunk of your incoming bandwidth is that top fraction of a percent of hits that are poorly-written scripts. Even a large number of RSS feeds behind a proxy shouldn't be that magnitude of traffic. We do get reader complaints, though, and for a sample of them, anyone thinking about doing this might want to read this thread first.

This discussion has been archived. No new comments can be posted.

Efficient RSS Throttling

Comments Filter:
  • I would imagine the same scenario (or, minimally, similar) could be used with managing trackbacks and pings then.

    It's funny how someone posts an article about "how such and such should be done" and slashcode's already been dealing with it for years.
  • I've been concerned that you'd wind up losing large numbers of aggregators behind big proxies or anonymizers. If you have 2,000 AOL users running the same aggregator software that tries to check on the top of the hour for feeds -- only the first 50 or 100 get the update? That doesn't seem to scale for RSS the way it would for Slashdot's throttling mechanism.
    • It's pretty easy to check that on our site; a pretty much fixed percentage of our users create accounts, log in, post comments, and generally contribute to the site. We log that grouped by IP as well, so when we see an IP whose RSS is blocked and which has activity from n logged-in users, we can estimate that there are k*n actual users behind it and it's probably a proxy. We manually look at those IPs, and allow the proxies a lot more RSS hits.

      If your site doesn't support logged-in user participation it's

Never say you know a man until you have divided an inheritance with him.