When RSS Traffic Looks Like a DDoS 443
An anonymous reader writes "Infoworld's CTO Chad Dickerson says he has a love/hate relationship with RSS. He loves the changes to his information production and consumption, but he hates the behavior of some RSS feed readers. Every hour, Infoworld "sees a massive surge of RSS newsreader activity" that "has all the characteristics of a distributed DoS attack." So many requests in such a short period of time are creating scaling issues. " We've seen similiar problems over the years. RSS (or as it should be called, "Speedfeed") is such a useful thing, it's unfortunate that it's ultimately just very stupid.
netcraft article (Score:5, Informative)
Simple HTTP Solution (Score:3, Informative)
Call me stupid (Score:5, Informative)
Over the years? How about over the weekend? (Score:5, Informative)
And it seems to have gotten worse since the new code was installed- I get 503 errors at the top of every hour now on slashdot.
"it's the connection overhead, stupid" (Score:5, Informative)
...is what one would say to the designers of RSS.
Mainly, IF your client is smart enough to communicate that it only needs part of the page, guess what? The pages, especially after gzip compression(which, including with mod_gzip, can be done ahead of time)...the real overhead is all the nonsense, both on a protocol level and for the server in terms of CPU time, of opening+closing a TCP connection.
It's also the fault of the designers for not including strict rules as part of the standard for how frequently the client is allowed to check back, and, duh, the client shouldn't be user-configured to check at common times, like on the hour.
Bram figured this out with BitTorrent- the server can instruct the client on when it should next check back.
Re:Simple HTTP Solution (Score:4, Informative)
Someone did a nice write-up about doing so [pastiche.org] back in 2002.
Re:Still haven't tried these newfangled RSS reader (Score:5, Informative)
Re:Still haven't tried these newfangled RSS reader (Score:3, Informative)
Re:Simple HTTP Solution (Score:3, Informative)
http://www.infoworld.com/rss/rss_info.html
Trying the top news feed, got back:
date -u ; curl --head http://www.infoworld.com/rss/news.xml
Tue Jul 20 19:51:44 GMT 2004
HTTP/1.1 200 OK
Date: Tue, 20 Jul 2004 19:48:30 GMT
Server: Apache
Accept-Ranges: bytes
Content-Length: 7520
Content-Type: text/html; charset=UTF-8
How do I write an RSS reader that only downloads this feed if the data has changed?
Jerry
Re:Still haven't tried these newfangled RSS reader (Score:2, Informative)
bbh
Publish/Subscribe (Score:5, Informative)
http://www.mod-pubsub.org/ [mod-pubsub.org]
The apache module mod_pubsub might be a solution.
From the mod_pubsub FAQ:
What is mod_pubsub?
mod_pubsub is a set of libraries, tools, and scripts that enable publish and subscribe messaging over HTTP. mod_pubsub extends Apache by running within its mod_perl Web Server module.
What's the benefit of developing with mod_pubsub?
Real-time data delivery to and from Web Browsers without refreshing; without installing client-side software; and without Applets, ActiveX, or Plug-ins. This is useful for live portals and dashboards, and Web Browser notifications.
Jabber also saw a publish/subscribe [jabber.org] mechanism as an important feature.
Re:Over the years? How about over the weekend? (Score:2, Informative)
The folks over at Netscape and/or UserLand should have studied the CDF [w3.org] standard first. Then they would have realized the value of specifying schedule information.
Common Sense? (Score:4, Informative)
I won't argue with those who have posted here that some alternative to the "pull" technology of RSS would be very useful. But...
The biggest problem I see isn't newsreaders but blogs. Somebody throws together a blog, inserts a little gizmo to display one of my feeds & then the page draws down the RSS every time the page is reloaded. Given the back-and-forth nature of a lot of folks' web browsing pattern, that means a single user might draw down one of my feeds 10-15 times in a 5 minute span. Now, why couldn't the blogger's software be set to load & cache a copy of the newsfeed according to a schedule?
The honorable mention for RSS abuse goes to the system administrator who set up a newreader screen saver that pulled one of my feeds. He then installed the screen saver on every PC in every office of his company. Every time the screen saver activated, POW! one feed drawn down...
Re:What about a scheduler? (Score:3, Informative)
RSS already supports the <ttl> element type [harvard.edu], which indicates how long a client should wait before looking for an update. Additionally, HTTP servers can provide this information through the Expires header.
Furthermore, well-behaved clients issue a "conditional GET" that only requests the file if it has been updated, which cuts back on bandwidth quite a bit, as only a short response saying it hasn't been updated is necessary in most cases.
Re:Push, not pull! (Score:4, Informative)
The short version is that ICE is far more bandwidth efficient than RSS because:
- the syndicator and subscriber can negotiate whether to push or pull the content. So if the network allows for true push, the syndicator can push the updates, which is most efficient. This eliminates all of the "check every hour" that crushes RSS syndicators. And while many home users are behind NAT, web sites aren't, and web sites generate tons of syndication traffic that could be handled way more efficiently by ICE. Push means that there are many fewer updates transmitted, and that the updates that are sent are more timely.
- ICE supports incremental updates, so the syndicator can send only the new or changed information. This means that the updates that are transmitted are far more efficient. For example, rather than responding to 99% of updates with "here are the same ten stories I sent you last time" you can reply with a tiny "no new stories" message.
- ICE also has a scheduling mechanism, so you can tell a subscriber exactly how often you update (e.g. hourly, 9-5, Monday-Friday). This means that even for polling, you're not wasting resources being polled all night. This saves tons of bandwidth for people doing pull updates.
PulpFiction (Score:3, Informative)
I recommend PulpFiction for an RSS/Atom reader on OS X [freshsqueeze.com]. I much prefer the interface and how it treats the news compared to NNW.
Re:Still haven't tried these newfangled RSS reader (Score:2, Informative)
Re:Can't this be throttled? (Score:3, Informative)
I think that the problem is the peak load - unfortunately the rss readers all download at the same time (they should be more uniformly distributed within the minimum update period). This means that you have to design your system to cope with the peak load, but then all that capacity is sitting idle the rest of the time.
The electricity production system has the same problem
Re:Idea (Score:2, Informative)
Re:Oh, come on (Score:3, Informative)
Re:Not to flame... (Score:2, Informative)
IP can multicast, but it needs support from the network to do that. The problem with that is that the internet is not under one authority that can say "from today onwards, we do multicast in such and such way". There have been experiments with multicasting (mbone), but there are some things that cannot be solved easily (eg. how do you register as a multicast client, and (important part here) how do you make every router from source to destination know about it, and act accordingly (remember, those routers are NOT under the same authority). So, even when you could multicast with UDP/IP, some logistics problems make it very difficult to do it.
However, within an autonomous system (which IS under a single authority) you could multicast, provided there is support provided by the net, in fact, both standard routing protocols (OSPF and RIP) as well as NTP can, and have multicast groups assigned to them.
It's too bad, but that's how the real world is....
Re:Not to flame... (Score:1, Informative)
Imagine being able to send a ping with a forged return header to the IP address *.*.*.* and getting four billion replies sent to the person who owns the forged address.
Re:Simple HTTP Solution (Score:2, Informative)
Not all web servers provide last-modified or etag headers. Infoworld doesn't, so even a well written RSS reader has to bring the whole feed down as they have no way to know if it has changed or not.
Feed on Feeds (Web based) (Score:2, Informative)
It's a "PHP/MySQL server side RSS/Atom aggregator", so you can read your feeds wherever you are, you only need a web browser on the client side.
Pros:
1) you don't need to synchronize the state between the multiple workstations you might use.
2) no platform/os problem on the client side.
Cons:
1) you need some web hosting with PHP and MySQL available (I pay 45 a year for my domain name + 30MB Webspace + 30MB FTP + 30MB MySQL base + 100*25MB pop/imap accounts + SSL everywhere).
2) no installer so you'll need many computing skills to set it up (no that hard).
3) no automated update, you have to click "Update" so you may miss some news when you offline (see away from any internet access) for a long period...
Changed my online life as I no longer have to install anything on the client side (usefull when away from your home/office) or have to synchronize my feeds either with some removable storage (my USB key failed after 250+ daily syncs) or through the net (BottomFeeder [cincomsmalltalk.com], a smalltalk implementation which works on every platform I ever came accross, allows to sync with an FTP location).
Regards,
Poulpy.
Re:Simple HTTP Solution (Score:2, Informative)
Pastiche knows when the document was last modified and can support my writing an rss reader that checks last-modified:
curl --head http://fishbowl.pastiche.org/nerdfull.xml
HTTP/1
Date: Tue, 20 Jul 2004 22:16:33 GMT
Server: Apache/1.3.26 (Unix) Debian GNU/Linux mod_gzip/1.3.19.1a mod_jk/1.1.0
Last-Modified: Mon, 19 Jul 2004 02:52:46 GMT
ETag: "28620-8faa-40fb377e"
Accept-Ranges: bytes
Content-Length: 36778
Content-Type: text/xml
But infoworld does not. As far as I can tell from the headers I displayed in the previous post, infoworld's server does not provide such data. Without the last-modified or etag or something similar, there is no way to ask for a conditional get, because there is nothing to base the conditional on, and most likely the server doesn't know how to compare the conditional anyway since it clearly is not keeping track of when the document was last modified.
I could easily be getting the syntax wrong, but whenever I request that it only send me the xml feed if it has been last modified in the last fraction of a second, I still get the page back:
date > datestamp; curl --time-cond datestamp http://www.infoworld.com/rss/news.xml
This returns a bunch of xml.
Running the same command on Pastiche's xml feed returns, as I would expect, absolutely nothing:
date > datestamp; curl --time-cond datestamp http://fishbowl.pastiche.org/nerdfull.xml
Jerry
Re:Not to flame... (Score:3, Informative)
There is a multicast overlay on top of the internet which consists of routers that can handle this load.
But the combination of no hardware/software support in the network, and no real huge push for this technology left multicast high and dry.
Brief idea of how multicast works:
1) A source send out a "I have a multicast feed" to its immediate routers. Those routers 'publish' this feed to their connected routers until every segment on the internet has seen this feed broadcast.
2) At the end points, individual computers see this message on their segment. They can subscribe to the feed by sending a message to their upstream router. This router places an entry in its table saying, "Someone on segment X wants feed Y, which I get from segment Z" It then sends a subscribe message to the router it got the original broadcast from, which does the same thing on upward until it hits the originating server.
3) Each router, when it sees a multicast packet, consults its table to see which (if any) segments it should forward the packet to. Eventually the packet makes its way to all the endpoints of the network
4) The publish broadcast is initiated periodically. Each router also periodically checks the table to see if they haven't received a re-subscribe message since the last publish broadcast. If no one resubscribes then the table entry is not refreshed - there is no unsubscribe, if you no longer want the feed just ignore it and it'll go away if no one else on your segment wants it. Only one subscriber on each segment needs to subscribe, so if I want it and my co worker wants it then if I see his subscribe packet before I send mine out then I won't send mine out since it'll be put on my segment anyway.
It's quite elegant, but when a router is dealing with 40+Gbps of packets it barely has time to figure out where each packet goes, nevermind statefully inspecting multicast packets and forwarding them appropiately. Not impossible, but it hasn't been rolled out and few providers see any money in supporting it.
-Adam