When RSS Traffic Looks Like a DDoS 443
An anonymous reader writes "Infoworld's CTO Chad Dickerson says he has a love/hate relationship with RSS. He loves the changes to his information production and consumption, but he hates the behavior of some RSS feed readers. Every hour, Infoworld "sees a massive surge of RSS newsreader activity" that "has all the characteristics of a distributed DoS attack." So many requests in such a short period of time are creating scaling issues. " We've seen similiar problems over the years. RSS (or as it should be called, "Speedfeed") is such a useful thing, it's unfortunate that it's ultimately just very stupid.
RSS maybe (Score:3, Funny)
Yesterday (Score:3, Interesting)
Re:Yesterday (Score:2, Interesting)
Re:Yesterday (Score:2)
Read about Scandal@Gmail [slashdot.org]
netcraft article (Score:5, Informative)
Can't this be throttled? (Score:3, Interesting)
Re:Can't this be throttled? (Score:3, Insightful)
Also, I doubt that the major problem here is bandwidth, more the number of requests the server has to deal with. RSS feeds are quite small (just text most of the time). The server would still have to run that PHP script you suggest.
Hmmmm. (Score:2)
Re:Can't this be throttled? (Score:3, Informative)
I think that the problem is the peak load - unfortunately the rss readers all download at the same time (they should be more uniformly distributed within the minimum update period). This means that you have to design your system to cope with the peak load, but then all that capacity is sitting idle the rest of the time.
The electricity production system has the same problem
Re:Can't this be throttled? (Score:5, Insightful)
Or rather, anyone that programs an RSS reader so horribly as to make it so that every client downloads information every hour on the hour would probably also barf on the input of a 500 or 404 error.
Most RSS feeders *should* just download every hour from the time they start, making the download intervals between users more or less random and well-dispersed. And if you want it more than every hour, well then edit the source and compile it yourself
Re:Can't this be throttled? (Score:2, Insightful)
What would really, really be effective would be a valid RSS feed that contained an error message in-line describing why your request was rejected. A few big sites doing this would rapidly get the rest of the users and clients to be updated.
Re:Can't this be throttled? (Score:5, Funny)
Re:Can't this be throttled? (Score:5, Insightful)
That's also a problem, though, since most people start work at their computer desks on the hour, or very close to it. The better solution would be for the client (1) to check once at startup, then (2) pick a random number between one and sixty (or thirty or whatever) and (3) start checking the feed, hourly, after that many minutes. That's the only way to ensure a decently random distribution of hits.
Re:Can't this be throttled? (Score:5, Insightful)
Simple HTTP Solution (Score:3, Informative)
Re:Simple HTTP Solution (Score:5, Insightful)
This "optimization" will not have any long-lasting benefits. There are at least three variables in this equation:
This optimization only addresses #3, which is the least likely to grow as time goes on.
Re:Simple HTTP Solution (Score:3, Insightful)
1. Number of users
2. Number of RSS feeds
3. Size of each request
And I'll add:
4. Time at which each request occurs
If RSS requests were evenly distributed throughout the hour, the problems would be minimal. When every single RSS reader assumes that updates should be checked exactly at X o'clock on the hour, you get problems.
Re:Simple HTTP Solution (Score:4, Informative)
Someone did a nice write-up about doing so [pastiche.org] back in 2002.
Re:Simple HTTP Solution (Score:3, Informative)
http://www.infoworld.com/rss/rss_info.html
Trying the top news feed, got back:
date -u ; curl --head http://www.infoworld.com/rss/news.xml
Tue Jul 20 19:51:44 GMT 2004
HTTP/1.1 200 OK
Date: Tue, 20 Jul 2004 19:48:30 GMT
Server: Apache
Accept-Ranges: bytes
Content-Length: 7520
Content-Type: text/html; charset=UTF-8
How do I write an RSS re
Re:Simple HTTP Solution (Score:4, Insightful)
Still haven't tried these newfangled RSS readers.. (Score:2, Interesting)
Re:Still haven't tried these newfangled RSS reader (Score:4, Interesting)
Re:Still haven't tried these newfangled RSS reader (Score:5, Informative)
Re:Still haven't tried these newfangled RSS reader (Score:3, Informative)
PulpFiction (Score:3, Informative)
I recommend PulpFiction for an RSS/Atom reader on OS X [freshsqueeze.com]. I much prefer the interface and how it treats the news compared to NNW.
Re:Still haven't tried these newfangled RSS reader (Score:2, Informative)
bbh
Call me stupid (Score:5, Informative)
Editorializing in the blurb (Score:3, Insightful)
Re:Editorializing in the blurb (Score:2, Funny)
Over the years? How about over the weekend? (Score:5, Informative)
And it seems to have gotten worse since the new code was installed- I get 503 errors at the top of every hour now on slashdot.
Re:Over the years? How about over the weekend? (Score:2)
Re:Over the years? How about over the weekend? (Score:3, Interesting)
What about a scheduler? (Score:5, Interesting)
Of course, this depends on the client to respect the request, but we already have systems that do (robots.txt), and they seem to work fairly well, most of the time.
Re:What about a scheduler? (Score:2)
Re:What about a scheduler? (Score:5, Funny)
Related to this is the fact that most traffic accidents happen "on the twenties." Human nature is a curious and seemingly very predictable thing.
Re:What about a scheduler? (Score:3, Insightful)
Most major cities I think have traffic reports more often than just on 20/40.
Re:What about a scheduler? (Score:3, Informative)
RSS already supports the <ttl> element type [harvard.edu], which indicates how long a client should wait before looking for an update. Additionally, HTTP servers can provide this information through the Expires header.
Furthermore, well-behaved clients issue a "conditional GET" that only requests the file if it has been updated, which cuts back on bandwidth quite a bit, as only a short response saying it hasn't been updated is necessary in most cases.
RSS needs better TCP stacks (Score:4, Interesting)
The reason this needs better TCP stacks is because every open connection is stored in kernel memory. That's not necessary. Once you have the connecting ip, port, and sequence number, those should go into a database, to be pulled out later when the content has been updated.
-russ
Re:RSS needs better TCP stacks (Score:2)
Basically, you use the TCP connection as a subscription. Call it "repeated confirmation of opt-in" if you want. Every time the user re-connects to get the next update (which they will probably do immediately; may as well) that's an indication that th
Re:RSS needs better TCP stacks (Score:3, Insightful)
its like saying - "java is great, except lets make it compiled, and platform specific"
Re:RSS needs better TCP stacks (Score:3, Funny)
For starters, how about the readers play nice and spread their updates around a bit instead of all clamoring at the same time.
Re:RSS needs better TCP stacks (Score:2)
Correct me if I'm wrong.
Re:RSS needs better TCP stacks (Score:5, Insightful)
Leaving thousands upon thousands of connections open on the server is a terrible idea no matter how well-implemented the TCP stack is. The real solution is to use some sort of distributed mirroring facility so everyone could connect to a nearby copy of the feed and spread the load. The even better solution would be to distribute asynchronous update notifications as well as data, because polling always sucks. Each client would then get a message saying "xxx has updated, please fetch a copy from your nearest mirror" only when the content changes, providing darn near optimal network efficiency.
Re:RSS needs better TCP stacks (Score:3, Interesting)
I'd like to hear one person, just one person say "Hmmm.... I wonder why Russ didn't suggest asynchronous update notifications?" And then maybe go on to answer themselves by sa
Wrong target, but good solution (Score:2)
Reimplementing TCP using a database is excessive. Making a light connectionless protocol that does similar to what you described would be a lot simpler and not require reimplementing everyone's TCP stack.
Also, as much as I hate the fad of labelling everything P2P, having a P2P-ish network for this would help, too. The original server can just hand out MD5's, and clients propagate the actual text throughout the network.
Of course (and this relates to the P2P stuff), every newfangled toy these days is ju
Re:RSS needs better TCP stacks (Score:2, Insightful)
Re:RSS needs better TCP stacks (Score:2)
Scheduling (Score:2)
Re:Scheduling (Score:2)
Idea (Score:5, Interesting)
Re:Idea (Score:5, Interesting)
I feel overlay netwrok scheme would work better than Bittorrent/tracker based system. In overlay network scheme each group of network will have its own ultra peer (JXTA rendezvous) which acts as tracker for all files in that network. I wanted to do this for slashdot effect (p2pbridge.sf.net) but somehow the project has been delayed for long.
Re:Idea (Score:2)
One hour interval? (Score:2, Insightful)
Or did the RSS reader authors hope that their applications wouldn't be used by anybody except for a few geeks?
Won't help (Score:2, Interesting)
We have way too much traffic from dumb P2P schemes today, considering the relatively small volume of new content being distributed.
Re:One hour interval? (Score:2)
You mean RSS readers are programmed to fetch the feed at hour xx:00??
That's fantastic
Some programmers should be shot...
Re:One hour interval? (Score:2)
One good idea would be for the protocols to allow each feed to suggest a default refresh rate. That way slow changing or overloaded sites could ask readers to slow down a little. A minimum refresh warning rate would be good too. (i.e. Refreshing faster than that rate might get you nuked.) I k
"it's the connection overhead, stupid" (Score:5, Informative)
...is what one would say to the designers of RSS.
Mainly, IF your client is smart enough to communicate that it only needs part of the page, guess what? The pages, especially after gzip compression(which, including with mod_gzip, can be done ahead of time)...the real overhead is all the nonsense, both on a protocol level and for the server in terms of CPU time, of opening+closing a TCP connection.
It's also the fault of the designers for not including strict rules as part of the standard for how frequently the client is allowed to check back, and, duh, the client shouldn't be user-configured to check at common times, like on the hour.
Bram figured this out with BitTorrent- the server can instruct the client on when it should next check back.
Re:"it's the connection overhead, stupid" (Score:2)
-russ
Re:"it's the connection overhead, stupid" (Score:2)
Inevitably the most popular clients are the most poorly-written ones which ignore as much of the spec as possible. Telling them what they should do is useless, because they don't listen.
As an example, consider all the broken BitTorrent implementations out there.
it's the PULL,stupid (Score:4, Interesting)
Err, did I miss the meeting where that was declared as the Web's original promise?
Anyway, the trouble is pretty obvious: RSS is just a polling mechanism to do fakey Push. (Wired had an interesting retrospective on their infamous "PUSH IS THE FUTURE" hand cover about PointCast.) And that's expensive, the cyber equivalent of a hoarde of screaming children asking "Are we there yet? Are we there yet? How about now? Are we there yet now? Are we there yet?" It would be good if we had an equally widely used "true Push" standard, where remote clients would register as listeners, and then the server can actually publish new content to the remote sites. However, in today's heavily firewall'd internet, I dunno if that would work so well, especially for home users.
I dunno. I kind of admit to not really grokking RSS, for me, the presentation is too much of the total package. (Or maybe I'm bitter because the weird intraday format that emerged for my own site [kisrael.com] doesn't really lend itself to RSS-ification...)
it's not stupid (Score:2)
Proposed Solution (Score:2, Interesting)
As a self-appointed representative of RSS, ... (Score:2)
I'd really, really like to.
Obviously, I can't, but boy would I like to.
Stupid RSS.
Server side and client side fixes. (Score:2)
On the client side, the software needs to be written to check for updates to the data before pulling the data. This will lessen the burder.
The
random check intervals? (Score:2, Insightful)
user starts program at 3.15 and it checks rss feed.
user sets check interval to 1 hour.
rand()%60 minutes later (let's say 37) it checks feed
every hour after that it checks the feed.
simplistic sure, but isn't rss in general?
on an aside, any of you (few) non-programmers interested in creating rss feeds, i put out some software that facilitates it.
hunterdavis.com/ssrss.html
Push, not pull! (Score:5, Interesting)
That's just plain retarded.
What they *should* do...
1) Content should be pushed from the source, so only *necessary* traffic is generated. It should be encrypted with a certificate so that clients can be sure they're getting content from the "right" server.
2) Any RSS client should also be able to act as a server, NTP style. Because of the certificate used in #1, this could be done easily while still ensuring that the content came from the "real" source.
3) Subscription to the RSS feed could be done on a "hand-off" basis. In other words, a client makes a request to be added to the update pool on the root RSS server. It either accepts the request, or redirects the client to one its already set up clients. Whereupon the process starts all over again. The client requests subscription to the service, and the request is either accepted or deferred. Wash, rinse, repeat until the subscription is accepted.
The result of this would be a system that could scale to just about any size, easily.
Anybody want to write it? (Unfortunately, my time is TAPPED!)
Re:Push, not pull! (Score:3, Interesting)
Too many upstream bandwidth restrictions, especially on home connections. Last thing people want is getting AUPped because they're mirroring slashdot headlines.
My solution? Multicast IPs. Multicast IPs solve every problem that's ever been encountered by mankind. Join Multicast, listen till you've heard all the headlines (which repeat ad nauseum), move on with life. Heck, keep listening if ya want. All we have to do is make it work.
Frank
Re:Push, not pull! (Score:4, Informative)
The short version is that ICE is far more bandwidth efficient than RSS because:
- the syndicator and subscriber can negotiate whether to push or pull the content. So if the network allows for true push, the syndicator can push the updates, which is most efficient. This eliminates all of the "check every hour" that crushes RSS syndicators. And while many home users are behind NAT, web sites aren't, and web sites generate tons of syndication traffic that could be handled way more efficiently by ICE. Push means that there are many fewer updates transmitted, and that the updates that are sent are more timely.
- ICE supports incremental updates, so the syndicator can send only the new or changed information. This means that the updates that are transmitted are far more efficient. For example, rather than responding to 99% of updates with "here are the same ten stories I sent you last time" you can reply with a tiny "no new stories" message.
- ICE also has a scheduling mechanism, so you can tell a subscriber exactly how often you update (e.g. hourly, 9-5, Monday-Friday). This means that even for polling, you're not wasting resources being polled all night. This saves tons of bandwidth for people doing pull updates.
Re:Push, not pull! (Score:2, Insightful)
"Push" is dead. "Push" was stillborn. The very climate w.r.t internet security is not disposed to "hey lets let remote servers push stuff into our network!"
I seem to remember... (Score:5, Interesting)
Comment removed (Score:5, Interesting)
Re:Oh, come on (Score:3, Informative)
It just ain't broadcast.. (Score:5, Interesting)
There used to be a system where you could pull a list of recently posted articles off of a server that your ISP had installed locally, and only get the newest headers, and then decide which article bodies to retrieve.. The articles could even contain rich content, like HTML and binary files. And to top it off, articles posted by some-one across the globe were transmitted from ISP to ISP, spreading over the world like an expanding mesh.
They called this.. USENET..
I realize that RSS is "teh hotness" and Usenet is "old and busted", and that "push is dead" etc. But for Pete's sake, don't send a unicast protocol to do a multicast (even if it is at the application layer) protocol's job!
It would of course be great if there was a "cache" hierarchy on usenet. Newsgroups could be styled after content providers URLs (e.g. cache.com.cnn, cache.com.livejournal.somegoth) and you could just subscribe to crap that way. There's nothing magical about what RSS readers do that the underlying stuff has to be all RRS-y and HTTP-y..
For real push you could even send the RSS via SMTP, and you could use your ISPs outgoing mail server to multiply your bandwidth (i.e. BCC).
Re:It just ain't broadcast.. (Score:5, Insightful)
You make some very good points. The old saying "When all you have is a hammer, everything looks like a nail" seems to ring true time and time again. These days it seems that everyone wants to use HTTP for everything and quite frankly it's not equipped to do that.
RSS over SMTP sounds pretty cool. Heck, just sending a list of subscribers an email of RSS and let their mail clients sort it out would be pretty nice.
Heh, my favorite posts are when some one suggested soething that sonuds totally novel and then someone else points our "Yeah! Like $lt;insert old and undeused technology>. It seems to do that damn well." The internet cannot forget its roots!
Re:It just ain't broadcast.. (Score:3, Insightful)
2) The RSS developer community can't picture themselves using anything except HTTP. I've tried mentioning other protocols to them; they don't respond.
3) NOBODY MAKE
Too many problems. (Score:2)
You know what would be nice.. (Score:3, Interesting)
RSS+DNS (Score:2)
This woul dbe an ideal solution, since most RSS feeds are a few K. There's room for a lot of RSS in 1 megabyte.
Of course, a caching proxy server would do the same thing.
Speedfeed? (Score:2)
Yeah, it is stupid, which is why most of us just call it RSS.
RSS is like a DDoS attack on my brain (Score:5, Interesting)
Re:RSS is like a DDoS attack on my brain (Score:3, Interesting)
DNS Polling? (Score:2)
The concept is that every time you updated your blog, you'd do a Dynamic DNS push to a RSS name, say, rss.www.slashdot.org's TXT record, containing the Unix time in seconds of the last update (alternatively, and this is how I'd probably implement it in my custom server, lookups to rss.www.slashdot.org would cause a date-check on the entry). The TTL of
Re:DNS Polling? (Score:2)
--Dan
If-Modified-Since and ETag headers (Score:2)
Similar problems with other stuff? (Score:2)
RSS scalability (Score:2)
But the primary solution will end up being caching. With the exception
Wasn't this the whole point of Konspire2b? (Score:2)
a blog with unlimited bandwidth
blogs are software systems that allow you to easily post a series of documents to your website over time. Many people use blogs to display daily thoughts, rants, news stories, or pictures. If you run a blog, your readers can return to your site regularly to see the new content that you have posted. Before blogs came along, maintaining a website (and updating it regularly) was a relatively tedious process. Some might call blogging a social revolution--
Wrong solution to wrong problem (Score:2)
I have what to my 10 minutes of thought on the subject appears to be a better solution - every web site that currently publishes an RDF page should instead push new entries to an NNTP newsgroup. I'd suggest that a heirarchy be created for it, then sort of a reverse of the URL for the group name, like rdf.org.slashdot or rdf.uk.co.thregister. Then the articles get propogated in a distributed manner and
RSS doesn't scale (Score:2)
Read this [stevex.org] for some more thoughts on this..
Publish/Subscribe (Score:5, Informative)
http://www.mod-pubsub.org/ [mod-pubsub.org]
The apache module mod_pubsub might be a solution.
From the mod_pubsub FAQ:
What is mod_pubsub?
mod_pubsub is a set of libraries, tools, and scripts that enable publish and subscribe messaging over HTTP. mod_pubsub extends Apache by running within its mod_perl Web Server module.
What's the benefit of developing with mod_pubsub?
Real-time data delivery to and from Web Browsers without refreshing; without installing client-side software; and without Applets, ActiveX, or Plug-ins. This is useful for live portals and dashboards, and Web Browser notifications.
Jabber also saw a publish/subscribe [jabber.org] mechanism as an important feature.
Common Sense? (Score:4, Informative)
I won't argue with those who have posted here that some alternative to the "pull" technology of RSS would be very useful. But...
The biggest problem I see isn't newsreaders but blogs. Somebody throws together a blog, inserts a little gizmo to display one of my feeds & then the page draws down the RSS every time the page is reloaded. Given the back-and-forth nature of a lot of folks' web browsing pattern, that means a single user might draw down one of my feeds 10-15 times in a 5 minute span. Now, why couldn't the blogger's software be set to load & cache a copy of the newsfeed according to a schedule?
The honorable mention for RSS abuse goes to the system administrator who set up a newreader screen saver that pulled one of my feeds. He then installed the screen saver on every PC in every office of his company. Every time the screen saver activated, POW! one feed drawn down...
Not to flame... (Score:4, Interesting)
Re:Not to flame... (Score:3, Informative)
There is a multicast overlay on top of the internet which consists of routers that can handle this load.
But the combination of no hardware/software support in the network, and no real huge push for this technology left multicast high an
Solution: HTTP 503 Response for Flow Control (Score:4, Insightful)
One completely backwards-compatible fashion to add flow-control to RSS would be to use the HTTP 503 response when server load is getting too high for your RSS files. The server simply sends an HTTP 503 response with a Retry-After header indicating how long the requesting client should wait before retrying.
Clients that ignore the retry interval or are overly aggressive could be punished by further 503 responses thus basically denying those aggressive clients access to the RSS feeds. Users of overly aggressive clients would soon find that they actually provide less fresh results and would place pressure on implementors to fix their implementations.
Told Ya So (Score:4, Interesting)
Yes, it's "cool" that I can set up a page (or now use a browser plug-in) to automatically get a lot of content from hundreds of web pages at a time when I really opened up the browser to check my e-mail.
What would have REALLY, been cool would be some sort of technology that would notify me when something CHANGED. No effort on my part, no *needless* effort on the servers part.
Oh wait... We HAD that didn't we, I think they were called Listservers, and they worked just fine. (Still do actually as I get a number of updates, including Slashdot, that way.) RSS advocates (and I won't mention any names) keep making pronouncements like "e-mail s dead!" simply because they have gotten themselves and their hosting companies on some black hole lists. Cry me a river now that your bandwidth costs are going through the roof and yet nobody is clicking though on your web page ads, because, guess what? Nobody is visiting your page. They have all they need to know about your updates via your RSS feeds.
Re:RHS (Score:2)
Re:RHS (Score:2)
Re:RHS (Score:3, Funny)
That's nothing compared to RMS, which (according to RMS) stands for GNU/Recursive Meta-Syndication.
Re:Revision of the Standard (Score:2, Interesting)
Re:Are they sure (Score:2)
Yes, of course! That's it! They should have known better than to run Infoworld off a 286 and a DSL conx in some guy's basement.