How Much Bandwidth is Required to Aggregate Blogs?

How Much Bandwidth is Required to Aggregate Blogs? 209

Posted by CmdrTaco on Sunday August 14, 2005 @06:28PM from the more-than-a-little dept.

Kevin Burton writes "Technorati recently published that they're seeing 900k new posts per day. PubSub says they're seeing 1.8M. With all these posts per day how much raw bandwidth is required? Due to innefficiencies in RSS aggregation protocols a little math is required to understand this problem." And more importantly, with millions of posts, what percentage of them have any real value, and how do busy people find that .001%?

How Much Bandwidth is Required to Aggregate Blogs?

This discussion has been archived. No new comments can be posted.

Search 209 Comments Log In/Create an Account

Comments Filter:

All at once (Score:5, Interesting)

by someonewhois ( 808065 ) * writes: on Sunday August 14, 2005 @06:30PM (#13318003) Homepage

It would make a lot more sense to have a protocol where you check one file that has a list of links to another XML file, and then the aggregator figures out which of those URLs has NOT been aggregated, then it downloads the other XML file which has the post-specific info, which it proceeds to display. That would save a lot of bandwidth, I'm sure.

Comment removed (Score:5, Interesting)

by account_deleted ( 4530225 ) writes: on Sunday August 14, 2005 @06:31PM (#13318011)

Comment removed based on user account deletion

Don't forget the robots (Score:5, Interesting)

by astrashe ( 7452 ) writes: on Sunday August 14, 2005 @06:33PM (#13318024) Journal

I used to have a blog that I recently shut down because no one read it.

No one read it, but I got a ton of hits -- all from indexing services. WordPress pings a service that lets lots of indexing systems know about new posts. Some of them -- Yahoo, for example, were contstantly going through my entire tree of posts, and hitting links for months, subjects, and so on.

It didn't bother me, because the bandwidth wasn't an issue, and it wasn't like they were hammering my vps or anything. It mostly just made it really hard to read the logs, because finding human readers was like looking for a needle in a haystack.

But bandwidth is cheap, and RSS is really useful, so it seems at least as good of a use for the resource as p2p movie exchanges.

Rather than assuming... (Score:5, Interesting)

by llZENll ( 545605 ) writes: on Sunday August 14, 2005 @06:33PM (#13318026)

Rather than a making all these assumptions why not just email Bob Wyman and ask him?

"How much data is this? If we assume that the average HTML post is 150K this will work out to about 135G. Now assuming we're going to average this out over a 24 hour period (which probably isn't realistic) this works out to about 12.5 Mbps sustained bandwidth.

Of course we should assume that about 1/3 of this is going to be coming from servers running gzip content compression. I have no stats WRT the number of deployed feeds which can support gzip (anyone have a clue?). My thinking is that this reduce us down to about 9Mbps which is a bit better.

This of course assumes that you're not fetching the RSS and just fetching the HTML. The RSS protocol is much more bloated in this regard. If you have to fetch 1 article from an RSS feed your forced to fetch the remaining 14 addition posts that were in the past (assuming you're not using the A-IM encoding method which is even rarer). This floating window can really hurt your traffic. The upside is that you have to fetch less HTML.

Now lets assume you're only fetching pinged blogs and you don't have to poll (polling itself has a network overhead). The average blog post would probably be around 20k I assume. If we assume the average feed has 15 items, only publishes one story, and has a 10% overhead we're talking about 330k per fetch of an individual post.

If we go back to the 900k posts per day figure we're talking a lot of data - 297G most of which is wasted. Assuming gzip compression this works out to 27.5Mbps.

Thats a lot of data and a lot of bloat which is unnecessary. This is a difficult choice for smaller aggregator developers as this much data costs a lot of money. The choice comes down to cheap HTML index ing with the inaccuracy that comes from HTML or accurate RSS which costs 2.2x more.

Update: Bob Wyman commented that he's seeing 2k average post size with 1.8M posts per day. If we are to use the same metrics as above this is 54G per day or around 5Mbps sustained bandwidth for RSS items (assuming A-IM differentials aren't used)."

Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)

by A beautiful mind ( 821714 ) writes: on Sunday August 14, 2005 @06:44PM (#13318079)

Normally you would be right, but now you're banging open doors. CmdrTaco and others are actively working on a new CSS-using formatting of slashdot.

Slashdot = blog = ironic (Score:4, Interesting)

by Lovejoy ( 200794 ) writes: <danlovejoy AT gmail DOT com> on Sunday August 14, 2005 @06:45PM (#13318082) Homepage

Does anyone else wonder why Slashdot editors seem to have it in for blogs? Is it because in Internet years, Slashdot is as old and sclerotic as the Dinomedia? Is Slashdot the Dinomedia of the new media?

Does anyone else consider it ironic that the Slashdot editorship HATES blogs, but Slashdot is actually a blog?

Anyone else getting tired of these questions?

Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)

by A beautiful mind ( 821714 ) writes: on Sunday August 14, 2005 @06:50PM (#13318113)

Oh yea, here is the link [slashdot.org] about it.

Re:All at once (Score:2, Interesting)

by G-Licious! ( 822746 ) writes: on Sunday August 14, 2005 @07:01PM (#13318166) Homepage

I don't think you need a list of links or even a separate file. An easier solution might be to just pass a format string in a separate link-tag on the html page announcing the feed. For example, right now we have: (taken straight form the linked article)

<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.feedblog.org/atom.xml" /> <link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.feedblog.org/index.rdf" />

And we could introduce a new relationship type, say "recent-feed", with a strftime-like format string:

<link rel="recent-feed" type="application/atom+xml" title="Atom" href="http://www.feedblog.org/atom.xml?date=%Y-%m- %d&time=%H:%M:%S" /> <link rel="recent-feed" type="application/rss+xml?date=%Y-%m-%d&time=%H:%M :%S" title="RSS" href="http://www.feedblog.org/index.rdf" />

Ofcourse, that'd require the blog feed to be a dynamic page of some sort (PHP, Python, Ruby, Perl, whatever..), but that shouldn't be a problem; I can't think of a single blog with a bandwidth problem that is using static pages.

Value (Score:5, Interesting)

by lakin ( 702310 ) writes: on Sunday August 14, 2005 @07:01PM (#13318167)

what percentage of them have any real value

I had for a while held the view that most blogs out there are pointless. Some can be insightful and some are basically used as company press releases, but most are people talking about their days activities that few people really care about, and a few of my friends have blogs like these. When I asked one whats the point, she said she just blogs stuff she would normally mention to many people on msn throughout the day. Its not meant to have value to anyone on slashdot, be hugely insightful, or detail some breathtaking new hack, its simply another way for her to talk to friends (that doesnt involve repeating herself).

Re:How much? If everyone GZipped, a lot less! (Score:5, Interesting)

by ZorbaTHut ( 126196 ) writes: on Sunday August 14, 2005 @07:06PM (#13318193) Homepage

As I remember, www.livejournal.com has experimented with gzip compression several times. They've discovered that the price of the CPU far exceeds the price of the bandwidth.

Bandwidth is cheap. Computers, not so much.

Re:All at once (Score:4, Interesting)

by broward ( 416376 ) writes: <browardhorne@@@gmail...com> on Sunday August 14, 2005 @07:21PM (#13318229) Homepage

The bandwidth isn't going to matter much.

The blog wave is close to an inflection point,
probably within six to twelve months...
which means that total bandwidth will probably
top out at about TWICE the current rate.

http://www.realmeme.com/Main/miner/preinflection/b logDejanews.png [realmeme.com]

I suspect that even now, many blogs are
starved for readership as new blogs come online
and steal mental bandwidth.

Re:Rod Serling says (Score:1, Interesting)

by Anonymous Coward writes: on Sunday August 14, 2005 @07:38PM (#13318284)

that if Google hadn't destroyed Usenet we'd not have all these goofy blogs and millions of people trying to make a dime off of them.

Re:How much? If everyone GZipped, a lot less! (Score:3, Interesting)

by grcumb ( 781340 ) writes: on Sunday August 14, 2005 @08:13PM (#13318441) Homepage Journal

"Compared to keeping a connection state, gzipping is _way_ more expensive. I find it very hard to believe that there is a case where keeping the connection longer was more expensive than gzipping the content."

I'm prone to agree. But I also suspect that my CTO is going to agree that it's cheaper to pay once for more processing power than it is to pay every day for higher bandwidth use. YMMV, of course. Bandwidth is relatively cheap in some parts of the US, but in other parts of the world it's hideously expensive.

In short, I agree with your conclusion, but I think that the GP is right, if not for the reasons he provided. In some cases it actually does make sense to cope with a little less efficiency in one part of the system than it is to cope with constantly higher costs in another.

semi off topic (Score:3, Interesting)

by cookiepus ( 154655 ) writes: on Sunday August 14, 2005 @08:26PM (#13318502) Homepage

Since we're on the subject of blog aggregation, can someone recomend a GOOD way to aggregate?

Every single RSS aggregator I've come across treats my RSS world similar to an e-mail reader, where each blog is a 'folder' and each entry is equivalent to an e-mail.

This is decidedly NOT what I want and I don't understand why everyone's writing the same thing.

My friend is running PLANET, which builds a frontpage out of the RSS feeds (looks kind of like the slasdot frontpage where adjacent stores come from different sources and are sorted in chronolocial order (newest on top)

PLANET seems to be a server-side implementation. My buddy's running Linux and he made a little page for me but it's not right for me to bug him every time I want to add a feed.

Is there anything like what I want that would run on Windows? And if not, why the heck not?

By the same token, why doesn't del.icio.us have any capacity to know when my links have been updated?

For what it's worth, here's my del.icio.us BLOGS area with some blogs I find good.

http://del.icio.us/eduardopcs/BLOG [del.icio.us]

Re:If you poll, at least do it well... (Score:2, Interesting)

by bobwyman ( 175558 ) writes: on Monday August 15, 2005 @12:55AM (#13319332) Homepage

Baricom: What you're looking for is the "cloud" interface defined at: http://blogs.law.harvard.edu/tech/soapMeetsRss [harvard.edu]
The documentation there is, I think, about as good as you'll find. While it says that it can be implemented in either XML-RPC or SOAP, I am aware only of XML-RPC implementations.

The cloud provides a means for blogs to notify subscribers of updates and should eliminate the need for polling -- except that the subscriptions must be renewed at least every 25 hours. Of course, this cloud stuff isn't terribly useful in most cases since it relies on the blog server being able to send an HTTP message to a remote client (subscriber). In most cases, those messages would be blocked by firewalls. This is, of course, why the "Atom over XMPP" [xmpp.org] stuff makes sense. It relies on a connection established from the client to the server -- in the same manner as is done with instant messaging clients. Thus, there are many fewer issues with firewalls.

Of course, having lots of session open between a client program and all of the various blogs it reads probably doesn't make much sense. Neither does it make sense for every blog to maintain a list of all of its "cloud" readers and go to the work of sending them all messages whenever the blog is updated. Thus, the most sensible way to do this push business is to have the individual blogs publish to a common network of aggregating servers and then have clients establish connections to the common service. Overall bandwidth consumption is thus reduced to the absolute minimum. That's what we're building at PubSub.com.

bob wyman

Miski Client-Server-Server-Client protocol (Score:2, Interesting)

by Philip Dorrell ( 804510 ) writes: on Monday August 15, 2005 @04:18AM (#13319817) Homepage Journal
As I explained (as long ago as 2000) in Miski: A White Paper [archive.org], we need a system with the following features:
- Each producer of link suggestions has a unique address, something like channel/user@example.com. (This implies resolution via DNS, but probably people will end up using the URL of an XML file.)
- The channel address points to the producer's server.
- The subscriber to a channel tells their server to subscribe to the channel. The subscriber's server talks to the producer's server.
- When the producer makes a new link suggestion, their client pushes it to their server, which pushes it to all the servers whose subscribers have subscribed to the channel.
- Each server pushes the link suggestion to their clients (by whatever means).
The pattern of client to server to server to client is a bit like the architecture of email, but it is quite spam-proof because you only ever receive what you asked for.
Additionally, subscribers can instantly "repost" a suggestion to their own channel, which will be read by their subscribers. To avoid reading duplicate posts, servers will optionally filter out duplicates. However, this has a major consequence, which is that subscribers are only ever guaranteed to see the URL, which means that anything you want to say about the content of a new page has to go into the URL. The current system of RSS titles and descriptions will not work under reposting and duplicate filtering.
The combination of real-time pushing and reposting could lead to a speeded up Internet, where exciting new ideas spread from one user to the next in a matter of minutes, without having to go through the bottlenecks of centralised attention and popular websites (such as Slashdot). This could be enough to turn the Internet into a "Global Brain", and perhaps even trigger the Technological Singularity.
I invented Miski to solve the problem of getting people to take notice of new ideas without having to engage in a massive publicity effort, but unfortunately I've failed to get anyone to take any notice of the Miski idea.
Re:How much? If everyone GZipped, a lot less! (Score:2, Interesting)

by JPDeckers ( 559434 ) writes: on Monday August 15, 2005 @04:50AM (#13319891) Homepage

Another nice and strange problem is that IE totally ignores ETag headers on gzipped pages (it does not send a If-None-Matched header back).
So effectively IE requests each and every page again if it's gzipped.
Nice to know that this bandwidthreduction-solution has the opposite effect...
See my blog [blogspot.com] for more info.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How Much Bandwidth is Required to Aggregate Blogs? 209

How Much Bandwidth is Required to Aggregate Blogs? More Login

How Much Bandwidth is Required to Aggregate Blogs?

All at once (Score:5, Interesting)

Comment removed (Score:5, Interesting)

Don't forget the robots (Score:5, Interesting)

Rather than assuming... (Score:5, Interesting)

Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)

Slashdot = blog = ironic (Score:4, Interesting)

Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)

Re:All at once (Score:2, Interesting)

Value (Score:5, Interesting)

Re:How much? If everyone GZipped, a lot less! (Score:5, Interesting)

Re:All at once (Score:4, Interesting)

Re:Rod Serling says (Score:1, Interesting)

Re:How much? If everyone GZipped, a lot less! (Score:3, Interesting)

semi off topic (Score:3, Interesting)

Re:If you poll, at least do it well... (Score:2, Interesting)

Miski Client-Server-Server-Client protocol (Score:2, Interesting)

Re:How much? If everyone GZipped, a lot less! (Score:2, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot