How Much Bandwidth is Required to Aggregate Blogs? 209
Kevin Burton writes "Technorati recently published that they're seeing 900k new posts per day. PubSub says they're seeing 1.8M. With all these posts per day how much raw bandwidth is required? Due to innefficiencies in RSS aggregation protocols a little math is required to understand this problem." And more importantly, with millions of posts, what percentage of them have any real value, and how do busy people find that .001%?
All at once (Score:5, Interesting)
Re:All at once (Score:5, Insightful)
Re:All at once (Score:2)
Re:All at once (Score:2)
Re:All at once (Score:2, Interesting)
I don't think you need a list of links or even a separate file. An easier solution might be to just pass a format string in a separate link-tag on the html page announcing the feed. For example, right now we have: (taken straight form the linked article)
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.feedblog.org/atom.xml" /> />
<link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.feedblog.org/index.rdf"
And we could introduce a new r
Re:All at once (Score:4, Interesting)
The blog wave is close to an inflection point,
probably within six to twelve months...
which means that total bandwidth will probably
top out at about TWICE the current rate.
http://www.realmeme.com/Main/miner/preinflection/
I suspect that even now, many blogs are
starved for readership as new blogs come online
and steal mental bandwidth.
Re:All at once (Score:3, Informative)
How much? If everyone GZipped, a lot less! (Score:5, Insightful)
So my plea to the internet community today.. make sure your web server is configured to send gzipped content. TFA says he doesn't know how many RSS feeds can support gzip. The answer is easy really, any feed being served by Apache (plus a LOT of other webservers. AOLserver even added gzip support recently). Here's how to setup Apache [whatsmyip.org] and here's where to check [whatsmyip.org] if your site is using GZip or and get an idea of the bandwidth savings you should see get. If you're site isn't gzipping, show your admin (if it's someone else) the 'how-to' above and ask them to implement it -- it's an absolute no-brainer win-win for everyone that takes no time at all to setup really. It's really absurd IMO that it's not enabled in Apache by default.
Re:How much? If everyone GZipped, a lot less! (Score:4, Insightful)
Re:How much? If everyone GZipped, a lot less! (Score:3, Insightful)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:3, Interesting)
"Compared to keeping a connection state, gzipping is _way_ more expensive. I find it very hard to believe that there is a case where keeping the connection longer was more expensive than gzipping the content."
I'm prone to agree. But I also suspect that my CTO is going to agree that it's cheaper to pay once for more processing power than it is to pay every day for higher bandwidth use. YMMV, of course. Bandwidth is relatively cheap in some parts of the US, but in other parts of the world it's hideously exp
Re:How much? If everyone GZipped, a lot less! (Score:4, Insightful)
That results in a 10 times shorter transfer time,
Which results in 10 times fewer simultaneous connections,
Which results in 10 times fewer apache processes,
Which results in massively reduced memory and processor requirements.
That unused processor and memory is what would be used to perform the gzip operations. Lets say for arguments sake compressing the output doubles the processor usage (a ridiculously high number) cutting the number of apache processes by an order of magnitude only has to reduce CPU requirements by 50% to come out on top.
If the gzip operation only inflicts a 10% overhead cutting the apache processes by ten only needs to free more than 9% to come out on top.
Look at your server, would cutting the number of apache processes from 400 to 40 save more than 10% of the CPU usage, would it save more than 50%?
[All numbers in this post were selected for ease of calculation not for their real world precision,]
Re:How much? If everyone GZipped, a lot less! (Score:4, Insightful)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Anyone running Apache can install mod_gzip, which compresses the served content and sends it to the browser, which decompresses and renders it. for further info, see this rather old article [webreference.com].
Re:How much? If everyone GZipped, a lot less! (Score:2)
On that note, using a standard web cache like Squid which supports compression in front of your webserver could solve the same problem.
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Just have your web application gzip it's output to a file with a
This assumes of course your web application is not dynamic at runtime on a visitor to visitor basis.
Re:How much? If everyone GZipped, a lot less! (Score:2)
CPU time (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:5, Informative)
so i wouldn't say ANY site using apache... but probably most. the real problem there is with compression load on the servers... gzip compression doesn't just happen you know, it takes CPU cycles that could be being used to just push data rather than encode it.
Re:How much? If everyone GZipped, a lot less! (Score:2)
Most web clients take gzipped content, so if it's static you should gzip by default and store compressed on the filesystem.
For browsers taking compressed content (most of them) serve as is and for those that don't you can uncompress the content on the
Re:How much? If everyone GZipped, a lot less! (Score:3, Insightful)
Re:How much? If everyone GZipped, a lot less! (Score:2, Interesting)
So effectively IE requests each and every page again if it's gzipped.
Nice to know that this bandwidthreduction-solution has the opposite effect...
See my blog [blogspot.com] for more info.
Re:How much? If everyone GZipped, a lot less! (Score:2)
Where the hell do you get the idea these two processes are comparable cpu-wise?
*shakes head*
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:2)
I'm sure they pass the cost of electricity on to you, the customer. If their bills start going up I'd bet they raise your rates.
Re:How much? If everyone GZipped, a lot less! (Score:2, Funny)
I'm too tired to explain to you how retarded that comment is in context to a multi-million dollar business like a datacenter. You think that they care if you are using 30 more Watts of electricity which doesnt equate to them having an extra 100 dollars on their power bill. They dont care / would never raise rates because of their power bill....They only raise rates when bandwidth availability / rackspace becomes a premium or their demand
Re:How much? If everyone GZipped, a lot less! (Score:3, Funny)
1) you're trying to have a conversation about two separate topics w/ 2 separate people
2) you've mixed up both the topics and the people already
3) you've replied to your OWN posts when you meant to reply to someone else's
4) you really like the word 'semantics'
Have to say that I'm really enjoying the fact that you work in IT but get pissed off by ppl arguing over linguistics. The irony is maxing my CPU out.
Re:How much? If everyone GZipped, a lot less! (Score:2)
I do not think that word means what you think it means.
(Oh, the irony.)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Nah, dude, zlib. Two minutes, max.
=P
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:5, Interesting)
Bandwidth is cheap. Computers, not so much.
Re:How much? If everyone GZipped, a lot less! (Score:3, Insightful)
One thing is for certain though, for many users bandwidth is NOT cheap.
Re:How much? If everyone GZipped, a lot less! (Score:5, Insightful)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Re:How much? If everyone GZipped, a lot less! (Score:2)
Not so useful on a dynamic site.
Gzip helps, but the real win is conditional get (Score:5, Informative)
Charles Miller [pastiche.org] explained this well a few years ago.
(I run the spiders at Technorati).
One problem - mod_ssl (Score:2)
In theory, the two should work together seamlessly. In practice, they don't.
Re:What about static html pages? (Score:2)
Yes, just use mod_gzip [sourceforge.net] with apache.
Running Apache as root (Score:2)
I'm not going to make the argument another fellow did (it can't corrupt your disk) 'cos it's not true, though you'd have to be pretty darn unlucky.
The more important point is - you were running Apache as root?! If so, I don't blame your boss. If not, how exactly did it corrupt the disk? I'd be putting my money on an unrelated error (without information to the contrary) personally - early disk failure, etc.
In general, though, firing someone for implementing software they've approved on a productio
Re:we've tried gzip on our server... (Score:2)
That said, it's not overly likely, and you'd have to be pretty unlucky to have something like that happen even when testing it and having it crash repeatedly.
The more important point is "What? You were running Apache as root?"
Re:How much? If everyone GZipped, a lot less! (Score:2)
That said, HTTP is pretty much made for it with its support for `Content-Encoding:', and the clear headers/body separation. You still have to do things like make sure the client understands gzip, of course.
The job would also be vastly simplified by zlib. I'd consider implementing your own gzip compressor pretty extreme when zlib is free and extremely well tested.
Comment removed (Score:5, Interesting)
Re:Bandwidth wasted for non-xhtml pages? (Score:2, Informative)
"Though a few KB doesn't sound like a lot of bandwidth, let's add it up. Slashdot's FAQ, last updated 13 June 2000, states that they serve 50 million pages in a month. When you break down the figures, that's ~1,612,900 pages per day or ~18 pages per second. Bandwidth savings are as fo
Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)
Re:Bandwidth wasted for non-xhtml pages? (Score:5, Interesting)
Answer: Not much (Score:4, Funny)
Slashdot is switching to html+css for the front page, but not for any dynamic pages like the one you're on now. Because slashcode was written by totally incompetent programmers, the markup for comment pages is not separated from the logic. Making any changes is therefore a huge undertaking and the people who wrote it are far too busy maintaining the high journalistic standards slashdot is known for to do it.
Please, mod... (Score:2)
Re:Bandwidth wasted for non-xhtml pages? (Score:2, Informative)
It has absolutely sod-all to do with XHTML. HTML 4.01 and XHTML 1.0 are functionally identical. You can use table layouts and <font> elements with XHTML 1.0 and you can use CSS with HTML 4.01.
You are referring to separating the content and the presentation through the use of stylesheets. This has nothing to do with XHTML, although it would save a hell of a lot of bandwidth if Slashdot implemented it. They are implementing it [slashdot.org].
Re:Bandwidth wasted for non-xhtml pages? (Score:2)
and from here [alistapart.com].
Ask an IT person if they know what Slashdots tagline is and theyll reply, News for Nerds. Stuff that Matters. Slashdot is a very prominent site, but underneath the hood you will find an old jalopy that could benefit from a web standards mechanic.
This is going to sound like a flame - and it isn't meant to be it. But it seems obvious at this point that the people run
Re:Bandwidth wasted for non-xhtml pages? (Score:2)
Whaddya mean? Haven't you seen the glorious new IT color scheme? [slashdot.org]
Re:Bandwidth wasted for non-xhtml pages? (Score:2, Funny)
Slashdot? (Score:4, Insightful)
On slashdot.... Oh wait....
Re:Slashdot? (Score:2, Funny)
The Answer (Score:2)
Like most of life, building networks of trust takes time. Aren't issues like this really part of the problem? Charging for bandwidth... My server has something like 100gig of transfer, and unless I get Slashdotted several times a month is this really a problem? And, if I do, why aren't I getting some ads in place to pay for it?
900k a day, not 9m (Score:2, Informative)
Don't forget the robots (Score:5, Interesting)
No one read it, but I got a ton of hits -- all from indexing services. WordPress pings a service that lets lots of indexing systems know about new posts. Some of them -- Yahoo, for example, were contstantly going through my entire tree of posts, and hitting links for months, subjects, and so on.
It didn't bother me, because the bandwidth wasn't an issue, and it wasn't like they were hammering my vps or anything. It mostly just made it really hard to read the logs, because finding human readers was like looking for a needle in a haystack.
But bandwidth is cheap, and RSS is really useful, so it seems at least as good of a use for the resource as p2p movie exchanges.
Re:Don't forget the robots (Score:2)
Almost none.
Don't worry about it, guys. If people ever start clamoring for MORE blog posts, you'll know.
Re:Don't forget the robots (Score:3, Informative)
See AWStats [sourceforge.net]
Re:Don't forget the robots (Score:3, Insightful)
Rather than assuming... (Score:5, Interesting)
"How much data is this? If we assume that the average HTML post is 150K this will work out to about 135G. Now assuming we're going to average this out over a 24 hour period (which probably isn't realistic) this works out to about 12.5 Mbps sustained bandwidth.
Of course we should assume that about 1/3 of this is going to be coming from servers running gzip content compression. I have no stats WRT the number of deployed feeds which can support gzip (anyone have a clue?). My thinking is that this reduce us down to about 9Mbps which is a bit better.
This of course assumes that you're not fetching the RSS and just fetching the HTML. The RSS protocol is much more bloated in this regard. If you have to fetch 1 article from an RSS feed your forced to fetch the remaining 14 addition posts that were in the past (assuming you're not using the A-IM encoding method which is even rarer). This floating window can really hurt your traffic. The upside is that you have to fetch less HTML.
Now lets assume you're only fetching pinged blogs and you don't have to poll (polling itself has a network overhead). The average blog post would probably be around 20k I assume. If we assume the average feed has 15 items, only publishes one story, and has a 10% overhead we're talking about 330k per fetch of an individual post.
If we go back to the 900k posts per day figure we're talking a lot of data - 297G most of which is wasted. Assuming gzip compression this works out to 27.5Mbps.
Thats a lot of data and a lot of bloat which is unnecessary. This is a difficult choice for smaller aggregator developers as this much data costs a lot of money. The choice comes down to cheap HTML index ing with the inaccuracy that comes from HTML or accurate RSS which costs 2.2x more.
Update: Bob Wyman commented that he's seeing 2k average post size with 1.8M posts per day. If we are to use the same metrics as above this is 54G per day or around 5Mbps sustained bandwidth for RSS items (assuming A-IM differentials aren't used)."
Re:Rather than assuming... (Score:2)
Some Answers (Score:4, Insightful)
Less than it currently takes, what with pull, HTTP, and XML used instead of more efficient technologies.
``what percentage of them have any real value, and how do busy people find that
Using a scoring system, like Slashdot's?
It's not like all of this is rocket science. It's just that people go along with the hyped technology that's "good enough for any conceivable purpose", ignoring the superior technology that had been invented before and wasn't hyped as much. Nothing new here.
Your sig says (Score:2)
You got your facts wrong. When feed readers use conditional GET and respect HTTP Last-Modified headers, and when feed publishers use gzip encoding (XML, like most plain text formats, compresses wonderfully), the bandwidth requirement for aggregation is minimal; the technologies themselves, then, are not inefficient; the inefficiency is in how they are being used. And the alternative you hint at, push, is nowhere near being "more efficient" since it would requi
Re:Your sig says (Score:2)
I can't say I agree with you, though. Maybe I should clarify my points a bit.
1. Push distribution should be more efficient than pull distribution, because it only sends when something has actually changed. You could argue that pull distribution can be more efficient, because multiple updates can be bundled, but the same can obviously be done with push distribution as well.
2. XML is more verbose than, for example, s-expressions. RSS is not terrible, but when I eye
Re:Your sig says (Score:2)
9 M? (Score:2)
the cited article discusses volumes of 900k, i.e.: thousands...
from whence comes this discrepancy ?
Re:9 M? (Score:2)
Aforementioned discrepancy cometh from thine arse, which be white as the first winter snow.
-1, redundant (Score:2)
So, you can say:
Whence comes this discrepancy?
but please don't use
From whence...
because it's redundant.
Definition of quality and value == arbitrary (Score:3, Insightful)
In actuality, my guess is that there are few blogs you might decide to visit, and of those you do, several may have content you find worthwhile. Remember, worthwhile is all in the perception of the reader - there is no real definition for quality or value. Perhaps through trial and error - in essence digital tinkering - you find and derive your own value.
cheers, --dave
Slashdot = blog = ironic (Score:4, Interesting)
Does anyone else consider it ironic that the Slashdot editorship HATES blogs, but Slashdot is actually a blog?
Anyone else getting tired of these questions?
Re:Slashdot = blog = ironic (Score:2)
The average blog is just some random joe telling us about his day or various bits of intellectual sophistry about things he doesn't understand (politics, science, etc).
Sorry, quantity != quality. A million monkeys at a million typewriters, only a few of them are producing the works of Shakespeare.
s/blog/website (Score:3, Insightful)
Time to ditch the World Wide Web, right?.
That's 900,000 posts (Score:4, Informative)
Finding the Worthwhile Content in Blogs (Score:5, Insightful)
If a friend is going through cancer treatment, her blog is worthwhile. If you find a youth group leader like yourself and can learn from his posts, his blog is worthwhile. A mother fighting for her health so that she can take care of her two sons and husband can share insights that are worthwhile. Someone fighting depression might have a worthwhile blog. A grandmother might have a view of the world that makes her blog worthwhile, just to get a different view. Perhaps a blog by someone who totally disagrees with you will be worthwhile, just to stretch your mind.
I've just described why I read the blogs on my blog roll. You can choose differently.
Top political blogs? You can find them easily among Technorati's top 100 list. Tags at Technorati will let you pick out specialties like science or "Master Blasters" or diabetes or the Tour de France. Google will turn up blogs if you search right, which is the trick for using Google.
"Worthwhile" is a much more difficult variable to calculate than "bandwidth." Perhaps it's the sheer variety of blogs that makes them interesting, because they are so individual and someone, somewhere will speak to your mind or your heart.
Worthwhile is what's worthwhile to you, and maybe to very few others. Not everyone will agree, and that's not a bad thing.
Do we even understand what "value" means? (Score:2)
Either I don't understand this question, or it's a completely idiotic question. What the fuck does "real value" mean? The maxim "One man's trash is another man's treasure " is especially important when talking about information--the asymmetry of value from person to person is even bigger than when you're talking about physical goods.
Considering the second half of the question, though, one might re-phrase t
Value (Score:5, Interesting)
I had for a while held the view that most blogs out there are pointless. Some can be insightful and some are basically used as company press releases, but most are people talking about their days activities that few people really care about, and a few of my friends have blogs like these. When I asked one whats the point, she said she just blogs stuff she would normally mention to many people on msn throughout the day. Its not meant to have value to anyone on slashdot, be hugely insightful, or detail some breathtaking new hack, its simply another way for her to talk to friends (that doesnt involve repeating herself).
Rod Serling says (Score:2)
You've just woken up in....The Blogosphere! De-de-de-de, de-de-de-de.....
The answer to the article's question is: nothing; there's no point in wading through the output of b
Judgemental much? (Score:2)
The days when 9 megabytes or 5 MPS sustained for a popular server is considered out of line is long gone. Poeple want to communicate, and they will use whatever resources are needed. How many resources do we use so that we can gaurantee that tuan will his present from grandma? How many resources do we use so that an arbitrary firm can mail a postcard to everyone in the country? How many resourc
Wheat from chaff (Score:5, Funny)
Re:Wheat from chaff (Score:5, Funny)
Ah! I guess you missed the following blog entry then:
Hi everybody, it's Sunday today and I'm bored. So I guess I'll get on with my homemade engine that runs on water. As you know, it's almost finished, and I expect it to put out as much as 1337 horsepower. The reliability of the motor should be good too: my friend, Ray Kewl in engineering, said it should provide well beyond 10,000 TEH (total engine hours).
Update: the engine is in the car, and it runs! on nothing but water! OMG I'm so happy! check the pictures and the diagrams to build your own. I can't wait to get my drivers license renewed so I can take it for a spin!
Re:Wheat from chaff (Score:3, Insightful)
Oops! (Score:2)
It's going to.
Someone is going to link to the original post [slashdot.org] on their blog. That article will be recopied a few times until any link to Slashdot is lost.
Some news reporter, hoping to pick up on the "next big thing" will take it to be a legitimate report.
When you watch the cable news and see an over-hyped story about a car that runs on water, ask yourself if it started out as a joke on Slashdot.
Re:Oops! (Score:2)
Re:Wheat from chaff (Score:2)
Wait--real life is more humorous--the GOP [rnc.org] is the first listing!
semi off topic (Score:3, Interesting)
Every single RSS aggregator I've come across treats my RSS world similar to an e-mail reader, where each blog is a 'folder' and each entry is equivalent to an e-mail.
This is decidedly NOT what I want and I don't understand why everyone's writing the same thing.
My friend is running PLANET, which builds a frontpage out of the RSS feeds (looks kind of like the slasdot frontpage where adjacent stores come from different sources and are sorted in chronolocial order (newest on top)
PLANET seems to be a server-side implementation. My buddy's running Linux and he made a little page for me but it's not right for me to bug him every time I want to add a feed.
Is there anything like what I want that would run on Windows? And if not, why the heck not?
By the same token, why doesn't del.icio.us have any capacity to know when my links have been updated?
For what it's worth, here's my del.icio.us BLOGS area with some blogs I find good.
http://del.icio.us/eduardopcs/BLOG [del.icio.us]
We don't (Score:2)
Busy people don't waste time on blogs. Blogs are the realm of internet kooks ranting about the latest conspiracy behind secret intelligence memos, not sane people with limited free time.
Re:We don't (Score:2)
Well, Slashdot is a blog (which is why I always chuckle at comments saying "I don't read blogs"), and if it wasn't the first website to use the "newest article appears at the top, pushes previous ones down" format that characterises a blog, then it was certainly among the first.
How? you know how... (Score:2)
They don't, they really have better things to do. The media actually does that for us already... what me worry?
Miski Client-Server-Server-Client protocol (Score:2, Interesting)
Re:Miski Client-Server-Server-Client protocol (Score:2)
Th e long tail (Score:4, Informative)
This effect is called the The long tail [wikipedia.org] effect, and is visible all over the web. For instance, Amazon.com says that every day, it sells more books that didn't sell yesterday than the sum of books sold that *also* sold yesterday. In other words, they sell (in sum) more of the items selling less than one every other day than of items selling (by type) more than that.
Eivind.
The Big Count (Score:2)
PubSub later admitted they may have been double-counting.
Re:busy people read 9000 blogs per day?? (Score:3, Informative)
Secondly, that would be posts, i'm assuming the intelligent stuff tends to be not in 90 seperate posts, but with multiple intelligent posts from the same person.
Third, since the original poster somehow messed up and cited the number 9 million instead of the correct number, 900,000 , that number is reduced to 9 posts a day, a reasonable amount to read.
Re:busy people read 9000 blogs per day?? (Score:2)
Re:If you poll, at least do it well... (Score:2)
Dave, where are you when we need you?
Re:If you poll, at least do it well... (Score:2, Interesting)
The documentation there is, I think, about as good as you'll find. While it says that it can be implemented in either XML-RPC or SOAP, I am aware only of XML-RPC implementations.
The cloud provides a means for blogs to notify subscribers of updates and should eliminate the need for polling -- except that the subscriptions must be renewed at least every 25 hours. Of course, this cloud stuff isn'
Re:What people even blog (Score:2)
Have you considered writing a blog? People would read it.