Slashdot Log In
Running The Numbers: Why Gnutella Can't Scale
Posted by
timothy
on Wed Feb 14, 2001 03:58 PM
from the hello-to-www.monkey.org/~dugsong dept.
from the hello-to-www.monkey.org/~dugsong dept.
jordan (one of the founding developers of Napster), writes: "As the rumour mill churns over Napster's future, many folks see Gnutella as the next best hope for the music loving file sharing community. Problem is, Gnutella can't scale . [Note: if that URL doesn't work, try this mirror.] Almost all research on Gnutella up till now has been based on observations of the system in the wild, but this paper discusses the technical merits of that statement through a detailed mathematical analysis of the Gnutella architecture." The kind of numbers that you may not like to read if you figure networks expand to accomodate traffic at a never-ending pace. Update: 02/15 12:24 AM by T : Jordan also points to this mirror for your reading pleasure.
This discussion has been archived.
No new comments can be posted.
Running The Numbers: Why Gnutella Can't Scale
|
Log In/Create an Account
| Top
| 287 comments
(Spill at 50!) | Index Only
| Search Discussion
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Re:I challenge you... (Score:5)
I havent started on d and f, but they could be added.
This project is called The ALPINE Network [cubicmetercrystal.com]
It scales linearly, and provides a query mechanism that rivals the performance of a centralized directory. (Although the bandwidth is more than a centralized query, but at least you have direct control over how much bandwidth you use and how).
At any rate, I could use development assistance a great deal. Let me know if anyone is interested.
Regards...
There is FUD there. (Score:3)
All the author really does is take an example of a mathmatical formula which grows exponentially and show how quickly he gets "scary" numbers. No effort is made to show whether or not the efficiency of Gnutella breaks down as the network increases in size. No effort is made show how much work is done per search or per result. He just makes assumptions about the gnutella network which results in exponential growth in the number of users, and then shows how the aggregate traffic also grows exponentially. Duh. What did you expect? By this logic, nothing scales.
Don't get me wrong, I don't think Gnutella scales either. But you don't need to wave around all the FUDdy math that this guy does to prove it. The argument why it doesn't scale is simple:
The reason is doesn't scale is that every search request (optimally) gets delivered to every client. We don't even have to look at how those searched get delivered. We'll completely ignore the amount of traffic in the backbone, and only count the traffic that has to exist on the last hop to each client. Let's assume that the requests are 100 bytes a piece, or about 1000 bits once we have all the overhead of UDP/IP/ethernet/PPP/ATM/whatever on top. If each search is 1000 bits, and the average client has a 56K modem, the whole thing falls apart when the search rate is 56 searches / second. If we assume 1 million users, each one can only perform a search about once every 5 hours on average before the modem links are 100% full.
The problem here is the broadcast of every search to every client. Any distributed search network needs to either assume very high bandwidth connections for all the clients (because they are all servers to the whole network) or have some hierarchy of caches / servers. The amount of bandwith being used at each client increases as more clients connect. If the number of users goes up by 1000%, the traffic on my local link goes up 1000%. This is why it doesn't scale. It has nothing to do with how many GB of traffic the network as a whole has to handle. It's the simple fact that the traffic at every client increases as more clients connect. This is the problem that has to be corrected, and Jordan's paper never even mentions this fact, relying instead on big scary numbers. His claim at the end that gnutella generates 2.4GBps of traffic for 1 million users is the ultimate FUD. How much traffic does Napster generate when it has 1 million people connected? He probably doesn't know because their servers go down first.
How About... (Score:3)
Of course, you'd have to work out how to prevent hostile clients and servers from corrupting your indexes, but I'm sure that's a much more easily solved problem than working out how to prevent some skript kiddie from flooding napsters servers off the net with a DDOS.
Who needs Gnutella? Move Napster off-shore. (Score:4)
In all seriousness, I don't condone mass piracy, but the RIAA has been screwing people for decades and I have to admit that I enjoy watching them squirm. What could the RIAA conceivably do if Napster were located offshore, preferably in a country not bound by the terms of the Berne convention?
Re:Well, duh (Score:3)
Maybe Gnutella needs to take the meta-internet approach. A "new" internet on top of the current internet?
(I dunno. I ask because I'm curious. How is Gnutella in general different than the internet in particular?)
Re:Scaling... (Score:3)
I even share the equations and methodologies I used, and try to poke holes in my own conclusions.
Further, I'm not a competitor. I haven't worked for Napster in 3 months. Before Napster my background was in poking holes in things anyway. All I did was finish a personal project I started a long time ago.
You actually sound more like FUD than anything. :-)
--jordan
Re:Scaling... (Score:3)
Re:Well, duh (Score:3)
You build something that uses a distributed algorithm to build a spanning tree. The nodes near the top of the spanning tree become the servers. You build the algorithm so that parents in your spanning tree will naturally have more bandwidth than you do.
I've been thinking about this for a long while.
Building the spanning tree isn't hard. Every node just selects one and only one parent node. They tell the parent that they're a child of that parent. You prevent cycles having a parent refuse to be a parent unless it also has a parent. If it loses its connection to its parent, it tells all the children that it no longer is a parent. One node 'seeds' the network as a root by saying it can be a parent without being a parent and not looking for a parent. Eventually it can delegate roothood to a child that has proven high bandwidth. It cannot cease being a root without doing this delegation.
You can have connections to nodes that are neither parents nor children, but search requests should not be propogated to those nodes unless you have no parent. Eventually a search request will make it onto the spanning tree and be efficiently distributed.
You can eventually elect servers who are near the top of the spanning tree. Nodes should, in general, elect parents that have more bandwidth than they do. This means that nodes near the top of the spanning tree should have the most bandwidth.
A solution (Score:3)
I really want to build this with my StreamModule system [omnifarious.org], but nobody is helping me with it, and I don't have the time to hack it out, especially since I'm so ridiculously methodical when it comes to code.
You build something that uses a distributed algorithm to build a spanning tree. The nodes near the top of the spanning tree become the servers. You build the algorithm so that parents in your spanning tree will naturally have more bandwidth than you do.
I've been thinking about this for a long while.
Building the spanning tree isn't hard. Every node just selects one and only one parent node. They tell the parent that they're a child of that parent. You prevent cycles having a parent refuse to be a parent unless it also has a parent. If it loses its connection to its parent, it tells all the children that it no longer is a parent. One node 'seeds' the network as a root by saying it can be a parent without being a parent and not looking for a parent. Eventually it can delegate roothood to a child that has proven high bandwidth. It cannot cease being a root without doing this delegation.
You can have connections to nodes that are neither parents nor children, but search requests should not be propogated to those nodes unless you have no parent. Eventually a search request will make it onto the spanning tree and be efficiently distributed.
You can eventually elect servers who are near the top of the spanning tree. Nodes should, in general, elect parents that have more bandwidth than they do. This means that nodes near the top of the spanning tree should have the most bandwidth.
Re:Scaling... (Score:3)
Ok, this time I did a bit more thorough check of the numbers. I agree with the first half, the traffic generated by the request half of the message. What I'm not as convinced of is the response side of the equation.
I don't know what the typical percentage of Gnutella users sharing files is, so I'll accept your figure of 30%. But 40% of those sharing files having a match? Even with your reduced number here I think it's high. If 40% of people sharing files had a match that would mean with default settings you'd get: (N=4, T=5) 484*(0.3*0.4) = 58 people finding a match. And with the numbers you use later of 10 matches a person you'd get 580 matching entries. I've never received anything near that high. But if I did, I certainly would have no motivation to increase T or N.
What happens if it's only 10% of those sharing that have a match? With the default settings you'd still get 14 people matching, or about 140 matching entries. That's still a *lot* of responses, more than I've ever received.
If all your default numbers are used, your nightmare scenario would yield 0.3*0.4*7,686,400*10 "found" responses to your query. That's 9 million 223 thousand 680 "grateful dead live" songs (though not unique) shared among 900 thousand deadheads who are all simultaneously online. Whoa.
I'm not an expert in human psychology by any means, but let me suggest this. With most tools, people don't feel any need to "tweak" them unless they're not working right. With 480 songs returned, I don't think many people would feel a need to tweak their settings. If someone was having a hard time finding something they might then change their settings -- but if they were having a hard time finding it they wouldn't get so many responses returned.
The only way I can imagine those monstrous amounts of data resulting from querries is if it happens by maliciousness or mistake.
Am I missing something?
Re:Scaling... (Score:3)
Not everything is practical just because there is a need for it.
Great straw-man rebuttal! How about if you try a more rational analogy? Going from gas combustion engines to teleportation or fusion power is a tad bigger leap than going from Napster to a similar service! And Napster ceasing to exist versus gas prices climbing higher is not analogous either...
A better analogy would be:
"If we run out of petroleum-based fuel, a similar or better form of energy will come to the forefront."
And that's ABSOLUTELY TRUE, reasonably proven through a huge mound of empirical evidence.
Where have I seen this before? (Score:5)
Why things are developed (long) (Score:4)
Warning: Rant Ahead!
Partially true. In your example, you said that if price of gasoline went up, teleportation or fusion-powered cars wouldn't be developed. I agree. However, if the price of gasoline went to $20/gallon tomarrow (an outrageous rate, but its just an example), then we'd either see a changeover to natural gas/electric or some other alternative energy source vehicle, or cars would be developed that got 400 miles/gallon.
So why would gas/electictric cars be implimented and not fusion or teleportation? Well, first we have a demand for transportation. The demand for transportation is rather high, at least in the developed world, and especially in the US, since all of us seem to want to live in the woods and commute to the city. Therefore, if the demand is high, we *will* find something to fulfill the need, as long as the cost of fulfilling the demand is not so great that we have to sacrific other, equally important demands. We don't commute to work via helicopters because the time, money, and energy we would have to exert to be able to use them isn't worth the extra few minutes we'd shave from our commute time. We don't commute to work with buses because we prefer living in areas with lower population densities (e.i. suburbs) which make buses impractical and we don't like the inconvience of having to conform to the bus's schedule and having to interact with other members of our community. We are looking for something that fulfills our need to get from point A to point B with the lowest oppertunity cost to us. This is the economics/social side of the scale. On the other side of the scale is the harsh laws of science and technology, which dictate what has been done, what is possible, and what is impossible, and what the costs for doing each are. Say we have a possible solution set such as this { car (gasoline), car (electric), walking, teleportation, car (fusion) }. Science tells us the teleportation looks impossible. Therefore, we eliminate it. Technology tells us that fusion powered cars haven't been done yet, and considering everything that we know about "hot" fusion, its doubtful we could ever fit a fusion reactor in a vehicle the size of a car. We are now left with gasoline-powered cars, electric-cars, and walking, in this simplified example. Walking is too much of an inconvience to us, science doesn't have a problem with it, but human nature, and the time it would take, plus distance that would have to be traveled, make it impossible. On the economic/psycology/social side, walking isn't happening. So what will it be, electric or gasoline? The technology that's in place makes gasoline-powered vehicle cheaper then electric, and gasoline, even at the high prices that it is lately, is still an economical means of transport. Plus, we have human nature, gasoline is tried and true, electric isn't. Electric also has some problems with travelling long distance, and infrastructure doesn't support electric right now. Therefore gas is the best solution to our problem. In the future, if electric becomes more ideal then gasoline (enough to override our habit of sticking with what we know), we will switch.
So, we learn this. Each problem/solution pair depends on economics, human nature (psycological/social), science, and technology.
Lets apply this to Napster, OpenNap, Gnutella, and the rest of the field. Napster was nice and easy, a lot of us became accustomed to using it, and the technology (on our end) was cheap. However, Napster is either dead or moving towards a fee-based service. All of a sudden, from the economics viewpoint, Napster is less ideal. OpenNap is simular to Napster, there is the additionaly hassle of finding a server, but since Napster is having trouble, OpenNap seems a lot more attractive. However, OpenNap from the social viewpoint, is insecure, it has a central server, it can be attacked. Therefore, what do we have left? Gnutella is free of cost, and cannot be shut down through elimination of a central server. It is harder to use, and technology says it won't scale in the current format. Plus, it eats up bandwidth like a hog.
The above was a rant, and presented simplified examples. I didn't mention gyro-driven cars, monorails, carts hauled by penguins, or bicycles, amoung other things, because I was trying to keep the examples simple (and carts hauled by penguins aren't really practical). I didn't mention stuff like how critical user mass applies to file sharing systems because it didn't pertain to the topic of the comment. So please, don't flame me with a comment how widget-driven cars are the ideal solution, or that file sharing also depends on bandwidth. Nitpicking just wastes both of our time. On the other hand, valid comments are appreciated.
Well what about Freenet then? (Score:4)
Freenet is also very well architected, unlike bogus Gnutella. It's designed to scale up, so that popular stuff gets cached all over the place. Like, more people downloading means that your connections go FASTER. This is cool.
Mojo Nation aims to do exactly that (Score:3)
Yes!
By using an internal microcredit/payment system (called mojo) and localized reputations Mojo Nation [mojonation.net] aims to do exactly that. Better connected brokers (peers) will naturally become more "server like" due to having a better uptime, lower latency and a lower mojo cost overall for other brokers (peers) to use.
The resources in the system are allocated dynamically. No strict heirarchy needs to be defined, it will establish itself appropriately for each individual peer as it is needed.
PS a new version (0.950) was released today.
OpenNap (Score:4)
Life will find a way..... (Score:3)
Well, duh (Score:5)
Anyone who understands how Gnutella works (unfortunately, too few people) knows that Gnutella is horribly broken, will never work, and is basically unfixable.
The more relevent question is whether you can have a peer-to-peer network without central servers that *can* scale. And the answer is "no".
However, the REAL question is whether you can have a peer-to-peer network with decentralized servers, i.e., with clients that automatically establish a heirarchy among all the clients, and certain clients become more "server like". They only way to make a Gnutella work is by making it heirarchical, but the heirarchy needs to be automatic for it have the same general "virtual network" aspect of Gnutella.
Is it possible? I don't know. You would probably have to have automatic bandwidth measurements, depth probes, all kinds of things to make it work. I simply don't know if it would be possible to automate something like that.
--
Audiogalaxy. (Score:3)
At first I was put off by the web interface, but:
1) It remembers everything you request in a queue and will get it when available. (A must for dial-up users)
2) Auto-resume using temp files.
3) A small app in your system tray/console only sends/receive when you have it running.
The greatest advantage is that ZDnet/CNet/MSNBC and other DON'T mention audiogalaxy in their "quest for Napster clone" articles, so the quality of users, and therefore the music, is excellent.
Unfortunately, it is a centralized system, but so far, it seems the mainstream media/RIAA have ignored it.
Freenet (Score:3)
I understand that there are basically three reasons for Freenet:
- Abolition of censorship
- Archival of documents based on their percieved "usefulness"
- Elimination of standard bottlenecks in most peer-only networks (I hate the term peer-to-peer, but won't digress into the rant behind that statement)
So, do we really care that Gnutella lasts any longer than the time that it takes to get Freenet everywhere?Re:Scaling... (Score:3)
--jordan
RIAA is advertising piracy... (Score:4)
Now there will be media coverage (other than internet) mentionning other alternatives like IRC, Gnutella, search engines, etc etc, this is really a stupid move... not counting the many people that is going to be pissed off at RIAA and stop buying CDs.
RIAA should have worked closely with napster to bring a decent buisness model instead of bashing on them, they might have actually profited from that. They've shown how many "copyright material" were leeched every second (around 10,000) but did they show EVIDENCE that their sales decreased DUE to napster? no, they didn't have to, but if they would have, things wouldn't be that way. You bet after napster shuts down, their sales will decrease, I, for a start, will not buy anymore CDs.
I hope a company picks on big artists for digital distribution and doing something like stephen king, a buck a download, money would go STRAIGHT to them and the record label would stop it's own piracy (i.e. ripping many artists off and taking the public for complete morons).
For now Gnutella will do for most people, and if people SHARE, maths or not, it will work, not as nicely as napster did, but there will be a bunchload of alternatives if gnutella isn't doing the job.
Alternative Mirror (Score:3)
--jordan
A Future Alternative (and its scales linearly too) (Score:5)
The key aspects of this network will be:
- No forwarding. This is currently eating gnutella alive. A UDP based multiplexed transport protocol is used to maintain hundreds of thousands of direct connections to all the peers you want to communicate with. You can also tailor your peering groups precisely to what you desire, as far as quality, reliability, etc.
- Low Communication Overhead. All queries that are broadcast are performed with minimal overhead within UDP packets. A typical napster breadth query (10,000 peers) would take a few minutes on a modem, and seconds on a DSL line.
- Adaptive Configuration. Peers that have better or more responsive content will gravitate towards the top of your query list, thus, over time you will have a large collection of high quality peers which will greatly increase the chance of you finding what you need.
There are a number of other features, however too much to detail here.
Also, this is under heavy development, and not operational. I am going solo on this at the moment, and so progress is slow. However, once completed, it *should* be a scalable alternative to completely decentralized searching / location.
Re:Scaling... (Score:4)
But if Napster gets squeezed, you can bet your last dollar that it will be made to. Or something like freenet or audiogalaxy will take over.
But if the price of gasoline goes up, you can bet your last dollar that teleportation will be made practical. Or that cars that use fusion will be developed.
Not everything is practical just because there is a need for it.
--
Re:Well what about Freenet then? (Score:5)
Freenet is also very well architected, unlike bogus Gnutella.
The problem with Gnutella is not the transferring of files, it's the searching. You'll note that Freenet conspicuously avoids the subject of searching, except for "yeah, we're thinking about it... real soon now!"
--
I challenge you... (Score:5)
Can we expect therefore to see an equally interesting and thorough discussion of how Napster/Gnutella can grow, evolve and perhaps merge, to provide the "ideal compromise" where we will not need 100Gb networks, but where:
a) The destruction of any significant %age of the network is transparrently ignored or healed.
b) The network will not segment as GnutellaNet can.
c) Bandwith requirements are low[er]
d) Anonymity of participants is maintained where required.
e) The law can't shut it down so easily.
f) Data can be secured, encrypted and/or signed (etc.) for specific users
And MY personal wish:
g) The end result is so globally accepted for file exchange and storage, that FTP dies a death, and we all live without buffer-overflow exploits for the rest of out lives
Note that Napster and Gnutella were very one-sided in their freedom with files. There was no facility available to ensure that the law wasn't honoured where desired.
--
Re:Well what about Freenet then? (Score:4)
A few months ago, I tried to find a simple, lucid discussion of exactly how FreeNet works with IP anonymity. On a technical level, but without having to plow through the code. Anybody want to try? I'm not disputing that it works... I just don't see how you can prevent someone from sitting on your router and watching packets fly, and correlate the IP on the other end to a single system.
--
Evan
Hey, dammit! (Score:3)
*sigh*
omega_rob -- friend of the dread pirate Napster
More info: improving your connection (Score:3)
1. Never connect to more than 5 hosts at a time. There's no need for it and you'll only hurt yourself by doing so. I used to spend a lot of time in the gnutella.wego.com discussion area, and then the GnutellaNews boards, helping out new users. Time after time someone would come in and say, "Gnutella is shit! I type in a search and I don't get results for 10 minutes!" Me: "How many connections do you have open?" Them: "50, and if I try with 100, it goes even slower!!"
The more active connections you have, the slower your Gnutella experience will be... And by being a congested node, you're adding latency to the network for everyone else. Set your max connections to 5. That gives me, on average, an overhead of 6-10K/sec in background chatter, not counting uploads/downloads.
If you're on dialup, max your connections out at 2 and (it hurts to say this) don't share files or you won't be able to do anything else online. If you really want to share - and that's a good thing - cap your uploads at 1. Leave routing up to the people with the fatter pipes.
2. Go for diversity in your connections. If you load up your client and see that you're connected to 5 RoadRunner nodes, dump a few of them and try to connect to other networks. Peer-to-peer file sharing relies a lot on peering, after all. Connecting across ISPs, networks, and even across countries is a good thing.
3. Don't share junk files. Please. Every time I search for Pink Floyd and get a ton of under-1MB MP3s in the results, I want to kill someone. Know which directories, if any, you're sharing... And clean them out from time to time. All those incomplete downloads you made are being sent out as search results, but nobody is going to download them from you. Those are a lot of wasted bytes coming through your query hits.
4. Perhaps most importantly, use a good client. See the parent for details.
Shaun
What about Usenet? (Score:3)
Tools like NewsShark and NewsGrabber make it easy to post or obtain binary formatted files such as multimedia and there is plenty of it available. No waiting for downloads, no acne-faced punk kids aborting them, and you can batch and resume at your convenience.
Usenet isn't that hard to use and there is a lot of music that can be found from your ISP's news server. Grab a client and check it out!
-Pat
Re:Well what about Freenet then? (Score:4)
The other thing you could do would be to take over a node on the network, and request the material you're interested in, but since freenet uses relay nodes, you can never be sure that the information you're recieving came directly from the node you are talking to or through N relay nodes. Also the data is encrypted on the harddrive of the node operator, so you, provably, cannot know if you are storing illegal data or a copy of Johney's essay for school.
Hope that helps,
--
Remove the rocks to send email
Re:Audiogalaxy? (Score:3)
Downloads are generally horribly slow. Generally most of my downloads on Napster/OpenNap servers come in around 25 - 100Kps. Audiogalaxy claims you get the fastest for your location, but I can't see how I'm getting 1-2.5Kps downloads if that's really true.
Selections not too bad, but you can't find the obscure stuff that you'll find on a network the size of Napster. It's organization is a step in the right direction, better than Napsters, but could be better.
What I really don't like about it is the fact that you have to choose which version each time. Sure, you're supposed to get the most popular version, but myself I don't like 128K mp3's. I prefer 192K files. So each time I download a song I have to choose which one I want. It would be nice if I could tell it I prefer 192K songs, and it would default to those.
With Napster I can find an entire album in a search and queue it up quickly. With AudioGalaxy it takes several clicks. You also don't have the ability to browse a users files. This is one of my favorite things about Napster is the ability to browse other users files. Sure Audio Galaxy gives you logical choices of other music, but I frequently find things I like that don't logically go with the song I initially searched for.
Basically I think AudioGalaxy is a good idea. I'd like to see a better client, maybe standalone or a Java client so it would have a little more flexibility, and I'd like to see more potential interaction between users.
Re:Well what about Freenet then? (Score:3)
Freenet is a totally peer-to-peer system. It is not possible to tell weather I'm sending the file directly to you or am just transmitting at the request of a node behind me. And if it's possible to procecute for that, then ISPs everywhere are in BIG trouble
--
Remove the rocks to send email
Re:Well, duh (Score:3)
This is how server on demand would work. A client starts and searchs for a server. It could be a local broadcast, searching a list of last known servers, searching a list of servers from a config file, asking the user for a server address, etc. If no server is found, the client automatically becomes a server. If a server is located, the client can begin to request data. If the server is under heavy load, it will ask a client (meeting certain requirements) to become a server. Of course, clients have a choice, perhaps a little config switch to either allower server status or never allow it. Anyway, the client needs to be worthy of server status. Some criteria might be uptime, bandwidth, available storage space, number of hops, etc. The burdened server would give the invitation. One thing that might be implamented is an automatic server invitation for clients which are located along the same route as more distant clients (say, somewhere in the middle of a 50 hop route based on response times). The middle server would then handle the requests for the more distant clients.
Obviously, we need to maintain a list of the quasi-centralized servers. Clients can maintain their own list of servers. Servers can hold lists of other servers. Search requests are never handed to clients. Only servers are searchable. Therefore, clients may publish available files to server databases. Servers may ignore these if clients do not meet certain guidlines.
We could incorporate a trust level into the server list. Once a client is deemed worthy to be a server, it is trusted. Certain bad events (dropping from the network too often, loosing storage space, etc) could reduce the trust rating. If that rating goes too low, server status is revoked and a new server can be appointed. Good events might raise that trust.
Anyway, that's a bit of my rambling. I'm not a network engineer so I couldn't describe the scalability of this method. It might work or might not. So comments are welcome! ^_^