Faster P2P By Matching Similiar Files? 222
Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files.
"SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""
Summary: (Score:5, Informative)
Comment removed (Score:5, Informative)
Re:Right.... (Score:3, Informative)
Should work just fine... (Score:2, Informative)
This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.
Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.
Should work just fine.
Re:Right.... (Score:5, Informative)
(Yes, I'm one of the authors.)
Re:TorrentSoup (Score:5, Informative)
Re:Nickelback? (Score:1, Informative)
Re:Nickelback? (Score:4, Informative)
For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.
Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)
Re:TorrentSoup (Score:4, Informative)
Already been done a long time ago (Score:3, Informative)
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma
So this paper isn't as revolutionary (if that's what they mean).
This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.
Re:Nickelback? (Score:3, Informative)
Tested on uTorrent 1.6.0 (old version) on my and my roomates computers. Incidentally, the process isn't any faster than downloading the file on one computer, and copying it over to the other one afterwards.
You are correct about the tags, though.
Re:Problem with variable insertions? (Score:5, Informative)
It will still be WORSE. (Score:3, Informative)
I grab an mp3 from person A. I then clean up the tag and rename it to suit me.
You want to download that same song with a different name and different tag.
You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.
Using the new model, you could also connect to person A and myself and download the blocks that are the same.
So instead of only...
99 people in the swarm and 1 seeder
you'd have 99 people in the swarm, 1 seeder, person A and myself.
But in order to FIND person A and myself, you'd have to go through A MILLION OTHER PEOPLE to find if they have any blocks that you are looking for.
The CRITICAL PART THAT THEY LEFT OUT is the amount of bandwidth you'd be using to search A MILLION unrelated systems with unrelated files looking for those blocks.
This works in their lab because they have very few machines with very few files and they've already pre-loaded those machines with the files they want to be found.
Re:Right.... (Score:4, Informative)
Re:TorrentSoup (Score:1, Informative)
If H(x) = H(y)
H(x+z) = H(y+z)
!
Re:Nickelback? (Score:3, Informative)
By "Some", you mean "Every Single Frickin' One Of Them", right?
No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.
Huh? (Score:2, Informative)
So does ed2k, Bittorrent, ..., ..., .... this is hardly news. Even plain ol' FTP and HTTP can do this to a degree.
As far as the 99.9% similar "speedup"... I seriously doubt that you'll see any gains other than in lab conditions. MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar). And unless you've still got a 14.4 modem I doubt you'll notice the speedup even with MP3's, since they are so small to begin with.
Re:Ok, I read the paper (Score:3, Informative)
It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.
-l