Slashdot Log In
Faster P2P By Matching Similiar Files?
Posted by
CmdrTaco
on Wed Apr 11, 2007 10:45 AM
from the something-doesn't-jive-here dept.
from the something-doesn't-jive-here dept.
Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files.
"SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Nickelback? (Score:5, Funny)
Well, sure, if you're only looking at Nickelback [thewebshite.net] songs.
Re:Nickelback? (Score:5, Informative)
Being that the only difference is just the text (ID3, ID3-2) tags, the rest of the song is exactly the same, so why can't you use that as a download source too? I personally organize all of my music, and because of this P2P programs believe that it's an entirely new file, when really it was just renamed and the header information was changed (generally to be grammatically correct.)</summary>
Parent
Re:Nickelback? (Score:5, Interesting)
DHT even supports partially corrupted files, your client just discards the corrupt data.
My question is, why would I want to use SET over DHT? Does SET not need a ceneralized server, or does it have any other advantage at all?
TFA is really short on technical details, but it sounds to me as though SET is just a re-design of DHT. Still, I imagine SET support will be in the next builds of all the major bittorrent clients if it ends up being worth something.
Parent
Re:Nickelback? (Score:4, Informative)
For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.
Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)
Parent
Re:Nickelback? (Score:5, Interesting)
Some P2P protocols allow looking up a file by a hash which does not take filename into account, but this will not handle the case where the files differ in only one small section. The best example is the following:
Person downloads an MP3.
Person finds that the MP3 is not properly tagged (for example, has a comment field saying who ripped it/released the rip, and has no track number.)
Person changes the MP3's ID3 tag
Now, nearly all existing P2P protocols will treat the new file as a completely different file, when in reality the most important contents (the audio itself) have not changed, only the file's metadata.
Other users will go for the "full-file" match with the largest number of sources, thus causing the mistagged MP3 to propagate more than a "fixed" one.
So a P2P system that ignores the ID3 tag when hashing would have significant advantages, in which the user could download the file from many sources and then choose which source to get their metadata from.
Parent
Re: (Score:3, Informative)
Tested on u
Re: (Score:3, Informative)
By "Some", you mean "Every Single Frickin' One Of Them", right?
No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.
Re:Nickelback? (Score:4, Interesting)
I think what he's talking about may be more like the document fingerprinting algorithms used to pare search engine results, or to detect plagiarism in student papers.
In some cases you will be downloading components of a file from two sources, neither of which have the others' component. The example in TFA was downloading the video portion of a movie from a foreign language site and the audio from a site with the language you speak but less bandwidth.
I suppose another example would be that if you were downloading an anthology of stories, you could take a particular story from a server that hosted a different anthology including that story. Or maybe you are downloading the new distro; you could take some of your files from sites offering the distro version you are looking for, some from sites only offering the files you need to upgrade to that version, and some from entirely different distros or much older versions if they happened to be the same.
I guess it could be thought of as a kind of "fuzzy akamai".
It's an interesting idea, but I don't see any commercial support for it. In fact I see commercial opposition under the current regime of copyright laws and royalty based business models.
Parent
Re: (Score:3, Insightful)
The last thing I want is a "similar" file.
What would be a "similar" file to a FreeBSD ISO? It would either be a corrupted file or one with an introduced exploit.
Re:TorrentSoup (Score:5, Informative)
Parent
Re:TorrentSoup (Score:4, Interesting)
It would be interesting if the implementing software could also look for possible matches within your existing file structure and reduce the data downloaded automatically, kind of like using diff and just downloading the patch.
Parent
Re:TorrentSoup (Score:4, Informative)
Parent
Re:TorrentSoup (Score:4, Insightful)
Parent
grea tide a (Score:5, Funny)
The music kids listen to (Score:3, Funny)
That'll work great with (Score:2)
No wait, hear me out. Most porn is going to be largely white or black skin colour (particularly with Friesian Cows if you're into that sort of thing), so the P2P can just find a chunk with a similar amount of that colour and download that!
Summary: (Score:5, Informative)
Tiny chunks, Large files (Score:2, Insightful)
The article talks about 16kb chunks, which for a dvd image would take more than the torrent protocol currently allows.
The client would spend more time communicating its chunk lists around than actually getting data.
(If I remember rightly, torrents can have a max of 65535 chunks and some servers prevent huge
meh (Score:2)
there's definitely potential for problem here. what if those files really arent supposed to be the same? a swapped byte here and there could have huge effects on the end result.
Overhead (Score:2)
I'm sure smarter people than I have already thought this out,
but that seems to be the most immediate downside.
"SET divides a one-gigabyte file into 64,000 16-kilobyte chunks"
In other words: instead of seeking out the one master-hash for the file,
your P2P is looking up the thousands of chunk hashes.
Container-aware P2P (Score:2)
Should work just fine... (Score:2, Informative)
This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.
Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.
Should work just fine.
Its all just ones and zeros (Score:2)
Shows the real trouble with P2P (Score:2)
This is just an illustration of the fact that P2P is an incredibly inefficient way of transferring files around. Most of the material is not only pirated, but a big fraction of the pirated material is the same stuff. P2P "peers" aren't necessarily nearby, either in a physical or bandwidth sense. So huge amounts of bandwidth are being spent shipping the same stuff around.
If it weren't for the piracy issue, the daily output of the RIAA, which is a few gigabytes, could be distributed efficiently by putti
Blurring the Line Between Data (Score:3, Interesting)
In essence, if you were downloading a music CD, the data chunks could be coming from someone who has an Ubuntu ISO image, or some type of copyrighted material.
With this system it essential becomes impossible to tell who is uploading what.
Re: (Score:3, Interesting)
That being said, about 90% of the commenters missed the point completly, though it is somewhat understandable given how vague and nontechnical the linked article and summa
Could really work! (Score:4, Funny)
Already been done a long time ago (Score:3, Informative)
Unfortunately, there are some issues with it:
-Only Shareaza supports it, other clients didn't want to play along.
-Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
-Only mp3 is supported, so no ogg, aac or wma
So this paper isn't as revolutionary (if that's what they mean).
This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.
An interesting licensing issue (Score:4, Interesting)
I tried it (Score:3, Funny)
Don't see how it could work on a large scale. (Score:3, Interesting)
How do you know a file is similar? By hashing? There's no guarantee that a particular chunk of a file with an md5 hash (for example) contains the same bytes as that of another file.
There are 2^256 possible chunks of 256 bits of data. There are 2^16 possible hashes with (using a 16 bit binary key) for that same data. That means that for every hash match, the data has a 1 in 16 chance of actually matching.
You can extend the key length to reduce this ratio, but you'll end up with a key length equal to your data size before you're sure the data is not a collision.
The problem gets worse if the chunks of data aren't equal in size.
This can only work if you have a centralized database of every possible file combination on your network. It's workable for a small amount of files, but will grow exponentially in a real environment. Not to mention, the centralized database would have to handle a significant amount of traffic, reducing the speed gains possible.
Count me as skeptical.
perhaps 100% similar (Score:3, Interesting)
Similarity detection rules.
Re: (Score:3, Informative)
mod this up -- angio...this is the problem (Score:3, Insightful)
A 200MB, 30min video that was compressed at 1000kbps DiVX is not the "same file with minor changes" as a 200MB, 30min video that was compressed at 900kbps DiVX. They ARE different files and should be treated as such. You also can't deduce anything from their filenames, play length, or any other characteristic so how would you determine which ones can go together and which ones can't? I did not see codecs or compression mentioned at al
Re: (Score:3, Interesting)
Re:Right.... (Score:5, Informative)
(Yes, I'm one of the authors.)
Parent
Problem with variable insertions? (Score:3, Interesting)
Re:Problem with variable insertions? (Score:5, Informative)
Parent
It will still be WORSE. (Score:3, Informative)
I grab an mp3 from person A. I then clean up the tag and rename it to suit me.
You want to download that same song with a different name and different tag.
You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.
Using the new model, you could also connect to person A and myself and download the blocks that are the same.
So instead of only...
99 people in the swarm and 1 seeder
you'd
Re:Right.... (Score:4, Informative)
Parent
Re: (Score:3, Interesting)
I disagree, this could be done with No Change to any protocol, client, or server: the only thing that needs created is a torrent creator located on a machine that had a full version of every "simular" torrent to be shared. the torrents would all be linked read only to the same DVD image (for exampl), only part of that DVD image would be labeled as "downloaded" and you would put the
Re: (Score:3, Informative)
It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.
-l
Re: (Score:2)
However, there is no
It gets worse. (Score:3)
To paraphrase Morbo: "DOW
Re: (Score:3, Funny)
Re:Snakeoil (Score:4, Funny)
May I suggest you don't open your e-mail and refrain from answering the phone for today? I usually fill up my bullshit quota with those two media alone. Slashdot is just the icing on the cake. ;)
Parent
Re: (Score:2, Insightful)
Okay, I'll admit that there's a few MP3s that have different ID3 tags but the actual audio is the same. A few. The large majority of duplicate songs are NOT the same audio data. It's been re-ripped, transcoded, or some other horrid thing done to it and is not the same data anymore.
Now, even assuming that there ARE tons of very-alike files out there, you'd have to write an intelligent comparer for each one so that it knew how to deal with the file and what informati
Re: (Score:3, Interesting)
It's more then a few. Most people use the default settings in their audio ripper/compression program, and it's all from the same CD. Even more people never uses an audio ripper and/or compressor, and simply downloads the file from the Internet. Not that many people bother to change ID3-tags either, but every single person that do, leads to a different file.
Re:This could work for some files, but not for oth (Score:3, Interesting)
But the article makes it sound like their custom software breaks up media files into their component streams, which clients can download separately as they desire.
Re:This could work for some files, but not for oth (Score:3, Interesting)