Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Software

Faster P2P By Matching Similiar Files? 222

Andreaskem writes "A Carnegie Mellon University computer scientist says transferring large data files, such as movies and music, over the Internet could be sped up significantly if peer-to-peer (P2P) file-sharing services were configured to share not only identical files, but also similar files. "SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source. Even then, downloads can be slow because these networks can't find enough sources to use all of a receiver's download bandwidth. That's why SET takes the additional step of identifying files that are similar to the desired file... No one knows the degree of similarity between data files stored in computers around the world, but analyses suggest the types of files most commonly shared are likely to contain a number of similar elements. Many music files, for instance, may differ only in the artist-and-title headers, but are otherwise 99 percent similar.""
This discussion has been archived. No new comments can be posted.

Faster P2P By Matching Similiar Files?

Comments Filter:
  • Summary: (Score:5, Informative)

    by PhrostyMcByte ( 589271 ) <phrosty@gmail.com> on Wednesday April 11, 2007 @11:53AM (#18690153) Homepage
    instead of sharing files, divide them into 16KB chunks and share those, to help work around files that get renamed or trivially altered (eg a website tagging their url to all the files you upload).
  • Comment removed (Score:5, Informative)

    by account_deleted ( 4530225 ) on Wednesday April 11, 2007 @11:54AM (#18690193)
    Comment removed based on user account deletion
  • Re:Right.... (Score:3, Informative)

    by Icarus1919 ( 802533 ) on Wednesday April 11, 2007 @11:55AM (#18690209)
    Ok, perhaps you're not certain how files work. But things compressed with different codecs and bitrates look VERY different when you actually look at the coding in the file as opposed to the same file named differently or with minor changes.
  • by Anonymous Coward on Wednesday April 11, 2007 @11:58AM (#18690257)
    I think everyone posting above saying "it won't work" should RTFA.

    This works by breaking files down into clusters and hashing the clusters (like Bittorrent already does). Then it searches for other shares that have clusters with the same hash value, and requests them.

    Assuming that the hashing scheme being used it "good" in that there are no collisions, two clusters with the same hash will contain the *exact same* information.

    Should work just fine.
  • Re:Right.... (Score:5, Informative)

    by angio ( 33504 ) on Wednesday April 11, 2007 @12:01PM (#18690311) Homepage
    Take a peek at the paper [cmu.edu] - it actually does work, and we demonstrated it. The intuition: people make small changes to files like changing the artist or title in the MP3 header, and then BitTorrent and other systems treat this as a "different" file, when in fact it's 99.9% similar.
    (Yes, I'm one of the authors.)
  • Re:TorrentSoup (Score:5, Informative)

    by joe_cot ( 1011355 ) on Wednesday April 11, 2007 @12:06PM (#18690417) Homepage
    It would still work the same way as it does now: an md5 of each specific block, and an md5 of the whole thing. If the md5 for the block doesn't match, it's not going to download, and if it's someone using collision to inject a block with the same md5, 1) it's not going to pass the md5 on the whole thing, 2) you're already vulnerable to it. The reason this will work is that they'll be lots of people sharing incomplete or corrupted versions of your FreeBSD iso; you'll get the blocks that are good, and skip the blocks that aren't, making "similar" files very useful. Not too difficult to understand, and no need for tin foil hats.
  • Re:Nickelback? (Score:1, Informative)

    by Anonymous Coward on Wednesday April 11, 2007 @12:20PM (#18690647)
    Shareaza already discards ID3 tags when hashing files. (It's an option if it's off by default.)
  • Re:Nickelback? (Score:4, Informative)

    by Robotech_Master ( 14247 ) * on Wednesday April 11, 2007 @12:46PM (#18691075) Homepage Journal
    If I understand the article right, SET looks at individual files within a particular download. DHT just looks at the whole download.

    For instance, if I'm uploading my "Songs I Like to Dance To" mp3 mix, and someone else is uploading an "All-Time Greatest Dance Hits" CD rip, and there are a couple of songs both uploads have in common, SET would enable someone downloading my MP3 mix to treat the CD rip as a partial seed (and vice versa), and pull down the songs held in common from either one.

    Whereas DHT would simply enable people to pull down my mix from other people uploading the mix, or the CD rip from other people uploading the CD rip, even if the tracker was down. (If I understand what DHT does correctly. Which it is possible I don't.)
  • Re:TorrentSoup (Score:4, Informative)

    by CastrTroy ( 595695 ) on Wednesday April 11, 2007 @12:58PM (#18691255)
    I'm not sure if this would work if you changed the byte offset though. Sure both ISOs may contain a lot of the same data, but I think it's very unlikely that the data would be at the same byte-offset in the file. I don't think that you'd be able to accomplish this for different byte offsets, because for a 100 MB File, assuming 5 MB chunks, You're looking at about 2,000,000,000 chunks to calculate (20 chunks, calculated at each byte offset).
  • by J0nne ( 924579 ) on Wednesday April 11, 2007 @01:13PM (#18691459)
    Shareaza has been doing this for years. When hashing MP3 files, it disregards what(s in the id3 tags, and just computes a hash for the audio information. This means that files with different id3 tags will still be added to the swarm, whicj is great.
    Unfortunately, there are some issues with it:
    -Only Shareaza supports it, other clients didn't want to play along.
    -Shareaza has/d a bug where it would fail to reconstruct the id3 tag after downloading, giving you files with empty tags
    -Only mp3 is supported, so no ogg, aac or wma

    So this paper isn't as revolutionary (if that's what they mean).

    This will only work with identical files that have metadata that is frequently changed by end-users, because there's no way you're going to be able to get a good file if you try to mix a cam with a dvdrip, or an ogg with an mp3, or an xvid file with a divx file. It just doesn't work that way.
  • Re:Nickelback? (Score:3, Informative)

    by thepotoo ( 829391 ) <thepotoospam@@@yahoo...com> on Wednesday April 11, 2007 @01:20PM (#18691571)
    Actually, DHT doesn't care about file names. You can test this yourself if you have a LAN. Grab a torrent, start downloading it on one computer (save as a different filename), get a little of the file downloaded, and start the same torrent on a different computer. Use your firewall to block the second computer's access to the tracker. DHT will kick in, your computers will log in, get each other's IPs, and computer 2 will get an insanely fast download speed until it catches up with computer 1.
    Tested on uTorrent 1.6.0 (old version) on my and my roomates computers. Incidentally, the process isn't any faster than downloading the file on one computer, and copying it over to the other one afterwards.

    You are correct about the tags, though.

  • by angio ( 33504 ) on Wednesday April 11, 2007 @01:38PM (#18691895) Homepage
    We define chunk boundaries using Rabin fingerprinting. It's a cute trick - not one of our own invention - that is relatively insensitive to insertions and deletions. It was used in some of the other work in this area, such as the Low Bandwidth File System (LBFS). There's a family of work in this area called "shingling" that can also apply to sequence similarity.
  • by khasim ( 1285 ) <brandioch.conner@gmail.com> on Wednesday April 11, 2007 @01:39PM (#18691901)
    Let's go with a fairly vanilla scenario:

    I grab an mp3 from person A. I then clean up the tag and rename it to suit me.

    You want to download that same song with a different name and different tag.

    You connect to person B sharing it. If you're using BitTorrent, you can also connect to any of 99 other people trying to download it from person B.

    Using the new model, you could also connect to person A and myself and download the blocks that are the same.

    So instead of only...
    99 people in the swarm and 1 seeder
    you'd have 99 people in the swarm, 1 seeder, person A and myself.

    But in order to FIND person A and myself, you'd have to go through A MILLION OTHER PEOPLE to find if they have any blocks that you are looking for.

    The CRITICAL PART THAT THEY LEFT OUT is the amount of bandwidth you'd be using to search A MILLION unrelated systems with unrelated files looking for those blocks.

    This works in their lab because they have very few machines with very few files and they've already pre-loaded those machines with the files they want to be found.
  • Re:Right.... (Score:4, Informative)

    by angio ( 33504 ) on Wednesday April 11, 2007 @01:41PM (#18691943) Homepage
    Similar in spirit - except rsync looks at files on your local hard drive by the same name, so there's only one possible candidate to draw from. SET looks at all of the files that everyone else is currently downloading, so we had to develop a much more efficient technique for locating useful files.
  • Re:TorrentSoup (Score:1, Informative)

    by Anonymous Coward on Wednesday April 11, 2007 @02:39PM (#18692857)
    Your 1) is not true. MD5 has the property that:

    If H(x) = H(y)

    H(x+z) = H(y+z)

    !
  • Re:Nickelback? (Score:3, Informative)

    by evilviper ( 135110 ) on Wednesday April 11, 2007 @02:49PM (#18692997) Journal

    Some P2P protocols allow looking up a file by a hash which does not take filename into account,

    By "Some", you mean "Every Single Frickin' One Of Them", right?

    nearly all existing P2P protocols will treat the new file as a completely different file,

    No. Only the most brain-dead P2P protocols will. "tree" hashes are in use by several P2P protocols. Some are just old or primitive, and have a large number of old servants around that don't understand newer hashes.

  • Huh? (Score:2, Informative)

    by sofla ( 969715 ) on Wednesday April 11, 2007 @03:47PM (#18693839)

    SET speeds up data transfers by simultaneously downloading different chunks of a desired data file from multiple sources, rather than downloading an entire file from one slow source.

    So does ed2k, Bittorrent, ..., ..., .... this is hardly news. Even plain ol' FTP and HTTP can do this to a degree.

    As far as the 99.9% similar "speedup"... I seriously doubt that you'll see any gains other than in lab conditions. MP3 is about the only format that might be agreeable to this, since I imagine its reasonably common for people to fix ID3's and then share the modified file. I just don't see it happening for other formats (.avi, .zip, .rar). And unless you've still got a 14.4 modem I doubt you'll notice the speedup even with MP3's, since they are so small to begin with.

  • by Loligo ( 12021 ) on Wednesday April 11, 2007 @07:29PM (#18696433) Homepage

    It's already been addressed: files encoded with different codecs, bitrates, compression ratios, what-have-you look completely different, have vastly different checksums, and even if named exactly the same and with the exact same file size, would never be confused for each other by any algorithm that's comparing what's actually IN the files.

      -l

"A car is just a big purse on wheels." -- Johanna Reynolds

Working...