MD5 of Song Data: Finding Duplicate MP3's

Journal MyHair's Journal: MD5 of Song Data: Finding Duplicate MP3's 1

Journal by MyHair on Wednesday December 31, 2003 @01:10PM

Update: I started this project but still can't strip the ID3 tags temporarily. I'll put the details and code snippets in a post.

Update 2: I more or less gave up on stripping the MD5 tags for now since it will take too much work. However, all ID3v1 tags are at the end of the file. ID3v2 tags can be and usually are at the beginning of the file, but ID3v2 tags are far less used in my collection, so I used the following command to get the MD5 sum of the first 500KiB of the file. It will miss identical songs that have had ID3v2 tags at the beggining of the file added or altered, but I was able to delete another 1.2 gigs of duplicate mp3s with this test:

head -c 512k <mp3 file> | md5sum -b

md5sum in this command returns a filename of "-", but I used some shell trickery to figure out which sum went with which file.

***************

Foreward: Really my only question here is "how do I calculate the MD5 sum of MP3 song data only--that is, not including ID3v1 or ID3v2 tags?" The rest is just chatting about what I'm doing.

I'm a network admin who thinks he knows enough about programming to pull this off. (I had several programming courses in college and occaionally play around with one language or another but never had a developer job or a serious project.)

I have disorganized, reorganized and relabled MP3 files. I want to find duplicates. Here's my idea to accomplish this:

I'm going to copy information from each MP3 into a database, either directly or through an intermediate delimited text file. The major info will be the path & file name, MD5 sum of the file, and the MD5 sum of the data only (minus the metadata/ID3 tags). I'll probably also try to get the MD5 sum of the first X seconds or X Kbytes of song data to detect duplicate beginings (in case of a truncated duplicate). Secondary info will include all the other stuff like file size, song length and ID3 tag data; this info will just be for me to locate duplicates that aren't byte-exact and to help me decide which of the duplicates to delete.

I could probably do this with a shell script and a couple of utility programs, but it may be simpler to grab a couple of modules from CPAN and use Perl. I don't anticipate having trouble with the metadata or MD5 sums of the entire file, but I don't yet see an easy way to calculate the MD5 sum of only the song portion of the MP3. I'm browsing through CPAN, and there are tons of modules which read and edit ID3 tags but nothing quite like what I have in mind. I did find one that strips the ID3 tags, but it seems to alter the file directly, and I only want to have the tags stripped just long enough to get the MD5 sum.

Sure I could read the data structure of the MP3 file and parse it myself, but I intuit that somebody made a perl module or command line utility that can nondestructively present me the tag-stripped MP3 so I can MD5 it and leave the original file unchanged. Anybody know of such a module or utility?

My app won't be pretty or user friendly. It will probably take a path as input and output either a delimited file or update a mysql or pgsql database directly. Then I will query the database to find what I need. I'm not only finding duplicates; I'll also be looking for files that are unique per storage device, because I have most of my MP3s on my main PC and my MP3 player (w/20G HDD), but there are some MP3s on one device but not on the other and vice versa, not to mention some on my linux box and some on my work PC. I can import from all sources into different tables and play with queries and eventually consolidate and catalog my mp3 collection.

I know there are some various apps that try to do some of these things, but I like working at the command line and having my hands on the raw data; every time I try to deal with an MP3 manager application I hate it.

EDIT: Yeah, I could just delete all my MP3's and re-rip them, but what fun would that be? Mucking about with Perl and SQL sounds so much more fun.

This discussion has been archived. No new comments can be posted.