6

how can i create a checksum of only the media data without the metadata to get a stable identification for a media file. preferably an cross platform approach with a library that has support for many formats. e.g. vlc, ffmpeg or mplayer.

(media files should be audio and video in common formats, images would be nice to have too)

yawniek
  • 269
  • 2
  • 13
  • 2
    I think you are looking for some fingerprinting algorithm... Which would be interesting as it can be used to identify similar media, too. Looking forward to answers, bumping. +1 – hurikhan77 Mar 07 '10 at 09:51
  • fingerprinting is interesting too, there is e.g. libofa [1] for audio (which i could not compile on osx despite patches) but i want something more generic to identify duplicate files and not duplicate songs/movies. [1] http://code.google.com/p/musicip-libofa/ – yawniek Mar 07 '10 at 09:55
  • by "without the tags" do you mean "without the metadata"? if so, saying "media data" may confuse things. – cregox Apr 08 '10 at 19:48

2 Answers2

3

I don't know of any existing platform-independent software that will accomplish this, but I do know a way that this could be accomplished in an interpreted (platform-independent) language such as Java.

Essentially, we simply need to strip any metadata (tags) from the file, demultiplexing video files beforehand. Theoretically after demux and removing metadata, one could hash the file and compare against another file that has undergone the same process to match identical files despite having different tags. Unlike a fingerprint, this would not identify similar songs/movies but identical files (imagine you might want the 10 different versions or bitrates of a given song you've archived, but don't want 2 identical copies of any of them floating around).

The most troubling part of this is removing tags as there are many different specifications for tag formats which are not necessarily implemented the same across different applications, i.e. the same exact audio file given identical tags separately through two different applications may not result in identical output files. The only way this could pose an issue fatal to the concept of an audio-only checksum is if popular tagging software makes any changes to the binary audio portion of the file, or pads the audio in a non-standard way.

Taking a checksum is trivial, but I'm not aware off the top of my head of any platform independent libraries to demux and detag mpeg files. I know that in 'nix environments, mpgtx is a great command-line tool that could perform the demux and detag, but obviously that is not a platform-independent solution.

Maybe someone out there feels ambitious?

defines
  • 9,494
  • 4
  • 35
  • 52
  • this is the way to go. in the meantime i wrote ha patch for ffmpeg to calculate sha1 hashes instead of adler32 checksum. this essentially does the trick. if anyone would like to help me bringing this to ffmpeg that would be great. – yawniek Apr 30 '10 at 12:06
0

one possible solution i found seems to be with vlc:

./VLC -I rc snd.mp3 :sout='#std{mux=raw,access=file,dst=-}' vlc://quit | sha1sum
yawniek
  • 269
  • 2
  • 13