Best library for fuzzy document match / text fingerprinting

Question

I am thinking of building an API that would let a program submit a "fingerprint" of an academic publication, match this against a database of articles from Open Access journals, and if found, send the user the canonical citation information. Initially this would be for a specific small research field, so it wouldn't necessarily need to deal with 20 million papers to be successful (even if the 1000 most commonly cited papers in the field were covered, that would be a huge boon for productivity and collaboration).

I wonder what library (which is able to interface with Ruby, ideally) would be the best for doing this "fingerprinting". I've seen Lucene's fuzzy match, but that seems to work on a word level, whereas in this case we would probably want to submit a much larger subset of the document. The reason to do fuzzy matches is that some people might have a Word.doc preprint, some might have the final PDF, etc.

I really appreciate some of the ideas here. Googling for "perceptual hash" get me into a bunch of new material. I tried to summarize many of my findings here.

It seems like SimHash, for example the C implementation would be the way to go, but I still need to experiment more.

This is a post that could be interesting: http://stackoverflow.com/questions/8544583/designing-a-noise-filter-for-plagiarism-detection-engine-in-ruby — Michael Kohl, Feb 14 '12 at 15:46
An alternative to providing a library is to detail an algorithm, although I doubt it would be fast enough to do it in pure Ruby. One idea a friend mentioned was to use some kind of a diff to see how "different" the text is from each of the texts in the database. With thousands of texts in the database, this might not scale, but we could perhaps do searches on some of the words in the text to quickly reduce the set to match to a manageable number. Tried different word-diffs, but didn't find any that robustly spit out the number of "differences" / transformations etc, without a lot of other info — Stian Håklev, Mar 06 '12 at 13:50

score 7 · Answer 1 · answered Mar 07 '12 at 16:29

7

You can use pHash for this kind of job.

And this gem will help you to get started:

require 'phash/text'
Phash::Text.new('first.txt') % Phash::Text.new('second.txt')

answered Mar 07 '12 at 16:29

fl00r

79,728
29
207
231

This is very neat, the closest I've come to something useful. I see that the C library has a built in data store which will let you submit hashes, and then let you match new files against all submitted hashes. Don't see any interface to this through Ruby thought, not sure ho wI would do this in practice. Also would love more info about scaleability (what if I want to match against 100k files for example). – Stian Håklev Mar 21 '12 at 19:01
1

I prefer to store hashes in Lucene and match pHashes by Levenstein distance. Lucene can handle millions of hashes pretty fast – fl00r Mar 21 '12 at 19:27
Interesting, could you provide a bit more detail? How do you retrieve the hashes for matching? How many do you calculate Levenshtein distance for? – Stian Håklev Mar 22 '12 at 20:26
1

in this particular gem method `text_hash` will return hash for file with text. Hash is an sequence of `1` an `0` (`text_hash(file).to_s(2)` will return this sequence). So you need to store this bit sequence as a string in your datastorage. In lucene you will send another bit sequence and it will match it for all stored sequences. Lucene can calculate Levenstein from the box, I believe, so we've just written some rules that told about what we call "similar hashes". In case if distance is not more then 15% we make desicion that files are pretty similar. We have got about 15 millions hashes. – fl00r Mar 23 '12 at 14:43

Best library for fuzzy document match / text fingerprinting

1 Answers1