9

I've been working with Double Metaphone and Caverphone2 for String comparisons and they work good on things like names, addresses, etc (Caverphone2 is working best for me). However, they produce way too many false positives when you get to numeric values, such as phone numbers, ip addresses, credit card numbers, etc.

So I've looked at the Luhn and Verhoeff algorithms and they describe essentially what I want, but not quite. They seem good at validation, but do not appear to be built for fuzzy matching. Is there anything that behaves like Luhn and Verhoeff, which could detected single-digit errors and transposition errors involving two adjacent digits, for encoding and comparison purposes similar to the fuzzy string algorithms?

I'd like to encode a number, then compare it to 100,000 other numbers to find closely identical matches. So something like 7041234 would match against 7041324 as a possible transcription error, but something like 4213704 would not.

Adrian McCarthy
  • 41,073
  • 12
  • 108
  • 157
JeffG
  • 597
  • 1
  • 7
  • 18

1 Answers1

4

Levenshtein and friends may be good for finding the distance between to specific strings or numbers. However if you want to build a spelling corrector you don't want to run through your entire word database at every query.

Peter Norvig wrote a very nice article on a simple "fuzzy matching" spelling correcter based on some of the technology behind google spelling suggestions.

If your dictionary has N entries, and the average word has length L, the "Brute force Levenshtein" approach would take time O(N*L^3). Peter Norvig's approach instead generates all words within a certain edit distance from the input, and looks them up in the dictionary. Hence it achieves O(L^k), where k is the furthest edit distance considered.

Thomas Ahle
  • 28,005
  • 19
  • 77
  • 105
  • 1
    Just wanted to say thank you for the answer. I plan to review the article, but for the moment, Daniel's answer above got me what I needed. – JeffG Jan 06 '12 at 14:54