17

I have a question that can we normalize the levenshtein edit distance by dividing the e.d value by the length of the two strings? I am asking this because, if we compare two strings of unequal length, the difference between the lengths of the two will be counted as well. for eg: ed('has a', 'has a ball') = 4 and ed('has a', 'has a ball the is round') = 15. if we increase the length of the string, the edit distance will increase even though they are similar. Therefore, I can not set a value, what a good edit distance value should be.

Naufal Khalid
  • 269
  • 1
  • 4
  • 11

2 Answers2

27

Yes, normalizing the edit distance is one way to put the differences between strings on a single scale from "identical" to "nothing in common".

A few things to consider:

  1. Whether or not the normalized distance is a better measure of similarity between strings depends on the application. If the question is "how likely is this word to be a misspelling of that word?", normalization is a way to go. If it's "how much has this document changed since the last version?", the raw edit distance may be a better option.
  2. If you want the result to be in the range [0, 1], you need to divide the distance by the maximum possible distance between two strings of given lengths. That is, length(str1)+length(str2) for the LCS distance and max(length(str1), length(str2)) for the Levenshtein distance.
  3. The normalized distance is not a metric, as it violates the triangle inequality.
Anton
  • 2,858
  • 13
  • 12
  • What I want to do it to rank the words on the basis of their edit distances so in my case normalized edit distance is better. If it violates the triangle inequality, is there any better way to find that? I was going through some papers eg: [Normalized edit distance](http://www.csie.ntu.edu.tw/~b93076/Computation%20of%20Normalized%20Edit%20Distance%20and%20Applications.pdf) but did not understand the algorithm used. Thanks! – Naufal Khalid Aug 22 '17 at 20:31
  • @NaufalKhalid Violation of the triangle inequality is not necessarily a problem, especially if you're only interested in pairwise differences (and not, say, a diameter of a set of strings). I would start from the normalized Levenshtein distance and only switch to something else if I ran into some specific problem. – Anton Aug 22 '17 at 20:56
  • @NaufalKhalid The paper you linked describes a different kind of normalization. While the distance divided by the length of the longest string can be roughly described as "mistake rate" (number of differences per character), the distance divided by the length of the edit path measures how serious the average mistake is. If all operations in your Levenshtein distance variation have the same cost, `W(P)/L(P)` will be the same for all non-identical strings. – Anton Aug 22 '17 at 20:56
  • Ok. I get it. Thank you! – Naufal Khalid Aug 23 '17 at 09:18
  • Why this approach is better than the `fDist = float(len - levenshteinDistance(s1, s2)) / float(len);` i.e. the other approach? Looks like here it is saying normalizedLevensteinDistance is `levenshteinDistance(s1, s2)/max(s1.length(), s2.length())`? – Exploring Sep 29 '20 at 05:32
5

I used the following successfully:

len = std::max(s1.length(), s2.length());
// normalize by length, high score wins
fDist = float(len - levenshteinDistance(s1, s2)) / float(len);

Then chose the highest score. 1.0 means an exact match.

Martin
  • 51
  • 1
  • 1