8

The function levenshtein in PHP works on strings with maximum length 255. What are good alternatives to compute a similarity score of sentences in PHP.

Basically I have a database of sentences, and I want to find approximate duplicates. similar_text function is not giving me expected results. What is the easiest way for me to detect similar sentences like below:

$ss="Jack is a very nice boy, isn't he?";
$pp="jack is a very nice boy is he";

$ss=strtolower($ss);  // convert to lower case as we dont care about case
$pp=strtolower($pp);

$score=similar_text($ss, $pp);
echo "$score %\n";  // Outputs just 29 %

$score=levenshtein ( $ss, $pp );
echo "$score\n";  // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(
anon
  • 81
  • 1
  • 4

3 Answers3

10

The levenshtein algorithm has a time complexity of O(n*m), where n and m are the lengths of the two input strings. This is pretty expensive and computing such a distance for long strings will take a long time.

For whole sentences, you might want to use a diff algorithm instead, see for example: Highlight the difference between two strings in PHP

Having said this, PHP also provides the similar_text function which has an even worse complexity (O(max(n,m)**3)) but seems to work on longer strings.

Community
  • 1
  • 1
Ferdinand Beyer
  • 58,119
  • 13
  • 142
  • 141
  • Thanks. Could you tell me if I am interpreting results of similar_text wrongly? I want to be able to detect similar sentences like in the example. – anon Feb 23 '11 at 15:36
  • `similar_text` returns the number of matching characters, not a percentage. There's a third optional parameter that can be used to find the percentage. – Spencer Hakim Feb 23 '11 at 15:51
  • Oh okay.. I thought the third parameter is just if you want to pass the result variable by reference :). – anon Feb 24 '11 at 05:17
3

I've found the Smith Waterman Gotoh to be the best algorithm for comparing sentences. More info in this answer. Here is the PHP code example:

class SmithWatermanGotoh 
{
    private $gapValue;
    private $substitution;

    /**
     * Constructs a new Smith Waterman metric.
     * 
     * @param gapValue
     *            a non-positive gap penalty
     * @param substitution
     *            a substitution function
     */
    public function __construct($gapValue=-0.5, 
                $substitution=null) 
    {
        if($gapValue > 0.0) throw new Exception("gapValue must be <= 0");
        //if(empty($substitution)) throw new Exception("substitution is required");
        if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0);
        else $this->substitution = $substitution;
        $this->gapValue = $gapValue;
    }

    public function compare($a, $b) 
    {
        if (empty($a) && empty($b)) {
            return 1.0;
        }

        if (empty($a) || empty($b)) {
            return 0.0;
        }

        $maxDistance = min(mb_strlen($a), mb_strlen($b))
                * max($this->substitution->max(), $this->gapValue);
        return $this->smithWatermanGotoh($a, $b) / $maxDistance;
    }

    private function smithWatermanGotoh($s, $t) 
    {   
        $v0 = [];
        $v1 = [];
        $t_len = mb_strlen($t);
        $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0));

        for ($j = 1; $j < $t_len; $j++) {
            $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue,
                    $this->substitution->compare($s, 0, $t, $j));

            $max = max($max, $v0[$j]);
        }

        // Find max
        for ($i = 1; $i < mb_strlen($s); $i++) {
            $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0));

            $max = max($max, $v1[0]);

            for ($j = 1; $j < $t_len; $j++) {
                $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue,
                        $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j));

                $max = max($max, $v1[$j]);
            }

            for ($j = 0; $j < $t_len; $j++) {
                $v0[$j] = $v1[$j];
            }
        }

        return $max;
    }
}

class SmithWatermanMatchMismatch
{
    private $matchValue;
    private $mismatchValue;

    /**
     * Constructs a new match-mismatch substitution function. When two
     * characters are equal a score of <code>matchValue</code> is assigned. In
     * case of a mismatch a score of <code>mismatchValue</code>. The
     * <code>matchValue</code> must be strictly greater then
     * <code>mismatchValue</code>
     * 
     * @param matchValue
     *            value when characters are equal
     * @param mismatchValue
     *            value when characters are not equal
     */
    public function __construct($matchValue, $mismatchValue) {
        if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue");

        $this->matchValue = $matchValue;
        $this->mismatchValue = $mismatchValue;
    }

    public function compare($a, $aIndex, $b, $bIndex) {
        return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue
                : $this->mismatchValue);
    }

    public function max() {
        return $this->matchValue;
    }

    public function min() {
        return $this->mismatchValue;
    }
}

$str1 = "Jack is a very nice boy, isn't he?";
$str2 = "jack is a very nice boy is he";
$o = new SmithWatermanGotoh();
echo $o->compare($str1, $str2);
Community
  • 1
  • 1
joshweir
  • 4,409
  • 2
  • 31
  • 51
2

You could try using similar_text.
It can get quite slow with 20,000+ characters (3-5 seconds) but your example you mention using only sentences, this will work just fine for that usage.

One thing to note is when comparing string of different sizes you will not get 100%. For example if you compare "he" with "head" you would only get a 50% match.

Shawn
  • 303
  • 3
  • 4
  • 12