0

I want a code to return sum of all similar sequences in two string. I wrote the following code but it only returns one of them

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    return sum( [c[i].size if c[i].size>1 else 0 for i in range(0,len(c)) ] )
print similar(a,b)

and the output will be

6

I expect it to be: 11

Mostafa Ghafoori
  • 157
  • 3
  • 11

3 Answers3

0

When we edit your code to this it will tell us where 6 is coming from:

from difflib import SequenceMatcher
a='Apple Banana'
b='Banana Apple'
def similar(a,b):
    c = SequenceMatcher(None,a.lower(),b.lower()).get_matching_blocks()
    for block in c:
        print "a[%d] and b[%d] match for %d elements" % block
print similar(a,b)

a[6] and b[0] match for 6 elements

a[12] and b[12] match for 0 elements

DevBot
  • 345
  • 4
  • 20
0

get_matching_blocks() returns the longest contiguous matching subsequence. Here the longest matching subsequence is 'banana' in both the strings, with length 6. Hence it is returning 6.

Try this instead:

def similar(a,b):
    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0

    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()

        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)

        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]

    return sum

This "subtracts" the matching part of the strings, and matches them again, until len(c) is 1, which would happen when there are no more matches left.

However, this script doesn't ignore spaces. In order to do that, I used the suggestion from this other SO answer: just preprocess the strings before you pass them to the function like so:

a = 'Apple Banana'.replace(' ', '')
b = 'Banana Apple'.replace(' ', '')

You can include this part inside the function too.

Antimony
  • 2,161
  • 3
  • 26
  • 38
0

I made a small change to your code and it is working like a charm, thanks @Antimony

def similar(a,b):
    a=a.replace(' ', '')
    b=b.replace(' ', '')

    c = 'something' # Initialize this to anything to make the while loop condition pass for the first time
    sum = 0
    i = 2
    while(len(c) != 1):
        c = SequenceMatcher(lambda x: x == ' ',a.lower(),b.lower()).get_matching_blocks()
        sizes = [i.size for i in c]
        i = sizes.index(max(sizes))
        sum += max(sizes)
        a = a[0:c[i].a] + a[c[i].a + c[i].size:]
        b = b[0:c[i].b] + b[c[i].b + c[i].size:]
    return sum
Mostafa Ghafoori
  • 157
  • 3
  • 11