3

I have 2 lists of potentially overlapping movie titles, but possibly written in a different form.
They are in 2 different dataframes from pandas. So I have tried to use the map() function with the fuzzywuzzy library like so:

df1.title.map(lambda x: process.extractOne(x, choices=df2.title, score_cutoff=95))

Which gives some good quality results. But the time complexity is such that I can only run it on a very small subset of both data frames. When I try incrementing the size of the data frames then it rapidly becomes unusable.

Then I tried to replace fuzzywuzzy with difflib. And it is much faster. But I can't get the results I want.

At first I tried:

df1.title.map(lambda x: difflib.get_close_matches(x, df2.title, n=1)

And that was fast but the quality of the results was poor. Even missing some simple uppercase / lowercase changes. Playing with the cutoff did not help.

So I thought I was using the wrong tool. In the docs and examples I saw get_close_matches used on single words. In titles there are various words.

Is SequenceMatcher a better choice?

And if yes, then how do I fit it into the map() so it does the same as the aforementioned functions: return only the best result, and only if the result is above a certain ratio?

Bastian
  • 4,813
  • 10
  • 39
  • 61

2 Answers2

1

To eliminate the possibility of low-score matches as a result of case-differences, I'd suggest applying .upper() or .lower() to the columns you're matching. After adjusting the case, you could compile a list of all titles into ThisList and apply the following function (relying, as you suggested, on SequenceMatcher) with a given tolerance.

def fuzzy_group_list_elements(ThisList,Tolerance):
    from difflib import SequenceMatcher
    Groups = {}
    TempList = ThisList.copy()
    for Elmt in TempList:
        if Elmt not in Groups.keys():
            Groups[Elmt] = []
        for OtherElmt in TempList:
            if SequenceMatcher(None,Elmt,OtherElmt).quick_ratio() > Tolerance:
                Groups[Elmt] = Groups[Elmt] + [OtherElmt]
                TempList.remove(OtherElmt)
    Groups[Elmt] = list(set(Groups[Elmt]))
    return dict((v,k) for k in Groups for v in Groups[k])

You can then apply the above function to the dataframe columns containing the movie titles:

Mapping = fuzzy_group_list_elements(ThisList,0.85)
df['Matched Title'] = df['Title'].replace(Mapping)
DrTRD
  • 1,381
  • 8
  • 17
1

I have written a Python package which aims to solve this problem. Amongst other things, it addresses the n^2 complexity of the problem (e.g. with two datasets of length 100, your code needs 10,000 comparisons).

You can install it using pip install fuzzymatcher

You can find the repo here and docs here.

Basic usage:

Given two dataframes df_left and df_right, which you want to fuzzy join, you can write the following:

from fuzzymatcher import link_table, left join

# Columns to match on from df_left
left_on = ["fname", "mname", "lname",  "dob"]

# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]

# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)

Or if you just want to link on the closest match:

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
RobinL
  • 9,064
  • 6
  • 39
  • 54
  • Much needed package. But it requires "Visual C++ Build Tools" which is 4 Gb on HDD. – user2978216 Jan 20 '18 at 12:46
  • Interesting - I guess you're on Windows? Do you already have pandas/numpy installed? Could you provide details of where/when you get the error message? – RobinL Jan 21 '18 at 08:47
  • I'm on Windows, pandas/numpy are installed. Installation attempt throws [error message](https://imgur.com/a/DXD5g) – user2978216 Jan 21 '18 at 11:58
  • Thanks, didn't realise the Levenstein package needed that. I've opened an issue to remind myself to get rid of that dependency in the next release. – RobinL Jan 21 '18 at 12:51
  • Note: this is now fixed in the latest code on Github – RobinL Oct 14 '19 at 08:57