I have 2 lists of potentially overlapping movie titles, but possibly written in a different form.
They are in 2 different dataframes from pandas. So I have tried to use the map()
function with the fuzzywuzzy
library like so:
df1.title.map(lambda x: process.extractOne(x, choices=df2.title, score_cutoff=95))
Which gives some good quality results. But the time complexity is such that I can only run it on a very small subset of both data frames. When I try incrementing the size of the data frames then it rapidly becomes unusable.
Then I tried to replace fuzzywuzzy
with difflib
. And it is much faster. But I can't get the results I want.
At first I tried:
df1.title.map(lambda x: difflib.get_close_matches(x, df2.title, n=1)
And that was fast but the quality of the results was poor. Even missing some simple uppercase / lowercase changes. Playing with the cutoff
did not help.
So I thought I was using the wrong tool. In the docs and examples I saw get_close_matches
used on single words. In titles there are various words.
Is SequenceMatcher
a better choice?
And if yes, then how do I fit it into the map()
so it does the same as the aforementioned functions: return only the best result, and only if the result is above a certain ratio?