0

I want to find the most similar value from a dataframe column to a specified string , e.g. a='book'. Let's say the dataframe looks like: df

col1
wijk 00 book
Wijk a 
test

Now I want to return wijk 00 book since this is the most similar to a. I am trying to do this with the fuzzywuzzy package.

Therefore, I have a dataframe A with the values I want to have a similar one for. Then I use:

A['similar_value'] = A.col1.apply(lambda x: [process.extract(x, df.col1, limit=1)][0][0][0])  

But when comparing a lot of strings, this takes too much time. Does anyone knows how to do this quickly?

baqm
  • 57
  • 4
  • How you define similarity here? – Zalak Bhalani Apr 26 '21 at 16:05
  • @ZalakBhalani the strings in the dataframe column should contain the string `a` – baqm Apr 26 '21 at 16:09
  • what's your current code with `fuzzywuzzy`? we can try to optimize that – tdy Apr 26 '21 at 16:10
  • I added my code – baqm Apr 26 '21 at 16:16
  • What is the process variable defined as? – Nate Rush Apr 26 '21 at 16:17
  • @baqm _"the strings in the dataframe column should contain the string `a`"_ - that is **not** a similarity check. You could do it with a simple `.contains` check - `df[df['col1'].str.contains("book")]`, [from this answer](https://stackoverflow.com/a/11531402/1431750). – aneroid Apr 26 '21 at 16:21
  • Does this answer your question? [Select by partial string from a pandas DataFrame](https://stackoverflow.com/questions/11350770/select-by-partial-string-from-a-pandas-dataframe) – aneroid Apr 26 '21 at 16:21

3 Answers3

1

I would use rapidfuzz:

from rapidfuzz import process, fuzz

df = pd.DataFrame(['wijk 00 book', 'Wijk a', 'test'], columns=['col1'])

search_str = 'book'
most_similar = process.extractOne(search_str, df['col1'], scorer=fuzz.WRatio)

Output:

most_similar
('wijk 00 book', 90.0, 0)

This gives you the most similar string in the column as well as a score for how similar it is to your search string.

0

You can use 'str.contains' method to get the string which exact substring

df[df["column_name"].str.contains("book")].values[0][0]
Zalak Bhalani
  • 484
  • 2
  • 10
0

Try fuzz.ratio() with Series.idxmax(). This will locate the col1 value with the highest fuzzy score against a:

from rapidfuzz import fuzz

a = 'book'
df.loc[df.col1.apply(lambda x: fuzz.ratio(x, a)).idxmax()]

# col1    wijk 00 book
# Name: 0, dtype: object

Update: The process.extractOne() method from @lolliesaurus is faster:

>>> %timeit process.extractOne(a, df.col1, scorer=fuzz.WRatio)
11.6 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

>>> %timeit df.loc[df.col1.apply(lambda x: fuzz.ratio(x, a)).idxmax()]
353 µs ± 3.45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
tdy
  • 11,122
  • 1
  • 8
  • 24