Im trying to get the closest match between two lists of strings (listA
and listB
) to create a listC
.
The purpose for that is because I have to clean a dataframe that has one column of strings which each string represent a fruit which some entries has spelling mistakes that I need to fix.
The actual column that I want to fix is called test:
print(test)
Output:
0 lychee
1 strawberry
2 nectarine
3 lychee
4 lychee
5 banana
6 raspberry
7 loga!!n
....
37497 grape
37498 apple
37499 rockmelon
Name: fruit_ate, Length: 37500, dtype: object
Then I converted the test column into a list called newTest
and I created a list of fruits with the correct names:
newTest = list(test)
fruits = ['lychee',
'strawberry',
'nectarine',
'banana',
'raspberry',
'kiwi',
'apple',
'durian',
'pear',
'logan',
'jackfruit',
'grape',
'peach',
'watermelon',
'rockmelon',
'orange']
I created a for loop that goes through newList
and get each element and returns me the closest match in fruits list. However, as I thought it would be easier to try to fix a small list first than just after my code works I could use it to fix the newTest
list.
So I created these listA
and listB
.
I copied some of these values of test column into the listB
, and I created listA
with values of the list fruits.
The way I managed to do that was:
listA = ['apple', 'banana', 'coco', 'grape', 'pear']
listB = ['ba88tana', 'peeaar', 'apple', 'ggra))pe']
listC = []
for i in listB:
listC.append(diff.get_close_matches(i, fruits, n=1, cutoff=0.5))
output: [['banana'], ['pear'], ['apple'], ['grape']]
When I run this it works fine, but if I apply the same algorithm to my newTest
list and fruits list it doesn't work, it says: TypeError: 'float' object is not iterable
.
If someone knows how to fix it or another way that I could do this it would be very helpful.