-1

Im trying to get the closest match between two lists of strings (listA and listB) to create a listC.

The purpose for that is because I have to clean a dataframe that has one column of strings which each string represent a fruit which some entries has spelling mistakes that I need to fix.

The actual column that I want to fix is called test:

print(test)

Output:

0             lychee
1         strawberry
2          nectarine
3             lychee
4             lychee
5             banana
6          raspberry
7            loga!!n
....
37497          grape
37498          apple
37499      rockmelon
Name: fruit_ate, Length: 37500, dtype: object

Then I converted the test column into a list called newTest and I created a list of fruits with the correct names:

newTest = list(test)

fruits = ['lychee',
      'strawberry',
      'nectarine',
      'banana',
      'raspberry',
      'kiwi',
      'apple',
      'durian',
      'pear',
      'logan',
      'jackfruit',
      'grape',
      'peach',
      'watermelon',
      'rockmelon',
      'orange']

I created a for loop that goes through newList and get each element and returns me the closest match in fruits list. However, as I thought it would be easier to try to fix a small list first than just after my code works I could use it to fix the newTest list.

So I created these listA and listB. I copied some of these values of test column into the listB, and I created listA with values of the list fruits.

The way I managed to do that was:

listA = ['apple', 'banana', 'coco', 'grape', 'pear']
listB = ['ba88tana', 'peeaar', 'apple', 'ggra))pe']
listC = []

for i in listB:
    listC.append(diff.get_close_matches(i, fruits, n=1, cutoff=0.5))

output: [['banana'], ['pear'], ['apple'], ['grape']]

When I run this it works fine, but if I apply the same algorithm to my newTest list and fruits list it doesn't work, it says: TypeError: 'float' object is not iterable.

If someone knows how to fix it or another way that I could do this it would be very helpful.

martineau
  • 99,260
  • 22
  • 139
  • 249

2 Answers2

0

Without seeing the entire code I would guess that newTest is a float when you are using it with your data?

Or that the line:

listC.append(diff.get_close_matches(i, fruits, n=1, cutoff=0.5))

The function diff may be recieving a float instead of a string, for example

diff.get_close_matches(32, text, n=1, cutoff=.5)

Instead of:

diff.get_close_matches('32', text, n=1, cutoff=.5)

This may be the case if your data is floats and not strings.

for i in newTest:
    diff.get_close_matches(str(i), text, n=1, cutoff=.5)

Posting relevant parts of the actual test would aid in diagnosis.

Ethan Henderson
  • 418
  • 4
  • 9
  • I added more details to clarify what I am trying to do. The cutoff argument is just how close the match needs to be, so I don't think this is the problem. Specially because I took it of to check and it gave me the same error :/ – Thays Britto Dec 03 '17 at 22:51
0

Dependencies

pip install editdistance

code (closest.py)

import editdistance
listA = ['apple', 'banana', 'coco', 'grape', 'pear']
listB = ['ba88tana', 'peeaar', 'apple', 'ggra))pe']
listC = []

for i in listB:
    res = None
    distance = len(i)+1
    for j in listA:
        diff = editdistance.eval(i, j)
        if diff < distance:
            distance = diff
            res = j
    listC.append(res)

print listC
Tilak Putta
  • 634
  • 4
  • 16