13

How can I tell difflib.get_close_matches() to ignore case? I have a dictionary which has a defined format which includes capitalisation. However, the test string might have full capitalisation or no capitalisation, and these should be equivalent. The results need to be properly capitalised, however, so I can't use a modified dictionary.

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
s = 'Acacia kochi W.Fitzg.'

# base case: proper capitalisation
print(difflib.get_close_matches(s,names,1,0.9))

# this should be equivalent from the perspective of my program
print(difflib.get_close_matches(s.upper(),names,1,0.9))

# this won't work because of the dictionary formatting
print(difflib.get_close_matches(s.upper().capitalize(),names,1,0.9))

Output:

['Acacia kochii W.Fitzg.']
[]
[]

Working code:

Based on Hugh Bothwell's answer, I have modified the code as follows to get a working solution (which should also work when more than one result is returned):

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
test = {n.lower():n for n in names}    
s1 = 'Acacia kochi W.Fitzg.'   # base case
s2 = 'ACACIA KOCHI W.FITZG.'   # test case

results = [test[r] for r in difflib.get_close_matches(s1.lower(),test,1,0.9)]
results += [test[r] for r in difflib.get_close_matches(s2.lower(),test,1,0.9)]
print results

Output:

['Acacia kochii W.Fitzg.', 'Acacia kochii W.Fitzg.']
rudivonstaden
  • 6,405
  • 4
  • 21
  • 36
  • Sorry to reboot an old post, but I found this interesting. For the final search product, I'm reading the code and it seems like you would not need the s1 and first results list. Is that correct? It seems the algorithm would produce the result you wanted without those lines. – Tyler Russell Dec 22 '17 at 00:59
  • @TylerRussell that's correct. The purpose was to verify that the capitalisation of the search term did not influence the result. The fact that searching with s1 and searching with s2 produced the same result showed that the algorithm worked. Generally you would only use one search term. – rudivonstaden Jan 09 '18 at 07:51

3 Answers3

13

I don't see any quick way to make difflib do case-insensitive comparison.

The quick-and-dirty solution seems to be

  • make a function that converts the string to some canonical form (for example: upper case, single spaced, no punctuation)

  • use that function to make a dict of {canonical string: original string} and a list of [canonical string]

  • run .get_close_matches against the canonical-string list, then plug the results through the dict to get the original strings back

Hugh Bothwell
  • 50,702
  • 6
  • 75
  • 95
4

After a lot of searching around I am sadly surprised to see no simple pre-canned answer to this obvious use case.

The only alternative seems to be "FuzzyWuzzy" library. Yet it relies on Levenshtein Distance just as Python's difflib, and its API is not production quality. Its more obscure methods are indeed case-insensitive, but it provides no direct or simple replacement for get_close_matches.

So here is the simplest implementation I can think of:

import difflib

def get_close_matches_icase(word, possibilities, *args, **kwargs):
    """ Case-insensitive version of difflib.get_close_matches """
    lword = word.lower()
    lpos = {p.lower(): p for p in possibilities}
    lmatches = difflib.get_close_matches(lword, lpos.keys(), *args, **kwargs)
    return [lpos[m] for m in lmatches]
gatopeich
  • 2,169
  • 24
  • 23
1

@gatopeich had the right idea, but the problem is that there may be many strings which differ only in capitalization. We surely want them all in our results, not just one of them!

The following adaption manages to do this:

def get_close_matches_icase(word, possibilities, *args, **kwargs):
    """ Case-insensitive version of difflib.get_close_matches """
    lword = word.lower()
    lpos = {}
    for p in possibilities:
        if p.lower() not in lpos:
            lpos[p.lower()] = [p]
        else:
            lpos[p.lower()].append(p)
    lmatches = difflib.get_close_matches(lword, lpos.keys(), *args, **kwargs)
    ret = [lpos[m] for m in lmatches]
    ret = itertools.chain.from_iterable(ret)
    return set(ret)
viuser
  • 873
  • 6
  • 17