-1

In difflib.get_close_matches(word, possibilities[, n][, cutoff]), whats the use of cutoff here. How it affects the word matches?

Jon Clements
  • 124,071
  • 31
  • 219
  • 256
Sana Jain
  • 5
  • 2

2 Answers2

1

From the documentation:

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

Trying the example from the documentation:

In [11]: import difflib

In [12]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
Out[12]: ['apple', 'ape']

In [13]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.1)
Out[13]: ['apple', 'ape', 'puppy']

In [14]: difflib.get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'], cutoff=0.9)
Out[14]: []

Details about the algorithm are noted in the article "Pattern Matching: The Gestalt Approach".

  • what is cutoff ? How to find it ? Is it related to edit distance ? – Sana Jain Jan 30 '15 at 09:26
  • Please read the documentation at https://docs.python.org/3.4/library/difflib.html –  Jan 30 '15 at 09:28
  • I read that document. It says,it is optional argument cutoff (default 0.6) is a float in the range [0, 1]. Whats it indicate? – Sana Jain Jan 30 '15 at 09:34
  • 1
    Please read all of https://docs.python.org/3.4/library/difflib.html, especially the description of the algorithm. It is all described in detail. –  Jan 30 '15 at 09:35
1

I came across the same question and I found that "difflib.get_close_matches" uses as foundation the approach on called "Gestalt pattern matching" described by Ratcliff and Obershelp (link below).

The method "difflib.get_close_matches" is based on the class "SequenceMatcher", which in the source code specify this: "SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people."

About the "cutoff". This tells you how close you want to find the match, if "1" then it needs to be exactly the same word, and as going down it's more relax. So for instance, if you choose "0" it will for sure return you the most "similar" work no matter you don't have any similar one, so this would not make much sense on most of the cases. It's then "0.6" the default, as this can give significant results, but its up to any particular solution, you need to test what it works for you based on your vocabulary and specific scenario.

PATTERN MATCHING: THE GESTALT APPROACH http://collaboration.cmc.ec.gc.ca/science/rpn/biblio/ddj/Website/articles/DDJ/1988/8807/8807c/8807c.htm

Hope this helps you to understand "difflib.get_close_matches" better.

juanman
  • 115
  • 8