12

I have run into a speed issue searching through a very large list. I have a file with a lot of errors and very strange words in it. I am trying to use difflib to find the closest match in a dictionary file I have that has 650,000 words in it. This approach below works really well but is very very slow and I was wondering if there is a better way to approach this problem. This is the code:

from difflib import SequenceMatcher
headWordList = [ #This is a list of 650,000 words]


openFile = open("sentences.txt","r")

for line in openFile:
    sentenceList.append[line]

percentage = 0
count = 0

for y in sentenceList:
      if y not in headwordList:

         for x in headwordList:
             m = SequenceMatcher(None, y.lower(), x)

             if m.ratio() > percentage:
                 percentage = m.ratio()

                 word = x

         if percentage > 0.86:        
             sentenceList[count] = word
count=count+1

Thanks for the help, software engineering is not even close to my strong suit. Much appreciated.

English Grad
  • 1,245
  • 3
  • 19
  • 38
  • 2
    I don't agree. He's more or less looking for alternative approaches – keyser Dec 23 '13 at 21:15
  • 2
    This is a data structures problem. – wim Dec 23 '13 at 21:17
  • Did you mean `m = SequenceMatcher(None, y.lower(), x)` – wim Dec 23 '13 at 21:18
  • 2
    One thing I can immediately see is to make `headwordList` a set instead of a list, to get better look-up performance for those `in` checks. – wim Dec 23 '13 at 21:19
  • Also you could parallelise this. Of course any serial optimisation is good but ultimately if you break it down into a few threads it may be faster to write it such that it's fast enough like that. – Aleksander Lidtke Dec 23 '13 at 21:21
  • 1
    Is this based on an existing algorithm, or are you sort of hacking something together? The specific `0.86` makes me think if we know the original problem, we might be able to suggest a better way to do it overall. – Izkata Dec 23 '13 at 21:21
  • Are you sure it is correct and works? percentage is non decreasing on each step, and counter is incremented only once – alko Dec 23 '13 at 21:21
  • wim, I need to check every word in the list and compare it using SequenceMatcher. Does making it a set help this? I would have though that it would be the same because you need to check each value. – English Grad Dec 23 '13 at 21:22
  • 1
    @EnglishGrad it will greatly help with `if y not in headwordList:` part – alko Dec 23 '13 at 21:22
  • 1
    I think you have an indentation issue as well - `count=count+1` doesn't make any sense where it is. Can you double-check that everything made it into the question correctly? – Izkata Dec 23 '13 at 21:24
  • [Everyone else has also overlooked this](http://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher.ratio) - `This is expensive to compute if get_matching_blocks() or get_opcodes() hasn’t already been called, in which case you may want to try quick_ratio() or real_quick_ratio() first to get an upper bound.` (a thought like this is why I asked about the algorithm being used - an upper-bound on the ratio might be sufficient for the `if` statement in the innermost loop, then only get the true ratio afterwards) – Izkata Dec 23 '13 at 21:27

4 Answers4

6

Two things that might provide some small help:

1) Use the approach in this SO answer to read through your large file the most efficiently.

2) Change your code from

for x in headwordList:
    m = SequenceMatcher(None, y.lower(), 1)

to

yLower = y.lower()
for x in headwordList:
    m = SequenceMatcher(None, yLower, 1)

You're converting each sentence to lower 650,000 times. No need for that.

Community
  • 1
  • 1
Dillon Welch
  • 461
  • 4
  • 15
4

You should change headwordList into a set.

The test word in headwordList will be very slow. It must do a string comparison on each word in headwordList, one word at a time. It will take time proportional to the length of the list; if you double the length of the list, you will double the amount of time it takes to do the test (on average).

With a set, it always takes the same amount of time to do the in test; it doesn't depend on the number of elements in the set. So that will be a huge speedup.

Now, this whole loop can be simplified:

     for x in headwordList:
         m = SequenceMatcher(None, y.lower(), x)

         if m.ratio() > percentage:
             percentage = m.ratio()

             word = x

     if percentage > 0.86:        
         sentenceList[count] = word

All this does is find the word from headwordList that has the highest ratio, and keep it (but only keep it if the ratio is over 0.86). Here's a faster way to do this. I'm going to change the name headwordList to just headwords as I want you to make it be a set and not a list.

def check_ratio(m):
    return m.ratio()

y = y.lower()  # do the .lower() call one time
m, word =  max((SequenceMatcher(None, y, word), word) for word in headwords, key=check_ratio)
percentage = max(percentage, m.ratio())  # remember best ratio
if m.ratio() > 0.86:
    setence_list.append(word)

This might seem a bit tricky but it is the fastest way to do this in Python. We will call the built-in max() function to find the SequenceMatcher result that has the highest ratio. First, we build a "generator expression" that tries all the words in headwords, calling SequenceMatcher() on each. But when we are done, we also want to know what the word was. So the generator expression produces tuples, where the first value in the tuple is the SequenceMatcher result and the second value is the word. The max() function cannot know that what we care about is the ratio, so we have to tell it that; we do this by making a function that tests what we care about, then passing that function as the key= argument. Now max() finds the value with the highest ratio for us. max() consumes all the values produced by the generator expression and returns a single value, which we then unpack into the varaibles m and word.

In Python, it is best practice to use variable names like sentence_list rather than sentenceList. Please see these guidelines: http://www.python.org/dev/peps/pep-0008/

It is not good practice to use an incrementing index variable and assign into indexed positions in a list. Rather, start with an empty list and use the .append() method function to append values.

Also, you might do better to build a dictionary of words and their ratios.

Note that your original code seems to have a bug: as soon as any word has a percentage over 0.86, all words are saved in sentenceList no matter what their ratio is. The code I wrote, above, only saves words where the word's own ratio was high enough.

EDIT: This is to answer a question about generator expressions needing to be parenthesized.

Whenever I get that error message, I usually split out the generator expression by itself and assign it to a variable. Like this:

def check_ratio(m):
    return m.ratio()

y = y.lower()  # do the .lower() call one time
genexp = ((SequenceMatcher(None, y, word), word) for word in headwords)
m, word =  max(genexp, key=check_ratio)
percentage = max(percentage, m.ratio())  # remember best ratio
if m.ratio() > 0.86:
    setence_list.append(word)

That's what I suggest. But if you don't mind a complicated line looking even busier, you can simply add an extra pair of parentheses as the error message suggests, so the generator expression is fully parenthesized. Like so:

m, word =  max(((SequenceMatcher(None, y, word), word) for word in headwords), key=check_ratio)

Python lets you omit the explicit parentheses around a generator expression when you pass the expression to a function, but only if it is the only argument to that function. As we are also passing a key= argument, we need a fully parenthesized generator expression.

But I think it's easier to read if you split out the genexp on its own line.

EDIT: @Peter Wood pointed out that the documentation suggests reusing a SequenceMatcher for speed. I don't have time to test this, but I think this is the right way to do it.

Happily, the code got simpler! Always a good sign.

EDIT: I just tested the code. This code works for me; see if it works for you.

from difflib import SequenceMatcher

headwords = [
# This is a list of 650,000 words
# Dummy list:
    "happy",
    "new",
    "year",
]


def words_from_file(filename):
    with open(filename, "rt") as f:
        for line in f:
            for word in line.split():
                yield word

def _match(matcher, s):
    matcher.set_seq2(s)
    return (matcher.ratio(), s)

ratios = {}
best_ratio = 0

matcher = SequenceMatcher()

for word in words_from_file("sentences.txt"):
    matcher.set_seq1(word.lower())
    if word not in headwords:
        ratio, word =  max(_match(matcher, word.lower()) for word in headwords)
        best_ratio = max(best_ratio, ratio)  # remember best ratio
        if ratio > 0.86:
            ratios[word] = ratio

print(best_ratio)
print(ratios)
steveha
  • 67,444
  • 18
  • 86
  • 112
  • Steveha, I find this approach interesting and I am trying to use it but I am getting and error message that says: "generator expression must be parenthesized if not sole argument" ...any ideas? – English Grad Dec 23 '13 at 22:15
  • Also, [the docs suggest reusing a `SequenceMatcher`](http://docs.python.org/2/library/difflib.html#sequencematcher-objects): 'SequenceMatcher computes and caches detailed information about the second sequence, so if you want to compare one sequence against many sequences, use set_seq2() to set the commonly used sequence once and call set_seq1() repeatedly, once for each of the other sequences.' – Peter Wood Dec 23 '13 at 23:02
3

1) I would store headwordList as a set, not a list, allowing for faster access as it is a hashed data structure.

2) You have sentenceList defined as a list then attempt to use it as a dictionary with sentenceList[x] = y. I would define a different structure specifically for counts.

3) You construct sentenceList which doesn't need to be done.

for line in file:
   if line not in headwordList...

4) You never tokenize line which means you store the entire line before the newline character in sentenceList and see if it is in a wordlist

C.B.
  • 7,548
  • 5
  • 16
  • 32
0

This is a data structures question. What you want to do, is to turn your list into something with faster element lookup speed, for example a binary search tree would work great here: time complexity is only O (log n) as opposed to O (n) in a list (which is in comparison incredibly fast).

There's a fairly simple explanation here:

http://interactivepython.org/runestone/static/pythonds/Trees/balanced.html

But if you are not familiar with tree concepts, you might want to start few chapters earlier:

http://interactivepython.org/runestone/static/pythonds/Trees/trees.html

Ruslan Osipov
  • 5,305
  • 3
  • 25
  • 42
  • 2
    This is not a useful answer to a Python question. Python includes several built-ins that can be usefully applied to this question (`set` and `dict`). @English Grad says that he/she is not an expert software engineer, so a page showing how to implement a data structure is not going to help, and even an expert software developer would be better served by using a Python built-in than by implementing a tree data structure. – steveha Dec 23 '13 at 21:42