sequence matching algorithm in python

Question

I have a list of sentences such as this:

errList = [ 'Ragu ate lunch but didnt have Water for drinks',
            'Rams ate lunch but didnt have Gatorade for drinks',
            'Saya ate lunch but didnt have :water for drinks',
            'Raghu ate lunch but didnt have water for drinks',
            'Hanu ate lunch but didnt have -water for drinks',
            'Wayu ate lunch but didnt have water for drinks',
            'Viru ate lunch but didnt have .water 4or drinks',

            'kk ate lunch & icecream but did have Water for drinks',
            'M ate lunch &and icecream but did have Gatorade for drinks',
            'Parker ate lunch icecream but didnt have :water for drinks',
            'Sassy ate lunch and icecream but didnt have water for drinks',
            'John ate lunch and icecream but didnt have -water for drinks',
            'Pokey ate lunch and icecream but didnt have Water for drinks',
            'Laila ate lunch and icecream but did have water 4or drinks',
        ]

I want to find out count of longest phrases/part (phrase must be more than 2 words) of sentences in each element of list? In following example, output will look closer to this (longest phrase as key and count as value):

{ 'ate lunch but didnt have': 7,
  'water for drinks': 7,
  'ate lunch and icecream': 4,
  'didnt have water': 3,
  'didnt have Water': 2    # case sensitives
}

Using re module is out of question since problem is close to sequence matching or perhaps using nltk or perhaps scikit-learn ? I have some familiarity with NLP and scikit but not enough to solve this? If I solve this, I will publish it here.

How would you define the phrase? In your example, the phrase 'ate lunch' appears in all sentences. What if the phrase is only one word? — Aechlys, May 23 '18 at 22:34
How does this relate to https://en.wikipedia.org/wiki/Longest_common_subsequence_problem? — Bill Bell, May 24 '18 at 03:10

score 6 · Accepted Answer · answered May 31 '18 at 08:48

It's not too painful with scikit-learn with a bit of numpy foo as well. A word of warning though, here I've just the defaults for preprocessing, if you're interested in the punctuation in your dataset then you will need to tweak this.

from sklearn.feature_extraction.text import CountVectorizer

# Find all the phrases >2 up to the max length 
cv = CountVectorizer(ngram_range=(3, max([len(x.split(' ')) for x in errList])))
# Get the counts of the phrases
err_counts = cv.fit_transform(errList)
# Get the sum of each of the phrases
err_counts = err_counts.sum(axis=0)
# Mess about with the types, sparsity is annoying
err_counts = np.squeeze(np.asarray(err_counts))
# Retrieve the actual phrases that we're working with
feat_names = np.array(cv.get_feature_names())

# We don't have to sort here, but it's nice to if you want to print anything
err_counts_sorted = err_counts.argsort()[::-1]
feat_names = feat_names[err_counts_sorted]
err_counts = err_counts[err_counts_sorted]

# This is the dictionary that you were after
err_dict = dict(zip(feat_names, err_counts))

Here's the output for the top few

11 but didnt have
10 have water for drinks
10 have water for
10 water for drinks
10 but didnt have water
10 didnt have water
9 but didnt have water for drinks
9 but didnt have water for
9 didnt have water for drinks
9 didnt have water for

Oliver Sherouse · Answer 2 · 2018-06-14T02:39:58.500

If you don't want to bother with external libraries, you can get this done with just the stdlib (although it may well be slower than some alternatives):

import collections
import itertools

def gen_ngrams(sentence):
    words = sentence.split() # or re.findall('\b\w+\b'), or whatever
    n_words = len(words)
    for i in range(n_words - 2):
        for j in range(i + 3, n_words):
            yield ' '.join(words[i: j]) # Assume normalization of spaces


def count_ngrams(sentences):
    return collections.Counter(
        itertools.chain.from_iterable(
            gen_ngrams(sentence) for sentence in sentences
        )
    )

counts = count_ngrams(errList)
dict(counts.most_common(10))

Which gets you:

{'but didnt have': 11,
 'ate lunch but': 7,
 'ate lunch but didnt': 7,
 'ate lunch but didnt have': 7,
 'lunch but didnt': 7,
 'lunch but didnt have': 7,
 'icecream but didnt': 4,
 'icecream but didnt have': 4,
 'ate lunch and': 4,
 'ate lunch and icecream': 4}

+1 for an answer just using stdlib, but the OP asked for trigrams and up and your answer contains bigrams. — piman314, Jun 04 '18 at 10:02

score 2 · Answer 3 · answered May 24 '18 at 10:39

Not the entire solution, but to help you a bit on the way, the following would get you a dictionary of ngrams and counts. The next step is then, as pointed out by Bill Bell in comments, to filter out shorter subsequences. This (as also pointed out in the comments) would mean deciding on your maximum length, or indeed what defines a phrase...

from nltk import ngrams, word_tokenize
from collections import defaultdict
min_ngram_length = 1
max_ngram_length = max([len(x) for x in errList])
d = defaultdict(int)
for item in errList:
    for i in range(min_ngram_length, max_ngram_length):
        for ngram in ngrams(word_tokenize(item), i):
            d[ngram] += 1
for pair in sorted(d.items(), key = lambda x: x[1], reverse=True):
    print(pair)

score 1 · Answer 4 · answered Jun 12 '18 at 06:35

Using a tools from the third-party library more_itertools:

Given

import itertools as it
import collections as ct

import more_itertools as mit


data = [ 
    "Ragu ate lunch but didnt have Water for drinks",
    "Rams ate lunch but didnt have Gatorade for drinks",
    "Saya ate lunch but didnt have :water for drinks",
    "Raghu ate lunch but didnt have water for drinks",
    "Hanu ate lunch but didnt have -water for drinks",
    "Wayu ate lunch but didnt have water for drinks",
    "Viru ate lunch but didnt have .water 4or drinks",
    "kk ate lunch & icecream but did have Water for drinks",
    "M ate lunch &and icecream but did have Gatorade for drinks",
    "Parker ate lunch icecream but didnt have :water for drinks",
    "Sassy ate lunch and icecream but didnt have water for drinks",
    "John ate lunch and icecream but didnt have -water for drinks",
    "Pokey ate lunch and icecream but didnt have Water for drinks",
    "Laila ate lunch and icecream but did have water 4or drinks",
]

Code

ngrams = []
for sentence in data:
    words = sentence.split()
    for n in range(3, len(words)+1): 
        ngrams.extend((list(mit.windowed(words, n))))

counts = ct.Counter(ngrams)
dict(counts.most_common(5))

Output

{('but', 'didnt', 'have'): 11,
 ('ate', 'lunch', 'but'): 7,
 ('lunch', 'but', 'didnt'): 7,
 ('ate', 'lunch', 'but', 'didnt'): 7,
 ('lunch', 'but', 'didnt', 'have'): 7}

Alternatively

sentences = [sentence.split() for sentence in data]
ngrams = mit.flatten(list(mit.windowed(w, n)) for n in range(3, len(sentences)+1) for w in sentences)
counts = ct.Counter(ngrams)
dict(counts.most_common(5))

sequence matching algorithm in python

4 Answers4