1

I've searched and searched but haven't found a topic/answer that specifically matches what I'm looking for, so here goes..

I want to implement a (bad/profanity) word filter, to wildcard match any word from a list of words within a string, and return the match, if found.

It's not as simple as "is word in string", as some words may be naughty on their own, but acceptable at the start, middle and/or end of a string. e.g "Scunthorpe"!

My concerns are (from the very little I know), that it's a lot of repetition/iteration on relatively long strings (up to 2048 characters) and that the pattern list is being called each time - is there a way to have any of it cached?

In a chat application this function could be called very often and with bad word lists of 300+ words, so efficiency is the key.

Here's what I currently have, with examples of different matches, and it works perfectly - but as a Python newcomer I have no idea whether this is efficient or not, so I was hoping an expert could offer some insight.

def badWordMatch(string):
    bad_words = ["poo", "wee", "barsteward*", "?orrible"]
    data = string.split()
    for each in bad_words:
        l = fnmatch.filter(data, each)
        if l:
            return each.replace("?","").replace("*","")
    return None

string_input = "Please do not wee in the swimming pool you 'orrible naughty barstewards!" # Matched: "wee"
#string_input = "Please do not dive in the swimming pool you 'orrible naughty barstewards!" # Matched: "barsteward"
#string_input = "Please do not dive in the swimming pool you 'orrible naughty kids!" # Matched: "orrible"
#string_input = "Please do not dive in the swimming pool you horrible naughty kids!" # Matched: "orrible"
#string_input = "Please do not dive in the swimming pool you naughty kids!" # No match!

isbadword = badWordMatch(string_input)

if isbadword is not None:
    print("Matched: %s" % (isbadword))
else:
    print("No match, string is clean!")

Update: Regular expression version:

import re

bad_words = ["poo$", "wee$", "barsteward.*", ".orrible"]

string_input = "Please do not poo & wee in the swimming pool you horrible naughty barstewards! Shouldn't match: week, xbarsteward xhorrible"

strings = string_input.split()

def test3():
    r = re.compile('|'.join('(?:%s)' % p for p in bad_words))
    for s in strings:
        t = r.match(s)
        if t:
            print "Matched! " + t.group()

test3()

Result:

Matched! poo Matched! wee Matched! horrible Matched! barstewards!

Rob.H
  • 33
  • 1
  • 5
  • Have you looked into using [regular expressions](http://stackoverflow.com/questions/4736/learning-regular-expressions)? – tyteen4a03 Mar 09 '17 at 22:11
  • @tyteen4a03 Hi. Yes, but as I mentioned I've not come across anything yet that matches this specific situation. The code I posted is the only working solution I've found so far, so I'd love to see any different ways to accomplish the same results, and/or any variations/improvements on the existing code to make it as fast/efficient as possible. – Rob.H Mar 09 '17 at 22:31

1 Answers1

0

In Python 3.2+, fnmatch.filter has a LRU cache decorator which means the most recent 256 calls are cached. Outside of this not much caching is performed by fnmatch. However, fnmatch uses re internally so your patterns are internally translated to regex and are hence cached automatically.

You're still better off building a regex from your list of bad words as from this answer one (explicitly compiled) regex is much faster than several hundred (implicitly compiled) regexes in your example.

Community
  • 1
  • 1
tyteen4a03
  • 1,604
  • 19
  • 37
  • Hi, thanks for the update. I did spot that answer, but couldn't show just the matched regex *pattern*. `import re bad_words = ["poo$", "wee$", "barsteward.*", ".orrible"] input = "Please do not poo & wee in the swimming pool you horrible naughty barstewards! Shouldn't match: week, xbarsteward xhorrible" strings = input.split() def test3(): r = re.compile('|'.join('(?:%s)' % p for p in bad_words)) for s in strings: t = r.match(s) if t: print "Match: " + t.group() test3()` Result: # Matched! poo # Matched! wee # Matched! horrible # Matched! barstewards! – Rob.H Mar 10 '17 at 00:06
  • Great, the code didn't parse and I can't edit it - D'oh! – Rob.H Mar 10 '17 at 00:12
  • Update: Added the regex version to the original question. It seems to work OK other than I can't find a way to return/display the exact pattern that matched, only the matched string. Any ideas? – Rob.H Mar 10 '17 at 00:21
  • If you *really* need the exact pattern that matched, you'll need to use separate regex patterns. – tyteen4a03 Mar 10 '17 at 00:31
  • I thought so. Will having multiple regex patterns be better (performance-wise) than using fnmatch? I'm not advanced enough yet to create such a code to compare. I have just tried "timeit" on the fnmatch version, and it was significantly quicker on the second lookup (Python 2.7). – Rob.H Mar 10 '17 at 00:40
  • Try [`timeit`](http://stackoverflow.com/questions/8220801/how-to-use-timeit-module). – tyteen4a03 Mar 10 '17 at 00:41
  • I think you misunderstood, I did use timeit (on the fnmatch version) and got a much quicker time on the second lookup, so it's obviously caching somewhere. What I meant was I'm not advanced enough to know how to make a version of the code that has multiple regex matches. Assuming a loop to build a list of compiled bad word regex's, which I think I can do - but after that? How to match all against a string? Some example code would be helpful, I'm struggling at this stage! – Rob.H Mar 10 '17 at 01:13
  • Yes, you would match every single regex pattern in a `for` loop. You'd need to run a benchmark to see which approach is faster (my bet is on the `re` one) – tyteen4a03 Mar 10 '17 at 01:15