I've searched and searched but haven't found a topic/answer that specifically matches what I'm looking for, so here goes..
I want to implement a (bad/profanity) word filter, to wildcard match any word from a list of words within a string, and return the match, if found.
It's not as simple as "is word in string", as some words may be naughty on their own, but acceptable at the start, middle and/or end of a string. e.g "Scunthorpe"!
My concerns are (from the very little I know), that it's a lot of repetition/iteration on relatively long strings (up to 2048 characters) and that the pattern list is being called each time - is there a way to have any of it cached?
In a chat application this function could be called very often and with bad word lists of 300+ words, so efficiency is the key.
Here's what I currently have, with examples of different matches, and it works perfectly - but as a Python newcomer I have no idea whether this is efficient or not, so I was hoping an expert could offer some insight.
def badWordMatch(string):
bad_words = ["poo", "wee", "barsteward*", "?orrible"]
data = string.split()
for each in bad_words:
l = fnmatch.filter(data, each)
if l:
return each.replace("?","").replace("*","")
return None
string_input = "Please do not wee in the swimming pool you 'orrible naughty barstewards!" # Matched: "wee"
#string_input = "Please do not dive in the swimming pool you 'orrible naughty barstewards!" # Matched: "barsteward"
#string_input = "Please do not dive in the swimming pool you 'orrible naughty kids!" # Matched: "orrible"
#string_input = "Please do not dive in the swimming pool you horrible naughty kids!" # Matched: "orrible"
#string_input = "Please do not dive in the swimming pool you naughty kids!" # No match!
isbadword = badWordMatch(string_input)
if isbadword is not None:
print("Matched: %s" % (isbadword))
else:
print("No match, string is clean!")
Update: Regular expression version:
import re
bad_words = ["poo$", "wee$", "barsteward.*", ".orrible"]
string_input = "Please do not poo & wee in the swimming pool you horrible naughty barstewards! Shouldn't match: week, xbarsteward xhorrible"
strings = string_input.split()
def test3():
r = re.compile('|'.join('(?:%s)' % p for p in bad_words))
for s in strings:
t = r.match(s)
if t:
print "Matched! " + t.group()
test3()
Result:
Matched! poo Matched! wee Matched! horrible Matched! barstewards!