How to remove ' in strings with RegexpTokenizer

Question

from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())

Currently returns: text = ['that', 's', 'some', 'text', 'you', 'know']

I need it to return: Currently returns: text = ['thats', 'some', 'text', 'you', 'know'] (The "thats" is one word)

Possible duplicate of [Best way to strip punctuation from a string in Python](http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python) — OneCricketeer, Feb 05 '17 at 20:26
@hansaplast This is text processing, a simple replace might replace other `'` which are not apostrophes, which is why they are using nltk. — Moses Koledoye, Feb 05 '17 at 20:31

score 4 · Accepted Answer · answered Feb 05 '17 at 20:32

4

There are 2 solutions. Either you want to preprocess your text variable with:

text = text.replace("'", "")

or you want to match "that's" as a single word with this modification:

tokenizer = RegexpTokenizer(r'[\w\']+')

answered Feb 05 '17 at 20:32

aldarel

426
2
7

How to remove ' in strings with RegexpTokenizer

1 Answers1