1
from nltk.tokenize import RegexpTokenizer
text="That's some text, you know!"
tokens=[]
tokenizer = RegexpTokenizer(r'\w+')
tokens+=tokenizer.tokenize(text.lower())

Currently returns: text = ['that', 's', 'some', 'text', 'you', 'know']

I need it to return: Currently returns: text = ['thats', 'some', 'text', 'you', 'know'] (The "thats" is one word)

OneCricketeer
  • 126,858
  • 14
  • 92
  • 185
Sledro
  • 119
  • 1
  • 11
  • why don't you just remove the `'` with `replace("'", "")`? – hansaplast Feb 05 '17 at 20:26
  • Possible duplicate of [Best way to strip punctuation from a string in Python](http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python) – OneCricketeer Feb 05 '17 at 20:26
  • @hansaplast This is text processing, a simple replace might replace other `'` which are not apostrophes, which is why they are using nltk. – Moses Koledoye Feb 05 '17 at 20:31

1 Answers1

4

There are 2 solutions. Either you want to preprocess your text variable with:

text = text.replace("'", "")

or you want to match "that's" as a single word with this modification:

tokenizer = RegexpTokenizer(r'[\w\']+')
aldarel
  • 426
  • 2
  • 7