-1

I preprocess a lot of text data and run into a problem with a regex. I have texts in a column of a dataframe that need to end with proper punctuation, e.g. a dot, to avoid problems in later analysis(summarization). In some texts, the last sentence ends with a dot, but after that a short reference of 2-3 letters is made. It looks like this:

col
blablabla. dpa
blablabla. AB
bla blabla

I want to find these references and delete them. However, if I use my code below to look at those cases first it will also give the third row back, even though the last word has more than 3 characters. This is the code I tried:

df.loc[df.col.str.contains("\w{2,3}$")]

or

df.loc[df.col.str.contains("\b\w{2,3}$\b")]

I hope someone has a suggestion on where I went wrong in my regex. Thank you very much in advance!

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397

0 Answers0