I preprocess a lot of text data and run into a problem with a regex. I have texts in a column of a dataframe that need to end with proper punctuation, e.g. a dot, to avoid problems in later analysis(summarization). In some texts, the last sentence ends with a dot, but after that a short reference of 2-3 letters is made. It looks like this:
col |
---|
blablabla. dpa |
blablabla. AB |
bla blabla |
I want to find these references and delete them. However, if I use my code below to look at those cases first it will also give the third row back, even though the last word has more than 3 characters. This is the code I tried:
df.loc[df.col.str.contains("\w{2,3}$")]
or
df.loc[df.col.str.contains("\b\w{2,3}$\b")]
I hope someone has a suggestion on where I went wrong in my regex. Thank you very much in advance!