Python find words with certain number of characters

Asked May 18 '21 at 08:24

Active May 18 '21 at 08:26

Viewed 18 times

-1

I preprocess a lot of text data and run into a problem with a regex. I have texts in a column of a dataframe that need to end with proper punctuation, e.g. a dot, to avoid problems in later analysis(summarization). In some texts, the last sentence ends with a dot, but after that a short reference of 2-3 letters is made. It looks like this:

col
blablabla. dpa
blablabla. AB
bla blabla

I want to find these references and delete them. However, if I use my code below to look at those cases first it will also give the third row back, even though the last word has more than 3 characters. This is the code I tried:

df.loc[df.col.str.contains("\w{2,3}$")]

df.loc[df.col.str.contains("\b\w{2,3}$\b")]

I hope someone has a suggestion on where I went wrong in my regex. Thank you very much in advance!

edited May 18 '21 at 08:25

Wiktor Stribiżew

484,719
26
302
397

asked May 18 '21 at 08:24

Arcticweasel

1

Use `r"\b\w{2,3}\b"`, `"\b"` is a BACKSPACE char and `$` matches the end of string position. – Wiktor Stribiżew May 18 '21 at 08:25

Python find words with certain number of characters

0 Answers0