Python Dataframe: Remove duplicate words in the same cell within a column in Python

Question

Below shows a column with data I have and another column with the de-duplicated data I want.

I honestly don't even know how to start doing this in Python code. I've read a couple of posts on this in R, but not in Python.

cs95 · Accepted Answer · 2019-04-10T09:51:41.603

13

If you're looking to get rid of consecutive duplicates only, this should suffice:

df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

Details

\b        # word boundary
(\w+)     # 1st capture group of a single word
( 
\s+       # 1 or more spaces
\1        # reference to first group 
)+        # one or more repeats
\b

_{Regex from here.}

To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict data structure:

from collections import OrderedDict

df['Desired'] = (df['Current'].str.split()
                              .apply(lambda x: OrderedDict.fromkeys(x).keys())
                              .str.join(' '))
df

           Current          Desired
0       Racoon Dog       Racoon Dog
1          Cat Cat              Cat
2  Dog Dog Dog Dog              Dog
3  Rat Fox Chicken  Rat Fox Chicken

edited Apr 10 '19 at 09:51

answered Nov 15 '17 at 23:54

cs95

274,032
76
480
537

Amazing! thank you – PineNuts0 Nov 16 '17 at 02:47
The first solution (with regex) does not work, the second solution (with OrderedDict) does work. – Carson Jul 31 '19 at 20:25
How do you ignore case sensitivity? For example if you had 'Cat cat' and wanted to remove 'cat'? – Omega Apr 15 '20 at 20:08
@Omega try converting the whole string to lower case first, assuming the result case doesn't matter to you. – cs95 Apr 15 '20 at 22:16

Python Dataframe: Remove duplicate words in the same cell within a column in Python

1 Answers1

Linked