filter based on partial string for multiple columns headers

Question

Consider the following Dataframe:

           AAA3 ABB3 DAT4 DEE3 ABB4 AAA4 DAA3 EAV3 DAC4 DEE4
01/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
02/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
03/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
04/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5

The values are not important, so I am giving all columns the same value.

What I want to do is see if the alphabets characters part of my columns headers, has an match among the headers, and if it does, remove the header that has an 4, leaving only the name that has a 3.

For example:

There is an AAA3, as well as an AAA4. I want to drop the AAA4 column, leaving only AAA3.

Note that there is a column named DAC4, but there isn't a DAC3. So I want to keep my DAC4 column.

I couldn't solve my problem with the following question : Select by partial string from a pandas DataFrame

score 2 · Answer 1 · answered Jul 26 '20 at 23:00

Step 1: Get a dictionary of similar columns :

from collections import defaultdict
from itertools import chain

d = defaultdict(list)
for entry in df.columns:
    d[entry[:-1]].append(entry)

d

defaultdict(list,
            {'AAA': ['AAA3', 'AAA4'],
             'ABB': ['ABB3', 'ABB4'],
             'DAT': ['DAT4'],
             'DEE': ['DEE3', 'DEE4'],
             'DAA': ['DAA3'],
             'EAV': ['EAV3'],
             'DAC': ['DAC4']})

Step 2: Get columns that end with 4 :

from itertools import chain

cols_to_drop = list(chain.from_iterable([[ent for ent in value 
                                         if ent.endswith("4")]
                                         for key,value in d.items() 
                                         if len(value) > 1]))

cols_to_drop
['AAA4', 'ABB4', 'DEE4']

Step 3: Drop columns :

df.drop(columns=cols_to_drop)

    AAA3    ABB3    DAT4    DEE3    DAA3    EAV3    DAC4
0   01/01/2020  1   1.1 1.5 1.2 1   0.9
1   02/01/2020  1   1.1 1.5 1.2 1   0.9
2   03/01/2020  1   1.1 1.5 1.2 1   0.9
3   04/01/2020  1   1.1 1.5 1.2 1   0.9

This solution also worked perfectly for me! thank you for your time helping me — khouzam, Jul 26 '20 at 23:26

score 2 · Accepted Answer · answered Jul 26 '20 at 23:13

2

Create a mask on duplicated of alphabet part. Create another mask where the last char is 3. Finally slicing using these masks

m = df.columns.str.extract(r'(^[A-Za-z]+)').duplicated(keep=False)
m1 = df.columns.str.endswith('3')
df_final =  df.loc[:,(~m | m1).values]

Out[146]:
            AAA3  ABB3  DAT4  DEE3  DAA3  EAV3  DAC4
01/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
02/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
03/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
04/01/2020     1   1.1   1.5   1.2   1.2     1   0.9

answered Jul 26 '20 at 23:13

Andy L.

23,082
3
11
23

Thank you! Could you explain what (r'(^[A-Za-z]+)') means ? – khouzam Jul 26 '20 at 23:24
@khouzam: it's the regex pattern for getting all alphabetic chars both uppercase and lowercase ignore digits. The `^` to indicate starting from the front. – Andy L. Jul 26 '20 at 23:25

filter based on partial string for multiple columns headers

2 Answers2