2

Consider the following Dataframe:

           AAA3 ABB3 DAT4 DEE3 ABB4 AAA4 DAA3 EAV3 DAC4 DEE4
01/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
02/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
03/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
04/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5

The values are not important, so I am giving all columns the same value.

What I want to do is see if the alphabets characters part of my columns headers, has an match among the headers, and if it does, remove the header that has an 4, leaving only the name that has a 3.

For example: 

There is an AAA3, as well as an AAA4. I want to drop the AAA4 column, leaving only AAA3.

Note that there is a column named DAC4, but there isn't a DAC3. So I want to keep my DAC4 column.

I couldn't solve my problem with the following question : Select by partial string from a pandas DataFrame

khouzam
  • 245
  • 3
  • 11

2 Answers2

2

Step 1: Get a dictionary of similar columns :

from collections import defaultdict
from itertools import chain

d = defaultdict(list)
for entry in df.columns:
    d[entry[:-1]].append(entry)

d

defaultdict(list,
            {'AAA': ['AAA3', 'AAA4'],
             'ABB': ['ABB3', 'ABB4'],
             'DAT': ['DAT4'],
             'DEE': ['DEE3', 'DEE4'],
             'DAA': ['DAA3'],
             'EAV': ['EAV3'],
             'DAC': ['DAC4']})

Step 2: Get columns that end with 4 :

from itertools import chain

cols_to_drop = list(chain.from_iterable([[ent for ent in value 
                                         if ent.endswith("4")]
                                         for key,value in d.items() 
                                         if len(value) > 1]))

cols_to_drop
['AAA4', 'ABB4', 'DEE4']

Step 3: Drop columns :

df.drop(columns=cols_to_drop)

    AAA3    ABB3    DAT4    DEE3    DAA3    EAV3    DAC4
0   01/01/2020  1   1.1 1.5 1.2 1   0.9
1   02/01/2020  1   1.1 1.5 1.2 1   0.9
2   03/01/2020  1   1.1 1.5 1.2 1   0.9
3   04/01/2020  1   1.1 1.5 1.2 1   0.9
sammywemmy
  • 14,894
  • 2
  • 11
  • 22
2

Create a mask on duplicated of alphabet part. Create another mask where the last char is 3. Finally slicing using these masks

m = df.columns.str.extract(r'(^[A-Za-z]+)').duplicated(keep=False)
m1 = df.columns.str.endswith('3')
df_final =  df.loc[:,(~m | m1).values]

Out[146]:
            AAA3  ABB3  DAT4  DEE3  DAA3  EAV3  DAC4
01/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
02/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
03/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
04/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
Andy L.
  • 23,082
  • 3
  • 11
  • 23
  • Thank you! Could you explain what (r'(^[A-Za-z]+)') means ? – khouzam Jul 26 '20 at 23:24
  • @khouzam: it's the regex pattern for getting all alphabetic chars both uppercase and lowercase ignore digits. The `^` to indicate starting from the front. – Andy L. Jul 26 '20 at 23:25