2

I read csv with more than 300 columns. Many columns have similar names.
Shortened example from the csv:

index   d     c     a     b     b     a     a     a
0       data  data  data  data  data  data  data  data

Python automatically adds numbers at the end of columns names (if they are similar) as a suffix to ensure each column has a unique name.
Example:

index   d     c     a     b     b.1   a.1   a.2   a.3
0       data  data  data  data  data  data  data  data

My assignment is to sort the columns alphabetically and to add zeros (leading zeros) to the suffix.
Desired output:

index   a     a.01  a.02  a.03  b     b.01  c     d
0       data  data  data  data  data  data  data  data

My code below (taken from stackoverflow) can sort the column. But I have no idea how to make suffix with leading zeros?

import pandas as pd
df= pd.read_csv(r"C:\Users\.....\file.csv", skipinitialspace=False, sep=';', header= 0, index_col = 'DateTime', low_memory=False)

df = df.sort_index(axis=1) # sort the column
pd.set_option('display.max_columns', None) # view all columns without truncated

Any ideas?

Edited question
After sorting the columns, I want,

a.01 a02 a12

rather than

a.01 a.012 a.02
k.ko3n
  • 844
  • 5
  • 17

4 Answers4

3

You could do in this way:

import re
df = df.rename(columns=lambda x: re.sub('\.','.0',x))

or

df = df.rename(columns=lambda x: x.replace('.','.0'))

Output:

      a  a.01  a.02  a.03     b  b.01     c     d
0  data  data  data  data  data  data  data  data

Or another option in which you can decide the number of leading zeros:

df = df.rename(columns=lambda x: x.split('.')[0] + '.' + x.split('.')[1].zfill(2) if '.' in x else x)

With this last option if the name of a column is b.10, it wont be changed.

Joe
  • 9,817
  • 4
  • 32
  • 43
  • 2
    This will turn `a.10` to `a.010`, worth clarifying what the desired behaviour for double digit numbers is – Dan Nov 12 '18 at 14:07
  • it would be a lot simpler to just use string interpolation with `02d`. So `lambda x: f"{x.split('.')[0]}.{x.split('.')[1]:02d}"` but piRSquared's solution doesn't require calling `split` twice – Dan Nov 12 '18 at 14:26
  • @Dan, i believe `df.columns.str` will be simple & easy approach. – Karn Kumar Nov 12 '18 at 14:48
3

rename

def f(x):
  a, *b = x.rsplit('.', 1)

  return f"{a}.{int(b[0]):02d}" if b else a

df.rename(columns=f).sort_index(1)

          a  a.01  a.02  a.03     b  b.01     c     d
index                                                
0      data  data  data  data  data  data  data  data
piRSquared
  • 240,659
  • 38
  • 359
  • 510
2

With df.columns.str

There are answers posted, but would like to add one more easy solution:

Since df.columns is an Index object, we can use the .str accessor.

Here is he Doc refence Doc

Just a short simulated example:

>>> df
    b.1   a.1   a.2
0  data  data  data
1  data  data  data

Replace the desired columns directly to DataFrame as follows:

>>> df.columns = df.columns.str.replace('.','.0')
>>> df
   b.01  a.01  a.02
0  data  data  data
1  data  data  data

OR below already given but better to use inplace=True rather assigning to df.

df.rename(columns=lambda x: x.replace('.','.0'), inplace=True)

Another way around using axis:

df.rename(lambda x: x.replace('.','.0'), axis='columns', inplace=True)
print(df)
   b.01  a.01  a.02
0  data  data  data
1  data  data  data
Karn Kumar
  • 5,809
  • 1
  • 18
  • 33
1

Using a string formatting and mainly a single row code:

import pandas as pd

df = pd.DataFrame( [["data", "data", "data", "data", "data", "data", "data", "data"], 
                    ["data", "data", "data", "data", "data", "data", "data", "data"]], 
                   columns=["a", "a.1", "a.2", "a.3", "b", "b.1", "c", "d"])

df.columns = [c.replace(c[c.find(".")+1], format(int(c[c.find(".")+1]),'02') if "." in c else c) for c in df.columns]

print(df) 

Out:

      a  a.01  a.02  a.03     b  b.01     c     d
0  data  data  data  data  data  data  data  data
1  data  data  data  data  data  data  data  data
Geeocode
  • 5,009
  • 2
  • 15
  • 30