Pandas: parse merged header columns from Excel

Question

The data in excel sheets is stored as follows:

   Area     |          Product1     |      Product2        |      Product3
            |      sales|sales.Value|   sales |sales.Value |  sales |sales.Value
  Location1 |    20     | 20000     |      25 |  10000     |   200  | 100
  Location2 |    30     | 30000     |      3  | 12300      |   213  | 10

the product name is a merge of 2 cells of two rows "no of sales" and "sales value" for each of 1000 or so areas for a given month. Similarly there are separate files for each month for the last 5 years. Further, new products have been added and removed in different months. So a different month file might look like:

   Area     |          Product1     |      Product4        |      Product3

Can the forum suggest the best way to read this data using pandas? I can't use index since the product columns are different each month

Ideally, I would like to convert the initial format above to:

 Area      | Product1.sales|Product1.sales.Value| Product2.sales |Product2.sales.Value | 
 Location1 | 20            | 20000              | 25             | 10000               |  
 Location2 | 30            | 30000              | 3              | 12300               |

import pandas as pd
xl_file = read_excel("file path", skiprow=2, sheetname=0)
/* since the first two rows are always blank */


                  0            1        2               3                      4
      0          NaN          NaN      NaN       Auto loan                    NaN
      1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
      2         3000       Name1  Central               0                      0
      3         3001       Name2  Central               0                      0

I want to convert it to Auto loan.No of account, Auto loan.Portfolio Outstanding as the headers.

Can you post an example of what the DataFrame looks like when you load a file with `df = pd.read_excel(...)`? What is `df.index` and `df.columns`? — unutbu, Dec 11 '14 at 10:28
thx, i figured it out. There are only 4 combination of meta columns in the 5x12=60 files. So I am just using a dictionary for all 4 combinations. — Samarth Bharadwaj, Dec 11 '14 at 11:01
@unutbu: does my EDIT express my requirement clearly? your help is appreciated, since my solution is not elegant. — Samarth Bharadwaj, Dec 11 '14 at 11:20
The edit explains the desired result (good), but it's still not clear (at least to me) what the original DataFrames look like. Do the original DataFrames have MultiIndex columns? Please post your code even if it is not elegant. It may help us understand what your DataFrames look like. — unutbu, Dec 11 '14 at 11:26
@unutbu I have added in EDIT2 the dataframe head() view. As you can see I would like to convert it to "Auto loan.No of accounts" or something similar that is uniquely query able. — Samarth Bharadwaj, Dec 11 '14 at 12:20

score 10 · Accepted Answer · edited Dec 12 '14 at 01:59

Suppose your DataFrame is df:

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])

so that it looks like this:

             0            1        2               3                      4
0          NaN          NaN      NaN       Auto loan                    NaN
1  Branch Code  Branch Name   Region  No of accounts  Portfolio Outstanding
2         3000       Name1  Central               0                      0
3         3001       Name2  Central               0                      0

Then first forward fill the NaNs in the first two rows (thus propagating 'Auto loan', for example).

df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)

Next fill in the remaining NaNs with empty strings:

df.iloc[0:2] = df.iloc[0:2].fillna('')

Now join the two rows together with . and assign that as the column level values:

df.columns = df.iloc[0:2].apply(lambda x: '.'.join([y for y in x if y]), axis=0)

And finally, remove the first two rows:

df = df.iloc[2:]

This yields

  Branch Code Branch Name   Region Auto loan.No of accounts  \
2        3000      Name1  Central                        0   
3        3001      Name2  Central                        0   

  Auto loan.Portfolio Outstanding  
2                               0  
3                               0

Alternatively, you could create a MultiIndex column instead of creating a flat column index:

import numpy as np
import pandas as pd

nan = np.nan
df = pd.DataFrame([
    (nan, nan, nan, 'Auto loan', nan)
    , ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
       , 'Portfolio Outstanding')
    , (3000, 'Name1', 'Central', 0, 0)
    , (3001, 'Name2', 'Central', 0, 0)
])
df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)
df.iloc[0:2] = df.iloc[0:2].fillna('Area')

df.columns = pd.MultiIndex.from_tuples(
    zip(*df.iloc[0:2].to_records(index=False).tolist()))
df = df.iloc[2:]

Now df looks like this:

         Area                           Auto loan                      
  Branch Code Branch Name   Region No of accounts Portfolio Outstanding
2        3000      Name1  Central              0                     0
3        3001      Name2  Central              0                     0

the column is a MultiIndex:

In [275]: df.columns
Out[275]: 
MultiIndex(levels=[[u'Area', u'Auto loan'], [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']],
           labels=[[0, 0, 0, 1, 1], [0, 1, 4, 2, 3]])

The column has two levels. The first level has values [u'Area', u'Auto loan'], the second has values [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region'].

You can then access a column by specifing the value from both levels:

print(df.loc[:, ('Area', 'Branch Name')])
# 2    Name1
# 3    Name2
# Name: (Area, Branch Name), dtype: object

print(df.loc[:, ('Auto loan', 'No of accounts')])
# 2    0
# 3    0
# Name: (Auto loan, No of accounts), dtype: object

One advantage of using a MultiIndex is that you can easily select all columns which have a certain level value. For instance, to select the sub-DataFrame having to do with Auto loans you could use:

In [279]: df.loc[:, 'Auto loan']
Out[279]: 
  No of accounts Portfolio Outstanding
2              0                     0
3              0                     0

For more on selecting rows and columns from a MultiIndex, see MultiIndexing Using Slicers.

Pandas: parse merged header columns from Excel

1 Answers1

Linked

Related