3
# import modules, set seed
import random
import numpy as np
import pandas as pd
random.seed(42)

The problem

I am having a dataframe df. Its rows contain values which are input to a function, producing variable number of outputs. The maximum number of outputs is not known a priori. The outputs are to be put in the same row as the function, creating new columns if necessary. Unfilled cells should be filled with NaNs.


Reproducible setup

Let's create a dataframe:

df = pd.DataFrame(pd.Series([random.randint(1,10) for _ in range(5)]),columns=['randomnums'])

This looks like:

enter image description here


What have I done

Created a dataframe (auxiliarydf) with the values I want to fill the rows of the to-be created columns of the original df, using from_dict(), apply(), a lambda function, dict & list comprehension:

auxiliarydf = pd.DataFrame.from_dict(
                {index: pd.Series(array) for index, array in zip(
                         df.index,
                         df['randomnums'].apply(
                                          lambda r: 
                                          # here I apply some function on the row.
                                          # The output will be a list of variable length
                                          # for the shake of an example:
                                          np.array([x for x in range(r)])))},
                orient='index')

auxiliarydf will be:

enter image description here

concat() df with auxiliarydf:

pd.concat([df, auxiliarydf], axis=1)

Result:

enter image description here

Which is as expected.


The question

Is there an easier, maybe built-in Pandas function to do the process above? It works, but it seems like a problem which appears with enough frequency to expect a neater solution.


Colab notebook available here with the code above.

zabop
  • 3,885
  • 3
  • 14
  • 47

1 Answers1

2

You can also try with directly creating a dataframe using the pd.DataFrame constructor and using the existing dataframe index and calling a series.tolist() to the resultant series of arrays, then you can use df.join():

auxillary_df = df['randomnums'].apply(lambda r: np.array([x for x in range(r)]))
df.join(pd.DataFrame(auxillary_df.to_list(),index=df.index))

   randomnums  0    1    2    3    4
0           2  0  1.0  NaN  NaN  NaN
1           1  0  NaN  NaN  NaN  NaN
2           5  0  1.0  2.0  3.0  4.0
3           4  0  1.0  2.0  3.0  NaN
4           4  0  1.0  2.0  3.0  NaN

Of course you can chain them to get a one liner , however readability first :)

df.join(pd.DataFrame(df['randomnums'].apply(lambda r:
    np.array([x for x in range(r)])).to_list(),index=df.index))
zabop
  • 3,885
  • 3
  • 14
  • 47
anky
  • 64,269
  • 7
  • 30
  • 56