Flatten DataFrame nested list/array with extra index keys (for time series)

Question

I have a DataFrame structured as follows. (It is a result of JSON normalization)

mydf

id    colA    colB    ...    colArray
foo   a1      b1             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
bar   a2      b2             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
fooz  a3      b3             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
barz  a4      b4             [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]

date are timestamps
each array in lines of colArray have a different length, but have exact same array element structure
['id', 'colA', 'colB'] is an example of columns that I would like to use as unique indexes

I would like to convert these data in order to use them as time series. My desired output would be something like:

id    colA    colB    ...    date               data1    data2 ... data n
foo   a1      b1             '1st timestamp'   'flex'   0.1
foo   a1      b1             '...'   
...
foo   a1      b1             'last_timestamp'   
bar   a2      b2             '1st timestamp'   'zorg'
bar   a2      b2             '...'   
...   
bar   a2      b2             'last_timestamp'   
fooz  a3      b3             '...'   
fooz  a3      b3             '...'   
...
fooz  a3      b3             '...'   
etc.

That would allow me to plot/analyze times series based on tuples such as [foo, a1, b1]

To me, this looks very similar to Flatten nested pandas dataframe, but the accepted answer is frustrating : the JSON/dict data is not really processed to produce a DataFrame with the correct data.

Does anyone have any advice on how to achieve this ?

First approach

Use the following, which is close to what I want:

tmpdf = pd.DataFrame(mydf['colArray'].tolist())
json_normalize(tmpdf[0])

But there are 2 issues:

I lost the ['id', 'colA', 'colB'] tuple that i would like to use as a unique identifier.
I need to perform the operation for each row of my tmpdf

Second method

Based on Accessing nested JSON data as dataframes in Pandas

pd.concat(pd.DataFrame.from_dict(tmp_array) for array in mydf['colArray'])

It gives me a dataframe with all my arrays flattened, correct columns names, but I lost the correspond keys ( ['id', 'colA', 'colB']). I feel this is the correct approach, but I can't figure out how to keep indexing columns (so that I can filter each resulting time series by indexing columns).

Too bad there is no "json_melt" function

Third method

Based on this question Flatten nested pandas dataframe. I can preserve my indexing columns, but the array elements are still in JSON and indexed as [0, 1, 2, ...]. I will have trouble dealing with the variable length (lots of NA for the higher values of columns indexes

Bibliography: Create a Pandas DataFrame from deeply nested JSON But the solution is based on the original JSON processing, whereas I would like to do this on an existing DataFrame

Accessing nested JSON data as dataframes in Pandas This is pretty close to what I want.

Flatten nested pandas dataframe The result looks like my first attempt, but the underlying JSON data is not really "matrixed" into the dataframe.

A rather complex and not satisfaying approach

EDIT: This question is the same But at time of asking, I could not find it via search. For future reference ?

jezrael · Accepted Answer · 2019-02-05T16:11:15.587

2

Use dictionary comprehension with pop for extract original column and concat for MulltiIndex:

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})

Alternative is use parameter keys:

df = pd.concat([pd.DataFrame(array) for array in mydf.pop('colArray')], keys=mydf.index)

Then remove second level, so possible join with original DataFrame:

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)

Sample:

mydf = pd.DataFrame({'id': ['foo', 'bar', 'fooz', 'barz'], 'colA': ['a1', 'a2', 'a3', 'a4'], 'colB': ['b1', 'b2', 'b3', 'b4'], 'colArray': [[{'date': 's', 'data1': 't', 'data2': 0.1}, {'date': 'd', 'data1': 'r', 'data2': 0.8}], [{'date': 'd', 'data1': 'y', 'data2': 0.1}], [{'date': 'g', 'data1': 'u', 'data2': 0.1}], [{'date': 'h', 'data1': 'i', 'data2': 0.1}]]})
print (mydf)
     id colA colB                                           colArray
0   foo   a1   b1  [{'date': 's', 'data1': 't', 'data2': 0.1}, {'...
1   bar   a2   b2        [{'date': 'd', 'data1': 'y', 'data2': 0.1}]
2  fooz   a3   b3        [{'date': 'g', 'data1': 'u', 'data2': 0.1}]
3  barz   a4   b4        [{'date': 'h', 'data1': 'i', 'data2': 0.1}]

df = pd.concat({k: pd.DataFrame(array) for k, array in mydf.pop('colArray').items()})
print (df)
    data1  data2 date
0 0     t    0.1    s
  1     r    0.8    d
1 0     y    0.1    d
2 0     u    0.1    g
3 0     i    0.1    h

df = df.reset_index(level=1, drop=True).join(mydf).reset_index(drop=True)
print (df)
  data1  data2 date    id colA colB
0     t    0.1    s   foo   a1   b1
1     r    0.8    d   foo   a1   b1
2     y    0.1    d   bar   a2   b2
3     u    0.1    g  fooz   a3   b3
4     i    0.1    h  barz   a4   b4

edited Feb 05 '19 at 16:11

answered Feb 05 '19 at 16:04

jezrael

629,482
62
918
895

Thanks for your quick answer, it works like a charm! To be honest, I don't really get the magic that `concat` + `keys=...` param do. I mean, I get that `pop` returns a new DF with indexing preserved. Then the `concat` part does the magic (quoting concat doc: *If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected*). If I had to say it with my own words : using [comprehension + keys= ], or using [ .pop().items()] fed `concat` with the right data for magic to happen ? – LoneWanderer Feb 05 '19 at 16:33
Erf, I had all the intel needed at hand(for the first part), but couldn't figure it out from documentation ... I get the reset_index mechanic, but would have searched for a long time to find it. – LoneWanderer Feb 05 '19 at 16:34
I read again the Pandas documentation (http://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html, http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html), and still don't find `keys=` argument behavior very clearly explained. Anyways, I'm probably missing it. – LoneWanderer Feb 05 '19 at 16:52
1

@LoneWanderer I think the best explain it [this Q/A](https://stackoverflow.com/questions/49620538/what-are-the-levels-keys-and-names-arguments-for-in-pandas-concat-functio) – jezrael Feb 05 '19 at 17:04
damn it https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas and https://stackoverflow.com/questions/49671693/pandas-dataframe-normalize-one-json-column-and-merge-with-other-columns already coveer the issue ... can't believe I did not find them ! i'll add them to the biblio – LoneWanderer Feb 06 '19 at 18:23

Flatten DataFrame nested list/array with extra index keys (for time series)

First approach

Second method

Third method

1 Answers1