I have a DataFrame structured as follows. (It is a result of JSON normalization)
mydf
id colA colB ... colArray
foo a1 b1 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
bar a2 b2 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
fooz a3 b3 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
barz a4 b4 [{'date': '...', 'data1': '...', 'data2': 0.1 ...}, ...]
date
are timestamps- each array in lines of
colArray
have a different length, but have exact same array element structure ['id', 'colA', 'colB']
is an example of columns that I would like to use as unique indexes
I would like to convert these data in order to use them as time series. My desired output would be something like:
id colA colB ... date data1 data2 ... data n
foo a1 b1 '1st timestamp' 'flex' 0.1
foo a1 b1 '...'
...
foo a1 b1 'last_timestamp'
bar a2 b2 '1st timestamp' 'zorg'
bar a2 b2 '...'
...
bar a2 b2 'last_timestamp'
fooz a3 b3 '...'
fooz a3 b3 '...'
...
fooz a3 b3 '...'
etc.
That would allow me to plot/analyze times series based on tuples such as [foo, a1, b1]
To me, this looks very similar to Flatten nested pandas dataframe, but the accepted answer is frustrating : the JSON/dict data is not really processed to produce a DataFrame with the correct data.
Does anyone have any advice on how to achieve this ?
First approach
Use the following, which is close to what I want:
tmpdf = pd.DataFrame(mydf['colArray'].tolist())
json_normalize(tmpdf[0])
But there are 2 issues:
- I lost the
['id', 'colA', 'colB']
tuple that i would like to use as a unique identifier. - I need to perform the operation for each row of my tmpdf
Second method
Based on Accessing nested JSON data as dataframes in Pandas
pd.concat(pd.DataFrame.from_dict(tmp_array) for array in mydf['colArray'])
It gives me a dataframe with all my arrays flattened, correct columns names, but I lost the correspond keys ( ['id', 'colA', 'colB']
).
I feel this is the correct approach, but I can't figure out how to keep indexing columns (so that I can filter each resulting time series by indexing columns).
Too bad there is no "json_melt" function
Third method
Based on this question Flatten nested pandas dataframe. I can preserve my indexing columns, but the array elements are still in JSON and indexed as [0, 1, 2, ...]. I will have trouble dealing with the variable length (lots of NA for the higher values of columns indexes
Bibliography: Create a Pandas DataFrame from deeply nested JSON But the solution is based on the original JSON processing, whereas I would like to do this on an existing DataFrame
Accessing nested JSON data as dataframes in Pandas This is pretty close to what I want.
Flatten nested pandas dataframe The result looks like my first attempt, but the underlying JSON data is not really "matrixed" into the dataframe.
A rather complex and not satisfaying approach
EDIT: This question is the same But at time of asking, I could not find it via search. For future reference ?