1

I am trying to split a column with an array of a list into multiple columns and create multiple rows. Below is the sample data:

signalid    monthyear   readings
5135         201901    [{"v":"90","q":1,"t":1546444800000,"g":0}]
5135         201901    [{"v":"50","q":1,"t":1546444900000,"g":0}]
5135         201901    [{"v":"40","q":1,"t":1546445800000,"g":0}]
5135         201901    [{"v":"30","q":1,"t":1546446800000,"g":0},{"v":"35","q":1,"t":1546446900000,"g":0}]

Details about data:

  1. Data is fetched from the Cassandra database.

  2. The column data type in Cassandra is text.

  3. dataframe.readings[0]

'[{"v":"9817","q":1,"t":1548979150085,"g":0},{"v":"9821","q":1,"t":1548979151475,"g":0}]'

  1. dataframe.readings.dtype

    dtype('O')

Normalize code:

normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
                            meta=["signalid", "monthyear"])

Expected output:

signalid    monthyear  v     q         t           g
5135         201901    90    1    1546444800000    0
5135         201901    50    1    1546444900000    0
5135         201901    40    1    1546445800000    0
5135         201901    30    1    1546446800000    0
5135         201901    35    1    1546446900000    0

Result:

TypeError                                 Traceback (most recent call last)
<ipython-input-13-486775fb3431> in <module>
     60 #normalizeddataframe = pandas.read_json(dataframe.readings)
     61 normalizeddataframe = json_normalize(data=dataframe, record_path='readings', 
---> 62                             meta=["signalid", "monthyear", "fromtime", "totime", "avg", "insertdate", "max", "min"])
     63 #normalizeddataframe = json_normalize(pandas.Series.to_json(dataframe.readings))
     64 print("Processig completed");

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
    260                 records.extend(recs)
    261 
--> 262     _recursive_extract(data, record_path, {}, level=0)
    263 
    264     result = DataFrame(records)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _recursive_extract(data, path, seen_meta, level)
    236         else:
    237             for obj in data:
--> 238                 recs = _pull_field(obj, path[0])
    239 
    240                 # For repeating the metadata later

C:\ProgramData\Anaconda3\lib\site-packages\pandas\io\json\normalize.py in _pull_field(js, spec)
    183                 result = result[field]
    184         else:
--> 185             result = result[spec]
    186 
    187         return result

TypeError: string indices must be integers
Nithin B
  • 513
  • 1
  • 6
  • 22

2 Answers2

0

Let's assume we have a data frame with column name as readings and the column values are

[{"v":"90","q":1,"t":1546444800000,"g":0}]

[{"v":"50","q":1,"t":1546444900000,"g":0}]

[{"v":"40","q":1,"t":1546445800000,"g":0}]

for row in data['readings']:
    for value_dict in row:
        for key in value_dict:
            data[key] = value_dict[key]

In case of multiple jsons in the list, you did not mention what is your expected output. I assume this might help you. If you mention the desired output, I can modify my code according to it.

bumblebee
  • 1,592
  • 8
  • 16
  • Added expected output to the question. – Nithin B Feb 06 '19 at 05:31
  • I think the problem is row variable is a string object and not JSON object, therefore value_dist and key variable are coming as [. – Nithin B Feb 06 '19 at 05:42
  • You cannot add multiple column values to the same row. It will overwrite the previous one. – bumblebee Feb 06 '19 at 06:59
  • But v,q,t,g values are different in array. So two rows need to created according to me. This link shows the same thing that I am trying to do https://www.kaggle.com/jboysen/quick-tutorial-flatten-nested-json-in-pandas . I think the problem is because it is a string instead of json object. – Nithin B Feb 06 '19 at 08:37
  • The above code gives TypeError: string indices must be integers error – Nithin B Feb 06 '19 at 09:18
  • Use json_normalize as you mentioned in the above link and append the normalized data. Refer this https://stackoverflow.com/questions/49671693/pandas-dataframe-normalize-one-json-column-and-merge-with-other-columns – bumblebee Feb 06 '19 at 10:29
0

Maybe there is a simpler solution but this one should work. The idea is to convert each entry of the readings column into a dataframe whose index is given by signalid and monthyear.

This function convert a dictionary into a DataFrame and sets the index:

def to_df(index, x):
    return pd.DataFrame.from_dict(x, orient='index').T.set_index(pd.MultiIndex.from_arrays(index, names=['signalid', 'monthyear']))

The next function triggers the call of to_df

def concat(y):
    return pd.concat(map(lambda x: to_df([[y.signalid], [y.monthyear]], x), y.readings))

We now apply concat to each row. The result is a Series of DataFrames which we can concat to receive the final data:

s = df.apply(concat, axis=1)
pd.concat(s.tolist())
JoergVanAken
  • 1,236
  • 8
  • 7