Using pandas and json_normalize to flatten nested JSON API response

Question

I have a deeply nested JSON that I am trying to turn into a Pandas Dataframe using json_normalize.

A generic sample of the JSON data I'm working with looks looks like this (I've added context of what I'm trying to do at the bottom of the post):

{
    "per_page": 2,
    "total": 1,
    "data": [{
            "total_time": 0,
            "collection_mode": "default",
            "href": "https://api.surveymonkey.com/v3/responses/5007154325",
            "custom_variables": {
                "custvar_1": "one",
                "custvar_2": "two"
            },
            "custom_value": "custom identifier for the response",
            "edit_url": "https://www.surveymonkey.com/r/",
            "analyze_url": "https://www.surveymonkey.com/analyze/browse/",
            "ip_address": "",
            "pages": [
                {
                    "id": "103332310",
                    "questions": [{
                            "answers": [{
                                    "choice_id": "3057839051"
                                }
                            ],
                            "id": "319352786"
                        }
                    ]
                },
                {
                    "id": "44783164",
                    "questions": [{
                            "id": "153745381",
                            "answers": [{
                                    "text": "some_name"
                                }
                            ]
                        }
                    ]
                },
                {
                    "id": "44783183",
                    "questions": [{
                            "id": "153745436",
                            "answers": [{
                                    "col_id": "1087201352",
                                    "choice_id": "1087201369",
                                    "row_id": "1087201362"
                                }, {
                                    "col_id": "1087201353",
                                    "choice_id": "1087201373",
                                    "row_id": "1087201362"
                                }
                                ]
                            }
                        ]
                }
            ],
            "date_modified": "1970-01-17T19:07:34+00:00",
            "response_status": "completed",
            "id": "5007154325",
            "collector_id": "50253586",
            "recipient_id": "0",
            "date_created": "1970-01-17T19:07:34+00:00",
            "survey_id": "105723396"
        }
    ],
    "page": 1,
    "links": {
        "self": "https://api.surveymonkey.com/v3/surveys/123456/responses/bulk?page=1&per_page=2"
    }
}

I'd like to end up with a dataframe that contains the question_id, page_id, response_id, and response data like this:

    choice_id      col_id      row_id       text   question_id       page_id      response_id
0  3057839051         NaN         NaN        NaN     319352786     103332310       5007154325
1         NaN         NaN         NaN  some_name     153745381      44783164       5007154325
2  1087201369  1087201352  1087201362        NaN     153745436      44783183       5007154325
3  1087201373  1087201353  1087201362        NaN     153745436      44783183       5007154325

I can get close by running the following code (Python 3.6):

df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions'], meta='id', record_prefix ='question_')
print(df)

Which returns:

                                    question_answers question_id          id
0                      [{'choice_id': '3057839051'}]   319352786  5007154325
1                            [{'text': 'some_name'}]   153745381  5007154325
2  [{'col_id': '1087201352', 'choice_id': '108720...   153745436  5007154325

But if I try to run json_normalize at a deeper nest and keep the 'question_id' data from the above result, I can only get the page_id values to return, not true question_id values:

answers_df = json_normalize(data=so_survey_responses['data'], record_path=['pages', 'questions', 'answers'], meta=['id', ['questions', 'id'], ['pages', 'id']])
print(answers_df)

Returns:

    choice_id      col_id      row_id       text          id questions.id   pages.id
0  3057839051         NaN         NaN        NaN  5007154325    103332310  103332310
1         NaN         NaN         NaN  some_name  5007154325     44783164   44783164
2  1087201369  1087201352  1087201362        NaN  5007154325     44783183   44783183
3  1087201373  1087201353  1087201362        NaN  5007154325     44783183   44783183

A complicating factor may be that all the above (question_id, page_id, response_id) are 'id:' in the JSON data.

I'm sure this is possible, but I can't get there. Any examples of how to do this?

Additional context: I'm trying to create a dataframe of SurveyMonkey API response output.

My long term goal is to re-create the "all responses" excel sheet that their export service provides.

I plan to do this by getting the response dataframe set up (above), and then use .apply() to match responses with their survey structure API output.

I've found the SurveyMonkey API pretty lackluster at providing useful output, but I'm new to Pandas so it's probably on me.

Did you ever succeed in formatting the responses from the Survey Monkey API in a ways that (even vaguely) resembles the Excel sheet from their export service? If so, would you save my week by sharing the code? — MajorBriggs, Jan 20 '19 at 19:10

score 14 · Accepted Answer · answered Nov 13 '18 at 00:58

You need to modify the meta parameter of your last option, and, if you want to rename columns to be exactly the way you want, you could do it with rename:

answers_df = json_normalize(data=so_survey_responses['data'],
                        record_path=['pages', 'questions', 'answers'],
                        meta=['id', ['pages', 'questions', 'id'], ['pages', 'id']])\
.rename(index=str,
        columns={'id': 'response_id', 'pages.questions.id': 'question_id', 'pages.id': 'page_id'})

score 2 · Answer 2 · answered Nov 12 '18 at 19:55

There is no way to do this in a completely generic way using json_normalize(). You can use the record_path and meta arguments to indicate how you want the JSON to be processed.

However, you can use the flatten package to flatten your deeply nested JSON and then convert that to a Pandas dataframe. The page has example usage of how to flatten a deeply-nested JSON and convert to a Pandas dataframe.

Using pandas and json_normalize to flatten nested JSON API response

2 Answers2