How to access nested attribute without passing parent attribute in pyspark json

Question

I am trying to access inner attributes of following json using pyspark

[
 {
    "432": [
        {
            "atttr1": null,
            "atttr2": "7DG6",
            "id":432,
            "score": 100
        }
    ]
},
 {
    "238": [
        {
            "atttr1": null,
            "atttr2": "7SS8",
            "id":432,
            "score": 100
        }
    ]
}
]

In the output, I am looking for something like below in form of csv atttr1, atttr2,id,score null,"7DG6",432,100 null,"7SS8",238,100

I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.

print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())

I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.

inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")

Any help will be highly appreciated. I am using spark 2.4

what is `peopleDF`? Could you show the output of `peopleDF.show()`? — mck, Apr 08 '21 at 12:40
that's input df. Renamed it. Also the output .show() is +--------------------+--------------------+ | 238| 432| +--------------------+--------------------+ | null|[[, 7DG6, 432, 100]]| |[[, 7SS8, 432, 100]]| null| +--------------------+--------------------+ — sparkingmyself, Apr 09 '21 at 05:49

score 1 · Answer 1 · answered Apr 08 '21 at 12:57

Without using pyspark features, you can do it like this:

data = json.loads(json_str)  # or whatever way you're getting the data

columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns))  # headers

for item in data:
    for obj in list(item.values())[0]:  # since each list has only one object
        print(','.join(str(obj[col]) for col in columns))

Output:

atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100

Or

for item in data:
    obj = list(item.values())[0][0]  # since the object is the one and only item in list
    print(','.join(str(obj[col]) for col in columns))

FYI, you can store those in a variable or write it out to csv instead of/and also printing it.

And if you're just looking to dump that to csv, see this answer.

How to access nested attribute without passing parent attribute in pyspark json

1 Answers1