Pyspark: Read Nested Columns from CSV file and assign a schema to the dataframe

Question

I tried to read a csv file that contains a nested column.

Example:

name,age,addresses_values
person_1,30,["France","street name",75000]

When reading I tired to assign a schema like follows:

csv_schema = StructType([
            StructField('name', StringType(), True),
            StructField('age', LongType(), True),
            StructField('addresses_values', StructType([
                    StructField('country', StringType(), True),
                    StructField('street', StringType(), True),
                   StructField('ZipCode', StringType(), True),
                   ]), True),
        ])

path = "file:///path_to_my_file"

dataset_df = spark.read.csv(path=path, header=True,schema=csv_schema)

This exception is raised:

pyspark.sql.utils.AnalysisException: CSV data source does not support structcountry:string,street:string,ZipCode:string data type.;

mck · Answer 1 · 2021-02-16T15:04:34.253

Here's a nasty way of parsing it. Inspired from this and this.

import pyspark.sql.functions as F

df = spark.read.text('arr.csv') \
    .filter("value != 'name,age,addresses_values'") \
    .select(F.split('value', ',(?=(?:[^\[\]]*\[[^\[\]]*\])*[^\[\]]*$)').alias('value')) \
    .selectExpr('value[0] name', 'value[1] age', "split(trim('[]' from value[2]), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)') addresses_values") \
    .selectExpr('name', 'age', 
        """struct(trim('"' from addresses_values[0]) as country,
                  trim('"' from addresses_values[1]) as street,
                  addresses_values[2] as zipcode)
           as addresses_values
        """)

df.show(truncate=False)
+--------+---+----------------------------+
|name    |age|addresses_values            |
+--------+---+----------------------------+
|person_1|30 |[France, street name, 75000]|
+--------+---+----------------------------+

df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- addresses_values: struct (nullable = false)
 |    |-- country: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- zipcode: string (nullable = true)

score 1 · Answer 2 · answered Feb 16 '21 at 14:57

A bit tricky but here's a way to parse those CSV values. You need to specify quote="[" to be able to read the column addresses_values as it contains commas. Then using from_json to parse it as an array of strings and finally create the desired struct from the array elements :

from pyspark.sql import functions as F

df = spark.read.csv(input_path, header=True, quote="[")

df1 = df.withColumn(
    "addresses_values",
    F.from_json(
        F.concat(F.lit("["), F.col("addresses_values")),
        "array<string>"
    )
).withColumn(
    "addresses_values",
    F.struct(
        F.col("addresses_values")[0].alias("country"),
        F.col("addresses_values")[1].alias("street"),
        F.col("addresses_values")[2].alias("ZipCode"),
    )
)

df1.show(truncate=False)

#+--------+---+----------------------------+
#|name    |age|addresses_values            |
#+--------+---+----------------------------+
#|person_1|30 |[France, street name, 75000]|
#+--------+---+----------------------------+

df1.printSchema()

#root
# |-- name: string (nullable = true)
# |-- age: string (nullable = true)
# |-- addresses_values: struct (nullable = false)
# |    |-- country: string (nullable = true)
# |    |-- street: string (nullable = true)
# |    |-- ZipCode: string (nullable = true)

Pyspark: Read Nested Columns from CSV file and assign a schema to the dataframe

2 Answers2