Re-name nested field in Scala Spark 2.0 Dataset

Question

I am trying to re-name a nested field within a Dataset of case classes using Spark 2.0. An example is as follows, where I am trying to rename "element" to "address" (maintaining where it is nested within the data structure):

df.printSchema
//Current Output:
root
 |-- companyAddresses: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- addressLine: string (nullable = true)
 |    |    |-- addressCity: string (nullable = true)
 |    |    |-- addressCountry: string (nullable = true)
 |    |    |-- url: string (nullable = true)

//Desired Output:
root
 |-- companyAddresses: array (nullable = true)
 |    |-- address: struct (containsNull = true)
 |    |    |-- addressLine: string (nullable = true)
 |    |    |-- addressCity: string (nullable = true)
 |    |    |-- addressCountry: string (nullable = true)
 |    |    |-- url: string (nullable = true)

For reference, the following do not work:

df.withColumnRenamed("companyAddresses.element","companyAddresses.address") 
df.withColumnRenamed("companyAddresses.element","address")

http://stackoverflow.com/questions/35592917/renaming-column-names-of-a-data-frame-in-spark-scala — banjara, Aug 22 '16 at 16:17
@shekhar That link only contains solutions for flat Datasets, the issue I'm having is where the field is nested. — Nick, Aug 22 '16 at 16:21

zero323 · Accepted Answer · 2016-08-22T16:38:55.703

What you're asking for here is not possible. companyAddresses is an array and element is simply not a column. It is just indicator of the schema of the array members. It cannot be selected, and it cannot be renamed.

You can only rename parent container:

df.withColumnRenamed("companyAddresses", "foo")

or names of the individual fields by modifying schema. In simple cases it is also possible to use struct and select:

df.select(struct($"foo".as("bar"), $"bar".as("foo")))

but obviously this is not applicable here.

score 0 · Answer 2 · answered Apr 11 '17 at 14:40

You could write a small recursive function for that, and use a map:

final JavaRDD rdd = df.toJavaRDD().map(row -> ....);


private static void flatDocument(Row input, Map<String,Object> outValues, String fqn)
{
    final StructType schema = input.schema();

    for (StructField field : schema.fields())
    {
        final String fieldName = field.name();

        String key = fqn == null ? fieldName : fqn + "_" + fieldName;

        Object buffer = input.getAs(fieldName);

        if (field.dataType().getClass().equals(StructType.class))
        {
            if (buffer != null) {
                flatDocument((Row) buffer, outValues, key);
            }
        }
        else
        {
            outValues.put(key, buffer);
        }
    }
}

But you need a schema to transform it back as a DataSet :/

Thomas, The original answer was correct, in that element isn't a column, and therefore not re-nameable. — Nick, Apr 21 '17 at 14:10
Nop, you said this is not possible whereas it is! I achieved this by using RDD (breaking DF, work on RDD, then re-build DF) — Thomas Decaux, Apr 22 '17 at 07:57

Re-name nested field in Scala Spark 2.0 Dataset

2 Answers2