Merge columns into single map with UDF from Array of Column Names

Question

So I have a dataframe of values that need to be summed together, then put into a Map[String,Long] format to save into Cassandra.

The below code works, however I was wondering if a map could be created based on an abstract list of columns. (Looking at the source code for their functions only makes me more confused).

var cols = Array("key", "v1", "v2")
var df = Seq(("a",1,0),("b",1,0),("a",1,1),("b",0,0)).toDF(cols: _*)
val df1 = df.groupBy(col(cols(0))).
  agg(map(lit(cols(1)), sum(col(cols(1))), lit(cols(2)), sum(col(cols(2)))) as "map")

This is my desired format for the dataframe & current given outcome with above code:

scala> df1.show(false)
+---+---------------------+
|key|map                  |
+---+---------------------+
|b  |Map(v1 -> 1, v2 -> 0)|
|a  |Map(v1 -> 2, v2 -> 1)|
+---+---------------------+

I would like to see a function that can returns the same as above but be able to place columns programmatically based on name. E.g.:

var columnNames = Array("v1", "v2")
df.groupBy(col(cols(0))).agg(create_sum_map(columnNames) as "map")

Is this even remotely possible in Spark?

score 1 · Accepted Answer · answered Jun 28 '19 at 08:12

There is no need to use a slow UDF, you can achieve this with pure in-built Spark functions and varargs, see e.g. Spark SQL: apply aggregate functions to a list of columns. This solution requires building a list of columns on which the aggregation can be applied on. Here, it's a bit more complicated since you want a map in the final output, this requires an extra step.

First create the expressions (columns) to be use in the aggregation:

val exprs = cols.tail.flatMap(c => Seq(lit(c), sum(col(c))))

Apply the group by and use the created exprs:

val df2 = df.groupBy(col(cols.head)).agg(exprs.head, exprs.tail:_*)
  .select(col(cols.head), map(cols.tail.flatMap(c => Seq(col(c), col(s"sum($c)"))):_*).as("map"))

Above an extra select is required to create the map and the cols.tail.flatMap(c => Seq(col(c), col(s"sum($c)")) is simply a list of the new columns that should be added to the map.

Resulting output is identical to before:

+---+---------------------+
|key|map                  |
+---+---------------------+
|b  |Map(v1 -> 1, v2 -> 0)|
|a  |Map(v1 -> 2, v2 -> 1)|
+---+---------------------+

Aha! So `Array:_*` turns the array into a set of parameters. I wonder why Spark disallows `.agg(exprs:_*)` though. This helps greatly - thanks. — ChiMo, Jul 01 '19 at 01:55

score 1 · Answer 2 · answered Jul 01 '19 at 03:04

So I figured out how to produce the result of my wanted answer based from @Shaido's answer.

def create_sum_map(cols: Array[String]): Column = 
  map(cols.flatMap(c => Seq(lit(c), sum(col(c)))):_*)

df.groupBy(col(cols.head)).agg(create_sum_map(columnNames) as "map")

I assume this works because a sum(Column) with affected columns is present within create_sum_map() in the .agg() function.

Merge columns into single map with UDF from Array of Column Names

2 Answers2