Spark dataframe : how to use as after a groupBy + sum

Question

My question is quite simple, but I can't seem to find a proper solution. I can hack it with horrible code, I would like to find something elegant.

Here is my line of code :

    val summedDF = dataFrame.groupBy(colsNamesGroupBy.head, colsNamesGroupBy.tail : _*).sum(colsNamesSum:_*)

It does a groupBy on an array of column Names, and then sum a few columns.

Everything works fine, but I get columns with the folowing name : sum(xxxx). I would like to rename these on the go, maybe with a map operation, so I only keep the "xxxx" name.

Anyone has any idea ?

EDIT :

I'm trying something like that, but I get "cannot resolve symbol agg with this signature" :

    val summedDF = dataFrame.groupBy(colsNamesGroupBy.head, colsNamesGroupBy.tail : _*).agg(colsNamesSum.map(c => sum(c).as(c)))

Did you take a look into this answer https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column — Avishek Bhattacharya, Sep 28 '17 at 09:36

Dmitry R. Bushkov · Accepted Answer · 2017-09-28T10:43:28.613

2

I would try something like that:

import org.apache.spark.sql.functions.{sum, col}

val aggregateExpr = colsNamesSum.map(c => sum(col(c)).as(c))

val summedDF = dataFrame.groupBy(colsNamesGroupBy.head, colsNamesGroupBy.tail : _*).agg(aggregateExpr.head, aggregateExpr.tail: _*)

edited Sep 28 '17 at 10:43

answered Sep 28 '17 at 09:53

Dmitry R. Bushkov

106
5

I would really like to make this work as this is the kind of solution I'm looking for – Martin Remy Sep 28 '17 at 10:25
I get typeMismatch exception – Martin Remy Sep 28 '17 at 10:31
Sorry, I migrated to Spark 2.1.1 recently, so there were some changes in .agg() method, I didn't know about. I'll edit my answer. – Dmitry R. Bushkov Sep 28 '17 at 10:41
This worked in my case (Spark 2.1.1). Please, try it out. – Dmitry R. Bushkov Sep 28 '17 at 10:44
Now I have to find how to only keep the second colomn with the same name :p – Martin Remy Sep 28 '17 at 11:28

score 0 · Answer 2 · answered Nov 19 '18 at 08:33

0

You need to import

import org.apache.spark.sql.functions._

So you can use .agg

answered Nov 19 '18 at 08:33

Haha TTpro

3,921
5
28
54

Spark dataframe : how to use as after a groupBy + sum

2 Answers2