group by agg multiple columns with pyspark

Question

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4

Possible duplicate of [Spark SQL: apply aggregate functions to a list of columns](https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns) and [Multiple Aggregate operations on the same column of a spark dataframe](https://stackoverflow.com/questions/34954771/multiple-aggregate-operations-on-the-same-column-of-a-spark-dataframe) — pault, Jun 20 '19 at 19:13
Full details in the duplicates, but you want to do: `from pyspark.sql.functions import max as max_` and then `sp.groupBy('id').agg(*[max_(c) for c in sp.columns[1:]])` - you can expand this to also include the `mean` and `min`. — pault, Jun 20 '19 at 19:15

group by agg multiple columns with pyspark

0 Answers0