1

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4  
Matt W.
  • 3,331
  • 16
  • 40
  • 1
    Possible duplicate of [Spark SQL: apply aggregate functions to a list of columns](https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns) and [Multiple Aggregate operations on the same column of a spark dataframe](https://stackoverflow.com/questions/34954771/multiple-aggregate-operations-on-the-same-column-of-a-spark-dataframe) – pault Jun 20 '19 at 19:13
  • 5
    Full details in the duplicates, but you want to do: `from pyspark.sql.functions import max as max_` and then `sp.groupBy('id').agg(*[max_(c) for c in sp.columns[1:]])` - you can expand this to also include the `mean` and `min`. – pault Jun 20 '19 at 19:15

0 Answers0