PySpark: Groupby on multiple columns with multiple functions

Question

I am running PySpark with Spark 2.0 to aggregate data. Below is the raw Dataframe (df) as received in Spark.

DeviceID    TimeStamp           IL1    IL2    IL3    VL1    VL2    VL3
1001        2019-07-14 00:45    2.1    3.1   2.25    235    258    122
1002        2019-07-14 01:15    3.2    2.4   4.25    240    250    192
1003        2019-07-14 01:30    3.2    2.0   3.85    245    215    192
1003        2019-07-14 01:30    3.9    2.8   4.25    240    250    192

Now I want to apply groupby logic by DeviceID. There are several posts there in StackOverflow. Particularly, This and this links are of point of interest. With the help of those posts I created the following script

from pyspark.sql import functions as F
groupby = ["DeviceID"]
agg_cv = ["IL1","IL2","IL3","VL1","VL2","VL3"]
func = [min,max]
expr_cv = [F.f(F.col(c)) for f in func for c in agg_cv]
df_final = df_cv_filt.groupby(*groupby).agg(*expr_cv)

The above code is showing error as

Columns are not iterable

Not able to understand why such error is coming. When I am using the following code

from pyspark.sql.functions import min, max, col
expr_cv = [f(col(c)) for f in func for c in agg_cv]

Then the above code is running fine.

My question is: how can I fix the above mentioned error.

Exactly. min, max by default are for python's default min and max. to use PySpark's min and max we can have F.min and F.max. — darshil, Jul 17 '19 at 05:35

darshil · Accepted Answer · 2019-07-16T10:57:02.933

Try with

func = [F.min,F.max]
agg_cv = ["IL1","IL2","IL3","VL1","VL2","VL3"]
expr_cv = [f(F.col(c)) for f in func for c in agg_cv]
df_final = df1.groupby(*groupby).agg(*expr_cv)

This should work.

+--------+---------+--------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------+
|DeviceID|min( IL1)|min(IL2)|min(IL3)|min(VL1)|min(VL2)|min(VL3)|max( IL1)|max(IL2)|max(IL3)|max(VL1)|max(VL2)|max(VL3)|
+--------+---------+--------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------+
|    1003|      3.2|     2.0|    3.85|     240|     215|     192|      3.9|     2.8|    4.25|     245|     250|     192|
|    1002|      3.2|     2.4|    4.25|     240|     250|     192|      3.2|     2.4|    4.25|     240|     250|     192|
|    1001|      2.1|     3.1|    2.25|     235|     258|     122|      2.1|     3.1|    2.25|     235|     258|     122|
+--------+---------+--------+--------+--------+--------+--------+---------+--------+--------+--------+--------+--------+```

PySpark: Groupby on multiple columns with multiple functions

1 Answers1