how to achieve execute several functions on each column dynamically?

Question

I am using spark-sql-2.4.1v with java8.

I have following scenario

val df = Seq(
  ("0.9192019",  "0.1992019",  "0.9955999"),
  ("0.9292018",  "0.2992019",  "0.99662018"),
  ("0.9392017",  "0.3992019",  "0.99772000")).toDF("item1_value","item2_value","item3_value")
.withColumn("item1_value", $"item1_value".cast(DoubleType))
.withColumn("item2_value", $"item2_value".cast(DoubleType))
.withColumn("item3_value", $"item3_value".cast(DoubleType))

df.show(20)

I need an expected output something like this

-----------------------------------------------------------------------------------
col_name      |  sum_of_column     | avg_of_column   | vari_of_column 
-----------------------------------------------------------------------------------
"item1_value" | sum("item1_value") | avg("item1_value") | variance("item1_value")
"item2_value" | sum("item2_value") | avg("item2_value") | variance("item2_value")
"item3_value" | sum("item3_value") | avg("item3_value") | variance("item3_value")
----------------------------------------------------------------------------------

how to achieve this dynamically .. tomorrow i may have

what will be size of this dataframe, number of columns and rows ? — Nikk, Jan 16 '20 at 13:40
Does this answer your question? [Spark SQL: apply aggregate functions to a list of columns](https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns) — user10938362, Jan 16 '20 at 13:52
Got it. Will post tonight as there is an answer for that. others may be earlier. — thebluephantom, Jan 16 '20 at 14:10
https://stackoverflow.com/questions/33882894/spark-sql-apply-aggregate-functions-to-a-list-of-columns No, he is right I think. — thebluephantom, Jan 16 '20 at 14:48
@BdEngineer I have replied for issue "how to get non-aggregated columns here ?" . please check https://stackoverflow.com/questions/59845157/how-to-get-non-aggregated-columns-in-spark-after-aggregation-performed — Nikk, Jan 22 '20 at 07:23

score 2 · Accepted Answer · answered Jan 16 '20 at 15:51

This is sample code that can achieve this. You can make column list dynamic and add more functions if needed.

import org.apache.spark.sql.types._
import org.apache.spark.sql.Column

val df = Seq(
  ("0.9192019",  "0.1992019",  "0.9955999"),
  ("0.9292018",  "0.2992019",  "0.99662018"),
  ("0.9392017",  "0.3992019",  "0.99772000")).
  toDF("item1_value","item2_value","item3_value").
  withColumn("item1_value", $"item1_value".cast(DoubleType)).
  withColumn("item2_value", $"item2_value".cast(DoubleType)).
  withColumn("item3_value", $"item3_value".cast(DoubleType))

val aggregateColumns = Seq("item1_value","item2_value","item3_value")

var aggDFs = aggregateColumns.map( c => {
    df.groupBy().agg(lit(c).as("col_name"),sum(c).as("sum_of_column"), avg(c).as("avg_of_column"), variance(c).as("var_of_column"))
})

var combinedDF = aggDFs.reduce(_ union _)

This returns following output:

scala> df.show(10,false)
+-----------+-----------+-----------+
|item1_value|item2_value|item3_value|
+-----------+-----------+-----------+
|0.9192019  |0.1992019  |0.9955999  |
|0.9292018  |0.2992019  |0.99662018 |
|0.9392017  |0.3992019  |0.99772    |
+-----------+-----------+-----------+


scala> combinedDF.show(10,false)
+-----------+------------------+------------------+---------------------+
|col_name   |sum_of_column     |avg_of_column     |var_of_column        |
+-----------+------------------+------------------+---------------------+
|item1_value|2.7876054         |0.9292018         |9.999800000999957E-5 |
|item2_value|0.8976057000000001|0.2992019         |0.010000000000000002 |
|item3_value|2.9899400800000002|0.9966466933333334|1.1242332201333484E-6|
+-----------+------------------+------------------+---------------------+

@BdEngineer There is nothing specifc here for 'var'. aggDFs.reduce(_ union _) is doing union of all Data frames per column aggregation. — Ramdev Sharma, Jan 16 '20 at 19:24
@BdEngineer I couldn't check code in java but you make simple union. Spark will optimize it while running it. Regarding each column in Seq, it is only transformation stage. When action will called then only all transformation happens, spark will run them in parallel for each column. You can validate same in Spark UI - SQL tab — Ramdev Sharma, Jan 21 '20 at 18:46
@Ramdev Sharma can you tell me what is wrong with this broadcast variable accessing ? https://stackoverflow.com/questions/64003697/spark-broadcast-variable-map-giving-null-value — BdEngineer, Sep 22 '20 at 05:50

how to achieve execute several functions on each column dynamically?

1 Answers1