-1

I have a dataframe and I need to agregate it. I do it this way.

df
      .groupBy($"key1", $"key2")
      .agg(
        sum($"total_in").as("total_in")
        , sum($"val1_in").as("val1_in")
        , sum($"val2_in").as("val2_in")
        , sum($"val3_in").as("val3_in")
        , sum($"val4_in").as("val4_in")
        , sum($"total_out").as("total_out")
        , sum($"val1_out").as("val1_out")
        , sum($"val2_out").as("val2_out")
        , sum($"val3_out").as("val3_out")
        , sum($"val4_out").as("val4_out"))

It doesn't look all that bad, but if the number of operation in agg becomes larger or a case when or some other logic gets added to every sum, this code will look really cumbersome.

I know it can be done. Maybe passing the columns to a function and iterating other them. Something like

.agg(
  map(x => func())
)

but this is where I'm stuck. How do I pass only the columns I need? Is it possible to alias them in the same place?

gjin
  • 761
  • 8
  • 25

2 Answers2

0

Use vararg syntax:

In python:

cols = ['total_in', 'val1_in', 'val2_in', ...]
.agg(*(sum(col).alias(col) for col in cols))

In Scala:

cols = Seq("total_in", "val1_in", "val2_in", ...)
.agg(
  sum(col(cols.head)).as(cols.head), 
  cols.tail.map(x => sum(col(x)).as(x)): _*
)
Psidom
  • 171,477
  • 20
  • 249
  • 286
0

I would do it like this :

val aggExpresion =
    df.columns
      .filterNot(Seq("key1","key2").contains(_))
      .map(s => sum(col(s)).as(s))

df.agg(aggExpresion.head,aggExpresion.tail:_*)
Raphael Roth
  • 23,651
  • 9
  • 69
  • 117