1

I have a spark data frame of around 60M rows. I want to create a single row data frame that will have the max of all individual columns.

I tried out the following options, but each has its own set of disadvantages-

  1. df.select(col_list).describe().filter(summary = 'max').show()

    -- This query doesn't return the string columns. So my original dimension of the data frame gets reduced.

  2. df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show()

    -- This query works, but it's disadvantageous when I have around 700 odd columns.

Can someone suggest a better syntax?

Mohamed Thasin ah
  • 8,314
  • 8
  • 35
  • 66
Teresa
  • 25
  • 1
  • 5
  • Refer this https://stackoverflow.com/questions/33224740/best-way-to-get-the-max-value-in-a-spark-dataframe-column – bumblebee Feb 20 '19 at 05:22
  • How would you aggregate string columns? What’s your logic for that? What’s the max of string columns? Clarify. – cph_sto Feb 20 '19 at 06:00
  • 1
    Use selectExpr and a map on df.columns – sramalingam24 Feb 20 '19 at 06:19
  • @cph_sto my idea is to return non null values for the entire dataframe, for the numeric columns, max() is pretty straight forward, for the string columns there is no logic. It can return any non null value. – Teresa Feb 20 '19 at 07:09
  • @sramalingam24 can you kindly share the exact syntax? I wanted to try selectExpr, but couldn't figure out the exact syntax. – Teresa Feb 20 '19 at 07:11
  • Returning any `non-NULL` value for Strings? So, returning the first value for instance should be ok too, right? – cph_sto Feb 20 '19 at 07:18
  • @cph_sto yes, returning the first value for string should be good too! – Teresa Feb 20 '19 at 07:46
  • @Teresa If my answer was helpful, don't forget to accept it - click on the check mark (✓) beside the answer to toggle it from greyed out to filled in. Thanks. – cph_sto Feb 20 '19 at 10:46
  • @cph_sto your answer was super helpful! Thanks a ton ! :) – Teresa Feb 20 '19 at 13:36
  • All the best ... – cph_sto Feb 20 '19 at 13:37

1 Answers1

2

The code will work irrespective of how many columns or mix of datatypes there are.

Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  10|   5| null|  50|
|  Bob|  15|  15|Simon|  10|
| Jack|   5|   1| Timo|   3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
    ['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
    ['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
    ['col2', 'col3', 'col5']

Read about first() and ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  15|  15|Simon|  50|
+-----+----+----+-----+----+
cph_sto
  • 5,368
  • 6
  • 31
  • 55
  • Thanks the list comprehension worked! This saved us from repetitively doing this 700 times - df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show() – Teresa Feb 20 '19 at 13:38