how to find the max value of all columns in a spark dataframe

Question

I have a spark data frame of around 60M rows. I want to create a single row data frame that will have the max of all individual columns.

I tried out the following options, but each has its own set of disadvantages-

df.select(col_list).describe().filter(summary = 'max').show()

-- This query doesn't return the string columns. So my original dimension of the data frame gets reduced.
df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show()

-- This query works, but it's disadvantageous when I have around 700 odd columns.

Can someone suggest a better syntax?

Refer this https://stackoverflow.com/questions/33224740/best-way-to-get-the-max-value-in-a-spark-dataframe-column — bumblebee, Feb 20 '19 at 05:22
How would you aggregate string columns? What’s your logic for that? What’s the max of string columns? Clarify. — cph_sto, Feb 20 '19 at 06:00
@cph_sto my idea is to return non null values for the entire dataframe, for the numeric columns, max() is pretty straight forward, for the string columns there is no logic. It can return any non null value. — Teresa, Feb 20 '19 at 07:09
@sramalingam24 can you kindly share the exact syntax? I wanted to try selectExpr, but couldn't figure out the exact syntax. — Teresa, Feb 20 '19 at 07:11
Returning any `non-NULL` value for Strings? So, returning the first value for instance should be ok too, right? — cph_sto, Feb 20 '19 at 07:18
@cph_sto yes, returning the first value for string should be good too! — Teresa, Feb 20 '19 at 07:46
@Teresa If my answer was helpful, don't forget to accept it - click on the check mark (✓) beside the answer to toggle it from greyed out to filled in. Thanks. — cph_sto, Feb 20 '19 at 10:46

cph_sto · Accepted Answer · 2019-02-20T10:45:17.167

The code will work irrespective of how many columns or mix of datatypes there are.

Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  10|   5| null|  50|
|  Bob|  15|  15|Simon|  10|
| Jack|   5|   1| Timo|   3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
    ['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
    ['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
    ['col2', 'col3', 'col5']

Read about first() and ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  15|  15|Simon|  50|
+-----+----+----+-----+----+

Thanks the list comprehension worked! This saved us from repetitively doing this 700 times - df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show() — Teresa, Feb 20 '19 at 13:38

how to find the max value of all columns in a spark dataframe

1 Answers1