You can use pyspark.sql.DataFrame.describe()
to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null
values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary()
which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg()
. For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe()
and summary()
do.