I have edited this question to provide an example -
I have a list of columns names :
colnames = ['col1','col2','col3']
I need to pass these to a Dataframe function one after another to return values for each. I would not use the groupBy function, so this is not a duplicate of the other question. I just need the max, min, sum of all values of each column in my Dataframe.
Code example -
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext("local[2]", "Count App")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(
[(1, 100, 200), (100, 200, 100), (100, 200, 100), (-100, 50, 200)],
("col1", "col2", "col3"))
df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 100| 200|
| 100| 200| 100|
| 100| 200| 100|
|-100| 50| 200|
+----+----+----+
colnames = ['col1','col2','col3']
maxval = map(lambda x: df.agg(sparkMax(df[x]).alias('max_of_{}'.format(x))), colnames)
## This gives me a list of Dataframes, NOT a single Dataframe as required
for x in maxval:
print (x.show())
+-----------+
|max_of_col1|
+-----------+
| 100|
+-----------+
None
+-----------+
|max_of_col2|
+-----------+
| 200|
+-----------+
None
+-----------+
|max_of_col3|
+-----------+
| 200|
+-----------+
How do I get a single Dataframe back from my lambda function, instead of a List of Dataframes. Looking like this -
+----------------+
|Column_name| Max|
+-----------+----+
|max_of_col1| 100|
+-----------+----+
|max_of_col2| 200|
+-----------+----+
|max_of_col3| 300|
+-----------+----+
I'm guessing something like a flatMap? Appreciated.