1

I have edited this question to provide an example -

I have a list of columns names :

colnames = ['col1','col2','col3']

I need to pass these to a Dataframe function one after another to return values for each. I would not use the groupBy function, so this is not a duplicate of the other question. I just need the max, min, sum of all values of each column in my Dataframe.

Code example -

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext("local[2]", "Count App")
sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame(
    [(1, 100, 200), (100, 200, 100), (100, 200, 100), (-100, 50, 200)],
("col1", "col2", "col3"))

df.show()


+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1| 100| 200|
| 100| 200| 100|
| 100| 200| 100|
|-100|  50| 200|
+----+----+----+

colnames = ['col1','col2','col3']

maxval = map(lambda x: df.agg(sparkMax(df[x]).alias('max_of_{}'.format(x))), colnames)

## This gives me a list of Dataframes, NOT a single Dataframe as required
for x in maxval:
print (x.show())



+-----------+
|max_of_col1|
+-----------+
|        100|
+-----------+

None
+-----------+
|max_of_col2|
+-----------+
|        200|
+-----------+

None
+-----------+
|max_of_col3|
+-----------+
|        200|
+-----------+

How do I get a single Dataframe back from my lambda function, instead of a List of Dataframes. Looking like this -

+----------------+
|Column_name| Max|
+-----------+----+
|max_of_col1| 100|
+-----------+----+
|max_of_col2| 200|
+-----------+----+
|max_of_col3| 300|
+-----------+----+

I'm guessing something like a flatMap? Appreciated.

aau22
  • 31
  • 6
  • I'm not sure I understand your question, but instead of `df.x`, try `df[x]`. Also you're not using `map()` correctly - it should be `map(func, iterable)` - so putting it all together, perhaps you're looking for `newdf = map(lambda x: df.agg(sparkMax(length(df[x]))), colnames )` – pault May 24 '18 at 01:35
  • Yes! This helped create a new List, but I dont see the values I was expecting as integers. Rather I see a list of this - DataFrame[max(length(col1)): int] DataFrame[max(length(col2)): int] DataFrame[max(length(col3)): int] – aau22 May 24 '18 at 01:57
  • If you could [edit] your question with an [mcve] that shows a small sample DataFrame as well as your desired result, people may be able to better understand your issue and provide alternative (perhaps more elegant) solutions. – pault May 24 '18 at 02:07
  • Edited my question for clarity. Thanks! – aau22 May 24 '18 at 15:16

1 Answers1

0

Map function in Python takes 2 args, first one being a function and second being a iterable.

newdf = map(lambda x: len(x), colnames)

This might be helpful - http://book.pythontips.com/en/latest/map_filter.html

df.x will not work. df is an object and you're accessing an attribute of the df object called x.

Have a look at the following example.

obj = type("MyObj", (object,), {'name':1})
a = obj()
print a.name

Above example will print the value of the attribute name as 1.

However if I try to do,

obj = type("MyObj", (object,), {'name':1})
a = obj()
var = 'name'
print a.var

this is going to give me a AttributeError as the object a does not have an attribute called var.

Achintha Gunasekara
  • 1,137
  • 1
  • 14
  • 28