pyspark passing column names from a list to dataframe function, how to interpolate?

Question

I have edited this question to provide an example -

I have a list of columns names :

colnames = ['col1','col2','col3']

I need to pass these to a Dataframe function one after another to return values for each. I would not use the groupBy function, so this is not a duplicate of the other question. I just need the max, min, sum of all values of each column in my Dataframe.

Code example -

from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext("local[2]", "Count App")
sqlContext = SQLContext(sc)

df = sqlContext.createDataFrame(
    [(1, 100, 200), (100, 200, 100), (100, 200, 100), (-100, 50, 200)],
("col1", "col2", "col3"))

df.show()


+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1| 100| 200|
| 100| 200| 100|
| 100| 200| 100|
|-100|  50| 200|
+----+----+----+

colnames = ['col1','col2','col3']

maxval = map(lambda x: df.agg(sparkMax(df[x]).alias('max_of_{}'.format(x))), colnames)

## This gives me a list of Dataframes, NOT a single Dataframe as required
for x in maxval:
print (x.show())



+-----------+
|max_of_col1|
+-----------+
|        100|
+-----------+

None
+-----------+
|max_of_col2|
+-----------+
|        200|
+-----------+

None
+-----------+
|max_of_col3|
+-----------+
|        200|
+-----------+

How do I get a single Dataframe back from my lambda function, instead of a List of Dataframes. Looking like this -

+----------------+
|Column_name| Max|
+-----------+----+
|max_of_col1| 100|
+-----------+----+
|max_of_col2| 200|
+-----------+----+
|max_of_col3| 300|
+-----------+----+

I'm guessing something like a flatMap? Appreciated.

I'm not sure I understand your question, but instead of `df.x`, try `df[x]`. Also you're not using `map()` correctly - it should be `map(func, iterable)` - so putting it all together, perhaps you're looking for `newdf = map(lambda x: df.agg(sparkMax(length(df[x]))), colnames )` — pault, May 24 '18 at 01:35
Yes! This helped create a new List, but I dont see the values I was expecting as integers. Rather I see a list of this - DataFrame[max(length(col1)): int] DataFrame[max(length(col2)): int] DataFrame[max(length(col3)): int] — aau22, May 24 '18 at 01:57
If you could [edit] your question with an [mcve] that shows a small sample DataFrame as well as your desired result, people may be able to better understand your issue and provide alternative (perhaps more elegant) solutions. — pault, May 24 '18 at 02:07

Achintha Gunasekara · Answer 1 · 2018-05-24T02:04:45.227

0

Map function in Python takes 2 args, first one being a function and second being a iterable.

newdf = map(lambda x: len(x), colnames)

This might be helpful - http://book.pythontips.com/en/latest/map_filter.html

df.x will not work. df is an object and you're accessing an attribute of the df object called x.

Have a look at the following example.

obj = type("MyObj", (object,), {'name':1})
a = obj()
print a.name

Above example will print the value of the attribute name as 1.

However if I try to do,

obj = type("MyObj", (object,), {'name':1})
a = obj()
var = 'name'
print a.var

this is going to give me a AttributeError as the object a does not have an attribute called var.

edited May 24 '18 at 02:04

answered May 24 '18 at 01:43

Achintha Gunasekara

1,137
1
14
28

Almost surely `df.x` will not work. AFAIK you can't access namespace members with variables like that. – pault May 24 '18 at 01:44
You're right, I've updated my answer to show how map works. Not sure what df object is... – Achintha Gunasekara May 24 '18 at 01:49
True, I've tried this and I get - AttributeError: 'DataFrame' object has no attribute 'x' – aau22 May 24 '18 at 01:49
I've updated my answer above :) – Achintha Gunasekara May 24 '18 at 01:55

pyspark passing column names from a list to dataframe function, how to interpolate?

1 Answers1