Why is pyspark so much slower in finding the max of a column?

Question

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column? I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value. I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.) The only thing I did ot try yet is the RDD way.

Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know

can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?

did you have a look [here](http://www.kdnuggets.com/2016/01/python-data-science-pandas-spark-dataframe-differences.html)? Also my $0.02 is to use pandas unless there's a specific reason to use Spark. Pandas is fantastic in its simplicity and power. [this](http://stackoverflow.com/questions/34625410/why-does-my-spark-run-slower-than-pure-python-performance-comparison) was also another post on SO — MattR, Apr 28 '17 at 17:32
Well, my reason at the moment is that I want to learn Sparks :-) Thanks for the links. The KDNugget article is interresting but older. I was playing with the Kaggle dataset as this was too big to be performant on my laptop, I had to do quite some tweaking to avoid too much memory swapping. So I thought Sparks might be interesting. And on the row-by-row calculations it seems to work better. But not when the whole dataframe isrequired. (I tried the rdd-way too now and it was not helpful neither) Looks like I have to dig in deeper to get a better understanding. — Johan Steunenberg, Apr 28 '17 at 19:59

MaxU · Answer 1 · 2017-04-28T17:50:03.487

0

As @MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.

Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.

edited Apr 28 '17 at 17:50

answered Apr 28 '17 at 17:40

MaxU

173,524
24
290
329

To be perfectly honest, I've never used Spark. Memory is relatively cheap that any company with "big data" can normally just beef up the Server and run with their current solution. Have you ever used it @MaxU? – MattR Apr 28 '17 at 17:46
@MattR, i used it, but a very few times... What if you need to process 5-500TB of data? ;-) – MaxU Apr 28 '17 at 17:47
`pandas.read_csv('PATH',chunksize=xxxxxxx)`? haha, but in all seriousness then I guess it makes sense. But I've never needed to handle that much data in my career. This may be off-topic, but did you run Spark from Scala or Python? If Python, you have any good links to where you learned? – MattR Apr 28 '17 at 17:53
I have encountered `MemoryError` in Pandas on my 16GiB RAM notebook quite a few times already... I used PySpark, because Spark SQL DataFrames are very similar to Pandas DFs – MaxU Apr 28 '17 at 17:54

score 0 · Answer 2 · answered May 02 '17 at 15:39

@MaxU, @MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.

sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.

In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.

On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.

Thanks for yourinput and this topic really got me much further in understanding Spark.

Why is pyspark so much slower in finding the max of a column?

2 Answers2

Linked