what is the optimal resource required for running a spark streaming application using Direct Kafka API?

Question

I am running my spark streaming application using direct Kafka API with a batch interval of 1 min and also using Pandas in my application code with Pyspark.

Below is my cluster configuration: 3 data nodes with each data node having a capacity of 8 core, 12GB RAM.

I have provided spark-submit job with below parameters,

--master yarn
--deploy-mode cluster
--executor-memory 2G
--total-executor-cores 4
--num-executors 11

but my SPARK UI shows my active batches are going in QUEUE status,

config("spark.streaming.backpressure.enabled", "true") \
.config("spark.streaming.kafka.maxRatePerPartition","200") \

Based on some post/questions answered I have set below options in spark configuration as specified below to avoid jobs going into queue status,

please correct me if am wrong at any stage of the application processing?

If you run big computations in `Pandas` you'll want to increase the driver's memory (`--driver-memory`) as well, since everything will be done "locally" — MaFF, Aug 28 '17 at 12:15
But in cluster mode will it not be distributed using pandas ?My input stream can have max 1000 records only and when I run the same application explained above in parallel for two distinct customers my process gets queued! — Mohammad Umar Farooq, Aug 28 '17 at 16:00
you're using 2G/executor * 11 executors for a total of 22G if you only have 36G it's normal that the second application gets queued. My remark on the driver's memory was just to say that it's also something that you have to take into consideration if you want your applications to run faster since pandas won't distribute — MaFF, Aug 28 '17 at 21:48
Ok got it thanks for info ! In production should we specify these Parameters or Yarn should dynamically allocate based on volume? — Mohammad Umar Farooq, Aug 29 '17 at 14:15
There should be a spark configurations file on your platform with default values for all of these `spark-default.conf` — MaFF, Aug 29 '17 at 17:56

score 1 · Answer 1 · answered Oct 11 '17 at 21:26

First of all, as mentioned by @Marie in the comments, the pandas part will execute locally meaning on the driver. If you want to do that --driver-memory has to be increased which kinda defeats the purpose of distributed processing. That being said, its a good idea to play with your batch interval starting from 5-10 sec and taking it up slowly. In addition to the parameters you can tune there is also spark.streaming.concurrentJobs which is not mentioned in the docs outright because of reason here. Incrementally increase this value from 10 to see what suits best. There are a lot of blog posts on optimization of streaming applications which go over the settings, some of which you have already performed. You might want to add "spark.serializer": "org.apache.spark.serializer.KryoSerializer" also, benefits of which are explained here.

what is the optimal resource required for running a spark streaming application using Direct Kafka API?

1 Answers1