I am running my spark streaming application using direct Kafka API with a batch interval of 1 min and also using Pandas in my application code with Pyspark.
Below is my cluster configuration: 3 data nodes with each data node having a capacity of 8 core, 12GB RAM.
I have provided spark-submit job with below parameters,
--master yarn
--deploy-mode cluster
--executor-memory 2G
--total-executor-cores 4
--num-executors 11
but my SPARK UI shows my active batches are going in QUEUE status,
config("spark.streaming.backpressure.enabled", "true") \
.config("spark.streaming.kafka.maxRatePerPartition","200") \
Based on some post/questions answered I have set below options in spark configuration as specified below to avoid jobs going into queue status,
please correct me if am wrong at any stage of the application processing?