Spark Streaming scheduling best practices

Question

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?

If it runs every 30 mins, it is more likely a batch not a streaming case. How do you use Spark Streaming exactly? — iTech, Feb 03 '19 at 03:30
Ok, so we have a batch job that runs every day in the morning and then the streaming job for the delta of changes on last 30 mins. — RockerZ, Feb 03 '19 at 03:46

score 1 · Accepted Answer · answered Feb 03 '19 at 03:32

1

For a job that takes 15 seconds running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.

AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.

First, make sure that you really need Spark since from what you described it could be an overkill.

Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.

answered Feb 03 '19 at 03:32

iTech

17,211
4
52
78

Lambda may not be best suited as there's lot of IO happening from S3 for this Job. But we can make it work with setting up the right batch size. – RockerZ Feb 03 '19 at 03:45

score 0 · Answer 2 · answered Feb 03 '19 at 05:54

For streaming related jobs -> the key would be to avoid IO in your case - as the job seems to take only 15 seconds. Push your messages to a queue ( AWS SQS ). Have a AWS step function triggered by a Cloudwatch event (implements a schedule like Cron in your case every 30 mins - to call a AWS Step function) to read messages from SQS and process them in a lambda ideally.

So one option (serverless):

Streaming messages --> AWS SQS -> (every 30 mins cloudwatch triggers a step function ) -> which triggers a lambda service to process all messages in the queue

https://aws.amazon.com/getting-started/tutorials/scheduling-a-serverless-workflow-step-functions-cloudwatch-events/

Option 2:

Streaming messages ---> AWS SQS -> Process messages using a Python application/Java Spring application having a scheduled task that wakes up every 30 mins and reads messages from queue and processes it in memory.

I have used option 2 for solving analytical problems, although my analytical problem took 10 mins and was data intensive.Option 2 in addition, needs to monitor the virtual machine (container) where the process is running. On the other hand Option 1 is serverless. Finally it all comes down to the software stack you already have in place and the software needed to process the streaming data.

Spark Streaming scheduling best practices

2 Answers2