We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?
-
If it runs every 30 mins, it is more likely a batch not a streaming case. How do you use Spark Streaming exactly? – iTech Feb 03 '19 at 03:30
-
Ok, so we have a batch job that runs every day in the morning and then the streaming job for the delta of changes on last 30 mins. – RockerZ Feb 03 '19 at 03:46
2 Answers
For a job that takes 15 seconds
running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.
AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.
First, make sure that you really need Spark since from what you described it could be an overkill.
Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.
![](../../users/profiles/1813858.webp)
- 17,211
- 4
- 52
- 78
-
Lambda may not be best suited as there's lot of IO happening from S3 for this Job. But we can make it work with setting up the right batch size. – RockerZ Feb 03 '19 at 03:45
For streaming related jobs -> the key would be to avoid IO in your case - as the job seems to take only 15 seconds. Push your messages to a queue ( AWS SQS ). Have a AWS step function triggered by a Cloudwatch event (implements a schedule like Cron in your case every 30 mins - to call a AWS Step function) to read messages from SQS and process them in a lambda ideally.
So one option (serverless):
Streaming messages --> AWS SQS -> (every 30 mins cloudwatch triggers a step function ) -> which triggers a lambda service to process all messages in the queue
Option 2:
Streaming messages ---> AWS SQS -> Process messages using a Python application/Java Spring application having a scheduled task that wakes up every 30 mins and reads messages from queue and processes it in memory.
I have used option 2 for solving analytical problems, although my analytical problem took 10 mins and was data intensive.Option 2 in addition, needs to monitor the virtual machine (container) where the process is running. On the other hand Option 1 is serverless. Finally it all comes down to the software stack you already have in place and the software needed to process the streaming data.
![](../../users/profiles/10972156.webp)
- 26
- 2