1

We have a spark streaming job that runs every 30 mins and takes 15s to complete the job. What are the suggested best practices in this scenarios. I am thinking I can schedule AWS datapipeline to run every 30 mins so that EMR terminates after 15 seconds and will be recreated. Is it the recommended approach?

RockerZ
  • 61
  • 8
  • If it runs every 30 mins, it is more likely a batch not a streaming case. How do you use Spark Streaming exactly? – iTech Feb 03 '19 at 03:30
  • Ok, so we have a batch job that runs every day in the morning and then the streaming job for the delta of changes on last 30 mins. – RockerZ Feb 03 '19 at 03:46

2 Answers2

1

For a job that takes 15 seconds running it on EMR is waste of time and resources, you will likely wait for a few minutes for an EMR cluster to bootstrap.

AWS Data Pipeline or AWS Batch will make sense only if you have a long running job.

First, make sure that you really need Spark since from what you described it could be an overkill.

Lambda with a CloudWatch Event scheduling might be all what you need for such a quick job with no infrastructure to manage.

iTech
  • 17,211
  • 4
  • 52
  • 78
  • Lambda may not be best suited as there's lot of IO happening from S3 for this Job. But we can make it work with setting up the right batch size. – RockerZ Feb 03 '19 at 03:45
0

For streaming related jobs -> the key would be to avoid IO in your case - as the job seems to take only 15 seconds. Push your messages to a queue ( AWS SQS ). Have a AWS step function triggered by a Cloudwatch event (implements a schedule like Cron in your case every 30 mins - to call a AWS Step function) to read messages from SQS and process them in a lambda ideally.

So one option (serverless):

Streaming messages --> AWS SQS -> (every 30 mins cloudwatch triggers a step function ) -> which triggers a lambda service to process all messages in the queue

https://aws.amazon.com/getting-started/tutorials/scheduling-a-serverless-workflow-step-functions-cloudwatch-events/

Option 2:

Streaming messages ---> AWS SQS -> Process messages using a Python application/Java Spring application having a scheduled task that wakes up every 30 mins and reads messages from queue and processes it in memory.

I have used option 2 for solving analytical problems, although my analytical problem took 10 mins and was data intensive.Option 2 in addition, needs to monitor the virtual machine (container) where the process is running. On the other hand Option 1 is serverless. Finally it all comes down to the software stack you already have in place and the software needed to process the streaming data.