DynamoDB data load after transforming files. Any AWS service like GCP Dataflow/Apache Beam?

Question

New to AWS. I have a requirement to create a daily batch pipeline

Read 6-10 1GB+ CSV Files. (Each file is an extract of a table from a SQL db.)
Transform each file with some logic and join all files to create one item per id.
Load this joined data in a single DynamoDB table with an upsert logic.

Current approach I have started with is: We have an EC2 available used for such tasks. So I am writing a python code to (1) read all CSVs, (2) convert to a denormalised JSON file and (3) import into Dynamodb using boto3

My question is that I am concerned if my data is "Big Data". Is processing 10GB data with a single Python script ok? And down the line if the file sizes become 10x, will I face scaling issues? I have only worked with GCP in the past and in this scenario I would have used DataFlow to get the task done. So is there an equivalent in AWS terms? Would be great if someone can provide some thoughts. Thanks for your time.

It could be useful if you take a deeper look into the service AWS Data pipeline — lvthillo, Dec 05 '20 at 15:23
Thank you I was looking at that but it's not being used in the org yet. I haven't worked with Hadoop so a bit intimidated with that as it uses EMR, I am just curious to know what data is big data. At what time a single threaded python code on a file becomes too big to become a candidate for Hadoop processing? — archie297, Dec 05 '20 at 21:57
you are asking the right questions -- but, hey: if it's stupid and it works, it's not that stupid ;) -- sure, a single threaded python script is going to eventually hit a scaling cap and when you get there you will have to replace it. but more complex solutions may have more operational overhead; finally: don't be intimidated by EMR or "big data" — Mike Dinescu, Dec 07 '20 at 06:54

score 0 · Answer 1 · answered Dec 06 '20 at 20:16

0

The AWS equivalent to Google Cloud Dataflow is AWS Glue. The documentation isn't clear but Glue does write to DynamoDB.

answered Dec 06 '20 at 20:16

Steven Ensslen

818
7
14

score 0 · Answer 2 · answered Dec 07 '20 at 15:31

0

A more appropriate equivalent of Dataflow in AWS is Kinesis Data Analytics, which supports Apache Beam's Java SDK.

You can see an example of an Apache Beam pipeline running on their service.

Apache Beam is able to write to DynamoDB.

Good luck!

answered Dec 07 '20 at 15:31

Pablo

8,295
37
52

DynamoDB data load after transforming files. Any AWS service like GCP Dataflow/Apache Beam?

2 Answers2