New to AWS. I have a requirement to create a daily batch pipeline
- Read 6-10 1GB+ CSV Files. (Each file is an extract of a table from a SQL db.)
- Transform each file with some logic and join all files to create one item per id.
- Load this joined data in a single DynamoDB table with an upsert logic.
Current approach I have started with is: We have an EC2 available used for such tasks. So I am writing a python code to (1) read all CSVs, (2) convert to a denormalised JSON file and (3) import into Dynamodb using boto3
My question is that I am concerned if my data is "Big Data". Is processing 10GB data with a single Python script ok? And down the line if the file sizes become 10x, will I face scaling issues? I have only worked with GCP in the past and in this scenario I would have used DataFlow to get the task done. So is there an equivalent in AWS terms? Would be great if someone can provide some thoughts. Thanks for your time.