I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.
I want to process this huge file in parallel to save time and memory.
I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.
- Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
- Use spring & java -> download file on the server then start processing and convert it into JSON.
Can anyone help me to get the right direction?