reading CSV file from s3 using spark

Question

I am new to Spark. I have a scenario where I need to read and process CSV file from AWS s3. This file is generated on daily basis, so I need to read and process it and dump data into Postgres.

I want to process this huge file in parallel to save time and memory.

I came up with two design but I am a little bit confused about spark as spark context requires connection to be open with all s3 bucket.

Use spark streaming to read CSV from s3 and process it and convert into JSON row by row and append the JSON data in JSONB column of Postgres.
Use spring & java -> download file on the server then start processing and convert it into JSON.

Can anyone help me to get the right direction?

What's the estimated size of the S3 objects you will be dealing with ? — piy26, Apr 26 '18 at 11:07
You should definitely go for Option 1, i.e., streaming file content and process it. As this approach is far better than Option 2 as you won't need many resources for a streaming solution. — himanshuIIITian, Apr 26 '18 at 12:32
@himanshuIIITian Can u please provide some reference pointer. — ManojP, Apr 26 '18 at 12:53

score 1 · Answer 1 · answered Apr 26 '18 at 13:25

1

If it's daily, and only 100MB, you don't really need much in the way of large scale tooling. I'd estimate < minute for basic download and process, even remotely, after which bomes the postgres load. Which Postgres offers

try doing this locally, with an aws s3 cp to copy to your local system, then try with postgres.

I wouldn't bother with any parallel tooling; even Spark is going to want to work with 32-64MB blocks, so you won't get more than 2-3 workers. And if the file is .gz, you get exactly one.

That said, if you want to learn spark, you could do this in spark-shell. Download locally first though, just to save time and money.

answered Apr 26 '18 at 13:25

stevel

9,897
1
31
43

1

I have 1000+ customers which produce CSV file on s3 every day. I need to get the file and process it and make one json document for all rows and insert into JsonB column. – ManojP Apr 26 '18 at 14:10
So, you have 1000 x 100 MB CSV files? Yes, spark will work with the built in csv module. Uncompressed CSV files can be partitioned into blocks (for s3a://, set in fs.s3a.block.size to 25MB for 4 workers/file). .gzip files: one worker per file. Going to have fun getting it all into postgres. As suggested: play with spark-shell in local mode first – stevel Apr 27 '18 at 12:58

reading CSV file from s3 using spark

1 Answers1