How to compare two big unsorted CSV files using Spring Batch?

Question

I have a task of comparing two big csv files and write out the comparison result to a new file. File 1 has 200k rows and file 2 could also have 200K or less than that. Both have 200 columns. The files are not sorted and can be in any order. I am using Java 8 and Spring Version 4.

Question

I am using Spring Batch in my project, is there any way I can achieve this using Spring Batch customized ItemReader and ItemWriter or should I use a tasklet and then plain Java code to compare the files? I also wanted to do it in the fastest way. The volume of the data will be really huge may be 2-4 Gigs so I don't want to load it in the memory. The file structures are something like the below.

File1:
regn_nbr,name,address1,countrycode,regn_date
2345,John,4332 JFK Boulevard,US,02-12-2011
2347,mark,4332 Maryland Avenue,US,04-27-2015
2348,Smith,4332 JFK road,US,07-30-2011
2302,Andy,4332 JFK lane,US,06-01-2010

File2:
regn_nbr,name,address1,countrycode,regn_date
2345,John,4332 JFK Boulevard,US,02-12-2011
2302,Andy,4332 JFK lane,US,06-01-2010
2911,Peter,12 candle drive,MX,01-01-2010
2348,Smith,4332 JFK road,US,07-30-2011
2347,mark,4332 Maryland Avenue,US,04-27-2015

Your suggestions, different approaches, strategies and expertise are most welcome.

Spring Batch is definitely the wrong choice for this. It's meant for batch processing, whereas your problem requires at least sorting the data before processing it. — Kayaman, Apr 08 '16 at 14:52
@Kayaman. thanks for your reply. Yeah it is a batch process application and we have come through lot of phases like pulling multiples files over FTP, databases and various sources and lots of business processing in too. We have consolidated the data from different sources and formed these files. Now we are in comparison stage. So my plan is to sort both the files insert the missing record key in the files and then pass it on to the processer to compare the value. Just curious to know without sorting can I achieve this comparison? — Kenneth, Apr 08 '16 at 15:09
Well look at it this way. You take the first row from the first file, and the first row from the second file. The row from the second file doesn't correspond to the row from the first file. You'd then have to go through the whole second file to see if the row exists at all, or if it's been deleted, and that doesn't really work in batch processing (nor would it be efficient). There's no way you can do it without some sort of preprocessing. I'd just stick it all into a database. — Kayaman, Apr 08 '16 at 15:23
Same here. INSERT lines from one file into one table, lines from the second file into another table and perform `NOT EXISTS` SELECT. — Artem Bilan, Apr 08 '16 at 16:45

Michael Pralow · Accepted Answer · 2016-04-08T21:47:15.287

0

are you sure you need a special program for that?

i would try it with

Database bulk load (e.g. for mysql load data from file)
run compare script afterwards and get result into file (e.g. for mysql select data into file)

if memory really is your primary concern, well all it needs is a some java main class, some java nio and simple java sql

edited Apr 08 '16 at 21:47

answered Apr 08 '16 at 21:38

Michael Pralow

6,358
2
27
43

I used the in-memory HSQL-DB to store and then process the file. It was very fast and efficient. Thanks Micheal – Kenneth Sep 08 '16 at 04:34

score 0 · Answer 2 · answered Apr 14 '16 at 15:59

I think that the best way is reading files and creating two list of a specific java bean that represents the structure of your file. These bean can implements Comparable and you can write a method that can order and compare the lists with specific rules written by you.

How to compare two big unsorted CSV files using Spring Batch?

2 Answers2