Read .csv file in GB's using Java

Question

I have following two requirements:

To read a CSV file and put rows line by line into the database (RDSMS) without any data manipulation.
To read a CSV file and put this data into the database (RDBMS). In this case, row Z might be dependent on row B. So need to have a staging DB (in-memory or another a staging RDBMS)

I am analyzing multiple ways to accomplish this:

Using Core java, and read file in Producer-consumer way.
Using Apache Camel and BeanIO to read the csv file.
Using SQL to read the file.

Wanted to know, if is there an already industry defined preferred way to do such kind of tasks?

I found few links on stackoverflow, but I am looking for more options:

I am using Java6 for implementation.

I don't know what database you are using, but MySQL for instance has a tool called `READ DATA INFILE` which is very fast and can load a large CSV file into a table. Can you let us know which database you are using? — Tim Biegeleisen, May 06 '16 at 10:15
DB2 has `LOAD` command which is used to read and load well formatted csv files. You can write a script to do additional validation etc. Disadvantage of these utilities is absence transaction management etc. — Sabir Khan, May 06 '16 at 10:19
Also, should we have logic to parse CSV in Java layer or in DB layer? Again is there any industry preferred way of doing it? — Anuj Mehra, May 06 '16 at 10:57
The answer to "industry standard" is **always**, it depends. Are you loading one CSV file to a database? One CSV file a day to a database? 100,000 CSV files a second to a database? You need to provide more details before anyone can give you an "industry standard" answer. — Gilbert Le Blanc, May 09 '16 at 15:19
We will be having 5-7 CSV files daily. Each CSV will range between 250MB - 2 GB. Most probably will be using some batch job processing to trigger imports. We already have other batch jobs running on server, so we have time frame of about 1 hour to squeeze these new imports. — Anuj Mehra, May 12 '16 at 11:39

score 0 · Accepted Answer · answered May 09 '16 at 14:59

0

you should use NIO package to do such stuff in GBs. NIO is asynchronous, fastest till date and most reliable. you can simple read files in chunks via NIO packaging and then insert into db using bulk commands rather than single insertion. Single insertion take lot of your CPU cycles and may cause OOM errors.

answered May 09 '16 at 14:59

Ankur Soni

4,597
4
32
63

Nio is available from java 1.4 – Ankur Soni May 12 '16 at 16:07

score 0 · Answer 2 · answered May 11 '16 at 09:52

You can use RandomAccessFile for reading csv file, it gives you fast enough read speed, it does not requires any extra jar file, here is code,

   File f=new File(System.getProperty("user.home")+"/Desktop/CSVDOC1.csv");
   RandomAccessFile ra = new RandomAccessFile(f,"rw");

                        ra.seek(0);//Read from start 
                        long p=ra.getFilePointer();
                        String d= ra.readLine();
                        ra.seek(p);
                        while(d!=null){                     

                        //Each line data stored in variable d                                                       
                        d=ra.readLine();

                        //d="col1","col2","col2","col3" 

                        //Separate line data by separator ","

                        //insert row values into database                           
                        }
                        //Release file lock
                        ra.close();

score 0 · Answer 3 · answered May 17 '16 at 08:49

0

We are using Apache Camel's "File:" protocol to read file and process the data.

answered May 17 '16 at 08:49

Anuj Mehra

185
1
15

Read .csv file in GB's using Java

3 Answers3