opencsv vs java split comma

Question

for years I've been parsing csv files during my nightly batch jobs using the following logic without issue. Anyhow I'm doing a full rewrite of the application and I'm now wondering if there may be any performance / quality gains in using something like opencsv? I have no experience with other libraries, so I was hoping someone else with experience could chime in.

while ((line = br.readLine()) != null) {
    String[] items = line.split(",");


        for (int i = 0; i < items.length; ++i) {
            // Remove extra quote
            if (items[i].length > 2) {
                items[i] = items[i].replaceAll("\"", "");
            }

            // Replace blank items with nulls
            if (items[i].matches("^\\s*$")) {
                items[i] = null;
            }

        }

        String item0 = item[0];
        String item1 = item[1];
}

score 4 · Accepted Answer · answered Oct 26 '13 at 04:52

4

You won't gain any performance, but a library will help you deal with fields that have embedded commas. Microsoft's obnoxious solution of using double-quotes instead of escaping the commas is a pain to deal with by hand, and opencsv will handle all of that for you.

answered Oct 26 '13 at 04:52

chrylis -cautiouslyoptimistic-

67,584
19
106
140

Very good point, the double quote is definitely ugly to deal with. I also noticed after reading through the api you can pass a String array of property variables into the csv parser and then just cast the parsed object to your entity object. I really like that feature, I think it will help to create a single dynamic parser. – Code Junkie Oct 26 '13 at 05:11
2

Reading the Csv will be "IO Bound" so not much improvements in Run times is likely. But your current approach involves multiple passes through each line (and uses regexpression); a well written Csv Parser should only require one pass through each line (and hopefully less CPU). Reducing CPU usage never does any harm. – Bruce Martin Oct 26 '13 at 06:32

dbw · Answer 2 · 2013-10-26T06:09:17.487

4

The answer given by chrylis is right, that you may not gain performance but yes opencsv will handle all the cases for you.
But if you are really worried about performance then a little tweak in your code can help you improve performance,
After analyzing the code for String.Split which is

    public String[] split(String regex) {
       return split(regex, 0);
    }
    public String[] split(String regex, int limit) {
           return Pattern.compile(regex).split(this, limit);
    }

For each of your String a new Pattern is compiled, code for Pattern.compile is

public static Pattern compile(String regex, int flags) {
     return new Pattern(regex, flags);
 }

The above code to create Pattern object is again repeated at,

items[i].matches("^\\s*$")

So if your files are having millions of lines, then creating millions of Pattern object can be overhead so you can change your code as,

    Pattern pat = Pattern.compile(","); 
    Pattern regexPattern = Pattern.compile("^\\s*$");       
    while ((line = br.readLine()) != null) 
    {
        String[] items = pat.split(line, 0);
        for (int i = 0; i < items.length; ++i) 
        {           
            if (items[i] != null && items.length > 2) // I think it should be items[i].length() > 2 
            { //We can also remove this null check as splitted strings will never be null
                items[i] = items[i].replaceAll("\"", "");
            }               
            if (regexPattern.matcher(items[i]) .matches()) {
                items[i] = null;
            }
        }           
    }

The performance gain will not be visible in small files but for big files and if same code executes for millions of file you will see a significant performance improvement.

edited Oct 26 '13 at 06:09

answered Oct 26 '13 at 05:23

dbw

5,842
2
21
56

Thanks dbw, yes I process millions of lines, so any improvement will help. Currently I'm able to process around 1 million lines every two mins using 8 cores. I know at this point it's purely academic, but considering I'm doing a full rewrite, it's a great time to see what else is out there that may reduce overhead. – Code Junkie Oct 26 '13 at 05:35
@George one more thing I believe line items.length > 2 should be items[i].length() > 2 – dbw Oct 26 '13 at 05:41
@George We can also remove the null check from if (items[i] != null && ...) because splitted strings will never be null, – dbw Oct 26 '13 at 06:10
I just fixed that code in my question just incase anybody was to ever copy it. I'd hate to introduce a bug into their project. I've been playing with opencsv for the past hour, I must say so far I really like it. Especially the header answer provided here, http://stackoverflow.com/questions/13505653/opencsv-how-to-map-selected-columns-to-java-bean-regardless-of-order I'm just concerned with it's efficiency, but I do really like the fact I'd be able to move my library of parsers to the database rather than 20 / 30 separate files. Any thoughts on performance? – Code Junkie Oct 26 '13 at 06:26
1

I just did a bench test against opencsv using near production data which consisted of both inserts and updates, a production file, database joins, 1031936 records and on my laptop multi threaded over an i7 in 104772. It looks like opencsv was approx 15secs quicker using the same test. I believe that was nothing more than my inefficient code you pointed out. – Code Junkie Oct 26 '13 at 06:40
@George I don't know how you are mapping and using inserts and updates in production file but then too a valid point from your side – dbw Oct 26 '13 at 06:45
String.split is really slow. The Guava Splitter class is faster. – tom Oct 28 '13 at 14:06

score 0 · Answer 3 · answered Oct 28 '13 at 13:59

0

To add to your options, consider the Jackson CsvMapper.

I parse 36 million rows out of around 4k files in 12 minutes using the jackson CsvMapper on a macbook pro. That's using it to map directly to POJOs in some places and using it to read Object[] per line in others and applying a huge amount of ancillary processing to normalise inputs.

It's also really easy to use:

as Object[]

    CsvMapper mapper = new CsvMapper();
    mapper.enable(CsvParser.Feature.WRAP_AS_ARRAY);
    File csvFile = new File("input.csv"); // or from String, URL etc
    MappingIterator<Object[]> it = mapper.reader(Object[].class).readValues(csvFile);

as POJOs

    public class CSVPerson{
      public String firstname;
      public String lastname;
      //etc
    }

    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.emptySchema().withHeader().withColumnSeparator(delimiter);
    MappingIterator<CSVPerson> it = = mapper.reader(CSVPerson).with(schema).readValues(input);
    while (it.hasNext()){
      CSVPerson row = it.next();
    }

I'm always singing the praises of this library, it's great. It's also really flexible.

answered Oct 28 '13 at 13:59

tom

2,616
12
28

I never gave Jackson a thought, I'll have to look deeper into it. I'm assuming you were multi threading to achieve that number? I'm using the same machine. – Code Junkie Oct 28 '13 at 14:22
Nope, but that number doesn't include persisting to the database, which is where my (so far unresolved) bottleneck is. It's worth trying out against your other solutions imo. This answer about split() might help you a bit too: http://stackoverflow.com/questions/19356021/stringtokenizer-reading-lines-with-integers/19356088#19356088 – tom Oct 28 '13 at 14:29
I've just checked. I can parse rows into POJOs with CsvMapper quicker than I can parse with Split() – tom Oct 28 '13 at 14:53
Yesterday while using opencsv and 8 threads on a mac book pro, I was able to persist a little more than 10 million rows to the database in approx 20mins. The table consisted of approx 20 columns along with several join tables requiring lookup. Only problem I faced was a data issue related to multi threading that is completely unrelated to this topic. – Code Junkie Oct 28 '13 at 15:20
I think my problem is I'm sticking the rows into a single instance of Mongo. Multithreading won't help there unfortunately, for me the parsing is the quick bit, the writing to disk the slow bit even with a SSD. Anyway, I wrote up the tokenizer benchmarks if you're interested: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/ – tom Oct 28 '13 at 15:23
I'm not familiar with Mongo or any other flat file databases, but can you do batch inserts? – Code Junkie Oct 28 '13 at 17:20
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/40133/discussion-between-tom-and-george) – tom Oct 28 '13 at 17:35

opencsv vs java split comma

3 Answers3