15

I am trying to read big CSV and TSV (tab-separated) Files with about 1000000 rows or more. Now I tried to read a TSV containing ~2500000 lines with opencsv, but it throws me an java.lang.NullPointerException. It works with smaller TSV Files with ~250000 lines. So I was wondering if there are any other Libraries that support the reading of huge CSV and TSV Files. Do you have any ideas?

Everybody who is interested in my Code (I shorten it, so Try-Catch is obviously invalid):

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

Edit: This is the Method where I construct the InputStreamReader:

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }
Brian Schack
  • 161
  • 1
  • 11
Robin
  • 2,996
  • 9
  • 33
  • 65
  • 2
    Why don't you read it yourself thanks to a BufferedReader? – Jean Logeart Dec 14 '12 at 13:58
  • Actually I wanted have nicely crafted ,common used Code and I don't want to reinvent the wheel, actually that's the reason everybody is using libs I think. But if there isn't anything working, I will do so. – Robin Dec 14 '12 at 14:00
  • 2
    with that many rows I would look into processing the file in batches: Read n lines from the file, process with csv, read next batch etc. – opi Dec 14 '12 at 14:01
  • @opi Well this could be a solution, thanks. – Robin Dec 14 '12 at 14:04

4 Answers4

16

Do not use a CSV parser to parse TSV inputs. It will break if the TSV has fields with a quote character, for example.

uniVocity-parsers comes with a TSV parser. You can parse a billion rows without problems.

Example to parse a TSV input:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

If your input is so big it can't be kept in memory, do this:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Jeronimo Backes
  • 5,701
  • 2
  • 20
  • 28
7

I have not tried it, but I had investigated superCSV earlier.

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

Check if that works for you, 2.5 million lines.

RuntimeException
  • 1,515
  • 2
  • 22
  • 29
  • Thank you I will have a look at this lib. – Robin Dec 14 '12 at 13:57
  • Thank you. `supercsv` handles `2 500 000` lines pretty nice. – Robin Dec 14 '12 at 14:58
  • 3
    @Robin As a Super CSV developer I'm glad to hear this, though to be fair to opencsv, you're bound to run into (memory) issues if you use `reader.readAll()` instead of reading each line and doing something with it. Your `replaceBackslashes()` method could also run into issues as you're writing the whole file to memory. Was your NPE occuring when closing one of your streams/readers? – James Bassett Dec 15 '12 at 14:31
  • @HoundDog Now that I am switching from openCsv to superCsv, I am quite happy with my decision, because superCsv seems to be very well documented and widely used, so I think it was the right decision. What would be your recommendation to my `replaceBackslashes()`? Yes the NPE occured when I tried to close the reader. – Robin Dec 16 '12 at 12:51
  • @HoundDog Thank you, I will try it, asap. – Robin Dec 17 '12 at 12:48
  • does supercsv support tsv files too? – andresp Jul 31 '15 at 08:09
  • CSVListReader breaks for tsv if it has ' or " , any work changes needed – Saurabh Mar 01 '16 at 07:10
1

Try switching libraries as suggested by Satish. If that doesn't help, you have to split the whole file into tokens and process them.

Thinking that your CSV didn't had any escape characters for commas

// r is the BufferedReader pointed at your file
String line;
StringBuilder file = new StringBuilder();
// load each line and append it to file.
while ((line=r.readLine())!=null){
    file.append(line);
}
// Make them to an array
String[] tokens = file.toString().split(",");

Then you can process it. Don't forget to trim the token before using it.

Sri Harsha Chilakapati
  • 10,962
  • 6
  • 46
  • 87
1

I don't know if that question is still active but here is the one I use successfully. Still may have to implement more interfaces such as Stream or Iterable, however:

import java.io.Closeable;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import java.util.Scanner;

/** Reader for the tab separated values format (a basic table format without escapings or anything where the rows are separated by tabulators).**/
public class TSVReader implements Closeable 
{
    final Scanner in;
    String peekLine = null;

    public TSVReader(InputStream stream) throws FileNotFoundException
    {
        in = new Scanner(stream);
    }

    /**Constructs a new TSVReader which produces values scanned from the specified input stream.*/
    public TSVReader(File f) throws FileNotFoundException {in = new Scanner(f);}

    public boolean hasNextTokens()
    {
        if(peekLine!=null) return true;
        if(!in.hasNextLine()) {return false;}
        String line = in.nextLine().trim();
        if(line.isEmpty())  {return hasNextTokens();}
        this.peekLine = line;       
        return true;        
    }

    public String[] nextTokens()
    {
        if(!hasNextTokens()) return null;       
        String[] tokens = peekLine.split("[\\s\t]+");
//      System.out.println(Arrays.toString(tokens));
        peekLine=null;      
        return tokens;
    }

    @Override public void close() throws IOException {in.close();}
}
Konrad Höffner
  • 8,510
  • 11
  • 50
  • 92
  • Actually I am pretty satisfied with SuperCSV. However thanks for a natural implementation. – Robin Apr 03 '14 at 13:47