5

I have noticed that using java.util.Scanner is very slow when reading large files (in my case, CSV files).

I want to change the way I am currently reading files, to improve performance. Below is what I have at the moment. Note that I am developing for Android:

InputStreamReader inputStreamReader;
    try {
        inputStreamReader = new InputStreamReader(context.getAssets().open("MyFile.csv"));
        Scanner inputStream = new Scanner(inputStreamReader);
        inputStream.nextLine(); // Ignores the first line
        while (inputStream.hasNext()) {
            String data = inputStream.nextLine(); // Gets a whole line
            String[] line = data.split(","); // Splits the line up into a string array

            if (line.length > 1) {
                // Do stuff, e.g:
                String value = line[1];
            }
        }
        inputStream.close();
    } catch (IOException e) {
        e.printStackTrace();
    }

Using Traceview, I managed to find that the main performance issues, specifically are: java.util.Scanner.nextLine() and java.util.Scanner.hasNext().

I've looked at other questions (such as this one), and I've come across some CSV readers, like the Apache Commons CSV, but they don't seem to have much information on how to use them, and I'm not sure how much faster they would be.

I have also heard about using FileReader and BufferedReader in answers like this one, but again, I do not know whether the improvements will be significant.

My file is about 30,000 lines in length, and using the code I have at the moment (above), it takes at least 1 minute to read values from about 600 lines down, so I have not timed how long it would take to read values from over 2,000 lines down, but sometimes, when reading information, the Android app becomes unresponsive and crashes.

Although I could simply change parts of my code and see for myself, I would like to know if there are any faster alternatives I have not mentioned, or if I should just use FileReader and BufferedReader. Would it be faster to split the huge file into smaller files, and choose which one to read depending on what information I want to retrieve? Preferably, I would also like to know why the fastest method is the fastest (i.e. what makes it fast).

Community
  • 1
  • 1
Farbod Salamat-Zadeh
  • 18,039
  • 16
  • 66
  • 118
  • 2
    You might want to read [this](http://stackoverflow.com/questions/2231369/scanner-vs-bufferedreader) and [this](http://en.allexperts.com/q/Java-1046/2009/2/Difference-Scanner-Method-Buffered.htm) – DSlomer64 Jun 26 '15 at 20:36
  • FYI, I have a 140,000-word "dictionary" (just a list of words, actually) that processes really fast using a `Scanner`. But not on an Android device. I get that there's not a lot of performance difference among the three choices you're considering. But I'm no expert. – DSlomer64 Jun 26 '15 at 20:41
  • Try a BufferedReader – Buddy Jun 26 '15 at 20:53
  • make the reading operation a Callable and throw it in ExecutorService – Palcente Jun 26 '15 at 21:20
  • 1
    You can read millions of lines in a second or two with `BufferedReader.` 30,000 lines should be barely perceptible. There no reason to use `Scanner` when you're only reading lines. – user207421 Jun 27 '15 at 06:48

3 Answers3

5

uniVocity-parsers has the fastest CSV parser you'll find (2x faster than OpenCSV, 3x faster than Apache Commons CSV), with many unique features.

Here's a simple example on how to use it:

CsvParserSettings settings = new CsvParserSettings(); // many options here, have a look at the tutorial

CsvParser parser = new CsvParser(settings);

// parses all rows in one go
List<String[]> allRows = parser.parseAll(new FileReader(new File("your/file.csv")));

To make the process faster, you can select the columns you are interested in:

parserSettings.selectFields("Column X", "Column A", "Column Y");

Normally, you should be able to parse 4 million rows around 2 seconds. With column selection the speed will improve by roughly 30%.

It is even faster if you use a RowProcessor. There are many implementations out-of-the box for processing conversions to objects, POJOS, etc. The documentation explains all of the available features. It works like this:

// let's get the values of all columns using a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);

//the parse() method will submit all rows to the row processor
parser.parse(new FileReader(new File("/examples/example.csv")));

//get the result from your row processor:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();

We also built a simple speed comparison project here.

Jeronimo Backes
  • 5,701
  • 2
  • 20
  • 28
2

Your code is good to load big files. However, when an operation is going to be longer than you're expecting, it's good practice to execute it in a task and not in UI Thread, in order to prevent any lack of responsiveness.

The AsyncTask class help to do that:

private class LoadFilesTask extends AsyncTask<String, Integer, Long> {
    protected Long doInBackground(String... str) {
        long lineNumber = 0;
        InputStreamReader inputStreamReader;
        try {
            inputStreamReader = new
                    InputStreamReader(context.getAssets().open(str[0]));
            Scanner inputStream = new Scanner(inputStreamReader);
            inputStream.nextLine(); // Ignores the first line

            while (inputStream.hasNext()) {
                lineNumber++;
                String data = inputStream.nextLine(); // Gets a whole line
                String[] line = data.split(","); // Splits the line up into a string array

                if (line.length > 1) {
                    // Do stuff, e.g:
                    String value = line[1];
                }
            }
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return lineNumber;
    }

    //If you need to show the progress use this method
    protected void onProgressUpdate(Integer... progress) {
        setYourCustomProgressPercent(progress[0]);
    }

    //This method is triggered at the end of the process, in your case when the loading has finished
    protected void onPostExecute(Long result) {
        showDialog("File Loaded: " + result + " lines");
    }
}

...and executing as:

new LoadFilesTask().execute("MyFile.csv");
Pang
  • 8,605
  • 144
  • 77
  • 113
Ciro Rizzo
  • 482
  • 4
  • 8
  • What's the difference between a 'task' and a 'UI thread' and how do they work? Also, the `doInBackground` method seems to return a `Long` - I am new to using `AsyncTask` so I was wondering how that method works too. Finally, when you do `context.getAssets().open(str[0])`, are you opening line 0 in the file, or are you somehow referring to the file you are reading from? Since I haven't used AsyncTask before, could you explain how it works and what each part of the code does? Thanks. – Farbod Salamat-Zadeh Jun 26 '15 at 21:32
  • Also, would I still use `Scanner`, or would it be better to change to `BufferedReader` (in your example). – Farbod Salamat-Zadeh Jun 26 '15 at 21:37
  • 1
    The UI Thread is the main thread where your app is running, usually this is the thread is working to run User Interface (ie Views, Layout, management of user interactions with your app), the AsyncTask is basically a Thread is running besides and in parallel to the UI Thread in background and have no interaction (not so much, with UI) so it can be working doing long operations without annoying User Interfaces or reducing responsiveness. The snippet I wrote is just an example, you can referring to this Official Link to have a look: http://developer.android.com/reference/android/os/AsyncTask.html – Ciro Rizzo Jun 26 '15 at 21:59
  • Parsing CSV by hand simply using **String.split(",")** is the worst advice you can give to read CSV reliably. This will break if any value containing the colon character. Use a CSV parser for that. Also, **readLine** will cause problems with values that contain the '\n' character. – Jeronimo Backes Jun 27 '15 at 05:36
0

You should use a BufferedReader instead:

BufferedReader reader = null;
try {
    reader = new BufferedReader( new InputStreamReader(context.getAssets().open("MyFile.csv"))) ;
    reader.readLine(); // Ignores the first line
    String data;
    while ((data = reader.readLine()) != null) { // Gets a whole line
        String[] line = data.split(","); // Splits the line up into a string array
        if (line.length > 1) {
            // Do stuff, e.g:
            String value = line[1];
        }
    }
} catch (IOException e) {
    e.printStackTrace();
} finally {
    if (reader != null) {
        try {
            reader.close();
        } catch (IOException e) {
            e.printStackTrace();
        } 
    } 
}
  • You should not use **readLine** at all as it will not handle values with line separators. Also, **String.split(",")** won't work with values that contain the colon character. Use a CSV parser for CSV instead of hand-coding a parser. – Jeronimo Backes Jun 27 '15 at 05:39