3

I have a question about optimization of my code (which works but is too slow...). I am reading an input in a form

X1 Y1
X2 Y2
etc

where Xi, Yi are integers. I am using bufferedReader for reading lines and then StringTokenizer for processing those numbers like this:

StringTokenizer st = new StringTokenizer(line, " ");

int x = Integer.parseInt(st.nextToken());
int y = Integer.parseInt(st.nextToken());

The problem is that this approach seems time inefficient when coping with large data sets. Could you suggest me some simple improvement (I have heard that some integer parse int or regex can be used) which would improve the performance? Thanks for any tips

EDIT: Perhaps I misjudged myself and some improvements have to be done elsewhere in the code...

Smajl
  • 6,702
  • 23
  • 97
  • 163
  • Why does it feel time inefficient? Have you profiled your code and found that this is the bottleneck? You can use `Scanner.nextInt()` to save some lines of code, but it won't be any faster. – Hari Menon Oct 14 '13 at 08:21
  • Perhaps the bottleneck is not in this part of code (it was my guess) but I will try to optimize the code in other pieces as well – Smajl Oct 14 '13 at 08:22
  • how large are the datasets ? – Lodewijk Bogaards Oct 14 '13 at 08:24
  • You should definitely exhaust other areas where performance can be improved first. There isn't much you can do about I/O. You could try to store the data in some different serialized format like json or avro, but if you have to read from this format only, there isn't much you can do. Another option is to try to parallelize it by splitting it into more files and have each thread process one file. – Hari Menon Oct 14 '13 at 09:35
  • 4
    The bottlneck (and core of my problem) lay somewhere else than in the piece of code I shown in my question - I wronglz classified my problem... – Smajl Oct 15 '13 at 11:37

2 Answers2

2

(updated answer)

I can say that whatever the problems in your program speed, the choice of tokenizer is not one of them. After an initial run of each method to even out initialisation quirks, I can parse 1000000 rows of "12 34" in milliseconds. You could switch to using indexOf if you like but I really think you need to look at other bits of code for the bottleneck rather than this micro-optimisation. Split was a surprise for me - it's really, really slow compared to the other methods. I've added in Guava split test and it's faster than String.split but slightly slower than StringTokenizer.

  • Split: 371ms
  • IndexOf: 48ms
  • StringTokenizer: 92ms
  • Guava Splitter.split(): 108ms
  • CsvMapper build a csv doc and parse into POJOS: 237ms (or 175 if you build the lines into one doc!)

The difference here is pretty negligible even over millions of rows.

There's now a write up of this on my blog: http://demeranville.com/battle-of-the-tokenizers-delimited-text-parser-performance/

Code I ran was:

import java.util.StringTokenizer;
import org.junit.Test;

public class TestSplitter {

private static final String line = "12 34";
private static final int RUNS = 1000000;//000000;

public final void testSplit() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        String[] st = line.split(" ");
        int x = Integer.parseInt(st[0]);
        int y = Integer.parseInt(st[1]);
    }
    System.out.println("Split: "+(System.currentTimeMillis() - start)+"ms");
}

public final void testIndexOf() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        int index = line.indexOf(' ');
        int x = Integer.parseInt(line.substring(0,index));
        int y = Integer.parseInt(line.substring(index+1));
    }       
    System.out.println("IndexOf: "+(System.currentTimeMillis() - start)+"ms");      
}

public final void testTokenizer() {
    long start = System.currentTimeMillis();
    for (int i=0;i<RUNS;i++){
        StringTokenizer st = new StringTokenizer(line, " ");
        int x = Integer.parseInt(st.nextToken());
        int y = Integer.parseInt(st.nextToken());
    }
    System.out.println("StringTokenizer: "+(System.currentTimeMillis() - start)+"ms");
}

@Test
public final void testAll() {
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
    this.testSplit();
    this.testIndexOf();
    this.testTokenizer();
}

}

eta: here's the guava code:

public final void testGuavaSplit() {
    long start = System.currentTimeMillis();
    Splitter split = Splitter.on(" ");
    for (int i=0;i<RUNS;i++){
        Iterator<String> it = split.split(line).iterator();
        int x = Integer.parseInt(it.next());
        int y = Integer.parseInt(it.next());
    }
    System.out.println("GuavaSplit: "+(System.currentTimeMillis() - start)+"ms");
}

update

I've added in a CsvMapper test too:

public static class CSV{
    public int x;
    public int y;
}

public final void testJacksonSplit() throws JsonProcessingException, IOException {
    CsvMapper mapper = new CsvMapper();
    CsvSchema schema = CsvSchema.builder().addColumn("x", ColumnType.NUMBER).addColumn("y", ColumnType.NUMBER).setColumnSeparator(' ').build();

    long start = System.currentTimeMillis();
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < RUNS; i++) {
        builder.append(line);
        builder.append('\n');
    }       
    String input = builder.toString();
    MappingIterator<CSV> it = mapper.reader(CSV.class).with(schema).readValues(input);
    while (it.hasNext()){
        CSV csv = it.next();
    }
    System.out.println("CsvMapperSplit: " + (System.currentTimeMillis() - start) + "ms");
}
tom
  • 2,616
  • 12
  • 28
  • line.split is highly inefficient - it is cca 1.5 times slower! – Smajl Oct 14 '13 at 08:23
  • Really? That's a real surprise tbh. Have you run tests on it? I'm Going to write a test case. – tom Oct 14 '13 at 08:24
  • split(" ") is not that great if you are dealing with really large data. Using this method results in loading all data into memory, which may cause performance issues. Of course I don't know whether it is the case here, because I haven't seen data from the profiler :). – Paperback Writer Oct 14 '13 at 08:29
  • I meant split each line :) Indexof is the fastest I think. I'll write a test – tom Oct 14 '13 at 08:33
  • could you provide an example of this idea? I am trying to hink of anything to make it run just a bit faster... thanks! – Smajl Oct 14 '13 at 09:20
  • are they always 2 digit numbers? – tom Oct 14 '13 at 13:28
  • 1
    In the indexOf method can you move the indexOf calculation in a single statement? int index = line.indexOf(' '); – Fedy2 Oct 14 '13 at 13:58
  • 2
    One more "crazy" solution, with some limitations and not well tested: http://pastebin.com/9Cf2ZVT0 – Fedy2 Oct 14 '13 at 14:42
  • 1
    wow. Just ran that through and it came out at 15ms on the same box as the other tests! 3x faster, 50x less readable :D – tom Oct 14 '13 at 14:46
  • Could you explain the "crazy" solution, it would be really useful for me. – Carlos Vega Apr 18 '14 at 01:34
  • @CarlosVega the "crazy" solution is specifically built to parse this exact file format, the two loops for two data points, the loops read the numbers backward and rather than isolating the string then using a parse function, it keeps a running total and knows the power multiplier for each character depending on its left position from the commencement of the loop (back to front), clever shortcuts like breaking loop when number sign (+/-) read, space check only on 1st loop, adding byte/char total as it steps make it as near fast as it can be. v nice Fred2 – Gerard May 19 '17 at 14:38
0

You could use regex to check if the value is numerical and then convert to integer:

if(st.nextToken().matches("^[0-9]+$"))
        {
           int x = Integer.parseInt(st.nextToken());
        }
Jhanvi
  • 4,744
  • 8
  • 30
  • 41