49

Question

Suppose you have a large ASCII text file, with a random non-negative integer on each line, each in the range from 0 to 1,000,000,000. There are 100,000,000 lines in the file. What's the fastest way to read through the file and calculate the sum of all the integers?

Constraint: we've got 10MB of RAM to work with. The file is 1GB in size, so we don't want to read the whole thing in and then process it.

Here are various solutions I've tried. I found the results rather surprising.

Is there anything faster that I've missed?

Please note: all timings given below are for running the algorithm 10 times in total (run once and discard; start timer; run 10 times; stop timer). The machine is a fairly slow Core 2 Duo.

Method 1: the natural approach

The first thing to try is the obvious approach:

private long sumLineByLine() throws NumberFormatException, IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }
    br.close();
    return total;
}

Note that the maximum possible return value is 10^17, which still easily fits in a long, so we don't have to worry about overflows.

On my machine, running this 11 times and discounting the first run takes around 92.9 seconds.

Method 2: a minor tweak

Inspired by a comment on this question, I tried not creating a new int k to store the result of parsing the line, and instead just to add the parsed value directly to total. So this:

    while ((line = br.readLine()) != null) {
        int k = Integer.parseInt(line);
        total += k;
    }

becomes this:

    while ((line = br.readLine()) != null)
        total += Integer.parseInt(line);

I was certain that this wouldn't make any difference, and thought it highly likely that the compiler would generate the same bytecode for the two versions. But, to my surprise, it did shave a little time off: we're down to 92.1 seconds.

Method 3: manually parsing the integer

One thing that bothers me about the code so far is that we turn the String into an int, and then add it on at the end. Might it not be quicker to add on as we go? What happens if we parse the String ourselves? Something like this...

private long sumLineByLineManualParse() throws NumberFormatException,
        IOException {
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line;
    long total = 0;
    while ((line = br.readLine()) != null) {
        char chs[] = line.toCharArray();
        int mul = 1;
        for (int i = chs.length - 1; i >= 0; i--) {
            char c = chs[i];
            switch (c) {
            case '0':
                break;
            case '1':
                total += mul;
                break;
            case '2':
                total += (mul << 1);
                break;
            case '4':
                total += (mul << 2);
                break;
            case '8':
                total += (mul << 3);
                break;
            default:
                total += (mul*((byte) c - (byte) ('0')));   
            }
            mul*=10;
        }
    }
    br.close();
    return total;
}

This, I thought, might save a little time, especially with some bitshift optimisations for doing the multiplication. But the overheads of converting to a character array must swamp any gains: this now takes 148.2 seconds.

Method 4: processing in binary

One last thing we can try is to process the file as binary data.

Parsing an integer from the front is awkward if you don't know the length of it. Parsing it backwards is much easier: the first digit you encounter is units, the next one is tens, and so on. So the easiest way to approach the whole thing is to read the file backwards.

If we allocate a byte[] buffer of (say) 8MB, we can fill it up with the last 8MB of the file, process it, then read the preceding 8MB, and so on. We need to be a little careful that we don't screw up a number that we're in the middle of parsing when we move to the next block, but that's the only problem.

When we encounter a digit, we add it (suitably multiplied according to its position in the numeral) to the total, and then multiply the coefficient by 10 so we're ready for the next digit. If we encounter anything that isn't a digit (a CR or LF), we just reset the coefficient.

private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[8*1024*1024];
    int mul = 1;
    long total = 0;
    while (lastRead>0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead-len);
        raf.readFully(buf, 0, len);
        lastRead-=len;
        for (int i=len-1; i>=0; i--) {
            //48 is '0' and 57 is '9'
            if ((buf[i]>=48) && (buf[i]<=57)) {
                total+=mul*(buf[i]-48);
                mul*=10;
            } else
                mul=1;
        }
    }
    raf.close();
    return total;
}

This runs in 30.8 seconds! That's a speed increase by a factor of 3 over the previous best.

Follow-up questions

  1. Why is this so much faster? I was expecting it to win, but not quite so impressively. Is it mainly the overheads of converting to a String? And all the worrying behind the scenes about character sets and the like?
  2. Can we do any better than this by using a MappedByteBuffer to help? I have a feeling that the overheads of invoking methods to read from the buffer would slow things down, especially when reading backwards from the buffer.
  3. Would it be better to read the file forwards rather than backwards, but still scan the buffer backwards? The idea would be that you read the first chunk of the file, and then scan backwards, but discarding the half-number at the end. Then when you read the next chunk, you set the offset so that you read from the beginning of the number you discarded.
  4. Is there anything I haven't thought of that could make a significant difference?

Update: more surprising results

First, an observation. It should have occurred to me before, but I think the reason for the inefficiency of the String-based reading is not so much the time taken to create all the String objects but the fact that they are so short-lived: we've got 100,000,000 of them for the garbage collector to deal with. That is bound to upset it.

Now some experiments based on answers/comments people have posted.

Am I cheating with the size of the buffer?

One suggestion was that since a BufferedReader uses a default buffer of 16KB, and I've used a buffer of 8MB, I'm not comparing like with like. It's bound to be faster if you use a bigger buffer.

Here's the shock. The sumBinary() method (Method 4) ran in 30.8 seconds yesterday with an 8MB buffer. Today, code unchanged, the wind direction has changed and we're at 30.4 seconds. If I drop the buffer size down to 16KB to see how much slower it gets, it gets faster! It now runs in 23.7 seconds. Crazy. Who saw that one coming?!

A bit of experimentation suggests that 16KB is about optimal. Perhaps the Java guys did the same experiments, and that's why they went with 16KB!

Is the problem I/O-bound?

I wondered about this too. How much time is spent on disk access, and how much on number crunching? If it's almost all disk access, as suggested by a well-supported comment on one of the proposed answers, then we won't be able to make much improvement whatever we do.

This is easy to test by running the code with all the parsing and number crunching commented out, but with the reading still intact:

private long sumBinary() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 1;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        /*for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57)) {
                total += mul * (buf[i] - 48);
                mul *= 10;
            } else
                mul = 1;
        }*/
    }
    raf.close();
    return total;
}

This now runs in 3.7 seconds! This doesn't look I/O-bound to me.

Of course, some of the I/O speed will come from disk cache hits. But that isn't really the point here: we're still taking 20 seconds of CPU time (also confirmed using Linux's time command), which is plenty big enough to try to reduce it.

Scanning forwards instead of backwards

I'd maintained in my original post that there was good reason to scan the file backwards rather than forwards. I didn't explain that very well. The idea was that if you scan a number forwards, you have to accumulate the total value of the scanned number, and then add it on. If you scan backwards, you can add it to the cumulative total as you go. My subconscious was making some sort of sense to itself (on which more later), but I'd missed one key point, which was pointed out in one of the answers: to scan backwards, I was doing two multiplications per iteration, but with scanning forwards you need only one. So I coded up a forward-scanning version:

private long sumBinaryForward() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int fileLength = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int acc = 0;
    long total = 0;
    int read = 0;
    while (read < fileLength) {
        int len = Math.min(buf.length, fileLength - read);
        raf.readFully(buf, 0, len);
        read += len;
        for (int i = 0; i < len; i++) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
        }
    }
    raf.close();
    return total;
}

This runs in 20.0 seconds, beating the backward-scanning version by a distance. Nice.

Multiplication cache

What I realised during the night, though, was that although I was performing two multiplications per iteration, there was the possibility of using a cache to store these multiplications, so that I could avoid having to perform them during backwards iteration. I was pleased to see when I woke up that someone had had the same idea!

The point is that there are at most 10 digits in the numbers we're scanning, and only 10 possible digits, so only 100 possibilities for the value of a digit to the cumulative total. We can precompute these, and then use them in the backward-scanning code. That ought to beat the forward-scanning version, because we've now got rid of the multiplications entirely. (Note that we can't do this with forward scanning, because the multiplication is of the accumulator, which could take any value up to 10^9. It's only in the backward case that both operands are limited to a few possibilities.)

private long sumBinaryCached() throws IOException {
    int mulCache[][] = new int[10][10];
    int coeff = 1;
    for (int i = 0; i < 10; i++) {
        for (int j = 0; j < 10; j++)
            mulCache[i][j] = coeff * j;
        coeff *= 10;
    }

    RandomAccessFile raf = new RandomAccessFile(file, "r");
    int lastRead = (int) raf.length();
    byte buf[] = new byte[16 * 1024];
    int mul = 0;
    long total = 0;
    while (lastRead > 0) {
        int len = Math.min(buf.length, lastRead);
        raf.seek(lastRead - len);
        raf.readFully(buf, 0, len);
        lastRead -= len;
        for (int i = len - 1; i >= 0; i--) {
            if ((buf[i] >= 48) && (buf[i] <= 57))
                total += mulCache[mul++][buf[i] - 48];
            else
                mul = 0;
        }
    }
    raf.close();
    return total;
}

This runs in 26.1 seconds. Disappointing, to say the least. Reading backwards is less efficient in terms of I/O, but we've seen that I/O is not the major headache here. I had expected this to make a big positive difference. Perhaps the array lookup is just as expensive as the multiplications we've replaced. (I did try making the array 16x16, and using bitshifts to index, but it didn't help.)

Looks like forward scanning is where it's at.

Using a MappedByteBuffer

Next thing to add in is a MappedByteBuffer, to see if that's more efficient than using a raw RandomAccessFile. It doesn't need much change to the code.

private long sumBinaryForwardMap() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    byte buf[] = new byte[16 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    int acc = 0;
    long total = 0;
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        for (int i = 0; i < len; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                total += acc;
                acc = 0;
            }
    }
    ch.close();
    raf.close();
    return total;
}

This does seem to improve things a little: we're now at 19.0 seconds. We've taken another second off our personal best!

What about multi-threading?

One of the proposed answers involves using multiple cores. I'm a little ashamed that that hadn't occurred to me!

The answer came in for some stick, because of the assumption that it's an I/O-bound problem. This seems a little harsh, in light of the results about I/O! Certainly worth a try, in any case.

We'll do this using fork/join. Here's a class to represent the result of a computation on part of the file, bearing in mind that there might be a partial result to the left (if we started half way through a number), and a partial result to the right (if the buffer finished half way through a number). The class also has a method for allowing us to glue two such results together, into a combined result for two adjacent sub-tasks.

private class SumTaskResult {
    long subtotal;
    int leftPartial;
    int leftMulCount;
    int rightPartial;

    public void append(SumTaskResult rightward) {
        subtotal += rightward.subtotal + rightPartial
                * rightward.leftMulCount + rightward.leftPartial;
        rightPartial = rightward.rightPartial;
    }
}

Now the key bit: the RecursiveTask that computes the result. For small problems (less than 64 characters), it calls computeDirectly() to calculate the result in a single thread; for larger problems, it splits into two, solves the two sub-problems in separate threads, and then combines the results.

private class SumForkTask extends RecursiveTask<SumTaskResult> {

    private byte buf[];
    // startPos inclusive, endPos exclusive
    private int startPos;
    private int endPos;

    public SumForkTask(byte buf[], int startPos, int endPos) {
        this.buf = buf;
        this.startPos = startPos;
        this.endPos = endPos;
    }

    private SumTaskResult computeDirectly() {
        SumTaskResult result = new SumTaskResult();
        int pos = startPos;

        result.leftMulCount = 1;

        while ((buf[pos] >= 48) && (buf[pos] <= 57)) {
            result.leftPartial = result.leftPartial * 10 + buf[pos] - 48;
            result.leftMulCount *= 10;
            pos++;
        }

        int acc = 0;
        for (int i = pos; i < endPos; i++)
            if ((buf[i] >= 48) && (buf[i] <= 57))
                acc = acc * 10 + buf[i] - 48;
            else {
                result.subtotal += acc;
                acc = 0;
            }

        result.rightPartial = acc;
        return result;
    }

    @Override
    protected SumTaskResult compute() {
        if (endPos - startPos < 64)
            return computeDirectly();
        int mid = (endPos + startPos) / 2;
        SumForkTask left = new SumForkTask(buf, startPos, mid);
        left.fork();
        SumForkTask right = new SumForkTask(buf, mid, endPos);
        SumTaskResult rRes = right.compute();
        SumTaskResult lRes = left.join();
        lRes.append(rRes);
        return lRes;
    }

}

Note that this is operating on a byte[], rather than the whole MappedByteBuffer. The reason for that is that we want to keep the disk access sequential. We'll take quite large chunks, fork/join, and then move to the next chunk.

Here's the method that does that. Note that we've pushed the buffer size up to 1MB (sub-optimal earlier, but more sensible here, it seems).

private long sumBinaryForwardMapForked() throws IOException {
    RandomAccessFile raf = new RandomAccessFile(file, "r");
    ForkJoinPool pool = new ForkJoinPool();

    byte buf[] = new byte[1 * 1024 * 1024];
    final FileChannel ch = raf.getChannel();
    int fileLength = (int) ch.size();
    final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, 0,
            fileLength);
    SumTaskResult result = new SumTaskResult();
    while (mb.hasRemaining()) {
        int len = Math.min(mb.remaining(), buf.length);
        mb.get(buf, 0, len);
        SumForkTask task = new SumForkTask(buf, 0, len);
        result.append(pool.invoke(task));
    }
    ch.close();
    raf.close();
    pool.shutdown();
    return result.subtotal;
}

Now here's the soul-destroying disappointment: this nicely multi-threaded code now takes 32.2 seconds. Why so slow? I spent quite a while debugging this, assuming I'd done something terribly wrong.

Turns out there was just one small tweak needed. I'd thought the threshold of 64 between small problem and big problem was a reasonable one; turns out that was totally ridiculous.

Think about it like this. The sub-problems are exactly the same size, so they should complete in pretty much the same time. So there's really no point splitting into more pieces than there are processors available. On the machine I'm using, with only two cores, going down to a threshold of 64 is ridiculous: it just adds more overhead.

Now you don't want to limit things so that it only uses two cores even when there are more available. Perhaps the right thing to do would be to find out the number of processors at runtime, and split into that many pieces.

In any case, if I change the threshold to 512KB (half the buffer size), it now completes in 13.3 seconds. Going down to 128KB or 64KB would allow more cores to be used (up to 8 or 16 respectively), and doesn't significantly affect the runtime.

So multi-threading does make a big difference.

It's been quite a long journey, but we started out with something that took 92.9 seconds and we're now down to 13.3 seconds... that's seven times the speed of the original code. And that's not by improving the asymptotic (big-Oh) time complexity, which was linear (optimal) right from the start... this has all been about improving the constant factor.

A good day's work.

I suppose I should probably try using the GPU next...

Postscript: generating the file of random numbers

I generated the random numbers with the following code, which I ran and redirected to a file. Obviously I can't guarantee that you'll end up with exactly the same random numbers that I had :)

public static void genRandoms() {
    Random r = new Random();
    for (int i = 0; i < 100000000; i++)
        System.out.println(r.nextInt(1000000000));
}
Community
  • 1
  • 1
chiastic-security
  • 19,689
  • 3
  • 33
  • 62
  • 2
    It is faster because you made the buffer 8mb, where the `BufferedReader` uses 16k. Even if reading a file backward would be much more efficient (it isn't!), you would gain much more by just increasing the buffer in the reader to the same level. You're comparing apples&oranges. – Thomas Jungblut Sep 01 '14 at 13:27
  • And it would probably also be more efficient to wrap FileInputStream into BufferedInputStream, then creating an InputStreamReader on top of it (so buffering happens on the lowest level). But decoding a character set is still overhead. – Durandal Sep 01 '14 at 13:32
  • Method 3: That `case` statement to use bit shifting in certain cases probably introduces more branching delays than perf improvements. Remove the case statement and just calculate `total += (mul*((byte) c - (byte) ('0'))); ` on each pass of the for-loop. What is the run time now? – selbie Sep 01 '14 at 13:33
  • This kind of program should be I/O bound, because reading the file is limited by mechanical speed, and parsing and adding ints involve one multiplication by 10, one subtraction, and one addition per character. Of course you could slow it down a lot by having it memory-allocate string objects if you want. – Mike Dunlavey Sep 01 '14 at 14:13
  • "Parsing an integer from the front is awkward if you don't know the length of it": quite to the contrary! Start with an accumulator set to 0, then for each next digit multiply the accumulator by 10 and add the value of that digit. At a non-digit stash away the value of the accumulator and reset it to 0. (But your question stipulates there is only one integer per line.) – Marc van Leeuwen Sep 01 '14 at 14:28
  • "the maximum possible return value is 10^17, which still easily fits in a long" no, not actually, in C long is only guaranteed to be at least 32 bits. It's 64 bits mostly in 64-bit Unix-like OSes, in 64-bit Windows long is still 32 bits – phuclv Sep 02 '14 at 04:05
  • As already said by others, this is probably I/O bound. But maybe you can get some minor speedup by using an array of integer to sum up. Each element of the array corresponds to one digit. This way you can sum up without multiplying by 10 for each digit. Don't care about digit overflows until the end (it's hard to explain with the few characters allowed in a comment) – user1781290 Sep 02 '14 at 07:34
  • Thanks, all. I've added an update to the post looking at buffer sizes, forward scanning, and I/O issues (all with rather surprising results). – chiastic-security Sep 02 '14 at 16:56
  • could you provide a link to the input file and/or the code that generates it? – jfs Sep 02 '14 at 18:07
  • @J.F.Sebastian code added. The file is 1GB in size (uncompressed), so not easy to post, but I can find a way if it's important to have exactly the same file. – chiastic-security Sep 02 '14 at 18:14
  • @LưuVĩnhPhúc In Java, which is what this question is about, a long is always 64 bits, regardless of underlying architecture. – chiastic-security Sep 02 '14 at 21:10
  • [@maaartinus provided the fastest solution](http://stackoverflow.com/questions/25606833/fastest-way-to-sum-integers-in-text-file#comment40046670_25607155) so far. It takes 2.3 seconds on my machine with warmed up file cache. It is very impressive. [The same algorithm in C](https://gist.github.com/zed/93ec2e02a22a133f5293#file-sum1gints-maaartinus-mmap-c-L28) takes **1.5** seconds. Note: if the file cache is cold then the problem *is* I/O bound: it takes around 8.6-8.9 seconds to complete the task. – jfs Sep 02 '14 at 22:03
  • @J.F.Sebastian No, the maaartinus solution is slower: note that the timings in the main post were for running the code 10 times, whereas the 2.3 seconds is for a single run. See the comment thread on maaartinus' answer. – chiastic-security Sep 03 '14 at 02:36
  • @J.F.Sebastian the solution I give at the end of the post runs in 13.3 seconds for 10 iterations, i.e., 1.3 seconds each run. The maaartinus solution runs in >2 seconds per iteration on my machine. – chiastic-security Sep 03 '14 at 02:49
  • @chiastic-security: got it. Could you run [the C version](https://gist.github.com/zed/93ec2e02a22a133f5293#file-sum1gints-maaartinus-mmap-c-L28) for reference? – jfs Sep 03 '14 at 02:56
  • @J.F.Sebastian yes, will do that tomorrow. I expect a C version of any of the solutions will run quite fast. My C is not really good enough to translate the multi-threaded version :) – chiastic-security Sep 03 '14 at 03:01
  • @chiastic-security: I wrote a [hybrid C-Python multi-threaded version](https://gist.github.com/zed/f199d5a0c453be2e9681). It is **0.9** seconds (cached file) on my machine. Python reads the file, manages threads, and computes the final result, C computes intermediate sums. – jfs Sep 03 '14 at 10:08
  • @J.F.Sebastian Nice. Presumably that timing is for a single (warmed) run? – chiastic-security Sep 03 '14 at 10:23
  • @chiastic-security: yes. The result for cold cache is **>8** seconds. – jfs Sep 03 '14 at 10:24
  • @chiastic-security Great! Also a good lesson for me on how to ask a right questions on stackoverflow! – Ivan T Sep 04 '14 at 07:46
  • 1
    People always fail to test their assumption that file processing is I/O bound. I can't tell you how many times I've heard the argument that C++ `fstream` being slow doesn't matter because the CPU is so much faster than the disk. Benchmarks always show that to be false. A lot of care is required to keep up with a modern disk streaming transfer of 120MB/s (spinning) or 500MB/s (solid state), and of course cache is faster yet. – Ben Voigt Sep 05 '14 at 15:26
  • `sumBinaryForwardMapForked()` seems incorrect as you can end up with partial reads at the end of the buffer that would be overridden. Using mapped file to copy to another buffer is usually the wrong way to do it as it kind of defeats the point of the memory mapped file. Testing on (partially) cached files also screws the results as the task won't experience any IO activity but just maps the file into kernel pages where it's cached - the latter won't even invalidate most of L2 cache. – bestsss Sep 07 '14 at 09:31
  • if the file is already in memory then you could try GPU-based solution (it is like multithreaded version but with ridiculous number of threads). (Beware, drivers that are distibuted with nvidia's cuda packages may lead to a black screen on Ubuntu). – jfs Sep 11 '14 at 06:52

7 Answers7

11

Your main bottleneck will be file IO. Parsing and adding up the numbers should not contribute to the algorithm as that can be done in a separate thread while the File I/O is waiting for the disk.

Some years ago I researched how to read from files in the fastest possible way and came across some excellent advice - which I implemented as a Scan routine as below:

// 4k buffer size.
static final int SIZE = 4 * 1024;
static byte[] buffer = new byte[SIZE];

// Fastest because a FileInputStream has an associated channel.
private static void ScanDataFile(Hunter p, FileInputStream f) throws FileNotFoundException, IOException {
    // Use a mapped and buffered stream for best speed.
    // See: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly
    final FileChannel ch = f.getChannel();
    long red = 0L;
    do {
        final long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
        final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
        int nGet;
        while (mb.hasRemaining() && p.ok()) {
            nGet = Math.min(mb.remaining(), SIZE);
            mb.get(buffer, 0, nGet);
            for (int i = 0; i < nGet && p.ok(); i++) {
                p.check(buffer[i]);
                //size += 1;
            }
        }
        red += read;
    } while (red < ch.size() && p.ok());
    // Finish off.
    p.close();
    ch.close();
    f.close();
}

You may wish to adjust this technique before testing it for speed as it is making use of an interfaced object called a Hunter to hunt for the data.

As you can see the advice was derived in 2008 and there have been many enhancements to Java since then so this may not provide an improvement.

Added

I have not tested this but this should fit into your tests and use the same technique:

class Summer {

    long sum = 0;
    long val = 0;

    public void add(byte b) {
        if (b >= '0' && b <= '9') {
            val = (val * 10) + (b - '0');
        } else {
            sum += val;
            val = 0;
        }
    }

    public long getSum() {
        return sum + val;
    }
}

private long sumMapped() throws IOException {
    Summer sum = new Summer();
    FileInputStream f = new FileInputStream(file);
    final FileChannel ch = f.getChannel();
    long red = 0L;
    do {
        final long read = Math.min(Integer.MAX_VALUE, ch.size() - red);
        final MappedByteBuffer mb = ch.map(FileChannel.MapMode.READ_ONLY, red, read);
        int nGet;
        while (mb.hasRemaining()) {
            nGet = Math.min(mb.remaining(), SIZE);
            mb.get(buffer, 0, nGet);
            for (int i = 0; i < nGet; i++) {
                sum.add(buffer[i]);
            }
        }
        red += read;
    } while (red < ch.size());
    // Finish off.
    ch.close();
    f.close();
    return sum.getSum();
}
OldCurmudgeon
  • 60,862
  • 15
  • 108
  • 197
  • 1
    Thanks! Surprised to discover that it's not I/O-bound at all, though. I've updated the post accordingly... – chiastic-security Sep 02 '14 at 16:57
  • @chiastic-security I couldn't help myself and implemented a [similar solution](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/FileNumberSum.java) taking 1.9 seconds. – maaartinus Sep 02 '14 at 20:10
  • @maaartinus Ah, my timings were for running it ten times, though :) I was running once and discarding, then starting the timer, and running ten times. Doing that with your code on my machine takes 22.8 seconds. – chiastic-security Sep 02 '14 at 21:05
  • @chiastic-security I see. I achieved some speed (1.9 -> 1.4) up by simplifying the conditions. I've also tried copying to a `byte[]`, but it didn't help. – maaartinus Sep 02 '14 at 22:56
9

Why is this so much faster?

Creating a String is much more expensive than a little maths.

Can we do any better than this by using a MappedByteBuffer help?

A little, yes. Its what I use. It save a memory to memory copy. i.e. no byte[] is needed.

I have a feeling that the overheads of invoking methods to read from the buffer would slow things down,

The methods get inlined if they are simple.

especially when reading backwards from the buffer.

It won't be any slower, in fact parsing forwards is simpler/faster because you use one * instead of two.

Would it be better to read the file forwards rather than backwards, but still scan the buffer backwards?

I don't understand why you would need to read backward at all.

The idea would be that you read the first chunk of the file, and then scan backwards, but discarding the half-number at the end. Then when you read the next chunk, you set the offset so that you read from the beginning of the number you discarded.

sounds unnecessarily complicated. I would read in a single pass, memory mapping in the entire file in one go. No need to use chunks unless the file is 2+ GB in size. and even then I would read in one pass.

Is there anything I haven't thought of that could make a significant difference?

If the data is in disk cache it will make more difference than anything else.

Peter Lawrey
  • 498,481
  • 72
  • 700
  • 1,075
4

You can go for a larger buffer size, and a faster coding to String (to Unicode).

BufferedReader br = new BufferedReader(new InputStreamReader(
        new FileInputStream(file), StandardCharsets.US_ASCII),
        1_024_000_000);

Your method of eliminating String usage, by using a binary InputStream/RandomAccessFile is worthwile.

Then it might also be nice if the source files were compressed. Under Unix one would pick the gzip format, where xxx.txt.gz uncompresses to xxx.txt. That would be readable with a GZipInputStream. It has the advantage of overall speed up of file transfer to and from the server's directory.

Joop Eggen
  • 96,344
  • 7
  • 73
  • 121
3

I think there is another way of doing this.

This is classical multiple process programming problem. In C language there is library MPI that solves this kinds of problems.

The idea of it is to chunk the list of integers for example in 4 parts and every part is summed by different process. After finishing, processes are summed together.

In java this could be done with threads (pseudo parallel) and java concurrency.

E.g 4 different threads summing 4 different parts of the list. At the end they are summed together.

Telephone companies uses Grid Computers that do this kind of parallel programming technic to sum their transactions.

The only problem here (bottleneck) is the IO operation. Reading the file will take much time. If somehow you can make multiple threads read different parts of the file... This is very complicated approach and I think that this will not do much good because the disk won't spin faster just because it's used by many threads, but there are other technics of doing similar stuff. You can read more about this here: Access File through multiple threads and here Reading a single file with Multiple Thread: should speed up?

Community
  • 1
  • 1
anchor
  • 615
  • 2
  • 8
  • 14
  • 10
    Multiprocessing will hardly help with IO bound problems. – maaartinus Sep 01 '14 at 14:19
  • 7
    On a spinning disk it even has good chances of being slower due to additional seeks. – CodesInChaos Sep 01 '14 at 15:27
  • The only way threading might help to speed things up here, COULD be to have one thread read data and the other one process it (but the reader thread would be sleeping most of the time waiting for the os getting data from disk) – user1781290 Sep 02 '14 at 11:36
  • 2
    Turns out it's not I/O-bound, and multi-threading helps quite a lot! See update to post. – chiastic-security Sep 02 '14 at 16:58
  • @J.F.Sebastian That's pretty surprising. I guess, I'll try to provide a solution taking as much time as reading the file. Or do you consider the problem solved already? – maaartinus Sep 02 '14 at 19:45
  • @J.F.Sebastian I couldn't help myself and [did it](https://dl.dropboxusercontent.com/u/4971686/published/maaartin/so/FileNumberSum.java). *Now* it *is* IO bound. – maaartinus Sep 02 '14 at 20:12
  • @maaartinus: it is a wonderful code that proves that the problem is I/O bound: the time it takes to compute the sum is significantly *larger* than the time it takes to read the file (**2.3** vs. **<1** seconds). – jfs Sep 02 '14 at 20:25
  • @J.F.Sebastian I'm afraid you're right. Simply summing up all the bytes is faster. – maaartinus Sep 02 '14 at 20:49
  • @maaartinus: yes. `result += b` is at least 2x faster. – jfs Sep 02 '14 at 20:50
  • @maaartinus: you are right [the problem is I/O bound if file caches are cold.](http://stackoverflow.com/questions/25606833/fastest-way-to-sum-integers-in-text-file#comment40049631_25606833) – jfs Sep 02 '14 at 22:06
2

Source: http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

For the best Java read performance, there are four things to remember:

  • Minimize I/O operations by reading an array at a time, not a byte at a time. An 8Kbyte array is a good size.
  • Minimize method calls by getting data an array at a time, not a byte at a time. Use array indexing to get at bytes in the array.
  • Minimize thread synchronization locks if you don't need thread safety. Either make fewer method calls to a thread-safe class, or use a non-thread-safe class like FileChannel and MappedByteBuffer.
  • Minimize data copying between the JVM/OS, internal buffers, and application arrays. Use FileChannel with memory mapping, or a direct or wrapped array ByteBuffer.
Margus
  • 18,332
  • 12
  • 51
  • 101
2

Based on this comment: "Simply summing up all the bytes is faster", I propose a variation of the accepted answer.

The accepted answer proposes breaking the problem into chunks, calculating a sum for each chuck using multithreading, and adding them together at the end.

This idea can be used to reduce the number of multiplications to O(1) in the backward scan, without any table lookups and without threading (or combine it with threading). Simply take advantage of the way multiplication distributes over addition and add all the ones digits into one accumulator, the tens into a separate one, hundreds and thousands into their own accumulators. This requires no multiplication whatsoever.

The reduce step combining results from multiple threads can also be done using the per-place accumulators. The final step of calculating the totals will require multiplication (or take advantage of the fact that 10 has only two bits set and use bit shifts and add), but only 9 multiplications are sufficient.

Community
  • 1
  • 1
Ben Voigt
  • 260,885
  • 36
  • 380
  • 671
  • yes, this was my first thought. `123 + 456 + 78 = (1+4)*100 + (2+5+7)*10 + (3+6+8)*1`. you'd have 10 longs, each representing the sum of all the digits with the same magnitude, then at the end you'd sum them up with something like: `total = l0 + l1*10 + ... + l9 * 1000000000` – Kip Sep 05 '14 at 18:50
  • also, keeping 10 separate long variables, as opposed to an array of 10 longs, might help with performance, as all array accesses are checked in Java. (That is something I have run into with micro-optimizations like this before.) – Kip Sep 05 '14 at 18:51
1

There are several issues here.

  1. Any solution based on reading lines is going to process each character twice. Compilers for example don't do this, they read one character at a time and despatch on it directly.
  2. Any solution based on readLine() is going to create Strings.
  3. You are using different buffer sizes.
  4. You are using different I/O technologies.
  5. In some cases you are using character conversion, while in others you aren't.
  6. You're over-analyzing the file. You don't really care where the white space is, or how much of it there is, as long as it separates the numbers from each other.

My solution:

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(file), 8*1024*1024/2);
    long    total = 0;
    int i;
    while ((i = bis.read()) != -1)
    {
        byte    b = (byte)i;
        long    number = 0;
        while (b >= '0' && b <= '9')
        {
            number = number*10+b-'0';
            if ((i = bis.read()) == -1)
                break;
            b = (byte)i;
        }
        total += number;
    }
user207421
  • 289,834
  • 37
  • 266
  • 440