Java outOfMemory exception in string.split

Question

I have a big txt file with integers in it. Each line in file has two integer numbers separated by whitespace. Size of a file is 63 Mb.

Pattern p = Pattern.compile("\\s");
    try (BufferedReader reader = new BufferedReader(new FileReader(filePath))) {
        String line;
        while ((line = reader.readLine()) != null) {
            String[] tokens = p.split(line);
            String s1 = new String(tokens[0]);
            String s2 = new String(tokens[1]);
            int startLabel = Integer.valueOf(s1) - 1;
            int endLabel = Integer.valueOf(s2) - 1;
            Vertex fromV = vertices.get(startLabel);
            Vertex toV = vertices.get(endLabel);
            Edge edge = new Edge(fromV, toV);
            fromV.addEdge(edge);
            toV.addEdge(edge);
            edges.add(edge);
            System.out.println("Edge from " + fromV.getLabel() + " to " + toV.getLabel());
        }

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.<init>(String.java:203)
at java.lang.String.substring(String.java:1913)
at java.lang.String.subSequence(String.java:1946)
at java.util.regex.Pattern.split(Pattern.java:1202)
at java.util.regex.Pattern.split(Pattern.java:1259)
at SCC.main(SCC.java:25)

Why am I getting this exception? How can I change my code to avoid it?

EDIT: I've already increase heap size to 2048m. What is consuming it? That's what I would want to know also.

For all I know jvm should allocate memory to list of vertices, set of edges, buffer for buffered reader and one small string "line". I don't see where this outOfMemory coming from.

I read about string.split() method. I think it's causing memory leak, but I don't know what should I do about it.

@Smit No, that would try to split the string "\s+" on the pattern `\s`. — erickson, Jul 30 '13 at 15:52
@erickson it wouldnt. Look at the String.split() method in the API. Although Pattern.split() should behave exactly the same, so really you're both wrong... — gnomed, Jul 30 '13 at 15:53
@gnomed Ah, you are right, but `p` is a `Pattern`, and `String.split()` eventually calls this method on `Pattern`. However, it's more efficient to compile the pattern once, and reuse it, than to recompile the pattern with every call to `String.split()`. — erickson, Jul 30 '13 at 15:56
@Smit this is calling the `Pattern` method `split`. The OP is caching the pattern. This is good practice for heavily reused patterns as they aren't free to compile. — Boris the Spider, Jul 30 '13 at 15:56
@BoristheSpider and @ erickson: I really didn't know that. I just tried and really worked well. Thanks for updating my knowledge. — Smit, Jul 30 '13 at 15:58

score 4 · Answer 1 · answered Jul 30 '13 at 15:52

4

Easiest way: increase your heap size: Add -Xmx512m -Xms512m (or even more) arguments to jvm

answered Jul 30 '13 at 15:52

fmgp

1,628
17
17

2

Is OP asking way to increase memory? This is a comment not answer . – Makky Jul 30 '13 at 15:54

score 4 · Accepted Answer · edited Jul 30 '13 at 16:04

4

What you should try first is reduce the file to small enough that it works. That will allow you to appraise just how large a problem you have.

Second, your problem is definitely unrelated to String#split since you are using it on just one line at a time. What is consuming your heap are the Vertex and Edge instances. You'll have to redesign this towards a smaller footprint, or completely overhaul your algorithms to be able to work with only a part of the graph in memory, the rest on the disk.

P.S. Just a general Java note: don't write

String s1 = new String(tokens[0]);
String s2 = new String(tokens[1]);

you just need

String s1 = tokens[0];
String s2 = tokens[1];

or even just use tokens[0] directly instead of s1, since it's about as clear.

edited Jul 30 '13 at 16:04

erickson

249,448
50
371
469

answered Jul 30 '13 at 15:58

Marko Topolnik

179,046
25
276
399

I know that, I've just read that using of tokens[0] is getting GC from removing characters from memory. – user1685095 Jul 30 '13 at 16:08
2

Writing `new String(...)` can't do any good except allocate even more memory. – Marko Topolnik Jul 30 '13 at 16:11
Yes, in my case I think it is so. – user1685095 Jul 30 '13 at 16:25
Actually, calling `new String(String)` *can* sometimes help, but probably not in this case. If you have a really long string, and by whatever means you create a small substring from it, and "throw away" the rest, the character buffer of the very long string will be retained. Using this constructor allocates a new `char[]` that is only as big as the substring, removing the reference to the original character buffer and perhaps allowing it to be GC'd. But in this case, it looks like a reference to almost all of the data on each line (except the space and newline) is permanently maintained. – erickson Jul 30 '13 at 16:32
1

@erickson That was true only until OpenJDK 7 Update 6, when the behavior was changed to no `char[]` sharing. However, since OP is not retaining `tokens[i]`, but just parsing them into integers, even with the old behavior `new String` would have been just more memory allocation with no positive effect. – Marko Topolnik Jul 30 '13 at 16:34
Yes, that was just desperate unreasonable attempt :( – user1685095 Jul 30 '13 at 16:36
@erickson You can even witness in OP's stacktrace what is going on: `Arrays.copyOfRange`---this makes a new array which is a copy of the specified range. That's post-Update 6 code. – Marko Topolnik Jul 30 '13 at 16:37
@user1685095 My best advice is to take a long and hard look at your `Vertex` and `Edge` classes and discard absolutely everything you can. Forget about any "nice OOP principles", get down to the metal. Use arrays of primtive `int`s as much as you can: you can for example replace an array of objects containing three `int`s each with three arrays of `int`s. It is **way** more memory-efficient. – Marko Topolnik Jul 30 '13 at 16:40
Well, If I am getting outOfMemory exception now, what would happen when I would run DFS based on recursion at this graph?.. :( I'm trying to use arrays of Vertex objects instread of lists right now, but I don't know how it would work in DFS... – user1685095 Jul 30 '13 at 16:45
You can run everything on an incidence matrix, which can be represented by just a single array of `int`s. Using full-blown objects is very costly and for this kind of tasks it is not recommended. If you were doing this in C, you'd be using the same kind of raw structure. – Marko Topolnik Jul 30 '13 at 16:50
That's right. As luck would have it, I was looking at Update 5 code because we ran into some mysterious long build times with Update 6 and later versions, so I held back on my dev system. I don't suppose there's a connection... – erickson Jul 30 '13 at 18:22

score 2 · Answer 3 · answered Jul 30 '13 at 15:52

2

Increase the heap memory limit, using the -Xmx JVM option.

More info here.

answered Jul 30 '13 at 15:52

rcgoncalves

118
6

2

Is OP asking way to increase memory? This is a comment not answer . – Makky Jul 30 '13 at 15:53
1

Why it's not answer? I dont see memory leak here. – Divers Jul 30 '13 at 15:54
I've read. He has a BIG file and he try to take it in heap. If it realy big, it impossible to solve problem without increasing heap size. – Divers Jul 30 '13 at 15:56
1

@Divers It's not impossible to do without increasing heap. An alternative would be to use an indexed file structure on disk. The question is, is it worth it? If the biggest graphs he'll be working with are on the order of 100 Mb, it's easier to increase the heap. If he needs to scale to 100 Gb graphs, it might be worth implementing a mass-storage approach. – erickson Jul 30 '13 at 16:00
1

Also, the question explicitly asks for the reason he's getting the error, and *how to change the code* to avoid it. So this non-answer is at risk of downvotes. – erickson Jul 30 '13 at 16:01
So it impossible to give answer here, because he should remake algorithm completely, not just change the code. – Divers Jul 30 '13 at 16:06

score 2 · Answer 4 · edited May 23 '17 at 10:31

You are getting this exception because your program is storing too much data in the java heap.

Although your exception is showing up in the Pattern.split() method, the actual culprit could be any large memory user in your code, such as the graph you are building. Looking at what you provided, I suspect the graph data structure is storing much redundant data. You may want to research a more space-efficient graph structure.

If you are using the Sun JVM, try the JVM option -XX:+HeapDumpOnOutOfMemoryError to create a heap dump and analyze that for any heavy memory users, and use that analysis to optimize your code. See Using HeapDumpOnOutOfMemoryError parameter for heap dump for JBoss for more info.

If that's too much work for you, as others have indicated, try increasing the JVM heap space to a point where your program no longer crashes.

score 0 · Answer 5 · answered Jul 30 '13 at 15:53

0

You have exception because you your heap space has finished. Try to increase heap with

 java -Xms512 -Xmx2048 (for example)

answered Jul 30 '13 at 15:53

Divers

9,063
7
41
87

score 0 · Answer 6 · answered Jul 30 '13 at 16:01

When ever you get an OOM while trying to parse stuff, its just that the method you are using is not scalable. Even though increasing the heap might solve the issue temporarily, it is not scalable. Example, if tomorrow your file size increases by an order or magnitude, you would be back in square one. I would recommend trying to read the file in pieces, cache x lines of the file, read off it, clear the cache and re-do the process. You can use either ehcache or guava cache.

Peter Lawrey · Answer 7 · 2013-07-30T16:48:21.953

0

The way you parse the string could be changed.

try (Scanner scanner = new Scanner(new FileReader(filePath))) {
    while (scanner.hasNextInt()) {
        int startLabel = scanner.nextInt();
        int endLabel = scanner.nextInt();
        scanner.nextLine(); // discard the rest of the line.
        // use start and end.

    }

I suspect the memory consumption is actually in the data structure you build rather than how you read the data, but this should make it more obvious.

edited Jul 30 '13 at 16:48

answered Jul 30 '13 at 16:02

Peter Lawrey

498,481
72
700
1,075

1

Isn't this slightly fragile as it is not robust to an erroneous third integer on the same line? I'd feel more safe if I first took a line and then split it. I think it is easily doable with `Scanner` as well. – Marko Topolnik Jul 30 '13 at 16:04
@MarkoTopolnik I was going to handle that but forgot. Fixed now. – Peter Lawrey Jul 30 '13 at 16:06
Well, hasNexInt() returning false immediatly... I thought about this solution either, but thought that scanner also using split and patterns in his work, so I thought that it won't help. – user1685095 Jul 30 '13 at 16:12
@user1685095 Just to be sure, can you try this. It doesn't use regex the same way and is more efficient, but I suspect this isn't the cause so it shouldn't help. – Peter Lawrey Jul 30 '13 at 16:49

Java outOfMemory exception in string.split

7 Answers7