0

I am trying to preprocess a large txt file (10G), and store it in binary file for future use. As the code runs it slows down and ends with

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

The input file has the following structure

200020000000008;0;2
200020000000004;0;2
200020000000002;0;2
200020000000007;1;2

This is the code I am using:

        String strLine;

        FileInputStream fstream = new FileInputStream(args[0]);
        BufferedReader br = new BufferedReader(new InputStreamReader(fstream)); 

        //Read File Line By Line
        HMbicnt map = new HMbicnt("-1");
        ObjectOutputStream  outputStream = null;
        outputStream = new ObjectOutputStream(new FileOutputStream(args[1]));

        int sepIndex = 15;

        int sepIndex2 = 0;
        String str_i = "";
        String bb = "";
        String bbBlock = "init";

        int cnt = 0;
        lineCnt = 0;
        while ((strLine = br.readLine()) != null)   {
            //rozparsovat radek         
            str_i = strLine.substring(0, sepIndex);
            sepIndex2 = strLine.substring(sepIndex+1).indexOf(';');
            bb = strLine.substring(sepIndex+1, sepIndex+1+sepIndex2);
            cnt = Integer.parseInt(strLine.substring(sepIndex+1+sepIndex2+1));
            if(!bb.equals(bbBlock)){
                outputStream.writeObject(map);
                outputStream.flush();
                map = new HMbicnt(bb);
                map.addNew(str_i + ";" + bb, cnt);
                bbBlock = bb;
            }
            else{
                map.addNew(str_i + ";" + bb, cnt);
            }
        }
        outputStream.writeObject(map);

        //Close the input stream
        br.close();
        outputStream.writeObject(map = null);
        outputStream.close();

Basically, it goes through the in file and stores data to the object HMbicnt (which is a hash map). Once it encounters new value in second column it should write object to the output file, free memory and continue.

Thanks for any help.

lmiguelvargasf
  • 40,464
  • 32
  • 169
  • 172
Filip
  • 1
  • either bblocks are never repeating, causing you to keep adding to the same map, or the implementation of `HMbicnt` is doing something fishy. Can you show the implementation of `addNew()` and any methods it calls? – Roberto Attias Jul 01 '15 at 21:39

2 Answers2

2

I think the problem is not that 10G is in memory, but that you are creating too many HashMaps. Maybe you could clear the HashMap instead of re-creating it after you don't need it anymore. There seems to have been a similar problem in java.lang.OutOfMemoryError: GC overhead limit exceeded , it is also about HashMaps

Community
  • 1
  • 1
user140547
  • 6,545
  • 2
  • 21
  • 63
  • You're probably right. An important thing to note, of course, is that both problems can be avoided by simply reducing the total amount of memory needed by the program. – Sam Estep Jul 01 '15 at 22:14
1

Simply put, you're using too much memory. Since, as you said, your file is 10 GB, there is no way you're going to be able to fit it all into memory (unless, of course, you happen to have over 10 GB of RAM and have configured Java to use it).

From what I can tell from your code and description of it, you're reading the entire file into memory and adding it to one huge in-RAM map as you're doing so, then writing your result to output. This is not feasible. You'll need to redesign your code to work in-place (i.e. only keep a small portion of the file in memory at any given time).

Sam Estep
  • 12,100
  • 2
  • 30
  • 62