How do we deal with a large GATE Document

Question

I'm getting Error java.lang.OutOfMemoryError: GC overhead limit exceeded when I try to execute Pipeline if the GATE Document I use is slightly large.

The code works fine if the GATE Document is small.

My JAVA code is something like this:

TestGate Class:

    public void gateProcessor(Section section) throws Exception { 
                Gate.init();
                Gate.getCreoleRegister().registerDirectories(....
                SerialAnalyserController pipeline .......
                pipeline.add(All the language analyzers)
                pipeline.add(My Jape File)
                Corpus corpus = Factory.newCorpus("Gate Corpus");
                Document doc = Factory.newDocument(section.getContent());
                corpus.add(doc);

                pipeline.setCorpus(corpus);
                pipeline.execute();
}

The Main Class Contains:

            StringBuilder body = new StringBuilder();
            int character;
            FileInputStream file = new FileInputStream(
                    new File(
                            "filepath\\out.rtf"));  //The Document in question
            while (true)
            {
                character = file.read();
                if (character == -1) break;
                body.append((char) character);
            }


            Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
            TestGate testgate = new TestGate();
            testgate.gateProcessor(section);

Interestingly this thing fails in GATE Developer tool as well the tools basically gets stuck if the document is more than a sepcific limit, say more than 1 page.

This proves that my code is logically correct but my approach is wrong. How do we deal with large chunks data in GATE Document.

How large is your document/file (**how many MB?**, e.g. for your `out.rtf`) and what are your **java heap** settings (are you using e.g. java -Xmx1g)? — dedek, Sep 21 '15 at 07:05
See also `OutOfMemoryError` related questions, e.g. http://stackoverflow.com/q/5839359/1857897 — dedek, Sep 21 '15 at 07:09

score 3 · Accepted Answer · answered Sep 29 '15 at 07:16

You need to call

corpus.clear();
Factory.deleteResource(doc);

after each document, otherwise you'll eventually get OutOfMemory on any size of docs if you run it enough times (Although by the way you initialize gate in the method it seems like you really need to process a single document only once).

Besides that, annotations and features usually take lots of memory. If you have an annotation-intensive pipeline, i.e. you generate lots of annotations with lots of features and values you may run out of memory. Make sure you don't have a processing resource that generates annotations exponentially - for instance a jape or groovy generates n to the power of W annotations, where W is number of words in your doc. Or if you have a feature for each possible word combination in your doc, that would generate factorial of W strings.

score 0 · Answer 2 · answered Sep 28 '15 at 14:03

0

every time its create pipeline object that's why it takes huge memory. That's why every time you use 'Annie' cleanup.

pipeline.cleanup(); pipeline=null;

answered Sep 28 '15 at 14:03

vijay velaga

97
1
9

How do we deal with a large GATE Document

2 Answers2