3

I'm currently trying to read the contents of an XML file into a Map Int (Map Int String) and it works quite well (using HaXml). However, I'm not satisfied with the memory consumption of my program and the problems seems to be the garbage collection.

Here's the code I'm using to read the XML file:

type TextFile = Map Int (Map Int String)

buildTextFile :: String -> IO TextFile
buildTextFile filename = do content <- readFile filename
                            let doc = xmlParse filename content
                                con = docContent (posInNewCxt filename Nothing) doc
                            return $ buildTF con

My guess is that content is held in memory even after the return, although it doesn't need to be (of course it could also be doc or con). I come to this conclusion because the memory consumption rises quickly with very large XML files, although the resulting TextFile is only a singleton map of a singleton map (using a special testing file, generally it's different, of course). So in the end, I have a Map of a Map Int String, with only one string in it, but the memory consumption is up to 19 MB.

Using strict application ($!) or using Data.Text instead of String in TextFile doesn't change anything.

So my question is: Is there some way to tell the compiler that the string content (or doc or con) isn't needed anymore and that it can be garbage collected?

And more generally: How can I find out where the problem really comes from without all the guessing?

Edit: As FUZxxl suggested I tried using deepseq and changed the second line of buildTextFile like so:

let doc = content `deepseq` xmlParse filename content

Unfortunately that didn't change anything really (or am I using it wrong?)...

bzn
  • 2,322
  • 1
  • 16
  • 20
  • 2
    Did you tried `deepSeq`? This is a common problem with lazy IO. Maybe strict IO would be better in this case. – fuz Jul 20 '11 at 15:13
  • 3
    Remember that a `String` takes 12 bytes/character so even if you make it strict it will take up a lot of space. – augustss Jul 20 '11 at 15:40
  • @augustuss, I assume those 12 bytes are for GHC on a 32bit target? – hvr Jul 20 '11 at 16:25
  • 1
    @hvr: See also [this answer](http://stackoverflow.com/questions/3254758/memory-footprint-of-haskell-data-types/3256825#3256825)--the `(:)` constructor alone would be three words, which is 12 bytes already on a 32bit system assuming you're only using `Char` values that GHC has cached. – C. A. McCann Jul 20 '11 at 16:50
  • @augustuss: Ouch, that's a lot. I guess Data.Text would be much better in this regard, wouldn't it? – bzn Jul 20 '11 at 17:01
  • @FUZxxl: Thanks, I now tried deepseq (see edit above), but without success. – bzn Jul 20 '11 at 17:10
  • 1
    @bzn I mean, why not deepseq the final return value? Like: `let x = builtTF con; x \`deepseq\` buildTF con`. – fuz Jul 20 '11 at 17:29
  • 2
    @bzn: As a rule of thumb, lists of anything are dubious if you expect to have large sections of them in memory all at once. Unless you're actually streaming characters between a producer/consumer pair of some sort, or only using very short strings, `String` is kind of terrible. – C. A. McCann Jul 20 '11 at 17:56
  • Thanks for all the input. deepseq'ing the final return value doesn't change anything either. Neither does Data.Text + deepseq. So, I guess I'll have to live with it for now. It's not dramatic, but when I compared the memory consumption with another program (written in C#, not by me) which has to handle the same data (+ additional business logic, which I didn't implement yet) and uses only a third of the space, I was kind of disappointed... – bzn Jul 20 '11 at 20:10
  • I think you need to stop guessing and start profiling. GHC includes retainer profiling that can tell you what is being held in memory and give you some clue as to why. – Paul Johnson Jul 20 '11 at 21:22
  • Yes, that seems to be the only possibility. Until now I somehow avoided learning how to use GHC and its debugging/profiling capabilities (since cabal-install is so comfortable). – bzn Jul 21 '11 at 09:39

1 Answers1

2

Don't Guess What Is Consuming Memory, Find Out For Sure

The first step is to determine what types are consuming the most memory. You can see lots of examples of heap profiling here on SO or read the GHC manual.

Forcing Computation

If the problem is lazy evaluation (you're building an on-heap thunk that can compute the XML document type and leaving the string in heap too) then use rnf and seq:

buildTextFile :: String -> IO TextFile
buildTextFile filename = do content <- readFile filename
                            let doc = xmlParse filename content
                                con = docContent (posInNewCxt filename Nothing) doc
                                res = buildTF con
                            return $ rnf res `seq` res

Or just use bang patterns (!res = buildTF con), either way that should force the thunks and allow the GC to collect String.

Thomas M. DuBuisson
  • 62,520
  • 7
  • 101
  • 163