Haskell and mutable structures' performance

Question

I was studying the answers given in Optimizing Haskell code and noticed that using a small input would indeed result in a faster Haskell run compared to Python.

But as the dataset grew in size, Python took the lead. Using a hashmap based version had improved the performance, but it was still lagging behind.

Even worse, I tried transliterating Python's dictionaries into hashtables and observed a hard performance hit. I really want to understand what's going on as I'll need mutable structures for a future application.

Here's the slightly modified Python code :

#! /usr/bin/env python2.7
import random
import re
import cPickle

class Markov:
    def __init__(self, filenames):
        self.filenames = filenames
        self.cache = self.train(self.readfiles())
        picklefd = open("dump", "w")
        cPickle.dump(self.cache, picklefd)
    print "Built a db of length "+str(len(self.cache))
        picklefd.close()

    def train(self, text):
        splitted = text.split(' ')
        print "Total of %d splitted words" % (len(splitted))
        cache = {}
        for i in xrange(len(splitted)-2):
            pair = (splitted[i], splitted[i+1])
            followup = splitted[i+2]
            if pair in cache:
                if followup not in cache[pair]:
                    cache[pair][followup] = 1
                else:
                    cache[pair][followup] += 1
            else:
                cache[pair] = {followup: 1}
        return cache

    def readfiles(self):
        data = ""
        for filename in self.filenames:
            fd = open(filename)
            data += fd.read()
            fd.close()
        return data

Markov(["76.txt"])

Haskell, with the original response (train4), a hashmap variant thereof (trainHM2) and the hashtable transliteration (trainHT) :

{-# LANGUAGE BangPatterns,DeriveGeneric #-}

import GHC.Generics (Generic)

import Data.List (foldl')
import Data.Hashable

import qualified Data.Map as M

import qualified Data.HashMap.Strict as HM
import qualified Data.ByteString.Char8 as B

import qualified Data.HashTable.IO as HT

--Using this instead of tuples yielded a 5~10% speedup
data StringTuple = STP !B.ByteString !B.ByteString deriving(Ord,Eq,Generic)
instance Hashable StringTuple

type Database3 = M.Map StringTuple (M.Map B.ByteString Int)
type DatabaseHM = HM.HashMap StringTuple (HM.HashMap B.ByteString Int)
type DatabaseHT = HT.BasicHashTable StringTuple DatabaseInnerHT
type DatabaseInnerHT = (HT.BasicHashTable B.ByteString Int)

train4 :: [B.ByteString] -> Database3
train4 words = foldl' update M.empty (zip3 words (drop 1 words) (drop 2 words))
    where update m (x,y,z) = M.insertWith' (inc z) (STP x y) (M.singleton z 1) m
          inc k _ = M.insertWith' (+) k 1

trainHM2 :: [B.ByteString] -> DatabaseHM
trainHM2 words = trainHM2G words HM.empty
    where 
    trainHM2G (x:y:[]) !hm = hm
    trainHM2G (x:y:z:rem) !hm = trainHM2G (y:z:rem) (HM.insertWith (inc z) (STP x y) (HM.singleton z 1) hm)
            where inc k _ = HM.insertWith (+) k 1


trainHT :: [B.ByteString] -> IO (DatabaseHT)
trainHT words = do 
 hm <- HT.new 

 trainHT' words hm 
 where 
  trainHT' (x:y:[]) !hm = return hm
  trainHT' (x:y:z:rem) !hm = do
   let pair = STP x y
   inCache <- HT.lookup hm pair
   case inCache of
    Nothing -> do
     htN <- HT.new :: IO (DatabaseInnerHT)
     HT.insert htN z $! 1
     HT.insert hm pair $! htN
    Just ht -> do
     cvM <- HT.lookup ht z
     case cvM of
      Nothing -> HT.insert ht z 1
      Just cv -> HT.insert ht z $! (cv+1)
   trainHT' (y:z:rem) hm

main = do contents <- B.readFile "76.txt"
      let bcont = B.split ' ' $ contents
      print $ length bcont
      let db = train4 $ bcont
      print $ "Built a DB of " ++ show (M.size db) ++ " words" 
      --let db = trainHM2 $ bcont
      --print $ "Built a DB of " ++ show (HM.size db) ++ " words"         
      --db <- trainHT $ (bcont)
      --print $ "Built a DB"

A makeshift C++11 transliteration (requires -fpermissive to compile, feel free to correct it) :

#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
#include <unordered_map>
#include <tuple>

/*
 Hash stuff here
 Taken from https://stackoverflow.com/a/7111460/314327
*/
size_t hash_combiner(size_t left, size_t right) //replacable
{ return left^right;}

template<int index, class...types>
struct hash_impl {
    size_t operator()(size_t a, const std::tuple<types...>& t) const {
        typedef typename std::tuple_element<index, std::tuple<types...>>::type nexttype;
        hash_impl<index-1, types...> next;
        size_t b = std::hash<nexttype>()(std::get<index>(t));
        return next(hash_combiner(a, b), t); 
    }
};
template<class...types>
struct hash_impl<0, types...> {
    size_t operator()(size_t a, const std::tuple<types...>& t) const {
        typedef typename std::tuple_element<0, std::tuple<types...>>::type nexttype;
        size_t b = std::hash<nexttype>()(std::get<0>(t));
        return hash_combiner(a, b); 
    }
};

namespace std {
    template<class...types>
    struct hash<std::tuple<types...>> {
        size_t operator()(const std::tuple<types...>& t) {
            const size_t begin = std::tuple_size<std::tuple<types...>>::value-1;
            return hash_impl<begin, types...>()(1, t); //1 should be some largervalue
        }
    };
}

/*
 Hash stuff end
*/

using namespace std;

/*
 Split, from https://stackoverflow.com/a/236803/314327
*/
vector<string> &split(const string &s, char delim, vector<string> &elems) {
 stringstream ss(s);
 string item;
 while (getline(ss, item, delim)) {
 elems.push_back(item);
 }
 return elems;
}

vector<string> split(const string &s, char delim) {
 vector<string> elems;
 split(s, delim, elems);
 return elems;
}
/*
 Split end
*/

typedef tuple<string,string> STP;

unordered_map< STP,unordered_map< string,int > > train(vector<string> &words)
{
 unordered_map< STP,unordered_map< string,int > > cache; 

 for(int i=0;i<words.size()-2;i++)
 {
  STP tup = make_tuple(words[i],words[i+1]);
  auto it = cache.find(tup);
  if(it!=cache.end())
  {
   auto it2 = it->second.find(words[i+2]);
   if(it2!=it->second.end())
   {
    it2->second += 1;
   }
   else
    it->second[words[i+2]] = 1;
  }
  else
  {    
   unordered_map< string,int > cacheInner;
   cacheInner[words[i+2]] = 1;
   cache[tup] = cacheInner;
  }
 }

 return cache;
}

int main()
{
 ifstream ifs("76.txt");
 stringstream buf;
 buf << ifs.rdbuf();
 string contents(buf.str());

 auto words = split(contents,' '); 
 cout << words.size(); 

 auto wordCache = train(words);

 cout << "\nHashtable count " << wordCache.size();

 cout << "\n";
 return 0;
}

And the results are :

C++ (GCC 4.6.3)

$ g++ -O3 -fpermissive -std=c++0x cpptest.cpp -o cpptest
$ /usr/bin/time -f "%E" ./cpptest
1255153

Hashtable count 64442
0:01.02

Python (2.7)

$ /usr/bin/time -f "%E" ./pytest.py 
Total of 1255153 splitted words
Built a db of length 64442
0:02.62

Haskell (GHC 7.4.1) - "train4"

$ ghc -fllvm -O2 -rtsopts -fforce-recomp -funbox-strict-fields hasktest.hs -o hasktest
[1 of 1] Compiling Main             ( hasktest.hs, hasktest.o )
Linking hasktest ...
$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB of 64442 words"
0:06.35

Haskell - "trainHM2"

$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB of 64442 words"
0:04.23

Haskell - "trainHT" - Using Basic variant (which is close to what Python does for dictionaries, I guess ?)

$ /usr/bin/time -f "%E" ./hasktest
1255153
"Built a DB"
0:10.42

Using Linear or Cuckoo for both tables

0:06.06
0:05.69

Using Cuckoo for the outermost table and Linear on the inside

0:04.17

Profiling had shown that there's quite a lot of GC, so, with +RTS -A256M

0:02.11

_{For the input data, I chose 76.txt as indicated in one of the answers and duplicated the whole text 12 times. It should amount to about 7 MBs.}

_{Tests were run on Ubuntu 12.04 in a VirtualBox container, using a single i5-520M core. Done more than a single run, all the results were pretty close.}

The last result is pretty fine for this microbenchmark, but is there anything else to improve in the code, considering that :

Cuckoo & Linear might be better suited for this dataset, but the "generic" Python solution is good to go without much an optimisation in this regard,
Valgrind reports that the C++ & Python versions take approximately 60MBs whilst Haskell RTS reports anywhere from 125MBs (Cuckoo&Linear) to 409MBs (Basic, larger heap) of memory for the same task. Wouldn't tweaking the garbage collector this much in a production environment be detrimental ? Is it possible to refactor the code so it has less memory usage ?

Update :

I guess "reducing garbage" is what I'm looking for. I know Haskell doesn't work the same way C++ does, but I want to know if it's possible to reduce garbage produced in imperative code, as the C++ example consumed half the memory without any space leaks. It'd hopefully be an improvement in terms of memory usage and execution time (as there'll be less GC).

Update 2 :

Computing the length during the table construction has reduced the memory footprint for sure (down to around 40MBs, actually !), which causes the GC to take longer, resulting in a slower execution time (due to discarding values that had been lazily read from the list, I presume ?).

And yes, hashtables' operations take a significant amount of time. I'll try mimicking alterations to see if it improves any further.

"is there anything else to improve in the code" is a big question. Can you be more specific? You say there's a lot of GC, but don't say much else about what you learned from profiling, or what questions came up. — jberryman, Oct 09 '13 at 03:35
Far from a full answer, however you force the entire list of words by printing the length and keep it retained in memory for the dict construction. I saved about 100M off the basic, larger heap size by not printing the length. If you need it, you can make a length value in parallel with the dictionary construction. — John L, Oct 09 '13 at 05:33

score 4 · Answer 1 · edited May 23 '17 at 10:25

This isn't really an answer, but it's too much to put in comments, so I'll drop it here until something better comes along. I don't see anything obviously wrong with your hashtable code (the only part I really looked at), other than a bit of refactoring/golfing.

First, the memory usage isn't very surprising to me. With -A256M, you're requiring that the RTS have a minimum allocation area of 256M, so that puts a floor on your memory usage. If data gets promoted or copied after a GC, the memory usage will grow. Also note that Haskell data structures tend to be a bit memory-hungry relative to other languages, see for example Memory footprint of Haskell data types. Given both of these factors, I'm not surprised by the total memory usage with a large allocation area.

Structures like the HashMap or a bytestring trie could use less memory, with the attendant downsides of using a data structure other than a hashtable.

Speaking of the allocation area, this code is a bit of a microbenchmark in that nearly all the allocated data (mostly bytestring data and internal hashtable values) are long-lived (they last until the program ends). This puts your test program in a situation where a very large allocation area is particularly beneficial, whereas if this database construction were just a part of a larger program the costs of the larger area might become dominant.

As to the optimal GC settings for a production environment, it's very hard to tell outside the context of an actual complete program. I can say that if performance really matters, it's worth spending some time tuning the GC flags. Even more so if you've enabled the threaded runtime.

Aside from memory issues, I strongly suspect that the hashtables package is working against you here. According to a profile, the 4 costliest functions are lookup/go, lookup, insert, and delete'.go. I think if it had the equivalent of Data.Map.alter, some of your operations could be coalesced for a performance gain. I would be very surprised if Python dictionaries weren't optimized for cases like cache[key] += 1 after all.

Haskell and mutable structures' performance

1 Answers1