Why does folding Events and Behaviors use so much memory?

Question

I am currently exploring the possibility to use basic containers to give FRP networks more structure and by that to create more sophisticated event networks easier.

Note: I use ordrea but had the same problem with reactive-banana too, so I guess this problem is not specific to the chosen frp implementation.

In this special case I am using a simple Matrix to store Events:

newtype Matrix (w :: Nat) (h :: Nat) v a where
   Matrix :: Vector a -> Matrix w h v a

-- deriving instances: Functor, Foldable, Traversable, Applicative

Matrix is basically just a thin wrapper around Data.Vector and most functions I'll use are basically the same as the corresponding Vector ones. The notable exception is indexing, but that should be self explanatory.

With this I can define matrices of events like Matrix 10 10 (Event Double) and are able to define basic convolution algorithms on that:

applyStencil :: (KnownNat w, KnownNat h, KnownNat w', KnownNat h')
             => M.Matrix w' h' (a -> c)
             -> M.Matrix w h (Event a)
             -> M.Matrix w h (Event c)
applyStencil s m = M.generate stencil
  where stencil x y = fold $ M.imap (sub x y) s
        sub x0 y0 x y g = g <$> M.clampedIndex m (x0 - halfW + x) (y0 - halfH + y)
        halfW = M.width s `div` 2
        halfH = M.height s `div` 2

Notes:

M.generate :: (Int -> Int -> a) -> M.Matrix w h a and

M.imap :: (Int -> Int -> a -> b) -> M.Matrix w h a -> M.Matrix w h b

are just wrappers around Vector.generate and Vector.imap respectively.
M.clampedIndex clamps indices into the bounds of the matrix.
Event is an instance of Monoid which is why it is possible to just fold the Matrix w' h' (Event c) returned by M.imap (sub x y) s.

I have a setup approximately like this:

let network = do
  -- inputs triggered from external events 
  let inputs :: M.Matrix 128 128 (Event Double)

  -- stencil used:
  let stencil :: M.Matrix 3 3 (Double -> Double)
      stencil = fmap ((*) . (/16)) $ M.fromList [1,2,1,2,4,2,1,2,1]

  -- convolute matrix by applying stencil
  let convoluted = applyStencil stencil inputs

  -- collect events in order to display them later
  -- type: M.Matrix 128 128 (Behavior [Double])
  let behaviors = fmap eventToBehavior convoluted

  -- now there is a neat trick you can play because Matrix
  -- is Traversable and Behaviors are Applicative:
  -- type: Behavior (Matrix 128 128 [Double])
  return $ Data.Traversable.sequenceA behaviors

Using something like this I am triggering ~15kEvents/s with no problems and lots of headroom in that regard.

Problem is that as soon as I sample the network I can only get about two samples per second from it:

main :: IO ()
main = do

  -- initialize the network
  sample <- start network

  forever $ do

    -- not all of the 128*128 inputs are triggered each "frame"
    triggerInputs

    -- sample the network
    mat <- sample

    -- display the matrix somehow (actually with gloss)
    displayMatrix mat

So far I have made the following observations:

Profiling tells me that productivity is very low (4%-8%)
Most of the time is spend by the garbage collector in Gen 1 (~95%)
Data.Matrix.foldMap (ie fold) is allocating the most memory (~45%, as per -p)
When I was still working with reactive-banana Heinrich Apfelmus recommended that tree based traversals are a better fit for behaviors¹. I tried that for sequenceA, fold and traverse with no success.
I suspected that the newtype wrapper was preventing vectors fusion rules to fire². This is most likely not the culprit.

At this point I have spent the better part of the week searching for a solution to this problem. Intuitively I'd say that sampling should be much faster and and foldMap should not create so much garbage memory. Any ideas?

I have no idea what the problem is (because I know nothing about these FRP frameworks) but I don't think `traverse` and `sequenceA` and such are likely what he meant—those are actually based on list traversals. — dfeuer, Nov 20 '14 at 16:09
@dfeuer `traverse` and `sequenceA` from `Data.Traversable`. I thought that was clear :) — fho, Nov 20 '14 at 16:39
I don't really know reactive-banana but I wouldn't be surprised if each `mappend` and each `fmap` creates a new node in the network, and you could end up with millions of nodes. Additionally if reactive-banana performs some pruning of unobserved nodes you would see good performance if you never observe anything and bad performance if you do. I don't know if it does this in practice though. Anyway, doing numerics with reactive-banana sounds unwise. — Tom Ellis, Nov 20 '14 at 22:01
I can't say anything about `ordrea`. But Tim is right, every matrix corresponds to `128*128 = 16384` nodes. You may be much better off using an event of matrices `Event (Matrix ...)` instead of a matrix of events `Matrix (Event ...)`. In the latter case, I think that `Vector` doesn't buy you anything at all, you could use nested lists and still get the same (abysmal) performance. — Heinrich Apfelmus, Nov 20 '14 at 23:16
I think we are talking past each others here. I actually (physically) have 128*128 event sources that produce up to 20kEvents per second but most of the time it is much less. I am hoping to use FRP techniques to structure the processing that is done on the event streams. — fho, Nov 21 '14 at 08:43
@Florian: I will believe you as soon as you use a plain list for the matrix of events, `[[Event Double]]`. :-) In general, your observation that one event occurrence should only trigger `3*3 = 9` other events is sound, though However, you need to be careful about observing this. The standard `sequenceA` combinator will definitely not work, because it "touches" all behavior updates. You're trying to write an incremental algorithm here, and you have to make sure that every "incremental update" makes only small changes to the end result -- running time is proportional to the size of the change. — Heinrich Apfelmus, Nov 21 '14 at 10:26
@HeinrichApfelmus actually I've tried to replace the `Vector` with `Sequence` yesterday ... which actually gave even worse performance. Maybe my intuition about *ordreas* `Events` is just wrong ... I treat them like `Listeners` in disguise. But I start to get the feeling that the whole convolution is done again each time I sample the network. — fho, Nov 21 '14 at 11:55
@Florian Kind of. The thing is that the structure of the convolution is reflected in the Events -- it will be translated into a tree of `merge` calls. Essentially, the convolution is translated "automatically" into an incremental algorithm, and the performance of that algorithm depends on the structure of the one-time algorithm. Essentially, I am suggesting nested lists so that you have tight control over that structure -- `Vector` and `Sequence` will just do things you don't know. On top of that, you have the complication that you are now working with Behaviors, which adds even more unknown. — Heinrich Apfelmus, Nov 21 '14 at 16:14
Implemented matrix based on nested lists: http://lpaste.net/5736269706772348928 ... profile is still dominated by GC. — fho, Nov 25 '14 at 13:36
While I haven’t profiled, it looks to me like you’re building a lot of large intermediate data structures unnecessarily, particularly in `foldMapTree`, which splits a list and then does a non-tail-call operation on both lazily-evaluated lists. You might try a strict left fold or an eagerly-evaluated lazy right fold instead. — Davislor, Jan 02 '18 at 01:42

Why does folding Events and Behaviors use so much memory?

0 Answers0