10

Consider these definitions from a previous question:

type Algebra f a = f a -> a

cata :: Functor f => Algebra f b -> Fix f -> b
cata alg = alg . fmap (cata alg) . unFix

fixcata :: Functor f => Algebra f b -> Fix f -> b
fixcata alg = fix $ \f -> alg . fmap f . unFix

type CoAlgebra f a = a -> f a

ana :: Functor f => CoAlgebra f a -> a -> Fix f
ana coalg = Fix . fmap (ana coalg) . coalg

fixana :: Functor f => CoAlgebra f a -> a -> Fix f
fixana coalg = fix $ \f -> Fix . fmap f . coalg

I ran some benchmarks and the results are surprising me. criterion reports something like a tenfold speedup, specifically when O2 is enabled. I wonder what causes such massive improvement, and begin to seriously doubt my benchmarking abilities.

This is the exact criterion code I use:

smallWord, largeWord :: Word
smallWord = 2^10
largeWord = 2^20

shortEnv, longEnv :: Fix Maybe
shortEnv = ana coAlg smallWord
longEnv = ana coAlg largeWord

benchCata = nf (cata alg)
benchFixcata = nf (fixcata alg)

benchAna = nf (ana coAlg)
benchFixana = nf (fixana coAlg)

main = defaultMain
    [ bgroup "cata"
        [ bgroup "short input"
            [ env (return shortEnv) $ \x -> bench "cata"    (benchCata x)
            , env (return shortEnv) $ \x -> bench "fixcata" (benchFixcata x)
            ]
        , bgroup "long input"
            [ env (return longEnv) $ \x -> bench "cata"    (benchCata x)
            , env (return longEnv) $ \x -> bench "fixcata" (benchFixcata x)
            ]
        ]
    , bgroup "ana"
        [ bgroup "small word"
            [ bench "ana" $ benchAna smallWord
            , bench "fixana" $ benchFixana smallWord
            ]
        , bgroup "large word"
            [ bench "ana" $ benchAna largeWord
            , bench "fixana" $ benchFixana largeWord
            ]
        ]
    ]

And some auxiliary code:

alg :: Algebra Maybe Word
alg Nothing = 0
alg (Just x) = succ x

coAlg :: CoAlgebra Maybe Word
coAlg 0 = Nothing
coAlg x = Just (pred x)

Compiled with O0, the digits are pretty even. With O2, fix~ functions seem to outperform the plain ones:

benchmarking cata/short input/cata
time                 31.67 μs   (31.10 μs .. 32.26 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 31.20 μs   (31.05 μs .. 31.46 μs)
std dev              633.9 ns   (385.3 ns .. 1.029 μs)
variance introduced by outliers: 18% (moderately inflated)

benchmarking cata/short input/fixcata
time                 2.422 μs   (2.407 μs .. 2.440 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.399 μs   (2.388 μs .. 2.410 μs)
std dev              37.12 ns   (31.44 ns .. 47.06 ns)
variance introduced by outliers: 14% (moderately inflated)

I would appreciate if someone can confirm or spot a flaw.

*I compiled things with ghc 8.2.2 on this occasion.)


postscriptum

This post from back in 2012 elaborates on the performance of fix in quite a fine detail. (Thanks to @chi for the link.)

Ignat Insarov
  • 4,444
  • 13
  • 34
  • 5
    Note that [*recursion-schemes* defines `cata`](https://hackage.haskell.org/package/recursion-schemes-5.0.2/docs/src/Data-Functor-Foldable.html#Recursive) as `cata f = c where c = f . fmap c . project` (as opposed to `cata f = f . fmap (cata f) . project`) due to, I presume, the same issue discussed in the postscript link. Cf. also [*why pipes defines inner functions*](https://stackoverflow.com/q/31168743/2751851). – duplode Feb 09 '18 at 14:35
  • 1
    Note that for the longer input, the `fix` variants only outperform their non-fix counterparts by a factor of 5 to 6. – Alec Feb 14 '18 at 19:26
  • 1
    FWIW, allocations are similar for `-O0` (the non-`fix` variants have only 1.31x the allocations) but quite different for `-O2` (the non-`fix` variants have 2.75x the allocations). That seems to support the hypothesis that the non-`fix` variant might be recomputing some things instead of sharing them. – Alec Feb 14 '18 at 19:33

1 Answers1

5

This is due to how the fixed point is computed by fix. This was pointed out by @duplode above (and by myself in a related question). Anyway, we can summarize the issue as follows.

We have that

fix f = f (fix f)

works, but makes a fix f new call at every recursion. Instead,

fix f = go
   where go = f go

computes the same fixed point avoiding that call. In the libraries fix is implemented in this more efficient way.

Back to the question, consider the following three implementations of cata:

cata :: Functor f => Algebra f b -> Fix f -> b
cata alg' = alg' . fmap (cata alg') . unFix

cata2 :: Functor f => Algebra f b -> Fix f -> b
cata2 alg' = go
   where
   go = alg' . fmap go . unFix

fixcata :: Functor f => Algebra f b -> Fix f -> b
fixcata alg' = fix $ \f -> alg' . fmap f . unFix

The first one makes a call cata alg' at every recursion. The second one does not. The third one also does not since the library fix is efficient.

And indeed, we can use Criterion to confirm this, even using the same test used by the OP:

benchmarking cata/short input/cata
time                 16.58 us   (16.54 us .. 16.62 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.62 us   (16.58 us .. 16.65 us)
std dev              111.6 ns   (89.76 ns .. 144.0 ns)

benchmarking cata/short input/cata2
time                 1.746 us   (1.742 us .. 1.749 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.741 us   (1.736 us .. 1.744 us)
std dev              12.69 ns   (10.50 ns .. 17.31 ns)

benchmarking cata/short input/fixcata
time                 2.010 us   (2.003 us .. 2.016 us)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.006 us   (2.001 us .. 2.011 us)
std dev              16.40 ns   (14.05 ns .. 19.27 ns)

Long inputs also show the improvement.

benchmarking cata/long input/cata
time                 119.3 ms   (113.4 ms .. 125.8 ms)
                     0.996 R²   (0.992 R² .. 1.000 R²)
mean                 119.8 ms   (117.7 ms .. 121.7 ms)
std dev              2.924 ms   (2.073 ms .. 4.064 ms)
variance introduced by outliers: 11% (moderately inflated)

benchmarking cata/long input/cata2
time                 17.89 ms   (17.43 ms .. 18.36 ms)
                     0.996 R²   (0.992 R² .. 0.999 R²)
mean                 18.02 ms   (17.49 ms .. 18.62 ms)
std dev              1.362 ms   (853.9 us .. 2.022 ms)
variance introduced by outliers: 33% (moderately inflated)

benchmarking cata/long input/fixcata
time                 18.03 ms   (17.56 ms .. 18.50 ms)
                     0.996 R²   (0.992 R² .. 0.999 R²)
mean                 18.17 ms   (17.57 ms .. 18.72 ms)
std dev              1.365 ms   (852.1 us .. 2.045 ms)
variance introduced by outliers: 33% (moderately inflated)

I also experimented with ana, observing that the performance of a similarly improved ana2 agrees with fixana. No surprises there as well.

chi
  • 101,733
  • 3
  • 114
  • 189
  • 1
    Well, but there is still the recursive call to `go`. That it is not a top-level binding does not mean it is not a lambda abstraction. According to benchmarks, it must be the case that top-level recursion causes tenfold performance loss comparing to `let`-recursion, but I will still be seeking the explanation of the underlying *computational* differences, be they lying on the STG level or even deeper. After all, `cata2` is no less recursive than `cata`, in that it conains recursion, "encapsulated" in `go` but nevertheless. – Ignat Insarov Feb 21 '18 at 02:59
  • In any way, it's great that we can now stop talking about particular functions and move on to researching why GHC optimizes top-level and `let`-level recursion differently. – Ignat Insarov Feb 21 '18 at 03:17
  • @Kindaro Well, `go` takes no arguments, while `cata` takes one. Recursions like `go = 1 : go` create an infinite list n constant space, while `f () = 1 : f ()` do the same in unbounded space. – chi Feb 21 '18 at 08:21
  • It kinda seems plausible, but I find it hard to wrap my mind around it. For one, I see no reason GHC cound not optimize this, if it is so easily done by hand. – Ignat Insarov Feb 21 '18 at 08:51
  • 1
    @Kindaro It is not always an optimization, space wise. `let f () = ... in print (f ()) >> print somethingElse >> print (f ())` could free memory during the middle step, and reallocate it for the third step. Using `let x = ... in print x >> print somethingElse >> print x` will keep `x` in memory for the whole computation. This is useful to achieve memoization (which is what we need in the `fix`) example, but that is not always desirable. Sometimes, we do want to throw away precomputed results (especially large ones) and recompute them later when needed. – chi Feb 21 '18 at 09:02
  • Well, if it gives tenfold speedup on any recursive function, I'd say it's a worthy tradeoff. – Ignat Insarov Feb 21 '18 at 11:09