With Circe Json why is implicit resolution slower at runtime

Question

Why is Circe Json slower with implicit decoder lookup compared to saving the implicit decoder to a val.

I would expect these to be the same because implicit resolution is done at runtime.

import io.circe._
import io.circe.generic.auto._
import io.circe.jackson
import io.circe.syntax._

private val decoder = implicitly[Decoder[Data.Type]]
def decode(): Either[Error, Type] = {
  jackson.decode[Data.Type](Data.json)(decoder)
}

def decodeAuto(): Either[Error, Type] = {
  jackson.decode[Data.Type](Data.json)
}


[info] DecodeTest.circeJackson             thrpt  200   69157.472 ± 283.285  ops/s
[info] DecodeTest.circeJacksonAuto         thrpt  200   67946.734 ± 315.876  ops/s

The full repo can be found here. https://github.com/stephennancekivell/some-jmh-json-benchmarks-circe-jackson

Scala never resolves implicit at runtime – cchantep Jan 18 '17 at 23:04 — cchantep, Jan 18 '17 at 23:04

score 18 · Accepted Answer · answered Jan 19 '17 at 01:50

Consider this much simpler case that doesn't involve circe or generic derivation at all:

package demo

import org.openjdk.jmh.annotations._

@State(Scope.Thread)
@BenchmarkMode(Array(Mode.Throughput))
class OrderingBench {
  val items: List[(Char, Int)] = List('z', 'y', 'x').zipWithIndex
  val tupleOrdering: Ordering[(Char, Int)] = implicitly

  @Benchmark
  def sortWithResolved(): List[(Char, Int)] = items.sorted

  @Benchmark
  def sortWithVal(): List[(Char, Int)] = items.sorted(tupleOrdering)    
}

On 2.11 on my desktop machine I get this:

Benchmark                        Mode  Cnt         Score        Error  Units
OrderingBench.sortWithResolved  thrpt   40  15940745.279 ± 102634.860  ps/s
OrderingBench.sortWithVal       thrpt   40  16420078.932 ± 102901.418  ops/s

And if you look at allocations the difference is a little bigger:

Benchmark                                            Mode  Cnt    Score   Error   Units
OrderingBench.sortWithResolved:gc.alloc.rate.norm  thrpt   20  176.000 ±  0.001    B/op
OrderingBench.sortWithVal:gc.alloc.rate.norm       thrpt   20  152.000 ±  0.001    B/op

You can tell what's going on by breaking out reify:

scala> val items: List[(Char, Int)] = List('z', 'y', 'x').zipWithIndex
items: List[(Char, Int)] = List((z,0), (y,1), (x,2))

scala> import scala.reflect.runtime.universe._
import scala.reflect.runtime.universe._

scala> showCode(reify(items.sorted).tree)
res0: String = $read.items.sorted(Ordering.Tuple2(Ordering.Char, Ordering.Int))

The Ordering.Tuple2 here is a generic method that instantiates an Ordering[(Char, Int)]. This is exactly the same thing that happens when we define our tupleOrdering, but the difference is that in the val case it happens once, while in the case where it's resolved implicitly it happens every time sorted is called.

So the difference you're seeing is just the cost of instantiating the Decoder instance in every operation, as opposed to instantiating it a single time at the beginning outside of the benchmarked code. This cost is relatively tiny, and for larger benchmarks it's going to be more difficult to see.

With Circe Json why is implicit resolution slower at runtime

1 Answers1