73

I am learning the DropWizard Metrics library (formerly Coda Hale metrics) and I am confused as to when I should be using Meters vs Timers. According to the docs:

Meter: A meter measures the rate at which a set of events occur

and:

Timer: A timer is basically a histogram of the duration of a type of event and a meter of the rate of its occurrence

Based on these definitions, I can't discern the difference between these. What's confusing me is that Timer is not used the way I would have expected it to be used. To me, Timer is just that: a timer; it should measure the time diff between a start() and stop(). But it appears that Timers also capture rates at which events occur, which feels like they are stepping on Meters toes.

If I could see an example of what each component outputs that might help me understand when/where to use either of these.

Naman
  • 23,555
  • 22
  • 173
  • 290
smeeb
  • 22,487
  • 41
  • 197
  • 389

1 Answers1

161

You're confused in part because a DW Metrics Timer IS, among other things, a DW Metrics Meter.

A Meter is exclusively concerned with rates, measured in Hz (events per second). Each Meter results in 4(?) distinct metrics being published:

  • a mean (average) rate since Metrics was started
  • 1, 5 and 15-minute rolling mean rates

You use a Meter by recording a value at different points in your code -- DW Metrics automatically jots down the wall time of each call along with the value you gave it, and uses these to calculate the rate at which that value is increasing:

Meter getRequests = registry.meter("some-operation.operations")
getRequests.mark() //resets the value, e.g. sets it to 0
int numberOfOps = doSomeNumberOfOperations() //takes 10 seconds, returns 333
getRequests.mark(numberOfOps) //sets the value to number of ops.

We would expect our rates to be 33.3 Hz, as 333 operations occurred and the time between the two calls to mark() was 10 seconds.

A Timer calculates these above 4 metrics (considering each Timer.Context to be one event) and adds to them a number of additional metrics:

  • a count of the number of events
  • min, mean and max durations seen since the start of Metrics
  • standard deviation
  • a "histogram," recording the duration distributed at the 50th, 97th, 98th, 99th, and 99.95 percentiles

There are something like 15 total metrics reported for each Timer.

In short: Timers report a LOT of metrics, and they can be tricky to understand, but once you do they're a quite powerful way to spot spikey behavior.


Fact is, just collecting the time spent between two points isn't a terribly useful metric. Consider: you have a block of code like this:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //service time 10 ms
context.stop()

Let's assume that costlyOperation() has a constant cost, constant load, and operates on a single thread. Inside a 1 minute reporting period, we should expect to time this operation 6000 times. Obviously, we will not be reporting the actual service time over the wire 6000x -- instead, we need some way to summarize all those operations to fit our desired reporting window. DW Metrics' Timer does this for us, automatically, once a minute (our reporting period). After 5 minutes, our metrics registry would be reporting:

  • a rate of 100 (events per second)
  • a 1-minute mean rate of 100
  • a 5-minute mean rate of 100
  • a count of 30000 (total events seen)
  • a max of 10 (ms)
  • a min of 10
  • a mean of 10
  • a 50th percentile (p50) value of 10
  • a 99.9th percentile (p999) value of 10

Now, let's consider we enter a period where occasionally our operation goes completely off the rails and blocks for an extended period:

Timer timer = registry.timer("costly-operation.service-time")
Timer.Context context = timer.time()
costlyOperation() //takes 10 ms usually, but once every 1000 times spikes to 1000 ms
context.stop()

Over a 1 minute collection period, we would now see fewer than 6000 executions, as every 1000th execution takes longer. Works out to about 5505. After the first minute (6 minutes total system time) of this we would now see:

  • a mean rate of 98 (events per second)
  • a 1-minute mean rate of 91.75
  • a 5-minute mean rate of 98.35
  • a count of 35505 (total events seen)
  • a max duration of 1000 (ms)
  • a min duration of 10
  • a mean duration of 10.13
  • a 50th percentile (p50) value of 10
  • a 99.9th percentile (p999) value of 1000

If you graph this, you'd see that most requests (the p50, p75, p99, etc) were completing in 10 ms, but one request out of 1000 (p99) was completed in 1s. This would also be seen as a slight reduction in the average rate (about 2%) and a sizable reduction in the 1-minute mean (nearly 9%).

If you only look at over the time mean values (either rate or duration), you'll never spot these spikes -- they get dragged into the background noise when averaged with a lot of successful operations. Similarly, just knowing the max isn't helpful, because it doesn't tell you how frequently the max occurs. This is why histograms are a powerful tool for tracking performance, and why DW Metrics' Timer publishes both a rate AND a histogram.

Naman
  • 23,555
  • 22
  • 173
  • 290
Matthew Mark Miller
  • 3,034
  • 1
  • 13
  • 11
  • 2
    A nice additional piece of information is contained [here](http://metrics.dropwizard.io/3.1.0/manual/core/#histograms). Timers use a exponentially decaying reservoir by default. Which means you are seeing data from roughly the last 5 minutes with a bias towards newer data for your Histogram values. – markdsievers Feb 09 '16 at 01:22
  • Is there a way to get the execution time each time a code block is run? I need this apart from the average rates, mean rates and stuff. Is this possible? – AnOldSoul Aug 26 '16 at 15:06
  • If you simply want to meter and report service time for every execution of a block of code, that's not really what a metrics aggregation library is for. The "Reservoir" that holds metrics as they occur likely doesn't save enough context for your needs. Luckily, this is a problem you can easily solve with simple logging or a VM instrumentation tool like YourKit. – Matthew Mark Miller Sep 06 '16 at 21:46
  • 1
    So that would essentially mean that if I am annotating my method with "@Timed" , annotating it also with "@Metered "is redundant because "@Timed" will do what "@Metered" does anyways ? – Chirag Mar 22 '18 at 08:53
  • 1
    @MatthewMarkMiller But how I can receive the AVG execution time in ms measured by the Timer for the last 1/5/15 minutes for example, as I see the timer the only store 'Mean' value that aggregates stats for the whole application lifecycle and as well it stores 1/5/15 minutes rates, but they show some frequency (rate of executions per sec) and not the AVG time. Lets say in the last 5 minutes I had 3 DB inserts with 1ms/2ms/3ms execution duration. So the AVG would be 2ms. How can I get this stat with the Timer? – MeetJoeBlack Jul 10 '18 at 21:46
  • Sadly, there isn't a built in meter for what you're asking for. The p_XX metrics (which present a rolling median, weighted towards the most recent values) are your best option. In your example, p_50 would report 2ms and p_75,95,99 etc would report 3ms, with a std_dev of 1ms. This would be sufficient to understand that most calls were under 2ms and no calls were over 3ms. – Matthew Mark Miller Jul 11 '18 at 18:26
  • I'm not sure if 1 min and 5 min gives you rolling mean rate. Their documentation states `Returns the five-minute exponentially-weighted moving average rate at which events have occurred since the meter was created.` – L. Dai Mar 25 '19 at 15:42