0

It's such a common problem but the answers are hard to find. I want to measure the performance of [ web server 95th percentile response time | API calls | algorithm performance | disk I/O | whatever ]. But, you know, that's a lot of data and I don't want to store it because this is used in production. Also, I don't want to spend a lot of CPU time calculating how slow my software is.

If you search for answers, you'll see many references to ancient algorithms that store a ton of data in bins or keep a large reservoir of random sample data. Common results include P-square and binmedian , and notice it's hard to find any decent implementations because although they're commonly suggested they're also garbage and nobody with a clue uses them.

You'll also find clever-sounding answers you can't implement because half the explanation is missing. Maybe if you were a stats major you'd understand this.

So what can I use to get cheap performance statistics? Algorithm and source code, please.

Sophit
  • 461
  • 2
  • 8

1 Answers1

1

Looking for algorithms means entering the realm of Academia, so it's helpful to know the proper name of the problem. We're looking for a Streaming Algorithm, probably Quantile Streaming although you may want other statistics too. Search for that phrase and you'll get more informed answers.

One easy answer is this paper, a collaboration between Amazon and Academia describing the state of the art as of 2007. It provides a high-level view of the Greenwald-Khanna (GK) and Q-Digest algorithms. You can actually find those algorithms in libraries. This library has an easy to use looking C++ and JS implementation. The Intel Math Kernel Library implements Zhang 2007.

While the sengelha library looks easy to use and good enough for most needs, the world has moved on since 2007. A paper from this year (Amazon, Yahoo, and Academia) describes the "lazy kll" algorithm which is implemented in the Data Sketches library (C++, Java, Python) here.

This information should be enough to let you generate quantile data from your software or even distributed software, and I hope others post even better answers.

Sophit
  • 461
  • 2
  • 8