A fast, lock-free exponential histogram for Go that produces Prometheus native histogram output with bounded relative error and trapezoidal quantile estimation.
Schema 2, range [500ns, 60s]. Speed benchmarks on Intel Xeon @ 2.80GHz, 24 cores, x86_64, Linux. Accuracy across 12 distributions, 100k observations each.
| goodhistogram | Prometheus native | |
|---|---|---|
| Memory per histogram | 0.85 KB | 19 KB |
| Record (1 thread) | 20 ns/op | 132 ns/op |
| Mean quantile error | 0.35% | 1.00% |
The full results of benchmarks on a Macbook and GCE worker can be seen in reports/benchmark_comparison.txt.
CockroachDB maintains hundreds of histograms per node for latency distributions, operation sizes (log entries, compaction bytes), and resource utilization. These histograms sit on the hot path of every SQL statement, every KV request, and every Raft proposal.
We'd been using Prometheus' classic histogram implementation to track and report on this data with good results, albeit with a little more overhead than we would have liked. Recent tests however to move Prometheus' new native histogram formats once again surfaced these overhead concerns and the necessity of them came into question.
As a result of this thinking, we designed goodhistogram. goodhistogram is a purpose-built replacement that:
- Matches Prometheus native bucket and export format.
- Strips memory and compute requirements to the bare minimum required.
- Introduces new recording and quantile computation improvements.
It's designed to be lightweight, fast and accurate as can be to meet the needs of a performance sensitive database. Additionally, goodhistograms are mergeable, compressible, and thread safe, making them ideal for high throughput distributed systems.
go get github.com/cockroachdb/goodhistogram
h := goodhistogram.New(goodhistogram.Params{
Lo: 500, // lower bound of tracked range (e.g. 500ns)
Hi: 60e9, // upper bound (e.g. 60s in nanoseconds)
ErrorBound: 0.10, // 10% relative error → schema 2
})
h.Record(1_500_000) // 1.5ms
h.Record(42_000) // 42µssnap := h.Snapshot()
p99 := snap.ValueAtQuantile(0.99)
mean := snap.Mean()
count, sum := snap.Total()ph := snap.ToPrometheusHistogram() // *prometheusgo.HistogramThe returned proto includes both conventional cumulative buckets (for backward compatibility with classic Prometheus) and native histogram sparse fields (schema, spans, deltas). Because the internal bucket indices are Prometheus bucket keys by construction, export is a direct copy with no remapping.
goodhistogram takes inspiration from Prometheus native histograms, DDSketch, and base2histogram. At a high level, it borrows the exponential bucketing scheme from Prometheus and DDSketch, whose error bounds act as the input to the constructor, it uses the bucket grouping concept found both in base2histogram and prometheus for fast recording. It then uses the trapezoidal interpolation described by base2histogram to significantly improve quantile estimation.
What makes it different from these libraries is the following:
- It uses an array for optimal runtime operations, while preserving the ability for sparse output.
- It performs constant time lookups with a log-linear bucketing function.
- It allows configuring a precise error bound, which determines bucket density.
- It only allocates buckets within a parameterized expected range, minimizing memory.
goodhistogram uses the same exponential bucket scheme as Prometheus
native histograms. For a given schema s, each power-of-two octave is
divided into 2^s buckets with boundaries at 2^(j/2^s). The
schema is chosen as the coarsest one whose relative error
(gamma - 1)/(gamma + 1) is at or below the requested ErrorBound,
where gamma = 2^(2^(-s)).
At construction, a fixed array of atomic.Uint64 counters is allocated
covering the range [Lo, Hi]. The number of buckets is determined by
the Prometheus bucket keys of Lo and Hi:
numBuckets = promBucketKey(Hi, schema) - promBucketKey(Lo, schema) + 1
Record(v) maps an int64 to a bucket index in O(1) without locks or
allocations:
- Convert to
float64and extract the IEEE 754 bits. - Compute the exponent from the biased exponent field:
exp = bits>>52 - 1022. - Look up the bucket index from a precomputed 256-entry table indexed by the top 8 bits of the mantissa.
- Combine:
key = bucket + (exp-1) * bucketsPerGroup, thenidx = key - minKey. counts[idx].Add(1).
The novelty of this implementation is in the precomputed bucket lookup table. This enables both constant-time bucket lookup, in addition to prometheus compatible export, which serves as a helpful collection of features from incompatible implementations.
ValueAtQuantile uses trapezoidal interpolation rather than the
uniform-density (log-linear) interpolation used by Prometheus:
- Compute average density in each bucket: count / width.
- Estimate density at each boundary by averaging adjacent buckets.
- Within the target bucket, model the density as varying linearly from left to right (a trapezoid). The CDF is the integral of this linear function, which is a quadratic.
- Solve the quadratic to find the value at the target rank.
This produces more accurate estimates when the true density is not uniform within a bucket, which is nearly always the case for real-world distributions. This approach is described well in the base2histogram link cited above.
As a final note on the design, goodhistogram takes two tracks which
theoretically affect quantile accuracy. The first is that it does not lock on
snapshot, which means that snapshots can be inconsistent. Inconsistent in this
case means that the TotalSum seen for the histogram can be greater than the sum
of its parts. In practice, the likelihood of this is low enough, and the order
of the effect small enough, that no detectable difference would be seen.
Bucket lookup is imprecise as well. Because of the techniques used to make bucket lookup constant, ~0.45% of the samples will be incorrectly rounded to a neighboring bucket. The resulting distortion however is small (.02%) compared to the performance gains that it allows for. The distortions seen are similar to the ones seen in non-linear bucketing formats, which too optimize for lookup performance.
It should be noted that the performance on this structure is likely to look much closer to the single-threaded example than the concurrently accessed one. This is because the benchmarks operate on a throughput that is much higher than would be seen on any histogram in CockroachDB. This means the likelihood of any two requests entering the critical region simultaneously exceedingly low, at least with the database as written today.
I should also note that these results are not the actual floor observed. The benchmarks above were run on x86, as it is today the most common chip architecture found on cloud machines. If one re-runs the test on ARM-based machines the differences between goodhistogram and Prometheus widen:
| goodhistogram | Prometheus native | |
|---|---|---|
| Record (1 thread) | 12.9 ns/op | 103 ns/op |
This is the result of ARM's optimistic atomic instructions, which allow increments without fencing. The gap widens even further on a Mac ARM machine:
| goodhistogram | Prometheus native | |
|---|---|---|
| Record (1 thread) | 2.6 ns/op | 58.1 ns/op |