Skip to content

Classic-only histogram: consider synchronized block instead of multi-LongAdder for observe() hot path #1915

@zeitlinger

Description

@zeitlinger

Context

While benchmarking the Prometheus shim PoC (bridging Prometheus client API to the OTel SDK), I found that classic-only histograms are 30% faster through the OTel SDK than through native Prometheus.

Benchmark numbers (JMH, single thread)

Path observe() latency
Native Prometheus (classic-only) 10.5 ns
OTel SDK (explicit bucket histogram) 7.3 ns

Root cause

Native Prometheus doObserve() uses 3 separate CAS-based atomics per call:

  1. classicBuckets[i].add(1)LongAdder
  2. sum.add(value)DoubleAdder
  3. count.increment()LongAdder

Plus a buffer.append() CAS attempt and volatile reads for reset/scale-down state.

The OTel SDK uses a single synchronized block with plain +=/++ arithmetic:

synchronized (lock) {
    this.sum += value;
    this.count++;
    this.counts[bucketIndex]++;
    // min/max tracking
}

In uncontended (single-thread) benchmarks, HotSpot elides the uncontended lock and optimizes the plain arithmetic freely, beating the multi-CAS approach.

Suggestion

For classic-only histograms (where nativeInitialSchema == CLASSIC_HISTOGRAM), consider an alternative doObserve() implementation that uses a synchronized block with plain fields instead of multiple LongAdder/DoubleAdder instances. The buffer mechanism (needed for native histogram scale-down) could also be bypassed in classic-only mode.

This wouldn't affect native or hybrid histograms, which still need the current design.

Multi-threaded consideration

The LongAdder approach was chosen for multi-threaded scalability (striped cells reduce contention). A synchronized block would serialize threads. However:

  • Most real-world observe() calls happen on different label-value combinations (different data points), so contention on a single data point is rare
  • Even under contention, the critical section is very short (~5 ns of arithmetic), so lock hold time is minimal
  • A benchmark with 4 threads would clarify the actual tradeoff

Not a high priority — 10.5 ns is already excellent. But worth considering if classic histogram performance matters.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions