Lazy-allocate error latency histogram on AggregateEntry#11478
Open
dougqh wants to merge 1 commit into
Open
Conversation
Each AggregateEntry allocated two DDSketchHistograms in its constructor (ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store, roughly 60-80 bytes per histogram even when empty. Most spans aren't errors, so most entries' errorLatencies sit empty for life. Now the field starts null. recordOneDuration lazy-allocates on the first error; if no error ever lands on the entry, it stays null and ~80 bytes of empty-histogram overhead are reclaimed. Across a full 2048-entry table that's ~150 KB if 95% of entries never error -- the typical case. For the wire format, SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) on first use and writes those cached bytes when an entry's errorLatencies is null. The cache is per-writer (not a global static) so each writer instance picks up the Histograms factory state at the time of its first report, avoiding races with test setup that registers the DDSketch factory at varying points. Trade-off: entries that DO see an error retain the histogram across clear() (just cleared, not nulled), so always-erroring entries allocate exactly once. Same total allocation as before for that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
Contributor
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master resultsStartup Time
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
This was referenced May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
errorLatencieshistogram allocation until the first error is recorded on an entry. Most entries never see an error in their lifetime; previously each one carried a ~60-80 byte emptyDDSketchHistogramfor life.SerializingMetricWritercaches the serialized form of an empty histogram (~17 bytes) and emits those cached bytes when an entry'serrorLatenciesisnull, so the wire format is byte-identical to before.Background
Extracted from #11389, where the same change was bundled with cardinality- and peer-tag-related work. This PR is just the lazy-errorLatencies piece; it sits between #11382 and #11387 so it can ship without depending on the cardinality machinery in #11387.
Trade-off
Entries that do see an error retain the histogram across
clear()(cleared, not nulled). An always-erroring entry allocates exactly once. Same total allocation as before for that path.Throughput benchmarks
This is a heap-footprint change, not a CPU one — the consumer's hot path is unchanged. The bench suite was re-run anyway as a sanity check to confirm no throughput regression vs the #11382 base. Same machine state and JMH config as the rest of the stack's runs (8 producer threads, 2×15s warmup + 5×15s, 1 fork, throughput mode).
AdversarialHighCardinalityResourceHighCardinalityPeer#11478 vs #11382is within the per-run error bar on every bench (0.94×–0.97×) — statistically indistinguishable. The CPU-side hot path didn't change:recordOneDurationnow callserrorLatenciesForWrite()instead of reading a final field, but that's a single-field-load-and-branch on every entry's first error and a direct field load thereafter, which the JIT inlines flat.aggregateDroppedcounts are also in line with #11382, confirming the lazy field doesn't perturb the table-cap behavior.The actual win — the ~150 KB heap reclamation at full table cap when 95% of entries never error — isn't observable in a throughput bench. It would show up in
jol-based per-entry footprint inspection (one fewer histogram per entry) or in a long-running profile of allocated-bytes-per-cycle (errorLatencies allocation amortizes from "one per unique key" to "one per unique error-emitting key").Test plan
:dd-trace-core:test— metrics tests pass🤖 Generated with Claude Code