Lazy-allocate error latency histogram on AggregateEntry by dougqh · Pull Request #11478 · DataDog/dd-trace-java

dougqh · 2026-05-27T18:34:39Z

Summary

Defer errorLatencies histogram allocation until the first error is recorded on an entry. Most entries never see an error in their lifetime; previously each one carried a ~60-80 byte empty DDSketchHistogram for life.
Across a full 2048-entry table, saves ~150 KB if 95% of entries never error (the typical case).
SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) and emits those cached bytes when an entry's errorLatencies is null, so the wire format is byte-identical to before.

Background

Extracted from #11389, where the same change was bundled with cardinality- and peer-tag-related work. This PR is just the lazy-errorLatencies piece; it sits between #11382 and #11387 so it can ship without depending on the cardinality machinery in #11387.

Trade-off

Entries that do see an error retain the histogram across clear() (cleared, not nulled). An always-erroring entry allocates exactly once. Same total allocation as before for that path.

Throughput benchmarks

This is a heap-footprint change, not a CPU one — the consumer's hot path is unchanged. The bench suite was re-run anyway as a sanity check to confirm no throughput regression vs the #11382 base. Same machine state and JMH config as the rest of the stack's runs (8 producer threads, 2×15s warmup + 5×15s, 1 fork, throughput mode).

Bench (ops/s)	v1.62.0	master	#11382	this PR (#11478)
`Adversarial`	444,290 ± 1,616,937	14,276,351 ± 1,091,138	32,556,300 ± 4,321,490	30,609,314 ± 6,944,664
`HighCardinalityResource`	4,854,335 ± 1,214,233	8,168,005 ± 3,493,716	35,739,452 ± 2,556,684	34,552,088 ± 4,687,212
`HighCardinalityPeer`	6,902,209 ± 368,641	10,110,142 ± 3,380,594	37,638,634 ± 6,673,337	35,491,425 ± 4,970,576

#11478 vs #11382 is within the per-run error bar on every bench (0.94×–0.97×) — statistically indistinguishable. The CPU-side hot path didn't change: recordOneDuration now calls errorLatenciesForWrite() instead of reading a final field, but that's a single-field-load-and-branch on every entry's first error and a direct field load thereafter, which the JIT inlines flat. aggregateDropped counts are also in line with #11382, confirming the lazy field doesn't perturb the table-cap behavior.

The actual win — the ~150 KB heap reclamation at full table cap when 95% of entries never error — isn't observable in a throughput bench. It would show up in jol-based per-entry footprint inspection (one fewer histogram per entry) or in a long-running profile of allocated-bytes-per-cycle (errorLatencies allocation amortizes from "one per unique key" to "one per unique error-emitting key").

Test plan

:dd-trace-core:test — metrics tests pass
No behavior change to the client-stats wire payload

🤖 Generated with Claude Code

Each AggregateEntry allocated two DDSketchHistograms in its constructor (ok + error latencies). DDSketchHistogram wraps a DDSketch + lazy store, roughly 60-80 bytes per histogram even when empty. Most spans aren't errors, so most entries' errorLatencies sit empty for life. Now the field starts null. recordOneDuration lazy-allocates on the first error; if no error ever lands on the entry, it stays null and ~80 bytes of empty-histogram overhead are reclaimed. Across a full 2048-entry table that's ~150 KB if 95% of entries never error -- the typical case. For the wire format, SerializingMetricWriter caches the serialized form of an empty histogram (~17 bytes) on first use and writes those cached bytes when an entry's errorLatencies is null. The cache is per-writer (not a global static) so each writer instance picks up the Histograms factory state at the time of its first report, avoiding races with test setup that registers the DDSketch factory at varying points. Trade-off: entries that DO see an error retain the histogram across clear() (just cleared, not nulled), so always-erroring entries allocate exactly once. Same total allocation as before for that case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

datadog-datadog-prod-us1-2 · 2026-05-27T18:48:24Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

DataDog/apm-reliability/dd-trace-java | agent_integration_tests

🔧 Fix in code (Fix with Cursor).
4 failed tests due to IllegalAccessError at MetricsIntegrationTest.groovy:44.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: f2ee559 | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-05-27T18:53:10Z

🟢 Java Benchmark SLOs — All performance SLOs passed

Suite	Status
Startup	🟢 pass

SLO thresholds are defined here based on automatically generated metrics. A warning is raised when results are within 5% of the threshold.

PR vs. master results

Startup Time

Scenario	This PR	master	Change
insecure-bank / iast	13,994 ms	13,967 ms	+0.2%
insecure-bank / tracing	12,866 ms	13,083 ms	-1.7%
petclinic / appsec	16,513 ms	16,176 ms	+2.1%
petclinic / iast	16,525 ms	15,798 ms	+4.6%
petclinic / profiling	15,574 ms	16,489 ms	-5.5%
petclinic / tracing	14,872 ms	15,684 ms	-5.2%

Commit: f2ee559c · CI Pipeline · Benchmarking Platform UI

Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion.

This was referenced May 27, 2026

Per-component / tag cardinality limits in client-side stats #11387

Draft

Memory-efficiency pass on ClientStatsAggregator + adversarial benchmark #11389

Draft

Add span-derived primary tags (CSS v1.3.0) #11402

Draft

dougqh marked this pull request as ready for review May 27, 2026 19:25

dougqh requested a review from a team as a code owner May 27, 2026 19:25

dougqh requested a review from amarziali May 27, 2026 19:25

dd-octo-sts Bot added the tag: ai generated Largely based on code generated by an AI or LLM label May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy-allocate error latency histogram on AggregateEntry#11478

Lazy-allocate error latency histogram on AggregateEntry#11478
dougqh wants to merge 1 commit into
dougqh/optimize-metric-keyfrom
dougqh/lazy-error-latencies

dougqh commented May 27, 2026 •

edited

Loading

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 27, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

dd-octo-sts Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dougqh commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Trade-off

Throughput benchmarks

Test plan

Uh oh!

datadog-datadog-prod-us1-2 Bot commented May 27, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

dd-octo-sts Bot commented May 27, 2026

🟢 Java Benchmark SLOs — All performance SLOs passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dougqh commented May 27, 2026 •

edited

Loading

datadog-datadog-prod-us1-2 Bot commented May 27, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading