perf(memory): per-AggOp memory profile tool + worst-offender report

The Phase 12.9 CI tripwire at `crates/beava-core/tests/per_entity_size_dump.rs` enforces `size_of::<AggOp>() ≤ 80 B` — but that only measures the stack-inlined slot. Variants that box state (`Box<UDDSketch>`, `Box<TrendResidualState>`, etc.) still cost 80 B inline PLUS unbounded heap. The static `bytes_per_entity_p99 = 7000` placeholder in `/metrics` (`crates/beava-server/src/http_admin.rs:124`) is a guess; we don't actually know which ops dominate.

Build a per-op profiler that takes a populated `AggOp` variant and returns:

- `stack_bytes` — always 80 (the enum slot)
- `heap_bytes` — recursive into `Box` / `Vec` / `HashMap` / sketch internals
- `breakdown` — structured per data-structure-type within state (so we can see "Sum spends N bytes on `Box<WindowedOp>` overhead, UDDSketch spends X bytes on bucket map, EWMA spends Y bytes on Welford triple")

Run it against the fraud-team workload (the realistic 14-op / 110-feature mix per `fraud-team.json` config) at steady state and emit a sorted table.

## Suspected offenders to verify

- `sum`, `mean`, `count` — small state, but each may pay an outsized `Box` pointer overhead relative to useful bytes. Specific concern that motivated this issue.
- Sketches (UDDSketch, HLL, count-min, bloom) — known-heavy but with workload-dependent variance.
- Windowed wrappers — every windowed op adds a `Vec` of bucket states; the overhead amortizes badly for short windows.
- `TrendResidual`, `BurstCount` — already flagged in v0.1 deferrals for borderline boxing (would drop the AggOp floor 80→64 B if all heavy variants box out).

## Done when

- A new binary at `crates/beava-bench/src/bin/memprofile.rs` runs the profile against the fraud-team config + writes `memory-profile-fraud-team.md` with the sorted op-by-op table.
- The top 5 memory offenders have a concrete byte breakdown + a one-line recommendation per op (keep / box smaller / restructure).
- An assertion verifies `Σ per-op heap bytes ≈ /metrics bytes_per_entity_p99` so reality and the Prometheus value stay coherent (separate bug if they diverge).

## Sibling work this unlocks

Replacing the static `bytes_per_entity_p99` placeholder with live dynamic sampling (the Phase 12.8 D-04 deferral) becomes trivial once this profiler is built — file separately after this lands and points at what to instrument.

~250 LOC + the generated report. Cohort Track-1 sized — meaty perf measurement work, no architectural decisions, plays directly to the pointer-overhead accounting question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(memory): per-AggOp memory profile tool + worst-offender report #68

Suspected offenders to verify

Done when

Sibling work this unlocks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf(memory): per-AggOp memory profile tool + worst-offender report #68

Description

Suspected offenders to verify

Done when

Sibling work this unlocks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions