bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work by igerber · Pull Request #569 · igerber/diff-diff

igerber · 2026-06-28T15:28:48Z

Summary

Adds benchmarks/speed_review/bench_memory_scaling.py - a durable, subprocess-isolated peak-RSS benchmark (ru_maxrss, median of repeats) for the memory-scaling work, covering the CallawaySantAnna / EfficientDiD / HeterogeneousAdoptionDiD bootstraps and the TwoWayFixedEffects within-transform fit at 500k-1M units.
Adds a "Memory scaling" section to docs/performance-plan.md with the honest before/after table, and lists the bench in the speed_review README.
Closes the 3-PR memory-scaling initiative (B1 perf(callaway-santanna): chunk multiplier-bootstrap weight generation to bound peak memory #561 / B2 perf(efficient-did, had): chunk multiplier-bootstrap weight generation to bound peak memory #563 / C perf(within-transform): drop the redundant full-frame copy in within_transform #567).

Measured pre-#561 (un-optimized) vs current main, 999 bootstrap reps, DIFF_DIFF_BACKEND=python, Apple Silicon:

Config	Before	After	Reduction
CallawaySantAnna bootstrap, 500k units	12.9 GB	2.1 GB	-84%
CallawaySantAnna bootstrap, 1M units	13.5 GB	3.0 GB	-78%
EfficientDiD bootstrap, 500k units	8.3 GB	1.6 GB	-81%
HeterogeneousAdoptionDiD event-study cband, 500k units	7.7 GB	1.2 GB	-84%
TwoWayFixedEffects (hc1) fit, 500k units x 6 cov	1.0 GB	0.93 GB	-8%
TwoWayFixedEffects (hc1) fit, 1M units x 6 cov	1.7 GB	1.5 GB	-9%

Methodology references

N/A - benchmark + docs only, no estimator/methodology code change. The measured wins come from already-merged perf(callaway-santanna): chunk multiplier-bootstrap weight generation to bound peak memory #561 / perf(efficient-did, had): chunk multiplier-bootstrap weight generation to bound peak memory #563 / perf(within-transform): drop the redundant full-frame copy in within_transform #567 (all bit-identical at atol=0).

Validation

Tests added/updated: none (harness + docs). The harness is validated by its --quick smoke; the before/after numbers were measured against the pre-perf(callaway-santanna): chunk multiplier-bootstrap weight generation to bound peak memory #561 tree (throwaway worktree) vs current main, median of repeats.
Evidence: the before/after table plus the committed baseline JSONs (benchmarks/speed_review/baselines/memory_scaling_{before,after}.json).

Security / privacy

Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

github-actions · 2026-06-28T15:31:22Z

Overall Assessment

✅ Looks good — No unmitigated P0/P1 findings. This PR does not change estimator code, assumptions, weighting, variance/SE logic, or defaults. The issues below are benchmark/documentation quality items.

Executive Summary

No methodology implementation changes were introduced; affected methods are only benchmarked/documented: CallawaySantAnna, EfficientDiD, HeterogeneousAdoptionDiD, and TwoWayFixedEffects.
No inline inference, NaN/Inf inference guard, control-group, aggregation, or new-parameter propagation issues were introduced.
Deferred items referenced in the new docs are already tracked in TODO.md, so they are not blockers.
I found two P2 documentation/reproducibility issues in the benchmark writeup/harness and one P3 CLI robustness issue.
Static compile and JSON parsing passed; runtime smoke could not run in this review environment because numpy is not installed.

Methodology

Finding 1 — P2: New docs overstate bit-identical bootstrap output guarantees

Location: docs/performance-plan.md:L21-L25, source context diff_diff/bootstrap_chunking.py:L10-L15

Impact: The new performance doc says every memory-scaling change is “bit-identical” and that only peak memory changes. For the multiplier-bootstrap chunking path, the in-code source material is more precise: generated weight streams are bit-identical, but downstream weights @ influence statistics are not guaranteed bit-for-bit because BLAS reduction order can vary by block shape. This is not an estimator implementation defect, but the new doc overstates the numerical contract for CS / EfficientDiD / HAD bootstrap paths.

Concrete fix: Reword the doc to match bootstrap_chunking.py: weight generation is bit-identical; outputs are mathematically equivalent and may differ only by floating-point reassociation unless a specific test/configuration proves exact equality.

Code Quality

Finding 2 — P3: CLI accepts invalid benchmark filters silently

Location: benchmarks/speed_review/bench_memory_scaling.py:L227-L238, benchmarks/speed_review/bench_memory_scaling.py:L157-L169

Impact: --only typo produces an empty result set rather than an error, and --repeats 0 can reach statistics.median([]). This is minor, but it can create misleading empty baseline JSONs or confusing failures.

Concrete fix: Add choices=("cs", "effdid", "had", "twfe") for --only, validate --repeats >= 1, and error if filtering leaves no configs.

Performance

Finding 3 — P2: Benchmark claims Python-backend numbers but does not enforce or record backend

Location: benchmarks/speed_review/bench_memory_scaling.py:L160-L164, benchmarks/speed_review/bench_memory_scaling.py:L257-L260, docs/performance-plan.md:L21-L25

Impact: The docs state the committed table was measured with DIFF_DIFF_BACKEND=python, but the new benchmark subprocess inherits the caller environment and the output JSON records only platform. In environments where the backend defaults to auto and Rust is available, reruns can silently produce non-comparable memory baselines.

Concrete fix: Add a --backend {python,rust,auto} option defaulting to python for this benchmark, pass it through env in subprocess.run, and write both requested and resolved backend metadata into the JSON.

Maintainability

No maintainability blockers found. The benchmark is isolated and does not add shared-library surface area.

Tech Debt

Finding 4 — P3 informational: Deferred work is properly tracked

Location: docs/performance-plan.md:L42-L44, TODO.md:L66-L67

Impact: The new docs defer stratified survey-PSU tiling and _iterative_demean vectorization, and both are already tracked in TODO.md. Per project policy, this is acceptable and not a blocker.

Concrete fix: None required.

Security

No findings. The new baseline JSONs and benchmark script do not contain secrets, credentials, or PII.

Documentation/Tests

No blocking findings. The benchmark file compiles and the baseline JSON files parse successfully. I could not run --quick in this environment because numpy is missing, so runtime validation remains unverified here.

…ng work Adds benchmarks/speed_review/bench_memory_scaling.py -- a subprocess-isolated peak-RSS sweep (ru_maxrss, median of repeats) for the bootstrap-chunking and within-transform memory wins (B1 #561 / B2 #563 / C #567) -- plus a "Memory scaling" section in docs/performance-plan.md with the honest before/after table. Measured pre-#561 (un-optimized) vs current main, 999 bootstrap reps: - CallawaySantAnna bootstrap: 500k 12.9->2.1 GB (-84%), 1M 13.5->3.0 GB (-78%) - EfficientDiD bootstrap 500k: 8.3->1.6 GB (-81%) - HeterogeneousAdoptionDiD cband 500k: 7.7->1.2 GB (-84%) - TWFE within-transform fit: 500k 1.0->0.93 GB (-8%), 1M 1.7->1.5 GB (-9%) Closes the 3-PR memory-scaling initiative. The baseline is re-measured against the pre-#561 tree (not a monkeypatch of current code, which would be contaminated by the same-line optimizations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-28T16:16:51Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e49de685bec25c4586b869058bf14076b5b9ac50

Overall Assessment

✅ Looks good — No unmitigated P0/P1 findings.

Executive Summary

This PR adds benchmark/docs/baseline data only; it does not change estimator math, assumptions, weighting, variance/SE, inference, or defaults.
Affected methods are benchmarked/documented only: CallawaySantAnna, EfficientDiD, HeterogeneousAdoptionDiD, and TwoWayFixedEffects.
Prior review items are addressed: bootstrap numerical contract is now accurate, CLI filters/repeats are validated, and backend metadata is enforced/recorded.
No new inline inference, NaN/Inf guard, control-group, aggregation, or parameter-propagation issues found.
AST parse and baseline JSON parsing passed; runtime smoke could not run here because numpy is not installed.

Methodology

No findings.

The new docs now match the in-code bootstrap chunking contract: generated weight streams are bit-identical, while downstream weights @ influence results may differ by BLAS reassociation rather than bit-for-bit equality. See docs/performance-plan.md:L21-L26 and diff_diff/bootstrap_chunking.py:L10-L15.

Cross-checks used: CS bootstrap/IF requirements in docs/methodology/REGISTRY.md:L490-L502, EfficientDiD bootstrap implementation note in docs/methodology/REGISTRY.md:L1043-L1052, HAD scope/assumptions in docs/methodology/REGISTRY.md:L2828-L2849, and TWFE within-transform contract in docs/methodology/REGISTRY.md:L323-L336.

Code Quality

No findings.

Prior CLI robustness issue is resolved: --only now uses choices, --repeats < 1 errors, and empty config sets error out at benchmarks/speed_review/bench_memory_scaling.py:L244-L266.

Performance

No findings.

Prior backend reproducibility issue is resolved: the benchmark defaults to --backend python, passes DIFF_DIFF_BACKEND into worker subprocesses, captures resolved backend, and writes backend metadata to JSON at benchmarks/speed_review/bench_memory_scaling.py:L167-L183 and benchmarks/speed_review/bench_memory_scaling.py:L246-L304. Baselines also record requested/resolved backend at benchmarks/speed_review/baselines/memory_scaling_after.json:L2-L6.

Maintainability

No findings.

The benchmark is isolated under benchmarks/speed_review/ and does not add shared library surface area.

Tech Debt

P3 informational: Deferred items are tracked

Impact: The docs mention remaining stratified survey-PSU tiling and _iterative_demean vectorization as deferred work. Both are already tracked in TODO.md, so this is not a blocker under the project’s deferred-work policy.

Concrete fix: None required. Locations: docs/performance-plan.md:L47-L49, TODO.md:L65-L67.

Security

No findings.

No secret-like strings were found in the changed benchmark/docs/baseline files.

Documentation/Tests

No blocking findings.

The new README entry and performance-plan section are consistent with the benchmark and baselines. Validation performed: AST parse passed for benchmarks/speed_review/bench_memory_scaling.py; both baseline JSON files parse successfully. Runtime --quick smoke was not runnable in this review environment because numpy is missing.

igerber force-pushed the perf/memory-scaling-benchmark branch from 717b478 to b495739 Compare June 28, 2026 16:14

igerber force-pushed the perf/memory-scaling-benchmark branch from b495739 to e49de68 Compare June 28, 2026 16:14

igerber added the ready-for-ci Triggers CI test workflows label Jun 28, 2026

igerber merged commit 18131f9 into main Jun 28, 2026
11 of 12 checks passed

igerber deleted the perf/memory-scaling-benchmark branch June 28, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work#569

bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work#569
igerber merged 1 commit into
mainfrom
perf/memory-scaling-benchmark

igerber commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Jun 28, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

github-actions Bot commented Jun 28, 2026

Overall Assessment

Executive Summary

Methodology

Finding 1 — P2: New docs overstate bit-identical bootstrap output guarantees

Code Quality

Finding 2 — P3: CLI accepts invalid benchmark filters silently

Performance

Finding 3 — P2: Benchmark claims Python-backend numbers but does not enforce or record backend

Maintainability

Tech Debt

Finding 4 — P3 informational: Deferred work is properly tracked

Security

Documentation/Tests

Uh oh!

github-actions Bot commented Jun 28, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

P3 informational: Deferred items are tracked

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant