Skip to content

bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work#569

Merged
igerber merged 1 commit into
mainfrom
perf/memory-scaling-benchmark
Jun 28, 2026
Merged

bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work#569
igerber merged 1 commit into
mainfrom
perf/memory-scaling-benchmark

Conversation

@igerber

@igerber igerber commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

Measured pre-#561 (un-optimized) vs current main, 999 bootstrap reps, DIFF_DIFF_BACKEND=python, Apple Silicon:

Config Before After Reduction
CallawaySantAnna bootstrap, 500k units 12.9 GB 2.1 GB -84%
CallawaySantAnna bootstrap, 1M units 13.5 GB 3.0 GB -78%
EfficientDiD bootstrap, 500k units 8.3 GB 1.6 GB -81%
HeterogeneousAdoptionDiD event-study cband, 500k units 7.7 GB 1.2 GB -84%
TwoWayFixedEffects (hc1) fit, 500k units x 6 cov 1.0 GB 0.93 GB -8%
TwoWayFixedEffects (hc1) fit, 1M units x 6 cov 1.7 GB 1.5 GB -9%

Methodology references

Validation

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

@github-actions

Copy link
Copy Markdown

Overall Assessment

Looks good — No unmitigated P0/P1 findings. This PR does not change estimator code, assumptions, weighting, variance/SE logic, or defaults. The issues below are benchmark/documentation quality items.

Executive Summary

  • No methodology implementation changes were introduced; affected methods are only benchmarked/documented: CallawaySantAnna, EfficientDiD, HeterogeneousAdoptionDiD, and TwoWayFixedEffects.
  • No inline inference, NaN/Inf inference guard, control-group, aggregation, or new-parameter propagation issues were introduced.
  • Deferred items referenced in the new docs are already tracked in TODO.md, so they are not blockers.
  • I found two P2 documentation/reproducibility issues in the benchmark writeup/harness and one P3 CLI robustness issue.
  • Static compile and JSON parsing passed; runtime smoke could not run in this review environment because numpy is not installed.

Methodology

Finding 1 — P2: New docs overstate bit-identical bootstrap output guarantees

Location: docs/performance-plan.md:L21-L25, source context diff_diff/bootstrap_chunking.py:L10-L15

Impact: The new performance doc says every memory-scaling change is “bit-identical” and that only peak memory changes. For the multiplier-bootstrap chunking path, the in-code source material is more precise: generated weight streams are bit-identical, but downstream weights @ influence statistics are not guaranteed bit-for-bit because BLAS reduction order can vary by block shape. This is not an estimator implementation defect, but the new doc overstates the numerical contract for CS / EfficientDiD / HAD bootstrap paths.

Concrete fix: Reword the doc to match bootstrap_chunking.py: weight generation is bit-identical; outputs are mathematically equivalent and may differ only by floating-point reassociation unless a specific test/configuration proves exact equality.

Code Quality

Finding 2 — P3: CLI accepts invalid benchmark filters silently

Location: benchmarks/speed_review/bench_memory_scaling.py:L227-L238, benchmarks/speed_review/bench_memory_scaling.py:L157-L169

Impact: --only typo produces an empty result set rather than an error, and --repeats 0 can reach statistics.median([]). This is minor, but it can create misleading empty baseline JSONs or confusing failures.

Concrete fix: Add choices=("cs", "effdid", "had", "twfe") for --only, validate --repeats >= 1, and error if filtering leaves no configs.

Performance

Finding 3 — P2: Benchmark claims Python-backend numbers but does not enforce or record backend

Location: benchmarks/speed_review/bench_memory_scaling.py:L160-L164, benchmarks/speed_review/bench_memory_scaling.py:L257-L260, docs/performance-plan.md:L21-L25

Impact: The docs state the committed table was measured with DIFF_DIFF_BACKEND=python, but the new benchmark subprocess inherits the caller environment and the output JSON records only platform. In environments where the backend defaults to auto and Rust is available, reruns can silently produce non-comparable memory baselines.

Concrete fix: Add a --backend {python,rust,auto} option defaulting to python for this benchmark, pass it through env in subprocess.run, and write both requested and resolved backend metadata into the JSON.

Maintainability

No maintainability blockers found. The benchmark is isolated and does not add shared-library surface area.

Tech Debt

Finding 4 — P3 informational: Deferred work is properly tracked

Location: docs/performance-plan.md:L42-L44, TODO.md:L66-L67

Impact: The new docs defer stratified survey-PSU tiling and _iterative_demean vectorization, and both are already tracked in TODO.md. Per project policy, this is acceptable and not a blocker.

Concrete fix: None required.

Security

No findings. The new baseline JSONs and benchmark script do not contain secrets, credentials, or PII.

Documentation/Tests

No blocking findings. The benchmark file compiles and the baseline JSON files parse successfully. I could not run --quick in this environment because numpy is missing, so runtime validation remains unverified here.

@igerber igerber force-pushed the perf/memory-scaling-benchmark branch from 717b478 to b495739 Compare June 28, 2026 16:14
…ng work

Adds benchmarks/speed_review/bench_memory_scaling.py -- a subprocess-isolated
peak-RSS sweep (ru_maxrss, median of repeats) for the bootstrap-chunking and
within-transform memory wins (B1 #561 / B2 #563 / C #567) -- plus a "Memory
scaling" section in docs/performance-plan.md with the honest before/after table.

Measured pre-#561 (un-optimized) vs current main, 999 bootstrap reps:
- CallawaySantAnna bootstrap: 500k 12.9->2.1 GB (-84%), 1M 13.5->3.0 GB (-78%)
- EfficientDiD bootstrap 500k: 8.3->1.6 GB (-81%)
- HeterogeneousAdoptionDiD cband 500k: 7.7->1.2 GB (-84%)
- TWFE within-transform fit: 500k 1.0->0.93 GB (-8%), 1M 1.7->1.5 GB (-9%)

Closes the 3-PR memory-scaling initiative. The baseline is re-measured against
the pre-#561 tree (not a monkeypatch of current code, which would be contaminated
by the same-line optimizations).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the perf/memory-scaling-benchmark branch from b495739 to e49de68 Compare June 28, 2026 16:14
@github-actions

Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e49de685bec25c4586b869058bf14076b5b9ac50


Overall Assessment

Looks good — No unmitigated P0/P1 findings.

Executive Summary

  • This PR adds benchmark/docs/baseline data only; it does not change estimator math, assumptions, weighting, variance/SE, inference, or defaults.
  • Affected methods are benchmarked/documented only: CallawaySantAnna, EfficientDiD, HeterogeneousAdoptionDiD, and TwoWayFixedEffects.
  • Prior review items are addressed: bootstrap numerical contract is now accurate, CLI filters/repeats are validated, and backend metadata is enforced/recorded.
  • No new inline inference, NaN/Inf guard, control-group, aggregation, or parameter-propagation issues found.
  • AST parse and baseline JSON parsing passed; runtime smoke could not run here because numpy is not installed.

Methodology

No findings.

The new docs now match the in-code bootstrap chunking contract: generated weight streams are bit-identical, while downstream weights @ influence results may differ by BLAS reassociation rather than bit-for-bit equality. See docs/performance-plan.md:L21-L26 and diff_diff/bootstrap_chunking.py:L10-L15.

Cross-checks used: CS bootstrap/IF requirements in docs/methodology/REGISTRY.md:L490-L502, EfficientDiD bootstrap implementation note in docs/methodology/REGISTRY.md:L1043-L1052, HAD scope/assumptions in docs/methodology/REGISTRY.md:L2828-L2849, and TWFE within-transform contract in docs/methodology/REGISTRY.md:L323-L336.

Code Quality

No findings.

Prior CLI robustness issue is resolved: --only now uses choices, --repeats < 1 errors, and empty config sets error out at benchmarks/speed_review/bench_memory_scaling.py:L244-L266.

Performance

No findings.

Prior backend reproducibility issue is resolved: the benchmark defaults to --backend python, passes DIFF_DIFF_BACKEND into worker subprocesses, captures resolved backend, and writes backend metadata to JSON at benchmarks/speed_review/bench_memory_scaling.py:L167-L183 and benchmarks/speed_review/bench_memory_scaling.py:L246-L304. Baselines also record requested/resolved backend at benchmarks/speed_review/baselines/memory_scaling_after.json:L2-L6.

Maintainability

No findings.

The benchmark is isolated under benchmarks/speed_review/ and does not add shared library surface area.

Tech Debt

P3 informational: Deferred items are tracked

Impact: The docs mention remaining stratified survey-PSU tiling and _iterative_demean vectorization as deferred work. Both are already tracked in TODO.md, so this is not a blocker under the project’s deferred-work policy.

Concrete fix: None required. Locations: docs/performance-plan.md:L47-L49, TODO.md:L65-L67.

Security

No findings.

No secret-like strings were found in the changed benchmark/docs/baseline files.

Documentation/Tests

No blocking findings.

The new README entry and performance-plan section are consistent with the benchmark and baselines. Validation performed: AST parse passed for benchmarks/speed_review/bench_memory_scaling.py; both baseline JSON files parse successfully. Runtime --quick smoke was not runnable in this review environment because numpy is missing.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Jun 28, 2026
@igerber igerber merged commit 18131f9 into main Jun 28, 2026
11 of 12 checks passed
@igerber igerber deleted the perf/memory-scaling-benchmark branch June 28, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant