bench(memory): durable peak-RSS benchmark + docs for the memory-scaling work#569
Conversation
Overall Assessment✅ Looks good — No unmitigated P0/P1 findings. This PR does not change estimator code, assumptions, weighting, variance/SE logic, or defaults. The issues below are benchmark/documentation quality items. Executive Summary
MethodologyFinding 1 — P2: New docs overstate bit-identical bootstrap output guaranteesLocation: Impact: The new performance doc says every memory-scaling change is “bit-identical” and that only peak memory changes. For the multiplier-bootstrap chunking path, the in-code source material is more precise: generated weight streams are bit-identical, but downstream Concrete fix: Reword the doc to match Code QualityFinding 2 — P3: CLI accepts invalid benchmark filters silentlyLocation: Impact: Concrete fix: Add PerformanceFinding 3 — P2: Benchmark claims Python-backend numbers but does not enforce or record backendLocation: Impact: The docs state the committed table was measured with Concrete fix: Add a MaintainabilityNo maintainability blockers found. The benchmark is isolated and does not add shared-library surface area. Tech DebtFinding 4 — P3 informational: Deferred work is properly trackedLocation: Impact: The new docs defer stratified survey-PSU tiling and Concrete fix: None required. SecurityNo findings. The new baseline JSONs and benchmark script do not contain secrets, credentials, or PII. Documentation/TestsNo blocking findings. The benchmark file compiles and the baseline JSON files parse successfully. I could not run |
717b478 to
b495739
Compare
…ng work Adds benchmarks/speed_review/bench_memory_scaling.py -- a subprocess-isolated peak-RSS sweep (ru_maxrss, median of repeats) for the bootstrap-chunking and within-transform memory wins (B1 #561 / B2 #563 / C #567) -- plus a "Memory scaling" section in docs/performance-plan.md with the honest before/after table. Measured pre-#561 (un-optimized) vs current main, 999 bootstrap reps: - CallawaySantAnna bootstrap: 500k 12.9->2.1 GB (-84%), 1M 13.5->3.0 GB (-78%) - EfficientDiD bootstrap 500k: 8.3->1.6 GB (-81%) - HeterogeneousAdoptionDiD cband 500k: 7.7->1.2 GB (-84%) - TWFE within-transform fit: 500k 1.0->0.93 GB (-8%), 1M 1.7->1.5 GB (-9%) Closes the 3-PR memory-scaling initiative. The baseline is re-measured against the pre-#561 tree (not a monkeypatch of current code, which would be contaminated by the same-line optimizations). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b495739 to
e49de68
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — No unmitigated P0/P1 findings. Executive Summary
MethodologyNo findings. The new docs now match the in-code bootstrap chunking contract: generated weight streams are bit-identical, while downstream Cross-checks used: CS bootstrap/IF requirements in Code QualityNo findings. Prior CLI robustness issue is resolved: PerformanceNo findings. Prior backend reproducibility issue is resolved: the benchmark defaults to MaintainabilityNo findings. The benchmark is isolated under Tech DebtP3 informational: Deferred items are trackedImpact: The docs mention remaining stratified survey-PSU tiling and Concrete fix: None required. Locations: SecurityNo findings. No secret-like strings were found in the changed benchmark/docs/baseline files. Documentation/TestsNo blocking findings. The new README entry and performance-plan section are consistent with the benchmark and baselines. Validation performed: AST parse passed for |
Summary
benchmarks/speed_review/bench_memory_scaling.py- a durable, subprocess-isolated peak-RSS benchmark (ru_maxrss, median of repeats) for the memory-scaling work, covering the CallawaySantAnna / EfficientDiD / HeterogeneousAdoptionDiD bootstraps and the TwoWayFixedEffects within-transform fit at 500k-1M units.docs/performance-plan.mdwith the honest before/after table, and lists the bench in the speed_review README.Measured pre-#561 (un-optimized) vs current
main, 999 bootstrap reps,DIFF_DIFF_BACKEND=python, Apple Silicon:Methodology references
atol=0).Validation
--quicksmoke; the before/after numbers were measured against the pre-perf(callaway-santanna): chunk multiplier-bootstrap weight generation to bound peak memory #561 tree (throwaway worktree) vs currentmain, median of repeats.benchmarks/speed_review/baselines/memory_scaling_{before,after}.json).Security / privacy
🤖 Generated with Claude Code