igerber · igerber · Jun 27, 2026 · Jun 26, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,7 +7,23 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+- **`placebo_group_test` gained an optional `treatment` parameter.** When supplied, units
+  that are ever real-treated are dropped before the placebo so it runs on never-treated units
+  only (the uncontaminated design); without it, behavior is unchanged and the caller must pass
+  control-only data. Degenerate designs (all fake-treated units dropped, or no controls
+  remaining) now raise a clear `ValueError` instead of a cryptic `LinAlgError`, and a
+  fake-treated unit that is itself real-treated emits a `UserWarning`.
+- **PlaceboTests methodology validation:** `tests/test_methodology_placebo.py` (paper-anchored
+  to Bertrand-Duflo-Mullainathan 2004) plus base-R exact-enumeration R parity
+  (`benchmarks/R/generate_placebo_golden.R` → `benchmarks/data/placebo_golden.json`). The
+  `PlaceboTests` methodology-review row is promoted to **Complete**.
+
 ### Changed
+- **`run_placebo_test`'s `fake_group` path now filters ever-treated units by default.** The
+  dispatcher threads its `treatment` column into `placebo_group_test`, so the fake-group
+  placebo runs on never-treated units only (a more-correct placebo). Calling
+  `placebo_group_test` directly without `treatment` retains the previous behavior.
 - **Bumped the Rust backend's `blas-src` crate `0.10` → `0.14`.** `blas-src` is a
   linker-only crate pulled in **only by the `accelerate` (macOS) feature**; the Linux
   `openblas` path links system OpenBLAS via `build.rs` and the default/Windows builds use the
@@ -17,6 +33,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   `maturin develop --features accelerate` against the pinned `ndarray 0.17`, the Rust unit
   tests, and the full Python⇄Rust equivalence suite (`tests/test_rust_backend.py`).
 
+### Fixed
+- **`permutation_test` now reports the randomization-inference p-value
+  `(1 + count) / (B + 1)`** (Phipson & Smyth 2010), replacing `count / B` floored at
+  `1/(B+1)`. The `+1` includes the observed statistic in both numerator and denominator
+  (the floor is now intrinsic). Because assignments are sampled with replacement, this is a
+  valid but slightly conservative Monte-Carlo randomization-inference p-value (not an exact
+  finite-sample value); it converges to the exact full-enumeration value `count/total` as the
+  number of permutations grows. (Permutation p-values shift by a small amount, at most
+  `~1/(B+1)`.)
+
 ### Security
 - **Bumped the Rust backend's `pyo3` and `numpy` crates 0.28 → 0.29.** Resolves two RustSec
   advisories in `pyo3 < 0.29` — RUSTSEC-2026-0176 (out-of-bounds read in `PyList`/`PyTuple`

diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -24,7 +24,7 @@ A **Complete** entry has a documented review pass against the primary academic s
 
 The catalog grew incrementally over several quarters, so formats vary across the existing Complete entries; the consistent invariant is that someone walked through the implementation against the academic source and captured the result here. New reviews going forward should aim for the fuller structure (Verified Components + Corrections Made + Deviations + dedicated methodology test file) used by the more recent entries.
 
-**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests (e.g., PlaceboTests). The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
+**In Progress** entries have a REGISTRY.md section and unit-test coverage, but no formal walk-through has been captured here yet. The In Progress band is wide — some entries also have some combination of a paper review (primary or companion), a dedicated methodology test file, and R parity fixtures; others have only the REGISTRY entry and unit tests. The "Documentation in place" sub-section enumerates what each entry already has; the "Outstanding for promotion" sub-section enumerates what's still needed to flip it to Complete.
 
 **Not Started** entries have neither a tracker walk-through nor an REGISTRY.md section. This tracker no longer carries any Not Started rows; new estimators are expected to enter as In Progress when their REGISTRY entry lands.
 
@@ -82,7 +82,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
 | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 |
 | PreTrendsPower | `pretrends.py` | `pretrends` package | **Complete** | 2026-05-19 |
 | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **Complete** | 2026-05-31 |
-| PlaceboTests | `diagnostics.py` | Bertrand-Duflo-Mullainathan (2004) (placebo laws); no canonical R | **In Progress** | — |
+| PlaceboTests | `diagnostics.py` | Bertrand-Duflo-Mullainathan (2004); base-R exact-enumeration parity | **Complete** | 2026-06-26 |
 
 ### Cross-Cutting Inference Features
 
@@ -1314,20 +1314,43 @@ CI and extending covariate-adjusted R parity are tracked follow-ups in `TODO.md`
 | Field | Value |
 |-------|-------|
 | Module | `diagnostics.py` |
-| Primary Reference | Bertrand, Duflo & Mullainathan (2004), QJE 119(1):249-275 (placebo laws / randomization inference). Paper review on file: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`. |
-| R Reference | None canonical (no R package ships a generic placebo battery) |
-| Status | **In Progress** |
-| Last Review | — |
+| Primary Reference | Bertrand, Duflo & Mullainathan (2004), QJE 119(1):249-275 (placebo laws / randomization inference). Paper review: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`. |
+| R Reference | No canonical R package for the DiD placebo battery; parity via base-R `combn` exact enumeration (`benchmarks/R/generate_placebo_golden.R`), optional `ri2`/`coin` convention check. |
+| Status | **Complete** |
+| Last Review | 2026-06-26 |
 
-**Documentation in place:**
-- REGISTRY.md section: `## PlaceboTests` (NaN-inference edge cases for `permutation_test` and `leave_one_out_test`)
-- Paper review: `docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md` (BDM 2004 placebo-law / serial-correlation grounding; proposes a `## PlaceboTests` REGISTRY entry, not yet integrated)
-- Implementation: tests embedded in `tests/test_diagnostics.py`
+**Verified Components:**
+- [x] Sampled randomization-inference p-value `(1 + count)/(B + 1)` — valid but slightly conservative (with-replacement Monte Carlo; Phipson & Smyth 2010; BDM fn 11), converging to the exact `count/total` full enumeration — `tests/test_methodology_placebo.py::TestPlaceboRandomizationInference`
+- [x] R parity: Python exhaustive enumeration matches base-R `combn` exact p-value + observed ATT at `atol=1e-12`; deterministic `leave_one_out_test` / `placebo_group_test` match R at `atol≈1e-10`; sampled `permutation_test` within Monte-Carlo tolerance — `TestPlaceboParityR` (skip-guarded; golden `benchmarks/data/placebo_golden.json`)
+- [x] `placebo_timing_test` detects differential pre-trends (significant under violated trends, null under parallel) and restricts to pre-treatment data — `TestPlaceboFakeTiming`
+- [x] `placebo_group_test` never-treated `treatment` filter (drops ever-treated), degenerate-design `ValueError`, misuse `UserWarning`, backward-compatible without `treatment` — `TestPlaceboFakeGroup`
+- [x] Permutation NaN-decoupling contract (RI p-value finite when `se` degenerate) + fail-closed `RuntimeError` on all-fail — `TestPlaceboInferenceContracts`
+- [x] Functional coverage (dispatch routing, zero-SE / `<2`-LOO NaN-inference) — `tests/test_diagnostics.py`
 
-**Outstanding for promotion:**
-- Standalone-vs-absorb decision: **resolved — standalone** (`diagnostics.py` is an exported public surface distinct from per-estimator placebo/LOO)
-- Integrate the proposed `## PlaceboTests` entry into REGISTRY.md (cite BDM 2004 + scope) and flip this row to Complete
-- Dedicated `tests/test_methodology_placebo.py` with BDM-anchored Verified Components (empirical permutation p-value per fn 12; p-value floor; LOO; fake-timing/fake-group) + Deviations block (permutation path's deliberate non-`safe_inference` + percentile CI; the NaN-inference convention). R parity is N/A (no canonical R placebo battery) → self-consistency / analytic anchors
+**Test Coverage:**
+- `tests/test_methodology_placebo.py` (24 tests across 5 paper-anchored classes; 2 `@pytest.mark.slow`)
+- `tests/test_diagnostics.py` (32 functional / edge-case tests)
+
+**R Comparison Results** (base-R `combn` exact enumeration; golden committed at `benchmarks/data/placebo_golden.json`):
+
+| Quantity | Tolerance | Result |
+|----------|-----------|--------|
+| Observed DiD ATT | `atol=1e-12` | match |
+| Exact RI p-value (`count/total`) | `atol=1e-12` | match |
+| Leave-one-out mean / se / CI / per-drop ATTs | `atol=1e-10` / `1e-9` | match |
+| Fake-group ATT (never-treated filtered) | `atol=1e-10` | match |
+| Sampled permutation p-value | Monte-Carlo | converges to exact |
+
+**Corrections Made:**
+- **Permutation p-value (this PR):** replaced `count/B` floored at `1/(B+1)` with the Phipson-Smyth (2010) randomization-inference value `(1 + count)/(B + 1)` — a valid but slightly conservative estimator for with-replacement Monte-Carlo draws (floor now intrinsic); "exact" is reserved for the full enumeration (`diff_diff/diagnostics.py`).
+- **`placebo_group_test` (this PR):** added an optional `treatment` parameter that drops ever-treated units so the placebo runs on never-treated data only (uncontaminated); added degenerate-design `ValueError` guards (replacing a cryptic `LinAlgError`) and a misuse `UserWarning`; corrected the docstring to describe both modes. The `run_placebo_test` dispatcher's `fake_group` path now filters ever-treated units by default (a more-correct placebo) — documented in CHANGELOG + REGISTRY.
+- **Docstring (this PR):** reworded `permutation_test`'s docstring — dropped the "exact ... valid with any sample size" overclaim; the sampled value is the valid/conservative with-replacement RI p-value, with "exact" reserved for full enumeration (BDM fn 12).
+
+**Deviations:** (documented in REGISTRY `## PlaceboTests`)
+- Permutation inference deliberately bypasses `safe_inference`: RI p-value + null-distribution percentile interval (not an effect CI) + null-mean `placebo_effect`.
+- `leave_one_out_test` reports the dispersion of per-drop ATTs (a sensitivity spread), not a design-based jackknife SE.
+- Permutation NaN-decoupling: the count-based RI p-value stays valid when the permutation `se` is degenerate (intentional departure from the bootstrap-NaN contract).
+- BDM's serial-correlation SE corrections (parametric AR, block bootstrap, cluster VCV, aggregation) are out of scope for this diagnostic surface.
 
 ---
 
@@ -1459,17 +1482,11 @@ whereas R's `did::att_gt` would error. This is a defensive enhancement that prov
 more graceful handling of edge cases while still signaling invalid inference to users.
 ```
 
-### Priority Order (2026-05-26)
-
-Promotion priority for the **In Progress** entries, ordered by what's blocked on substantive review work (top of list = needs review next) vs. consolidation pass (bottom of list = mostly tracker walk-through):
-
-**Substantive-review-blocked (each still missing one or more of: a methodology test file, R parity, or a paper review):**
-
-1. **PlaceboTests** — standalone-vs-absorb decision resolved (standalone) and the BDM (2004) paper review is now on file (`docs/methodology/papers/bertrand-duflo-mullainathan-2004-review.md`). Remaining for promotion: the dedicated methodology test file + REGISTRY integration (R parity N/A). Methodologically lightweight.
+### Priority Order (updated 2026-06-26)
 
-**Consolidation-pass-blocked (already has paper review or methodology file or R parity; mostly Verified Components walk-through):**
+Only one **In Progress** entry remains: **Survey Data Support**. (PlaceboTests was promoted to Complete on 2026-06-26 — see its detail section above.)
 
-4. **Survey Data Support** — cross-cutting feature; promotion requires the per-estimator integration paths to be locked down first.
+- **Survey Data Support** — cross-cutting feature; consolidation-pass-blocked. Promotion requires the per-estimator integration paths to be locked down first, then a dedicated `tests/test_methodology_survey.py` (Binder-equation-numbered Verified Components), an R-parity table vs `survey::svyglm`/`svycontrast` wired into this tracker, a deviations block, and a consolidated cross-estimator `NotImplementedError`-gaps enumeration.
 
 ---
 

diff --git a/benchmarks/R/generate_placebo_golden.R b/benchmarks/R/generate_placebo_golden.R
@@ -0,0 +1,133 @@
+#!/usr/bin/env Rscript
+# Golden generator: PlaceboTests (diff_diff/diagnostics.py) R parity.
+#
+# Bertrand, Duflo & Mullainathan (2004) placebo-law / randomization-inference
+# diagnostics. Writes a fixed, fully deterministic 2-period panel and the R
+# reference values so tests/test_methodology_placebo.py::TestPlaceboParityR can
+# pin Python output against R without requiring R at test time.
+#
+# Outputs (checked into the repo):
+#   benchmarks/data/placebo_test_panel.csv   (unit, t, y, treatment)
+#   benchmarks/data/placebo_golden.json
+#
+# Usage:
+#   Rscript benchmarks/R/generate_placebo_golden.R
+#
+# Notes:
+#   - The panel is HARDCODED (not RNG-generated) so R and Python consume bit-
+#     identical data; no cross-language RNG matching is needed.
+#   - Permutation p-value uses EXACT enumeration of all C(8, 3) = 56 treated-group
+#     assignments (the observed assignment is one of them): exact p =
+#     #{|ATT*| >= |ATT_obs|} / total (observed included; min 1/total). This is the
+#     ground truth the library's SAMPLED (1 + count)/(B + 1) value converges to.
+#   - n_treated = 3 != N/2 = 4, so no assignment's complement shares its |ATT|
+#     (avoids exact-tie pairing); the panel is chosen with a clear boundary gap
+#     so the 1e-12 exact-parity comparison is not tie-flip fragile.
+#   - leave-one-out se is the dispersion (sd, ddof=1) of the per-drop ATTs (NOT a
+#     design-based jackknife SE), with a t-distribution (df = n_valid - 1), exactly
+#     matching diff_diff.leave_one_out_test via safe_inference.
+#   - Optional ri2/coin convention cross-check is guarded by requireNamespace and
+#     is NOT a committed dependency (base-R combn enumeration is the anchor).
+
+suppressMessages(library(jsonlite))
+
+# ---- fixed panel (8 units x 2 periods; real treated = units 0,1,2) ----
+panel <- data.frame(
+  unit = rep(0:7, each = 2),
+  t = rep(c(0, 1), times = 8),
+  y = c(
+    -1.639137, -0.623634, 0.051834, 1.622805, 0.261434, 0.82986,
+    0.337559, 1.580412, -1.055892, -1.067745, 1.062855, 1.478681,
+    0.139217, 0.8575, -1.253286, -0.560034
+  )
+)
+panel$treatment <- as.integer(panel$unit %in% c(0, 1, 2))
+real_treated <- c(0, 1, 2)
+n_treated <- 3L
+units <- 0:7
+
+# 2x2 DiD ATT = double difference of group means (post = t == 1).
+did_att <- function(df, treated) {
+  is_t <- df$unit %in% treated
+  post <- df$t == 1
+  (mean(df$y[is_t & post]) - mean(df$y[is_t & !post])) -
+    (mean(df$y[!is_t & post]) - mean(df$y[!is_t & !post]))
+}
+
+att_obs <- did_att(panel, real_treated)
+
+# ---- permutation: EXACT randomization-inference p-value ----
+combos <- combn(units, n_treated, simplify = FALSE)
+atts <- vapply(combos, function(s) did_att(panel, s), numeric(1))
+count <- sum(abs(atts) >= abs(att_obs) - 1e-12)
+total <- length(atts)
+p_exact <- count / total
+# boundary gap: nearest distinct |ATT*| to |ATT_obs| (excluding the observed)
+gap <- sort(abs(abs(atts) - abs(att_obs)))[2]
+
+# ---- leave-one-out (deterministic jackknife over treated units) ----
+loo_units <- real_treated
+loo_atts <- sapply(loo_units, function(u) {
+  remaining <- panel[panel$unit != u, ]
+  treated_rem <- setdiff(real_treated, u)
+  did_att(remaining, treated_rem)
+})
+loo_mean <- mean(loo_atts)
+loo_se <- sd(loo_atts) # ddof = 1, the dispersion of LOO ATTs (not an SE-of-mean)
+loo_df <- length(loo_atts) - 1L
+loo_t <- loo_mean / loo_se
+loo_p <- 2 * pt(-abs(loo_t), df = loo_df)
+loo_crit <- qt(0.975, df = loo_df)
+loo_ci <- c(loo_mean - loo_crit * loo_se, loo_mean + loo_crit * loo_se)
+
+# ---- fake-group (deterministic; drop ever-treated, fake-treat controls 3,4) ----
+fg_fake_treated <- c(3, 4)
+fg_panel <- panel[!(panel$unit %in% real_treated), ] # never-treated only
+fg_att <- did_att(fg_panel, fg_fake_treated)
+
+# ---- optional convention cross-check (NOT a committed dependency) ----
+ri2_ok <- requireNamespace("ri2", quietly = TRUE)
+
+golden <- list(
+  description = "PlaceboTests R parity (BDM 2004): exact RI permutation p-value + deterministic LOO + fake-group, on a fixed 2-period panel.",
+  panel_csv = "benchmarks/data/placebo_test_panel.csv",
+  real_treated = real_treated,
+  n_treated = n_treated,
+  observed_att = att_obs,
+  permutation = list(
+    convention = "exact enumeration: p = #{|ATT*| >= |ATT_obs|} / total (observed included)",
+    count = count,
+    total = total,
+    p_exact = p_exact,
+    boundary_gap = gap
+  ),
+  leave_one_out = list(
+    dropped_units = loo_units,
+    per_drop_att = as.list(setNames(loo_atts, as.character(loo_units))),
+    mean = loo_mean,
+    se = loo_se,
+    df = loo_df,
+    t_stat = loo_t,
+    p_value = loo_p,
+    ci_lower = loo_ci[1],
+    ci_upper = loo_ci[2]
+  ),
+  fake_group = list(
+    fake_treated_units = fg_fake_treated,
+    note = "ever-treated units dropped (treatment filter); ATT is the double-difference",
+    att = fg_att
+  ),
+  ri2_convention_checked = ri2_ok
+)
+
+write.csv(panel, "benchmarks/data/placebo_test_panel.csv", row.names = FALSE)
+write_json(golden, "benchmarks/data/placebo_golden.json",
+  auto_unbox = TRUE, pretty = TRUE, digits = 12
+)
+
+cat(sprintf("observed ATT = %.12f\n", att_obs))
+cat(sprintf("exact RI: count=%d total=%d p_exact=%.12f gap=%.4f\n", count, total, p_exact, gap))
+cat(sprintf("LOO: mean=%.12f se=%.12f df=%d p=%.6f\n", loo_mean, loo_se, loo_df, loo_p))
+cat(sprintf("fake_group ATT = %.12f\n", fg_att))
+cat(sprintf("ri2 convention cross-check available: %s\n", ri2_ok))
+cat("Wrote benchmarks/data/placebo_test_panel.csv + placebo_golden.json\n")
diff --git a/benchmarks/data/placebo_golden.json b/benchmarks/data/placebo_golden.json
@@ -0,0 +1,35 @@
+{
+  "description": "PlaceboTests R parity (BDM 2004): exact RI permutation p-value + deterministic LOO + fake-group, on a fixed 2-period panel.",
+  "panel_csv": "benchmarks/data/placebo_test_panel.csv",
+  "real_treated": [0, 1, 2],
+  "n_treated": 3,
+  "observed_att": 0.4399611333333,
+  "permutation": {
+    "convention": "exact enumeration: p = #{|ATT*| >= |ATT_obs|} / total (observed included)",
+    "count": 15,
+    "total": 56,
+    "p_exact": 0.2678571428571,
+    "boundary_gap": 0.03574946666667
+  },
+  "leave_one_out": {
+    "dropped_units": [0, 1, 2],
+    "per_drop_att": {
+      "0": 0.4580263,
+      "1": 0.1802923,
+      "2": 0.6815648
+    },
+    "mean": 0.4399611333333,
+    "se": 0.2511240579855,
+    "df": 2,
+    "t_stat": 1.751967282079,
+    "p_value": 0.2218771530393,
+    "ci_lower": -0.6405384802636,
+    "ci_upper": 1.52046074693
+  },
+  "fake_group": {
+    "fake_treated_units": [3, 4],
+    "note": "ever-treated units dropped (treatment filter); ATT is the double-difference",
+    "att": 0.006379666666667
+  },
+  "ri2_convention_checked": false
+}
diff --git a/benchmarks/data/placebo_test_panel.csv b/benchmarks/data/placebo_test_panel.csv
@@ -0,0 +1,17 @@
+"unit","t","y","treatment"
+0,0,-1.639137,1
+0,1,-0.623634,1
+1,0,0.051834,1
+1,1,1.622805,1
+2,0,0.261434,1
+2,1,0.82986,1
+3,0,0.337559,0
+3,1,1.580412,0
+4,0,-1.055892,0
+4,1,-1.067745,0
+5,0,1.062855,0
+5,1,1.478681,0
+6,0,0.139217,0
+6,1,0.8575,0
+7,0,-1.253286,0
+7,1,-0.560034,0