igerber · igerber · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,8 +10,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - **`generate_synthetic_control_data()` data generator + a capstone `SyntheticControl` tutorial.** New public generator (`diff_diff/prep_dgp.py`, exported from `diff_diff`) builds a **single-treated-unit** factor-model panel for synthetic-control demos and tests: one treated unit whose latent factor loadings and baseline are an exact convex combination of a few donors (so the noiseless trajectory lies in the donor convex hull and a synthetic control reproduces it closely — the observed fit is approximate under added noise), persistent AR(1) factors, predictor covariates that each proxy a distinct factor, a common calendar time effect, and a known `"ramp"` or `"constant"` treatment effect emitted as `true_effect`. Tutorial **`docs/tutorials/25_synthetic_control_policy.ipynb`** walks the whole `SyntheticControl` surface end-to-end on a policy-evaluation story (one state adopts a clean-energy standard), structured around **two inference philosophies**: cross-unit permutation (`in_space_placebo` + Firpo–Possebom `confidence_set`, with `leave_one_out` / `in_time_placebo` robustness) versus over-time conformal (CWZ `conformal_test` / `conformal_confidence_intervals` / `conformal_average_effect`), with the per-period conformal band as the climax. A `tests/test_t25_synthetic_control_policy_drift.py` drift guard re-derives every quoted number from the generator.
 - **`TwoStageDiD` methodology validation (Gardner 2022 / `did2s`).** New `tests/test_methodology_two_stage.py` with paper-equation-numbered Verified Components (§3 two-stage procedure / eqs. 4 & 6, §3.3 GMM variance, fn. 19 always-treated exclusion, Proposition 5, covariate path, `balance_e`, `vcov_type` narrowing) plus a `did2s::did2s()` cross-language parity fixture (`benchmarks/R/generate_did2s_golden.R` → `benchmarks/data/did2s_golden.json` + `did2s_test_panel.csv`), pinning overall + event-study ATT (abs 1e-6) and SE (abs 1e-7). `METHODOLOGY_REVIEW.md` `TwoStageDiD` row flipped to **Complete**.
+- **`p_val_type` parameter on `DifferenceInDifferences` (and inherited `TwoWayFixedEffects`)** for the wild cluster bootstrap, mirroring `fwildclusterboot::boottest`: `"two-tailed"` (default — test on `|t|`, two-tailed inverted CI, which may be asymmetric) or `"equal-tailed"` (each tail at `alpha/2`, equal-tailed CI). Only used when `inference="wild_bootstrap"`; inert on `MultiPeriodDiD` (which falls back to analytical inference).
 
 ### Fixed
+- **Wild cluster bootstrap (`inference="wild_bootstrap"`) now imposes the null — fixes an invalid p-value (issue #543).** `DifferenceInDifferences`/`TwoWayFixedEffects` with `inference="wild_bootstrap"` previously produced a p-value that contradicted its own confidence interval (e.g. CI `[2.30, 2.64]` excluding 0, yet `p = 0.86`). `diff_diff.utils.wild_bootstrap_se` *claimed* to run the Wild Cluster **Restricted** bootstrap but never actually imposed the null — it re-fit the full design (keeping the treatment column) to the unchanged outcome, so the "restricted" residuals equaled the unrestricted ones and the bootstrap coefficient distribution centered on the estimate instead of 0. The p-value `mean(|t*| ≥ |t₀|)` then measured noise around the estimate (≈0.5–0.86 regardless of significance) while the percentile-of-coefficients CI happened to look fine — an internal contradiction. The bootstrap now genuinely imposes H₀ (drops the coefficient's column for the restricted fit), studentizes with the analytical CR1 SE, and derives the CI by **test inversion** so the p-value and CI are exactly consistent (`0 ∈ CI ⟺ p ≥ alpha`). For Rademacher weights with few clusters the full `2**n_clusters` sign-vector set is enumerated (deterministic), matching R's `fwildclusterboot::boottest`. **Results change** for any prior `wild_bootstrap` use: the headline `p_value`/`conf_int` are corrected (a true effect is now correctly significant), and the reported `se` is now the analytical cluster-robust (CR1) SE (numerically ~unchanged in well-behaved cases). Validated against `fwildclusterboot::boottest()` (`benchmarks/R/generate_wild_cluster_boot_golden.R`; bootstrap t-distribution to ~6e-14, `se`/`t`/interior-`p` exact, CI to ~1e-4) and an independent full-refit enumeration. See `docs/methodology/REGISTRY.md` §"Wild cluster bootstrap (WCR)".
+- **Cluster-robust / HC1 standard errors no longer raise `ZeroDivisionError` on a saturated design.** `linalg.compute_robust_vcov` (NumPy path) divided by `(n_eff - k)` in the HC1/CR1 small-sample adjustment without guarding a design with no residual degrees of freedom (`n_eff == k`, e.g. a 2×2 DiD with one observation per cluster-period); it now returns a NaN vcov so inference is degenerate (NaN), consistent with the all-or-nothing NaN convention, rather than crashing. Surfaced while hardening the wild cluster bootstrap (`wild_bootstrap_se` independently routes saturated / weak-identification designs to NaN, and represents a genuinely unbounded inverted CI with `±inf` instead of mixing finite point estimates with NaN endpoints).
+- **Wild cluster bootstrap on a rank-deficient full-dummy design no longer crashes when storing the vcov.** `_run_wild_bootstrap_inference` computed the stored cluster-robust vcov via `compute_robust_vcov(X, ...)` on the full design, which inverts `X'X` directly and raises (or returns garbage) when a nuisance column is collinear (e.g. a fixed-effect dummy collinear with treatment) — even though the ATT is identified and the bootstrap itself drops such columns. It now computes the stored vcov through the rank-aware `solve_ols(..., rank_deficient_action="silent")` path, NaN-expanding the dropped column (bit-identical to the prior result on full-rank designs).
 - **`TwoStageDiD` analytical GMM standard errors are now exact (match R `did2s` to ~1e-7).** The Gardner two-stage GMM sandwich `_compute_gmm_variance` derived its residuals from the *iterative* alternating-projection first-stage fixed effects (`_iterative_fe`, which converge only to ~1e-7 on unbalanced untreated panels) while computing `gamma_hat` exactly — leaving the variance ~1% off the analytical sandwich. The variance now re-solves the Stage-1 FE **exactly** (sparse OLS, reusing the `gamma_hat` factorization), and `_build_fe_design` gained an intercept column so its column space spans the grand mean (the prior intercept-free design omitted it, and the exact residual is first-order sensitive to it). Unidentified-FE obs (rank-deficient / Proposition-5) fall back to the iterative residual, so those edge cases are unchanged; the reported `overall_att` still uses the iterative FE (point-estimate equivalence with `ImputationDiD` preserved). Mirrors the same-class fix already applied to `ImputationDiD`'s exact-sparse variance.
 - **`LinearRegression.get_se()` / `get_inference()` no longer return a `NaN` standard error from a tiny-negative variance artifact.** A high-leverage / degenerate coefficient (e.g. an absorbed-FE dummy near-collinear with the treatment, whose Bell-McCaffrey Satterthwaite DOF already hits the noise-floor guard) can have a CR2/HC variance of ~0 (≈1e-32) whose vcov diagonal lands just-below-zero under BLAS-dependent float rounding; `np.sqrt` of the negative then produced a `NaN` SE **nondeterministically** — passing single-threaded but failing under the parallel pure-Python full-suite run (`tests/test_methodology_wls_cr2.py::TestLinearRegressionFENanGuardEndToEnd::test_did_absorbed_fe_lr_inference_nan_for_guarded_coefs`). Both SE sites now clamp the vcov diagonal at 0, so the SE is finite (0 for a genuinely-zero variance), deterministic, and BLAS-independent. **No change for any positive variance** (the clamp is a no-op there); only the previously-`NaN` degenerate case is affected.
 - **`TripleDifference` power analysis now honors `n_periods > 2`.** `simulate_power`,

diff --git a/TODO.md b/TODO.md
@@ -166,6 +166,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | CR2 Bell-McCaffrey DOF uses a naive `O(n² k)` per-coefficient loop over cluster pairs. Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation that avoids the full `n × n` `M` matrix. Switch when a user hits a large-`n` cluster-robust design. | `linalg.py::_compute_cr2_bm` | Phase 1a | Low |
 | `SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted outcome/predictor panels) on EVERY fit to support the opt-in `in_space_placebo()`, so callers who never run the placebo still pay O(units × periods × predictor-vars) memory (same as `SyntheticDiD`'s always-on snapshot for `in_time_placebo`). Store a compact array/index representation instead of per-variable DataFrames, or build the snapshot lazily on first placebo call (would need to retain the source data, ~same cost). | `synthetic_control.py` snapshot build, `synthetic_control_results.py::_SyntheticControlFitSnapshot` | follow-up | Low |
 | EfficientDiD DR (covariate) path rebuilds the full polynomial sieve basis `_polynomial_sieve_basis(X, K)` for every candidate `K` inside each of the three nuisance fits (outcome regression, propensity ratio, inverse propensity), per `fit()`. After the growing-sieve cap removal (PR-B), large covariate-adjusted fits at large `n` pay more avoidable basis-construction cost. Cache the basis per `(X, K)` within a `fit()` and share it across the nuisance helpers. | `diff_diff/efficient_did_covariates.py` (the three sieve helpers) | PR-B follow-up | Low |
+| Wild cluster bootstrap CI inversion calls `_t_star(r)` ~O(100) times (outward bracketing + bisection per endpoint), and each call materializes a fresh `(B × n)` `y_star` matrix plus the `(k × B)` refit and `(n × B)` residual arrays. For large panels or large `n_bootstrap` this allocation churn is noticeable. The bootstrap is for the few-cluster regime (small `B` when enumerated; `n` typically modest), so it is acceptable today; if a large-`n`/large-`B` user hits it, chunk `_t_star` over bootstrap draws or precompute the `r`-independent cluster-level pieces (the restricted residuals are linear in `r`) so each inversion evaluation avoids rebuilding the full `B × n` matrix. | `diff_diff/utils.py::wild_bootstrap_se._t_star` | #543 | Low |
 
 #### Testing/Docs
 

diff --git a/benchmarks/R/generate_wild_cluster_boot_golden.R b/benchmarks/R/generate_wild_cluster_boot_golden.R
@@ -0,0 +1,92 @@
+#!/usr/bin/env Rscript
+# Golden generator: wild_bootstrap_se vs R `fwildclusterboot::boottest`
+# (Roodman, MacKinnon, Nielsen & Webb 2019; Cameron-Gelbach-Miller 2008).
+#
+# Writes a fixed few-cluster 2x2 DiD design and the R reference so that
+# tests/test_wild_bootstrap.py::TestWildBootstrapParityR can pin Python output
+# against R without requiring R at test time.
+#
+# Outputs (checked into the repo):
+#   benchmarks/data/wild_cluster_boot_test_data.csv  (cluster, treated, post, y)
+#   benchmarks/data/wild_cluster_boot_golden.json
+#
+# Usage:
+#   Rscript benchmarks/R/generate_wild_cluster_boot_golden.R
+#
+# Notes:
+#   - boottest defaults are the WCR (restricted) bootstrap: impose_null=TRUE,
+#     type="rademacher". boottest switches to full enumeration once B reaches the
+#     number of possible draws (2^G); with G=6 and B=99999 it enumerates all
+#     2^6=64 raw sign-vectors (deterministic, no RNG), of which 2^(G-1)=32 have
+#     distinct |t|. The library's wild_bootstrap_se uses the same trigger
+#     (2^n_clusters <= n_bootstrap) and also materializes all 2^G raw vectors,
+#     reporting n_bootstrap = 2^G; both yield the same p-value and CI.
+#   - The DGP uses a deliberately weak effect + heavy noise so the bootstrap
+#     p-value is INTERIOR (not 0/1), letting the test pin the exact p-value.
+#     boottest counts strict exceedances |t*| > |t0|; the library matches this
+#     (it floors the reported p at 1/(B+1), inactive for an interior p).
+#   - se is the analytical CR1 cluster-robust SE = (G/(G-1))((N-1)/(N-k)) sandwich,
+#     which the library reports as `se` and uses to studentize the test. boottest
+#     studentizes with the same CR1 SE, so teststat == coef/se.
+#   - The CI is obtained by inverting the bootstrap test (boottest's grid search
+#     vs the library's bisection); they agree to ~1e-4 on this design.
+
+suppressMessages({
+  library(fwildclusterboot)
+  library(fixest)
+  library(jsonlite)
+})
+
+set.seed(20240624)
+G <- 6
+obs_per_cluster <- 8
+rows <- list()
+i <- 1
+for (c in 0:(G - 1)) {
+  is_treated <- as.integer(c < G / 2)
+  cluster_effect <- rnorm(1, 0, 1.5)
+  for (o in 1:obs_per_cluster) {
+    for (period in c(0, 1)) {
+      y <- 4.0 + cluster_effect + 1.0 * period
+      if (is_treated == 1 && period == 1) y <- y + 0.3  # weak effect
+      y <- y + rnorm(1, 0, 4.0)                          # heavy noise -> interior p
+      rows[[i]] <- data.frame(cluster = c, treated = is_treated, post = period, y = y)
+      i <- i + 1
+    }
+  }
+}
+df <- do.call(rbind, rows)
+df$inter <- df$treated * df$post
+
+data_path <- "benchmarks/data/wild_cluster_boot_test_data.csv"
+write.csv(df[, c("cluster", "treated", "post", "y")], data_path, row.names = FALSE)
+
+m <- feols(y ~ treated + post + inter, data = df, cluster = ~cluster)
+coef_est <- as.numeric(coef(m)["inter"])
+se_cr1 <- as.numeric(se(m)["inter"])  # CR1 clustered SE
+
+run_bt <- function(ptype) {
+  set.seed(1)
+  bt <- boottest(m, param = "inter", clustid = ~cluster, B = 99999,
+                 type = "rademacher", impose_null = TRUE,
+                 p_val_type = ptype, conf_int = TRUE, sign_level = 0.05)
+  list(p_val = as.numeric(bt$p_val),
+       t_stat = as.numeric(bt$t_stat),
+       conf_int = as.numeric(bt$conf_int),
+       n_clusters = as.integer(bt$N_G))
+}
+
+golden <- list(
+  n_clusters = G,
+  coef = coef_est,
+  se_cr1 = se_cr1,
+  two_tailed = run_bt("two-tailed"),
+  equal_tailed = run_bt("equal-tailed")
+)
+
+json_path <- "benchmarks/data/wild_cluster_boot_golden.json"
+write(toJSON(golden, auto_unbox = TRUE, digits = 12, pretty = TRUE), json_path)
+cat("Wrote", data_path, "and", json_path, "\n")
+cat(sprintf("coef=%.6f se=%.6f two-tailed p=%.6f CI=[%.6f, %.6f]\n",
+            coef_est, se_cr1, golden$two_tailed$p_val,
+            golden$two_tailed$conf_int[1], golden$two_tailed$conf_int[2]))
diff --git a/benchmarks/data/wild_cluster_boot_golden.json b/benchmarks/data/wild_cluster_boot_golden.json
@@ -0,0 +1,17 @@
+{
+  "n_clusters": 6,
+  "coef": 2.23239205609,
+  "se_cr1": 0.8998239199017,
+  "two_tailed": {
+    "p_val": 0.03125,
+    "t_stat": 2.480920996559,
+    "conf_int": [0.226920726858, 4.86871767596],
+    "n_clusters": 6
+  },
+  "equal_tailed": {
+    "p_val": 0.03125,
+    "t_stat": 2.480920996559,
+    "conf_int": [0.226920726858, 4.86871767596],
+    "n_clusters": 6
+  }
+}
diff --git a/benchmarks/data/wild_cluster_boot_test_data.csv b/benchmarks/data/wild_cluster_boot_test_data.csv
@@ -0,0 +1,97 @@
+"cluster","treated","post","y"
+0,1,0,0.467900024557135
+0,1,1,5.28330432772302
+0,1,0,6.21815961331612
+0,1,1,8.91123549413264
+0,1,0,1.6036031176057
+0,1,1,5.93987635703142
+0,1,0,5.51552308604298
+0,1,1,12.5845789538936
+0,1,0,10.043533800938
+0,1,1,-0.529572566385062
+0,1,0,11.6147463969497
+0,1,1,4.90478267058331
+0,1,0,0.0195534899449052
+0,1,1,-0.623734360455213
+0,1,0,4.62408114371322
+0,1,1,1.77824940819063
+1,1,0,1.44117212105917
+1,1,1,11.8287065860359
+1,1,0,2.48612322817247
+1,1,1,0.79272738518101
+1,1,0,0.25664105566934
+1,1,1,5.46071019138058
+1,1,0,5.45585277762759
+1,1,1,1.72888361535044
+1,1,0,1.64819641981181
+1,1,1,-2.16930041278279
+1,1,0,-8.65527307341515
+1,1,1,-0.820012214512376
+1,1,0,2.00788152825625
+1,1,1,1.35027289877691
+1,1,0,-4.22739939644098
+1,1,1,2.96495871691995
+2,1,0,12.4088103290406
+2,1,1,3.86455595416874
+2,1,0,-3.77098700413387
+2,1,1,8.74834024074986
+2,1,0,7.63806603679012
+2,1,1,5.21226192690189
+2,1,0,10.3553929570071
+2,1,1,5.02404351098283
+2,1,0,0.0773467620776902
+2,1,1,3.8454453091425
+2,1,0,4.92829011733396
+2,1,1,8.18058751908359
+2,1,0,2.32327477678745
+2,1,1,3.85791466127983
+2,1,0,3.9943310874897
+2,1,1,5.63626346312242
+3,0,0,3.98548370346126
+3,0,1,2.08705100717492
+3,0,0,11.9671637110292
+3,0,1,8.50781843122744
+3,0,0,9.45919061415933
+3,0,1,9.17729331028167
+3,0,0,7.13828833203314
+3,0,1,3.60329239073677
+3,0,0,8.91231705393201
+3,0,1,6.51778257944759
+3,0,0,11.8343546913714
+3,0,1,5.85077180326843
+3,0,0,5.99854630042171
+3,0,1,3.83363405966151
+3,0,0,0.1722050079041
+3,0,1,13.599136275248
+4,0,0,7.39942475667185
+4,0,1,6.67760744120808
+4,0,0,8.06177241503042
+4,0,1,2.99574273253755
+4,0,0,4.17388457897458
+4,0,1,8.58064165732133
+4,0,0,1.35389122978994
+4,0,1,10.4092188896989
+4,0,0,10.0502116052792
+4,0,1,-0.0952531532868699
+4,0,0,-0.95058168214624
+4,0,1,2.07486027363059
+4,0,0,8.73683838173143
+4,0,1,6.20299878058477
+4,0,0,8.71918261324184
+4,0,1,6.78513130717663
+5,0,0,5.0782224773697
+5,0,1,2.95137909543423
+5,0,0,-0.00135445066279938
+5,0,1,2.94611250473384
+5,0,0,2.51953938488589
+5,0,1,-1.52349046695107
+5,0,0,0.164141843016002
+5,0,1,1.29757869447802
+5,0,0,2.59101825926055
+5,0,1,4.68185771046451
+5,0,0,0.988240247965372
+5,0,1,-3.44776984931728
+5,0,0,7.81515777811177
+5,0,1,-4.39325130500531
+5,0,0,1.60122778317178
+5,0,1,0.151072360378801