diff --git a/CHANGELOG.md b/CHANGELOG.md index 347c8718..4b8f55f9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -99,6 +99,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 the two previously-provisional REGISTRY notes (RA influence-function variance convention; pooled fixed-composition estimand) resolve in the library's favour with no estimator change. +### Changed +- **`EfficientDiD` survey-weighted Silverman bandwidth** (covariate DR path). The auto + Silverman bandwidth for the kernel-smoothed conditional `Omega*(X)` now uses a + **survey-weighted** per-dimension dispersion (weighted mean/std over the positive-weight + support) instead of the unweighted sample dispersion, so the bandwidth reflects the + population covariate distribution the kernel targets. The rate term `n` remains the + positive-weight support count. This shifts the DR point estimate and SE only in + **overidentified (H>1) covariate cells under non-uniform survey weights**; under uniform + weights it reduces to the previous bandwidth up to floating point, and it preserves the + existing invariances to zero-weight (subpopulation / padded) rows and to weight rescaling. + Non-survey and just-identified (H=1) paths are unchanged. + ## [3.6.0] - 2026-06-29 ### Added diff --git a/TODO.md b/TODO.md index 837cde40..d8b9305a 100644 --- a/TODO.md +++ b/TODO.md @@ -33,7 +33,6 @@ The `Origin` column (Actionable tables) and the `PR` column (Deferred tables) bo | `SyntheticControl` remaining ADH-2015 §4 items: the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`) and sparse-SC subset search (`l < J`, holding `V` fixed). LOO, in-time placebo, CV `V`-selection, and inverse-variance `V` have landed; these two are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 | Mid | Low | | `SyntheticControl` conformal (CWZ 2021) extensions: (a) one-sided / signed-`t` variants (§7); (b) covariates in the conformal proxy (`X_jt`, eqs 4/6 — current proxy is outcomes-only); (c) AR / innovation-permutation path (Lemmas 5-7) for time-series proxies. The joint test, pointwise CIs, and average-effect CI have landed. | `conformal.py`, `synthetic_control_results.py` | CWZ-2021 | Heavy | Low | | `ContinuousDiD` CGBS-2024 extensions (matches R `contdid` v0.1.0 deferral set): (a) `covariates=` kwarg; (b) discrete-treatment saturated regression (integer dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per Remark 3.1 when `P(D=0)=0`. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these `[ ]`. | `continuous_did.py` | CGBS-2024 | Heavy | Low | -| `EfficientDiD` survey-weighted Silverman bandwidth in conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std; survey-weighted statistics better reflect the population distribution (second-order refinement). | `efficient_did_covariates.py` | — | Quick | Low | | `ImputationDiD` LOO conservative-variance refinement (BJS 2024 Supp. Appendix A.9) — a finite-sample improvement to the auxiliary-model residuals reducing overfit of `tau_tilde_g` to `epsilon`. Asymptotic Theorem-3 variance is implemented and matches R `didimputation` (which also omits LOO by default). | `imputation.py` | imputation-validation | Mid | Low | | `TwoWayFixedEffects(vcov_type in {hc2, hc2_bm})` with replicate-weight designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate, which doesn't compose with the full-dummy HC2/HC2-BM build — a correct impl needs per-replicate full-dummy refit. Workaround: `hc1` for replicate-weight CR1. | `twfe.py::fit` | follow-up | Heavy | Low | | TWFE's HC2/HC2-BM inline full-dummy build (`twfe.py:280-315`) duplicates the dummy-construction logic in `DifferenceInDifferences(fixed_effects=...)` (`estimators.py:478-486`). Extract a shared helper, or delegate TWFE's HC2/HC2-BM path to DiD's `fixed_effects=` branch (with TWFE-specific cluster-default threading), to reduce drift risk on FE naming / survey behavior / result-surface conventions. Substantive refactor — touches both estimators. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit` | follow-up | Heavy | Low | diff --git a/diff_diff/efficient_did_covariates.py b/diff_diff/efficient_did_covariates.py index 69ad4109..393f98d8 100644 --- a/diff_diff/efficient_did_covariates.py +++ b/diff_diff/efficient_did_covariates.py @@ -805,22 +805,45 @@ def _silverman_bandwidth(X: np.ndarray, unit_weights: Optional[np.ndarray] = Non ``h = (4 / (d + 2))^{1/(d+4)} * median_std * n^{-1/(d+4)}`` When ``unit_weights`` is provided, the rule is evaluated on the - **positive-weight support** (rows with ``w > 0``) only. The dispersion - statistic stays unweighted within that support (a documented second-order - simplification — survey-weighted bandwidth is deferred), but dropping - zero-weight rows keeps the bandwidth — and hence the kernel-smoothed - ``Omega*(X)`` and the per-unit efficient weights it feeds — invariant to - zero-weight (survey-subpopulation / padded) rows: such rows carry no - information yet would otherwise move both ``n`` and the standard deviation - (e.g. a zero-weight row with an extreme covariate inflates ``median_std``). - Falls back to the full matrix if the support is empty. + **positive-weight support** (rows with ``w > 0``) only, and the per-dimension + dispersion is **survey-weighted**: ``median_std`` is the median across + covariate dimensions of the weighted standard deviation + ``sqrt(sum_i w_i (x_i - xbar_w)^2 / sum_i w_i)`` with the weighted mean + ``xbar_w = sum_i w_i x_i / sum_i w_i``. Survey-weighted moments reflect the + population distribution the kernel-smoothed ``Omega*(X)`` targets, rather than + the unweighted sample. The rate term ``n`` remains the positive-weight support + count (the dispersion is weighted; the sample-size term is not — a deliberately + scoped refinement, not Kish ``n_eff``). + + Invariances preserved: + + - **Weight scale** (``w -> c*w``, ``c > 0``): the weighted mean/std and the + positive-weight count are all invariant, so the bandwidth is unchanged. + - **Zero-weight (survey-subpopulation / padded) rows**: zero-weight rows drop + from the support, contribute nothing to the weighted moments, and do not + change the count, so the bandwidth — and hence ``Omega*(X)`` and the + per-unit efficient weights it feeds — is invariant to such padding (e.g. a + zero-weight row with an extreme covariate cannot inflate ``median_std``). + - **Uniform positive weights**: the weighted std reduces to the unweighted + population std, matching the pre-refinement bandwidth up to floating point. + + Falls back to the unweighted full matrix when no weights are given or the + positive-weight support is empty. """ + weights = None if unit_weights is not None: support = unit_weights > 0 if np.any(support): X = X[support] + weights = unit_weights[support] n, d = X.shape - stds = np.std(X, axis=0) + if weights is not None: + w_norm = weights / weights.sum() + weighted_mean = w_norm @ X + weighted_var = w_norm @ (X - weighted_mean) ** 2 + stds = np.sqrt(weighted_var) + else: + stds = np.std(X, axis=0) stds[stds < 1e-10] = 1.0 median_std = float(np.median(stds)) h = (4.0 / (d + 2)) ** (1.0 / (d + 4)) * median_std * n ** (-1.0 / (d + 4)) diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 827dff52..c39c1608 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1137,7 +1137,7 @@ where `q_{g,e} = pi_g / sum_{g' in G_{trt,e}} pi_{g'}`. - **Note:** Outcome regressions m_hat_{g',t,tpre}(X) use a polynomial sieve (total degree up to K) with AIC/BIC order selection — the same basis family as the propensity-ratio sieve — matching the paper's flexible nonparametric nuisance specification (Section 4). The sieve order is selected by an OLS information criterion `IC = n*ln(RSS/n) + c_n*p_K`, where `p_K = comb(K+d, d)` is the **sieve basis dimension** (the number of fitted coefficients; `d` = covariate count) and `c_n = 2` (AIC) / `ln(n)` (BIC), on the within-group (survey-weighted) residual sum of squares; the within-group **positive-weight support** count is used for both `n` and the penalty (the raw row count when unweighted) so the selected order — and hence m_hat — is invariant both to the survey-weight scale and to zero-weight (survey-subpopulation / padded) rows. Zero-weight observations are inert in the weighted RSS, weighted Gram, and weighted loss totals; keying order selection (auto-k_max, the `n_basis` admissibility cap, and the IC sample-size terms) off the positive-weight support keeps them inert for selection too — otherwise padding the panel with zero-weight rows could push `floor(n^{1/5})` to a higher candidate degree and silently change the selected K (hence the DR estimate). This applies identically to the two sieve propensity nuisances. Degree 1 reproduces a linear working model up to floating point, so AIC/BIC degrades to linear when the conditional mean is linear and grows the order only when the data support it. There is **no fixed order ceiling**: the candidate maximum degree grows as `floor(n_pos^{1/5})` with the (positive-weight) group support `n_pos`, giving a sieve dimension `p_K = comb(K+d, d)` bounded by `n_basis = p_K < n_pos` — a *growing* sieve. Assumption C.1's regularity conditions are stated in terms of the sieve **dimension** (not the polynomial degree, which differ once `d > 1`): uniform consistency (C.1(5), `||m_hat - m||_inf = o_p(1)`) and the `o_p(n^{-1/2})` product rate (C.1(6)) require `p_K -> ∞` with `p_K = o(n)`. The growing sieve satisfies this for the low-dimensional covariate settings typical of DiD — with the `floor(n^{1/5})` degree rule, `p_K = comb(floor(n^{1/5})+d, d) = o(n)` for small `d`; for high-dimensional `X` a polynomial sieve faces the curse of dimensionality (`p_K` can outpace `o(n)`), where the paper's ML-nuisance option (Remark 4.2) is preferable. Under C.1 the doubly robust covariate path attains the semiparametric efficiency bound asymptotically (Theorem 4.1); a frozen finite-order sieve (fixed `p_K`) would violate C.1(5)/(6). The DR property still ensures consistency, regardless of `d`, if either the outcome regression or the propensity ratio is correctly specified. - **Note:** If every sieve degree is rank-skipped for an outcome regression (a comparison group too small for even the linear basis, or a degenerate/constant covariate), the estimator falls back to the intercept-only within-group mean of `Y_t - Y_{tpre}` (the unconditional outcome regression) with a UserWarning — distinct from the propensity-ratio sieve's constant-ratio-1 fallback. - **Note:** EfficientDiD bootstrap with survey weights supported (Phase 6) via PSU-level multiplier weights -- **Note:** EfficientDiD covariates (DR path) with survey weights supported — WLS outcome regression, weighted sieve normal equations for propensity ratios/inverse propensities, survey-weighted Nadaraya-Watson kernel for conditional Omega*(X), and survey-weighted ATT averaging. The auto Silverman bandwidth for the conditional Omega*(X) kernel is evaluated on the **positive-weight support** (rows with `w > 0`): the dispersion statistic stays unweighted within that support (a fully survey-weighted bandwidth is deferred as a second-order refinement), but excluding zero-weight rows keeps the bandwidth — and hence Omega*(X) and the per-unit efficient weights it feeds in overidentified (H>1) cells — invariant to zero-weight (survey-subpopulation / padded) rows. Together with the positive-weight-support sieve order selection, this makes the DR point estimate exactly invariant to zero-weight padding (a zero-weight row with an extreme covariate would otherwise inflate the unweighted std and move the bandwidth). +- **Note:** EfficientDiD covariates (DR path) with survey weights supported — WLS outcome regression, weighted sieve normal equations for propensity ratios/inverse propensities, survey-weighted Nadaraya-Watson kernel for conditional Omega*(X), and survey-weighted ATT averaging. The auto Silverman bandwidth for the conditional Omega*(X) kernel is evaluated on the **positive-weight support** (rows with `w > 0`) with a **survey-weighted dispersion**: `median_std` is the median across covariate dimensions of the weighted standard deviation `sqrt(sum_i w_i (x_i - xbar_w)^2 / sum_i w_i)` (weighted mean `xbar_w = sum_i w_i x_i / sum_i w_i`), so the bandwidth reflects the population covariate distribution the kernel targets rather than the unweighted sample. The rate term `n` stays the positive-weight support count (the dispersion is weighted; the sample-size term is not — a deliberately scoped refinement, not Kish `n_eff`). Because zero-weight rows drop from the support and contribute nothing to the weighted moments or the count, and because the weighted mean/std and the count are invariant to weight rescaling `w -> c*w`, the bandwidth — and hence Omega*(X) and the per-unit efficient weights it feeds in overidentified (H>1) cells — is invariant to both zero-weight (survey-subpopulation / padded) rows and weight scale; under uniform positive weights the weighted std reduces to the unweighted population std, matching the pre-refinement bandwidth up to floating point. Together with the positive-weight-support sieve order selection, this keeps the DR point estimate exactly invariant to zero-weight padding (a zero-weight row with an extreme covariate would otherwise move the dispersion and the bandwidth). - **Note:** Cluster-robust SEs use the standard Liang-Zeger clustered sandwich estimator applied to EIF values: aggregate EIF within clusters, center, and compute variance with G/(G-1) small-sample correction. Cluster bootstrap generates multiplier weights at the cluster level (all units in a cluster share the same weight). Analytical clustered SEs are the default when `cluster` is set; cluster bootstrap is opt-in via `n_bootstrap > 0`. - **Note:** Hausman pretest operates on the post-treatment event-study vector ES(e) per Theorem A.1. Both PT-All and PT-Post fits are aggregated to ES(e) using cohort-size weights before computing the test statistic H = delta' V^{-1} delta where delta = ES_post - ES_all and V = Cov(ES_post) - Cov(ES_all). Covariance is computed from aggregated ES(e)-level EIF values. The variance-difference matrix V is inverted via Moore-Penrose pseudoinverse to handle finite-sample non-positive-definiteness. Effective rank of V (number of positive eigenvalues) is used as degrees of freedom. - **Note:** Last-cohort-as-control (`control_group="last_cohort"`) reclassifies the latest treatment cohort as pseudo-never-treated and drops time periods at `t >= last_g - anticipation`, excluding anticipation-contaminated periods from the pseudo-control's pre-treatment window. This is distinct from CallawaySantAnna's `not_yet_treated` option which dynamically selects not-yet-treated units per (g,t) pair. diff --git a/tests/test_methodology_efficient_did.py b/tests/test_methodology_efficient_did.py index d61ce757..d3650b85 100644 --- a/tests/test_methodology_efficient_did.py +++ b/tests/test_methodology_efficient_did.py @@ -59,6 +59,7 @@ from diff_diff import CallawaySantAnna, EfficientDiD from diff_diff.efficient_did import _hausman_quadratic_form from diff_diff.efficient_did_covariates import ( + _silverman_bandwidth, estimate_inverse_propensity_sieve, estimate_outcome_regression, estimate_propensity_ratio_sieve, @@ -764,6 +765,78 @@ def test_inverse_propensity_sieve_selects_order_six(self): np.testing.assert_allclose(auto, cap6, atol=1e-9) # specifically order 6 +# ============================================================================= +# Survey-weighted Silverman bandwidth (conditional Omega* kernel) +# ============================================================================= + + +class TestSurveyWeightedSilvermanBandwidth: + """`_silverman_bandwidth` uses a survey-weighted per-dimension dispersion on + the positive-weight support, with the rate term `n` kept as the positive- + weight count (REGISTRY EfficientDiD covariates Note). These lock the + refinement's behavior and its invariances. + """ + + @staticmethod + def _weighted_median_std_bandwidth(X, w): + """Reference: median-of-weighted-std Silverman bandwidth over the + positive-weight support, n = positive-weight count.""" + support = w > 0 + Xs, ws = X[support], w[support] + n, d = Xs.shape + wn = ws / ws.sum() + mean = wn @ Xs + var = wn @ (Xs - mean) ** 2 + stds = np.sqrt(var) + stds[stds < 1e-10] = 1.0 + median_std = float(np.median(stds)) + return (4.0 / (d + 2)) ** (1.0 / (d + 4)) * median_std * n ** (-1.0 / (d + 4)) + + def test_matches_weighted_reference_formula(self): + # Under non-uniform weights the bandwidth equals the hand-computed + # weighted-dispersion formula (not the unweighted one). + rng = np.random.default_rng(11) + X = rng.normal(size=(250, 3)) + w = rng.uniform(0.4, 3.0, size=250) + got = _silverman_bandwidth(X, w) + expected = self._weighted_median_std_bandwidth(X, w) + np.testing.assert_allclose(got, expected, rtol=0.0, atol=1e-12) + # And it genuinely differs from the unweighted bandwidth. + assert abs(got - _silverman_bandwidth(X)) > 1e-4 + + def test_reduces_to_unweighted_under_uniform_weights(self): + # Uniform positive weights -> weighted std == unweighted population std. + rng = np.random.default_rng(12) + X = rng.normal(size=(200, 2)) + np.testing.assert_allclose( + _silverman_bandwidth(X, np.full(200, 3.7)), + _silverman_bandwidth(X), + rtol=0.0, + atol=1e-12, + ) + + def test_invariant_to_weight_scale(self): + # w -> c*w leaves the weighted mean/std and the count unchanged. + rng = np.random.default_rng(13) + X = rng.normal(size=(180, 3)) + w = rng.uniform(0.5, 2.0, size=180) + np.testing.assert_allclose( + _silverman_bandwidth(X, w), _silverman_bandwidth(X, 137.0 * w), rtol=0.0, atol=1e-12 + ) + + def test_invariant_to_zero_weight_padding(self): + # Appending extreme zero-weight rows must not move the bandwidth: they + # leave the support, the weighted moments, and the count all unchanged. + rng = np.random.default_rng(14) + X = rng.normal(size=(150, 2)) + w = rng.uniform(0.5, 2.0, size=150) + X_pad = np.vstack([X, rng.uniform(30.0, 60.0, size=(40, 2))]) + w_pad = np.concatenate([w, np.zeros(40)]) + np.testing.assert_allclose( + _silverman_bandwidth(X, w), _silverman_bandwidth(X_pad, w_pad), rtol=0.0, atol=0.0 + ) + + # ============================================================================= # End-to-end survey invariance: zero-weight padding must not change the estimate # =============================================================================