Problem
Several us-data imputations learn from donor survey/admin variables that are only observed, or only meaningfully observed, after a policy/process selection step. Treating non-observed or selected-out donor records as ordinary zeros/negatives can make the imputation learn the current-law selection rule rather than the latent concept we need to transfer to CPS.
Recent SSI disability work exposed this for asset-limit reform: SSI disabled/blind receipt is a positive signal for latent SSA disability criteria, but nonreceipt is censored by SSI resources, income/payment amount, application/take-up, pending/denied claims, and other nonmedical screens. A receipt classifier using assets/income can therefore absorb the current-law asset test and understate newly eligible people under an asset-limit reform.
This may apply more broadly. For example, IRS PUF itemized deductions are effectively observed in a tax filing context where itemization itself is selected by current-law tax incentives and the standard deduction. Imputing itemized deduction amounts onto CPS from all records as if non-itemizers are clean zeros may be reasonable for some output variables, but it may be wrong if the estimand is latent itemizable expenses or potential deductions under a different standard deduction/itemization regime.
Investigate
- Inventory imputed variables whose donor values are censored or selected by current-law rules, reporting behavior, application behavior, or filing choices.
- For each, define the estimand explicitly: observed/current-law outcome vs latent underlying concept vs potential value under alternate policy.
- Identify which rows should be:
- out of universe,
- positive/observed labels,
- reliable zeros/negatives,
- unlabeled/censored but still in the prediction universe.
- Decide when to use two-stage or hurdle models, e.g. participation/itemization/application/take-up equation plus conditional amount equation.
- Add tests/diagnostics for policy-threshold artifacts, such as cliffs in latent imputed values at SSI asset limits or the standard deduction/itemization margin.
- Document the pattern so future imputations do not default to training on censored donor outcomes as if they were clean labels.
Examples to review
- SSI disability criteria imputation from SIPP receipt/reason and disability evidence.
- SSI asset/resource imputations and use of imputed vs observed donor values.
- PUF-to-CPS itemized deduction imputations, especially whether non-itemizers should be modeled as true zero itemized expenses or censored/unobserved potential expenses depending on the downstream policy use.
- Any other source-impute or calibration surface where a donor variable reflects current-law participation, claiming, filing, or eligibility rather than a latent underlying value.
Desired outcome
A short design note plus targeted code/test changes where needed. The goal is not to make every model more complex, but to make the training universe and labels match the estimand for policy counterfactuals.
Problem
Several us-data imputations learn from donor survey/admin variables that are only observed, or only meaningfully observed, after a policy/process selection step. Treating non-observed or selected-out donor records as ordinary zeros/negatives can make the imputation learn the current-law selection rule rather than the latent concept we need to transfer to CPS.
Recent SSI disability work exposed this for asset-limit reform: SSI disabled/blind receipt is a positive signal for latent SSA disability criteria, but nonreceipt is censored by SSI resources, income/payment amount, application/take-up, pending/denied claims, and other nonmedical screens. A receipt classifier using assets/income can therefore absorb the current-law asset test and understate newly eligible people under an asset-limit reform.
This may apply more broadly. For example, IRS PUF itemized deductions are effectively observed in a tax filing context where itemization itself is selected by current-law tax incentives and the standard deduction. Imputing itemized deduction amounts onto CPS from all records as if non-itemizers are clean zeros may be reasonable for some output variables, but it may be wrong if the estimand is latent itemizable expenses or potential deductions under a different standard deduction/itemization regime.
Investigate
Examples to review
Desired outcome
A short design note plus targeted code/test changes where needed. The goal is not to make every model more complex, but to make the training universe and labels match the estimand for policy counterfactuals.