Add collapsed SOCA imputation for capital gains basis and holding period

## Summary

Build a first-pass capital-gains basis and holding-period imputation in `policyengine-us-data` that supports capital-gains basis indexation analysis while staying close enough to Yale Budget Lab's published approach to benchmark against CRS/Yale estimates.

This issue covers the **collapsed tax-unit version**: one representative long-term capital-gains holding period and one representative long-term capital-gains basis per tax unit. It should improve on Yale where feasible, but remain simple enough to implement and validate before a synthetic-lot version.

Implementation PRs:

- PolicyEngine-US input variables: PolicyEngine/policyengine-us#8423
- Draft `policyengine-us-data` collapsed SOCA imputation: PolicyEngine/policyengine-us-data#1128
- Draft PolicyEngine-US indexation reform: PolicyEngine/policyengine-us#8427

## Current Status

As of May 24, 2026:

- `policyengine-us#8423` adds `long_term_capital_gains_basis` and `long_term_capital_gains_years_held` with focused YAML tests. It was rebased onto current `upstream/main` and now sets `long_term_capital_gains_basis.uprating = "calibration.gov.irs.soi.long_term_capital_gains"`, while leaving years-held non-uprated. Latest pushed commit: `827c7e161f`.
- `policyengine-us-data#1128` implements the deterministic collapsed SOCA imputation, wires it into PUF/Extended CPS generation, backfills release PUF reads when the new fields are missing, excludes basis/holding period from independent QRF imputation, and adds focused tests.
- `$cycle` review found two actionable issues: release PUF load/read backfill and a missing test proving QRF exclusion. Both were fixed in `policyengine-us-data` commit `d1595678`; the second review pass reported no actionable findings.
- `policyengine-us#8427` is stacked on `#8423`, adds the tax-unit-first indexation reform, and was rebased/pushed at `2a09ae8d5f` after the basis-uprating fix.
- Remaining blocker: merge/release `policyengine-us#8423`, then bump the `policyengine-us` pin/lock in `policyengine-us-data#1128` to the release containing those variables and mark the draft PR ready. Do not treat a bump to a release lacking the new variables as sufficient for final merge, even if it satisfies freshness CI.

## Replication And Benchmark Status

Local Yale replication harness: `/Users/maxghenis/_tmp/mark-keightley-yale-replication/replicate_yale_capital_gains.py`.

Published Yale workbook targets extracted:

- Prospective 2026-2035 revenue change: `-$168.5B`.
- Retrospective 2026-2035 revenue change: `-$952.7B`.
- Retrospective 2027 average tax change for the top 0.1 percent: `-$353,235`.

Public-code replication:

- Reimplemented the public SOCA/Budget-Lab imputation pieces from Yale's `Tax-Data` resources.
- Compared with Yale Figure 1 `Tax-Data 2026`, the public-resource reproduction has max absolute difference `0.004715` for gain-dollar holding-period shares and `0.003829` for basis-to-sales ratios.
- Full Yale revenue/distribution rerun is blocked locally because `Tax-Simulator` points to generated model interfaces not present in the public checkout, including `Tax-Data` vintage `2026030513` and `Macro-Projections` vintage `2026022522`.
- Important reconciliation detail: Yale's article says indexation is not allowed to generate losses, but the public `indexed_kg.csv` runscript scenarios inherit `index_kg_no_loss = 0`; only `indexed_kg_retro_noloss` sets the no-loss flag and it is not in that runscript. Treat no-loss on/off as an explicit benchmark axis.

PolicyEngine scratch score:

- Scratch dataset: enhanced CPS with the collapsed SOCA basis/holding-period fields added manually.
- 2026 retrospective score moved from `-$31.4B` to `-$35.1B` after basis uprating.
- 2026 scratch inputs: `long_term_capital_gains_before_response = $345.3B`, imputed basis `$1,058.9B`, indexation adjustment `$204.5B`.
- This is not a benchmark-quality Yale comparison because the scratch enhanced-CPS gains base is far below the PUF/CBO-scale target (`$1.68T` parameter value for 2026 long-term gains). Default enhanced CPS still scores zero until the new basis/holding-period fields are present in a real data artifact.
- Local PUF benchmark remains blocked: the `puf_2024.h5` release URL returns 404 without private release access, and no local `puf_2021/2023/2024.h5` artifact was found in the usual cache/worktree locations.

Local verification after the basis-uprating fix:

- `uv run ruff check ...` on touched PolicyEngine-US Python files: pass.
- `uv run pytest policyengine_us/tests/test_system_import.py -q`: 5 passed.
- `uv run python -m policyengine_core.scripts.policyengine_command test policyengine_us/tests/policy/contrib/capital_gains_indexation.yaml -c policyengine_us`: 7 passed.
- `uv run python -m policyengine_core.scripts.policyengine_command test policyengine_us/tests/variables/household/income/person/capital_gains/long_term_capital_gains_basis.yaml policyengine_us/tests/variables/household/income/person/capital_gains/long_term_capital_gains_years_held.yaml -c policyengine_us`: 4 passed.

## Motivation

CRS's May 20, 2026 report, **IF13231, "Indexing Capital Gains Taxes for Inflation: Marginal Effective Tax Rates and Revenue Estimates,"** cites Yale Budget Lab's Tax-Simulator estimate for capital-gains basis indexation. Yale's model uses two imputed fields absent from the public-use tax data:

- `kg_lt_years_held`: weighted-average holding period for net long-term capital gains.
- `kg_lt_basis`: imputed cost basis for net long-term capital gains.

PolicyEngine currently imputes many PUF-derived fields into Enhanced CPS, including pre-response long-term capital gains, but does not have a cost-basis or holding-period imputation suitable for capital-gains indexation.

## External Reference

Yale Budget Lab's appendix describes the current benchmark approach:

- Use IRS SOI Sales of Capital Assets (SOCA) data for holding-period buckets, sales price, basis, gains, and losses.
- Draw a representative holding period for each PUF record with nonzero net long-term capital gains or losses.
- Estimate basis-to-sales ratios by holding period and market-cycle proxy.
- Convert realized net gain/loss into an imputed basis.
- Apply CPI basis indexation in the tax calculator.

Relevant Yale code:

- `Budget-Lab-Yale/Tax-Data`, `src/impute_variables.R`: SOCA basis and holding-period imputation.
- `Budget-Lab-Yale/Tax-Data`, `src/project_puf.R`: cycle-aware basis projection adjustment.
- `Budget-Lab-Yale/Tax-Simulator`, `src/calc/functions/income/kg.R`: indexation tax logic.

## Proposed Scope

Add two **person-level storage variables**, aligned with the pre-response long-term gains input used by `policyengine-us`:

- `long_term_capital_gains_years_held`
- `long_term_capital_gains_basis`

Use them to support a later `policyengine-us` reform that can index long-term capital-gains basis. The collapsed imputation object should be **tax-unit-level first**, because the source and policy mechanics are tax-return/tax-unit oriented:

```text
tax_unit_long_term_capital_gains_before_response
  = sum(person.long_term_capital_gains_before_response)
```

Compute sign, holding period bucket, BSR, basis, proceeds, and the indexation adjustment at the tax-unit collapsed-transaction level. Person-level variables should be treated as storage and compatibility fields. `policyengine-us` should not compute independent person-level indexation adjustments for this collapsed model unless tests prove exact equivalence to the tax-unit collapsed result after aggregation.

Impute basis and holding period from baseline pre-response long-term gains and keep them aligned with `long_term_capital_gains_before_response`; behavioral changes should not trigger re-imputation of basis.

If a tax unit's long-term gains are split across people, assign the same representative holding period to each person with nonzero pre-response long-term gains in that tax unit and allocate storage basis in proportion to `abs(long_term_capital_gains_before_response)` only as a storage convention. The hard contract is tax-unit equivalence: add acceptance tests requiring exact equality between the tax-unit collapsed calculation and the person-stored representation for tax units with positive/zero, positive/positive, positive/negative, negative/negative, and one-large-gain/one-small-loss spouse patterns.

The first implementation should be deterministic and record-stable, not merely reproducible under a fixed random seed. Generate pseudo-random keys from stable identifiers such as dataset vintage, tax year, tax unit/person ID, gain/loss sign, and imputation version, then sort within each quota cell by those keys. Shuffling input rows, chunking, or parallel execution should not change assigned holding periods or basis after sorting back to IDs.

This issue should not attempt to model multiple transactions per tax unit. That belongs in the synthetic-lots follow-up issue.

## Proposed Method

### 1. Source Data

Use public IRS SOI Sales of Capital Assets data. There are two distinct SOCA inputs:

- The 2013-2015 SOCA release, which Yale uses for the most recent detailed holding-period and gain/loss distributions.
- The historical basis-to-sales panel used in Yale's `soca_basis_sales.csv`, covering 1985 and 1997-2015, which supports the BSR market-cycle regression.

Extract or vendor compact resource tables needed for holding-period buckets, gain/loss dollar-weighted holding-period shares, basis-to-sales ratios, and available AGI/asset-type splits for validation and possible improved conditioning.

Before implementation, build a source matrix with one row per target moment and columns for exact IRS table/source, public/private availability, year coverage, gain/loss sign coverage, AGI coverage, asset-type coverage, and coefficient-of-variation availability. Public SOCA tables may not support all AGI x asset type x sign x holding-period x BSR cells; unavailable or high-CV cells should fall back to reliable marginals rather than creating false precision.

### 2. Holding-Period Assignment

For each record with nonzero long-term capital gains:

- If `long_term_capital_gains_before_response > 0`, assign from a gain-dollar-weighted SOCA holding-period distribution.
- If `long_term_capital_gains_before_response < 0`, assign from a loss-dollar-weighted SOCA holding-period distribution.
- If `long_term_capital_gains_before_response == 0`, set holding period and basis to zero in released PE datasets.

Improve on Yale by making conditioning configurable:

- MVP: unconditional holding-period distribution, matching Yale.
- Preferred: condition on AGI band when SOCA tabulations support it, then fall back to unconditional distributions for sparse cells.
- Optional: condition on filing status or broad asset type only if validation shows it materially improves fit.

Use a documented conditioning hierarchy and shrinkage strategy. Require minimum weighted denominators and maximum CV thresholds before using a conditioned cell. Use raking or empirical-Bayes shrinkage to combine sparse conditioned moments with reliable marginals.

The assignment should not rely only on independent random draws. Capital gains are concentrated enough that a seeded draw can miss SOCA dollar-weighted targets. Instead, assign holding-period buckets within each conditioning cell using `weight * abs(long_term_capital_gains_before_response)` so bucket shares hit SOCA gain-dollar/loss-dollar targets within tolerance. Then draw only the continuous within-bucket holding period.

### 3. Basis-To-Sales Ratio

Estimate a basis-to-sales ratio function by holding period. Yale models:

```text
log(BSR) = holding-period-bucket fixed effects + beta * h * log(1 + trailing market return)
```

For positive gains, compute:

```text
basis = gain * BSR / (1 - BSR)
```

For losses, use loss-transaction BSR, which is greater than 1:

```text
basis = abs(loss) * loss_BSR / (loss_BSR - 1)
```

Numeric fail-safe caps should be explicit in resource metadata. Proposed defaults:

- gain BSR: `[0.001, 0.999]`
- loss BSR: `[1.001, 100]`

Emit clipping diagnostics: share of records clipped, weighted gain/loss dollars clipped, clipping by AGI group and holding-period bucket, and the maximum influence of any one record on aggregate inferred sales/basis.

### 4. Projection And Uprating

The current PR direction is: the data artifact stores base-year basis and years-held, `long_term_capital_gains_basis` has the same variable-level uprater as long-term gains, and years-held does not uprate. This avoids basis staying frozen when a single-year dataset is projected while also avoiding arbitrary holding-period growth.

For future projection improvements, explicitly decide whether the data artifact stores projected dollar basis or stores a base-year basis-to-gain / basis-to-sales ratio that formulas convert using projected capital gains. Avoid uprating basis once in the data pipeline and again through a variable uprating rule.

Yale first grows `kg_lt_basis` with capital gains, then scales basis by the ratio of predicted weighted-average BSR in projection year `y` to the predicted weighted-average BSR in the base year. PolicyEngine should implement an equivalent or improved mechanism and validate it against aggregate gain/basis relationships.

Projection validation should report, by year: weighted LTCG, weighted inferred basis, weighted inferred sales, aggregate BSR, and each ratio to the base year. Add a no-double-scaling acceptance test once a BSR-cycle projection factor is added.

### 5. Integration With Existing Pipeline

Expected touchpoints:

- Add variable definitions in `policyengine-us` first, so `policyengine-us-data` does not silently skip the variables.
- Add extraction/imputation in `policyengine-us-data`.
- Add the new fields to the relevant PUF financial subset and Enhanced CPS imputed-variable list.
- Add deterministic random seed handling.
- Add storage/resource files with provenance and generation scripts.
- Add tests for resource parsing, deterministic imputation, bounds, and aggregate validation.

## Calibration And Validation

Calibration should be explicit. This imputation should not only produce plausible individual values; it must reproduce relevant aggregate moments.

Required validation:

- Weighted total positive and negative long-term capital gains in the dataset remain consistent with SOI/CBO targets already used by the data pipeline.
- Imputed holding-period bucket shares reproduce SOCA gain-dollar and loss-dollar distributions.
- Imputed BSR by holding-period bucket reproduces SOCA BSRs.
- Aggregate implied sales and basis reproduce SOCA moments where available, including gain/loss sign and AGI class where reliable.
- SOCA/SOI-informed gross gain and gross loss dollars not represented by net LTCG are quantified by AGI class, including net-zero and small-net records where offsetting gross activity may be large.
- PolicyEngine benchmark effects are reconciled to Yale/CRS under exact comparable assumptions: prospective vs retrospective, 1-/3-/5-year minimum holding periods, CPI-U vs chained-CPI if relevant, no-loss cap on/off, mechanical vs conventional behavioral response, and fiscal-year vs calendar-year reporting.

Do **not** calibrate directly to Yale's revenue estimate unless the explicit goal is a benchmark replication. Use Yale/CRS as an external validation check.

Split benchmark validation into two parts:

- Mechanical no-behavior sanity checks, to isolate the basis and holding-period imputation.
- Conventional checks with capital-gains behavioral response and fiscal-year conversion if comparing against CRS/Yale revenue estimates.

Include a "why different from Yale/CRS" reconciliation table covering data anchor, BSR/holding-period imputation, inflation series, behavioral response, and fiscal-year conversion.

## Acceptance Criteria

- New variables exist in `policyengine-us` and are available in Enhanced CPS.
- The data build creates nonzero `long_term_capital_gains_basis` and `long_term_capital_gains_years_held` for records with nonzero long-term capital gains/losses.
- Imputation is reproducible under stable record keys; shuffled input rows produce identical assigned holding periods and basis after sorting back to IDs.
- Resource generation is documented and reproducible from public source data.
- Unit tests cover gain, loss, zero-gain, spouse split/sign combinations, boundary BSR, clipping diagnostics, and projected-year behavior.
- Unit tests enforce exact tax-unit-level indexation equivalence for every person-storage pattern, including mixed-sign spouses.
- Unit tests enforce no double-scaling of projected basis once BSR-cycle projection is implemented.
- Zero-gain records have zero basis and zero holding period in released datasets.
- Validation output shows SOCA holding-period, BSR, implied sales, implied basis, and omitted/compressed gross activity moments before/after imputation.
- Source matrix documents which SOCA targets are exact, shrunk, or fallback-only.
- A small benchmark script can compute mechanical and conventional indexation effects for 2026-2035 under retrospective/prospective regimes and 1-/3-/5-year minimum holding-period variants.

## Known Limitations

- A collapsed tax-unit imputation treats each tax unit's net long-term gain/loss as if it came from one representative transaction.
- It cannot correctly represent simultaneous sales of old and new assets in a prospective indexation regime.
- It cannot apply no-loss caps at the asset-lot level.
- It does not recover gross gains and gross losses hidden inside a net long-term capital-gains amount.

These limitations should be quantified and addressed in the synthetic-lots follow-up issue.

## Suggested Follow-Up

Implement a synthetic-lot imputation that splits each tax unit's long-term gains/losses into multiple pseudo-transactions with separate holding periods, basis, and eligibility under prospective indexation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add collapsed SOCA imputation for capital gains basis and holding period #1126

Summary

Current Status

Replication And Benchmark Status

Motivation

External Reference

Proposed Scope

Proposed Method

1. Source Data

2. Holding-Period Assignment

3. Basis-To-Sales Ratio

4. Projection And Uprating

5. Integration With Existing Pipeline

Calibration And Validation

Acceptance Criteria

Known Limitations

Suggested Follow-Up

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add collapsed SOCA imputation for capital gains basis and holding period #1126

Description

Summary

Current Status

Replication And Benchmark Status

Motivation

External Reference

Proposed Scope

Proposed Method

1. Source Data

2. Holding-Period Assignment

3. Basis-To-Sales Ratio

4. Projection And Uprating

5. Integration With Existing Pipeline

Calibration And Validation

Acceptance Criteria

Known Limitations

Suggested Follow-Up

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions