Skip to content

Add collapsed SOCA imputation for capital gains basis and holding period #1126

@MaxGhenis

Description

@MaxGhenis

Summary

Build a first-pass capital-gains basis and holding-period imputation in policyengine-us-data that supports capital-gains basis indexation analysis while staying close enough to Yale Budget Lab's published approach to benchmark against CRS/Yale estimates.

This issue covers the collapsed tax-unit version: one representative long-term capital-gains holding period and one representative long-term capital-gains basis per tax unit. It should improve on Yale where feasible, but remain simple enough to implement and validate before a synthetic-lot version.

Implementation PRs:

Current Status

As of May 24, 2026:

  • policyengine-us#8423 adds long_term_capital_gains_basis and long_term_capital_gains_years_held with focused YAML tests. It was rebased onto current upstream/main and now sets long_term_capital_gains_basis.uprating = "calibration.gov.irs.soi.long_term_capital_gains", while leaving years-held non-uprated. Latest pushed commit: 827c7e161f.
  • policyengine-us-data#1128 implements the deterministic collapsed SOCA imputation, wires it into PUF/Extended CPS generation, backfills release PUF reads when the new fields are missing, excludes basis/holding period from independent QRF imputation, and adds focused tests.
  • $cycle review found two actionable issues: release PUF load/read backfill and a missing test proving QRF exclusion. Both were fixed in policyengine-us-data commit d1595678; the second review pass reported no actionable findings.
  • policyengine-us#8427 is stacked on #8423, adds the tax-unit-first indexation reform, and was rebased/pushed at 2a09ae8d5f after the basis-uprating fix.
  • Remaining blocker: merge/release policyengine-us#8423, then bump the policyengine-us pin/lock in policyengine-us-data#1128 to the release containing those variables and mark the draft PR ready. Do not treat a bump to a release lacking the new variables as sufficient for final merge, even if it satisfies freshness CI.

Replication And Benchmark Status

Local Yale replication harness: /Users/maxghenis/_tmp/mark-keightley-yale-replication/replicate_yale_capital_gains.py.

Published Yale workbook targets extracted:

  • Prospective 2026-2035 revenue change: -$168.5B.
  • Retrospective 2026-2035 revenue change: -$952.7B.
  • Retrospective 2027 average tax change for the top 0.1 percent: -$353,235.

Public-code replication:

  • Reimplemented the public SOCA/Budget-Lab imputation pieces from Yale's Tax-Data resources.
  • Compared with Yale Figure 1 Tax-Data 2026, the public-resource reproduction has max absolute difference 0.004715 for gain-dollar holding-period shares and 0.003829 for basis-to-sales ratios.
  • Full Yale revenue/distribution rerun is blocked locally because Tax-Simulator points to generated model interfaces not present in the public checkout, including Tax-Data vintage 2026030513 and Macro-Projections vintage 2026022522.
  • Important reconciliation detail: Yale's article says indexation is not allowed to generate losses, but the public indexed_kg.csv runscript scenarios inherit index_kg_no_loss = 0; only indexed_kg_retro_noloss sets the no-loss flag and it is not in that runscript. Treat no-loss on/off as an explicit benchmark axis.

PolicyEngine scratch score:

  • Scratch dataset: enhanced CPS with the collapsed SOCA basis/holding-period fields added manually.
  • 2026 retrospective score moved from -$31.4B to -$35.1B after basis uprating.
  • 2026 scratch inputs: long_term_capital_gains_before_response = $345.3B, imputed basis $1,058.9B, indexation adjustment $204.5B.
  • This is not a benchmark-quality Yale comparison because the scratch enhanced-CPS gains base is far below the PUF/CBO-scale target ($1.68T parameter value for 2026 long-term gains). Default enhanced CPS still scores zero until the new basis/holding-period fields are present in a real data artifact.
  • Local PUF benchmark remains blocked: the puf_2024.h5 release URL returns 404 without private release access, and no local puf_2021/2023/2024.h5 artifact was found in the usual cache/worktree locations.

Local verification after the basis-uprating fix:

  • uv run ruff check ... on touched PolicyEngine-US Python files: pass.
  • uv run pytest policyengine_us/tests/test_system_import.py -q: 5 passed.
  • uv run python -m policyengine_core.scripts.policyengine_command test policyengine_us/tests/policy/contrib/capital_gains_indexation.yaml -c policyengine_us: 7 passed.
  • uv run python -m policyengine_core.scripts.policyengine_command test policyengine_us/tests/variables/household/income/person/capital_gains/long_term_capital_gains_basis.yaml policyengine_us/tests/variables/household/income/person/capital_gains/long_term_capital_gains_years_held.yaml -c policyengine_us: 4 passed.

Motivation

CRS's May 20, 2026 report, IF13231, "Indexing Capital Gains Taxes for Inflation: Marginal Effective Tax Rates and Revenue Estimates," cites Yale Budget Lab's Tax-Simulator estimate for capital-gains basis indexation. Yale's model uses two imputed fields absent from the public-use tax data:

  • kg_lt_years_held: weighted-average holding period for net long-term capital gains.
  • kg_lt_basis: imputed cost basis for net long-term capital gains.

PolicyEngine currently imputes many PUF-derived fields into Enhanced CPS, including pre-response long-term capital gains, but does not have a cost-basis or holding-period imputation suitable for capital-gains indexation.

External Reference

Yale Budget Lab's appendix describes the current benchmark approach:

  • Use IRS SOI Sales of Capital Assets (SOCA) data for holding-period buckets, sales price, basis, gains, and losses.
  • Draw a representative holding period for each PUF record with nonzero net long-term capital gains or losses.
  • Estimate basis-to-sales ratios by holding period and market-cycle proxy.
  • Convert realized net gain/loss into an imputed basis.
  • Apply CPI basis indexation in the tax calculator.

Relevant Yale code:

  • Budget-Lab-Yale/Tax-Data, src/impute_variables.R: SOCA basis and holding-period imputation.
  • Budget-Lab-Yale/Tax-Data, src/project_puf.R: cycle-aware basis projection adjustment.
  • Budget-Lab-Yale/Tax-Simulator, src/calc/functions/income/kg.R: indexation tax logic.

Proposed Scope

Add two person-level storage variables, aligned with the pre-response long-term gains input used by policyengine-us:

  • long_term_capital_gains_years_held
  • long_term_capital_gains_basis

Use them to support a later policyengine-us reform that can index long-term capital-gains basis. The collapsed imputation object should be tax-unit-level first, because the source and policy mechanics are tax-return/tax-unit oriented:

tax_unit_long_term_capital_gains_before_response
  = sum(person.long_term_capital_gains_before_response)

Compute sign, holding period bucket, BSR, basis, proceeds, and the indexation adjustment at the tax-unit collapsed-transaction level. Person-level variables should be treated as storage and compatibility fields. policyengine-us should not compute independent person-level indexation adjustments for this collapsed model unless tests prove exact equivalence to the tax-unit collapsed result after aggregation.

Impute basis and holding period from baseline pre-response long-term gains and keep them aligned with long_term_capital_gains_before_response; behavioral changes should not trigger re-imputation of basis.

If a tax unit's long-term gains are split across people, assign the same representative holding period to each person with nonzero pre-response long-term gains in that tax unit and allocate storage basis in proportion to abs(long_term_capital_gains_before_response) only as a storage convention. The hard contract is tax-unit equivalence: add acceptance tests requiring exact equality between the tax-unit collapsed calculation and the person-stored representation for tax units with positive/zero, positive/positive, positive/negative, negative/negative, and one-large-gain/one-small-loss spouse patterns.

The first implementation should be deterministic and record-stable, not merely reproducible under a fixed random seed. Generate pseudo-random keys from stable identifiers such as dataset vintage, tax year, tax unit/person ID, gain/loss sign, and imputation version, then sort within each quota cell by those keys. Shuffling input rows, chunking, or parallel execution should not change assigned holding periods or basis after sorting back to IDs.

This issue should not attempt to model multiple transactions per tax unit. That belongs in the synthetic-lots follow-up issue.

Proposed Method

1. Source Data

Use public IRS SOI Sales of Capital Assets data. There are two distinct SOCA inputs:

  • The 2013-2015 SOCA release, which Yale uses for the most recent detailed holding-period and gain/loss distributions.
  • The historical basis-to-sales panel used in Yale's soca_basis_sales.csv, covering 1985 and 1997-2015, which supports the BSR market-cycle regression.

Extract or vendor compact resource tables needed for holding-period buckets, gain/loss dollar-weighted holding-period shares, basis-to-sales ratios, and available AGI/asset-type splits for validation and possible improved conditioning.

Before implementation, build a source matrix with one row per target moment and columns for exact IRS table/source, public/private availability, year coverage, gain/loss sign coverage, AGI coverage, asset-type coverage, and coefficient-of-variation availability. Public SOCA tables may not support all AGI x asset type x sign x holding-period x BSR cells; unavailable or high-CV cells should fall back to reliable marginals rather than creating false precision.

2. Holding-Period Assignment

For each record with nonzero long-term capital gains:

  • If long_term_capital_gains_before_response > 0, assign from a gain-dollar-weighted SOCA holding-period distribution.
  • If long_term_capital_gains_before_response < 0, assign from a loss-dollar-weighted SOCA holding-period distribution.
  • If long_term_capital_gains_before_response == 0, set holding period and basis to zero in released PE datasets.

Improve on Yale by making conditioning configurable:

  • MVP: unconditional holding-period distribution, matching Yale.
  • Preferred: condition on AGI band when SOCA tabulations support it, then fall back to unconditional distributions for sparse cells.
  • Optional: condition on filing status or broad asset type only if validation shows it materially improves fit.

Use a documented conditioning hierarchy and shrinkage strategy. Require minimum weighted denominators and maximum CV thresholds before using a conditioned cell. Use raking or empirical-Bayes shrinkage to combine sparse conditioned moments with reliable marginals.

The assignment should not rely only on independent random draws. Capital gains are concentrated enough that a seeded draw can miss SOCA dollar-weighted targets. Instead, assign holding-period buckets within each conditioning cell using weight * abs(long_term_capital_gains_before_response) so bucket shares hit SOCA gain-dollar/loss-dollar targets within tolerance. Then draw only the continuous within-bucket holding period.

3. Basis-To-Sales Ratio

Estimate a basis-to-sales ratio function by holding period. Yale models:

log(BSR) = holding-period-bucket fixed effects + beta * h * log(1 + trailing market return)

For positive gains, compute:

basis = gain * BSR / (1 - BSR)

For losses, use loss-transaction BSR, which is greater than 1:

basis = abs(loss) * loss_BSR / (loss_BSR - 1)

Numeric fail-safe caps should be explicit in resource metadata. Proposed defaults:

  • gain BSR: [0.001, 0.999]
  • loss BSR: [1.001, 100]

Emit clipping diagnostics: share of records clipped, weighted gain/loss dollars clipped, clipping by AGI group and holding-period bucket, and the maximum influence of any one record on aggregate inferred sales/basis.

4. Projection And Uprating

The current PR direction is: the data artifact stores base-year basis and years-held, long_term_capital_gains_basis has the same variable-level uprater as long-term gains, and years-held does not uprate. This avoids basis staying frozen when a single-year dataset is projected while also avoiding arbitrary holding-period growth.

For future projection improvements, explicitly decide whether the data artifact stores projected dollar basis or stores a base-year basis-to-gain / basis-to-sales ratio that formulas convert using projected capital gains. Avoid uprating basis once in the data pipeline and again through a variable uprating rule.

Yale first grows kg_lt_basis with capital gains, then scales basis by the ratio of predicted weighted-average BSR in projection year y to the predicted weighted-average BSR in the base year. PolicyEngine should implement an equivalent or improved mechanism and validate it against aggregate gain/basis relationships.

Projection validation should report, by year: weighted LTCG, weighted inferred basis, weighted inferred sales, aggregate BSR, and each ratio to the base year. Add a no-double-scaling acceptance test once a BSR-cycle projection factor is added.

5. Integration With Existing Pipeline

Expected touchpoints:

  • Add variable definitions in policyengine-us first, so policyengine-us-data does not silently skip the variables.
  • Add extraction/imputation in policyengine-us-data.
  • Add the new fields to the relevant PUF financial subset and Enhanced CPS imputed-variable list.
  • Add deterministic random seed handling.
  • Add storage/resource files with provenance and generation scripts.
  • Add tests for resource parsing, deterministic imputation, bounds, and aggregate validation.

Calibration And Validation

Calibration should be explicit. This imputation should not only produce plausible individual values; it must reproduce relevant aggregate moments.

Required validation:

  • Weighted total positive and negative long-term capital gains in the dataset remain consistent with SOI/CBO targets already used by the data pipeline.
  • Imputed holding-period bucket shares reproduce SOCA gain-dollar and loss-dollar distributions.
  • Imputed BSR by holding-period bucket reproduces SOCA BSRs.
  • Aggregate implied sales and basis reproduce SOCA moments where available, including gain/loss sign and AGI class where reliable.
  • SOCA/SOI-informed gross gain and gross loss dollars not represented by net LTCG are quantified by AGI class, including net-zero and small-net records where offsetting gross activity may be large.
  • PolicyEngine benchmark effects are reconciled to Yale/CRS under exact comparable assumptions: prospective vs retrospective, 1-/3-/5-year minimum holding periods, CPI-U vs chained-CPI if relevant, no-loss cap on/off, mechanical vs conventional behavioral response, and fiscal-year vs calendar-year reporting.

Do not calibrate directly to Yale's revenue estimate unless the explicit goal is a benchmark replication. Use Yale/CRS as an external validation check.

Split benchmark validation into two parts:

  • Mechanical no-behavior sanity checks, to isolate the basis and holding-period imputation.
  • Conventional checks with capital-gains behavioral response and fiscal-year conversion if comparing against CRS/Yale revenue estimates.

Include a "why different from Yale/CRS" reconciliation table covering data anchor, BSR/holding-period imputation, inflation series, behavioral response, and fiscal-year conversion.

Acceptance Criteria

  • New variables exist in policyengine-us and are available in Enhanced CPS.
  • The data build creates nonzero long_term_capital_gains_basis and long_term_capital_gains_years_held for records with nonzero long-term capital gains/losses.
  • Imputation is reproducible under stable record keys; shuffled input rows produce identical assigned holding periods and basis after sorting back to IDs.
  • Resource generation is documented and reproducible from public source data.
  • Unit tests cover gain, loss, zero-gain, spouse split/sign combinations, boundary BSR, clipping diagnostics, and projected-year behavior.
  • Unit tests enforce exact tax-unit-level indexation equivalence for every person-storage pattern, including mixed-sign spouses.
  • Unit tests enforce no double-scaling of projected basis once BSR-cycle projection is implemented.
  • Zero-gain records have zero basis and zero holding period in released datasets.
  • Validation output shows SOCA holding-period, BSR, implied sales, implied basis, and omitted/compressed gross activity moments before/after imputation.
  • Source matrix documents which SOCA targets are exact, shrunk, or fallback-only.
  • A small benchmark script can compute mechanical and conventional indexation effects for 2026-2035 under retrospective/prospective regimes and 1-/3-/5-year minimum holding-period variants.

Known Limitations

  • A collapsed tax-unit imputation treats each tax unit's net long-term gain/loss as if it came from one representative transaction.
  • It cannot correctly represent simultaneous sales of old and new assets in a prospective indexation regime.
  • It cannot apply no-loss caps at the asset-lot level.
  • It does not recover gross gains and gross losses hidden inside a net long-term capital-gains amount.

These limitations should be quantified and addressed in the synthetic-lots follow-up issue.

Suggested Follow-Up

Implement a synthetic-lot imputation that splits each tax unit's long-term gains/losses into multiple pseudo-transactions with separate holding periods, basis, and eligibility under prospective indexation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions