Skip to content

Impute below-threshold student loan holders#332

Merged
MaxGhenis merged 2 commits intomainfrom
codex/fix-281-below-threshold-borrowers
Apr 13, 2026
Merged

Impute below-threshold student loan holders#332
MaxGhenis merged 2 commits intomainfrom
codex/fix-281-below-threshold-borrowers

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

  • add SLC liable-to-repay targets for Plan 2 and Plan 5 alongside the above-threshold targets
  • fix the SLC parser to read the Higher education total rows instead of the first matching above-threshold row
  • impute missing below-threshold England student loan holders into the FRS base dataset using the liable-to-repay shortfall against the observed repayer stock

Details

This keeps the active-repayer stock anchored to observed PAYE deductions, then fills the missing England tertiary-education borrower stock needed for the base dataset and future plan uprating.

The checked-in SLC snapshot implies these missing borrower stocks before weighting against the live dataset:

  • Plan 2 shortfall: 4.955m in 2025, rising to 5.57m in 2028
  • Plan 5 shortfall: 10k in 2025, rising to 2.165m in 2030

Testing

  • uvx ruff check policyengine_uk_data/targets/sources/slc.py policyengine_uk_data/datasets/imputations/student_loans.py policyengine_uk_data/targets/compute/other.py policyengine_uk_data/targets/compute/__init__.py policyengine_uk_data/targets/build_loss_matrix.py policyengine_uk_data/tests/test_student_loan_targets.py policyengine_uk_data/tests/test_student_loan_plan.py
  • uv run --python 3.13 pytest -q policyengine_uk_data/tests/test_student_loan_targets.py policyengine_uk_data/tests/test_student_loan_plan.py

Closes #281.

Copy link
Copy Markdown
Collaborator

@vahid-ahmadi vahid-ahmadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entity-level mismatch bug in compute functions

Both compute_student_loan_plan and compute_student_loan_plan_liable in other.py mix person-level and household-level arrays:

plan = ctx.pe_person("student_loan_plan")       # person-level (n_persons)
repayments = ctx.pe_person("student_loan_repayments")  # person-level (n_persons)
on_plan = (plan == plan_value) & (ctx.country == "ENGLAND") & (repayments > 0)

ctx.country is sim.calculate("country").values which returns a household-level array (confirmed: country entity is household). This will fail at runtime with real data due to array length mismatch.

The old code handled this correctly by explicitly mapping region to person level:

region = ctx.sim.calculate("region", map_to="person").values

The test doesn't catch this because DummyCtx uses same-length arrays for everything and household_from_person is identity.

Fix: replace ctx.country == "ENGLAND" with something like ctx.sim.calculate("country", map_to="person").values == "ENGLAND" in both functions. Note that the imputation code in student_loans.py already does this correctly (sim.calculate("country", map_to="person").values).

Minor: Plan 2 below-threshold eligible pool is broader than Plan 2 cohort

plan_2_eligible uses plan_2_age_band (ages 21–55) but doesn't require the Plan 2 cohort filter (uni start >= 2012). In 2025, anyone aged ~31+ started uni pre-2012 and would be Plan 1 cohort, yet they're in the Plan 2 eligible pool. This likely doesn't affect the total count (the probabilistic assignment still targets the right number), but it distributes Plan 2 holders across a wider age range than reality. Worth considering whether to tighten the age band or add the cohort filter.

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Addressed the review in 21b2243. The fix switches the SLC compute path back to person-level mapping instead of the household-level , and it tightens the Plan 2 below-threshold imputation pool to the estimated Plan 2 cohort as well as the age band. Local validation after the change: E902 No such file or directory (os error 2)
--> ...:1:1

Found 1 error. and ============================= test session starts ==============================
platform darwin -- Python 3.13.9, pytest-8.4.2, pluggy-1.6.0
rootdir: /Users/maxghenis/worktrees/policyengine-uk-data-fix-281
configfile: pyproject.toml
plugins: anyio-4.11.0
collected 11 items

policyengine_uk_data/tests/test_student_loan_targets.py ...... [ 54%]
policyengine_uk_data/tests/test_student_loan_plan.py ..... [100%]

============================== 11 passed in 9.23s ============================== ().

@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Addressed the review in 21b2243.

Changes:

  • switched the SLC compute path back to person-level country mapping instead of the household-level ctx.country
  • tightened the Plan 2 below-threshold imputation pool to the estimated Plan 2 cohort as well as the age band

Local validation after the change:

  • uvx ruff check ...
  • uv run --python 3.13 pytest -q policyengine_uk_data/tests/test_student_loan_targets.py policyengine_uk_data/tests/test_student_loan_plan.py (11 passed)

@MaxGhenis MaxGhenis merged commit fd72fca into main Apr 13, 2026
3 checks passed
@MaxGhenis MaxGhenis deleted the codex/fix-281-below-threshold-borrowers branch April 13, 2026 17:37
@MaxGhenis
Copy link
Copy Markdown
Contributor Author

Follow-up: a post-merge review found that the SLC parser is still dropping literal zero target values, so the Plan 5 2025 above-threshold zero is not actually enforced in calibration. I’m fixing that in a small follow-up PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Impute loan-holder-but-not-repaying status to FRS base dataset

2 participants