Fix SPI synthetic prior weights#421
Conversation
|
Local testing-mode data build completed successfully with Python 3.13 and Final local Latest calibration-log diagnostics: |
|
Pushed follow-up commit eb60c1d after the first CI run exposed the fast-fixture child-limit smoke-test threshold. Local verification on the updated branch: |
There was a problem hiding this comment.
Reviewing the fix for SPI synthetic prior weights. Tests pass locally, but I found one functional bug in the diagnostics and a few design questions worth resolving.
1. initial_zero_weight_* diagnostics go stale after the first log call (bug)
_household_weight_diagnostics reads dataset.household.household_weight.values to derive zero_initial = original_weights <= 0 (policyengine_uk_data/utils/calibrate.py:169-172). But the calibration loop mutates that same attribute in place at the end of every logged epoch:
dataset.household.household_weight = final_weights.sum(axis=0)(policyengine_uk_data/utils/calibrate.py:492 and :532)
So initial_zero_weight_rows, initial_zero_weight_household_weight, and initial_zero_weight_prior_share are only correct for the very first log_performance call (epoch 0). From epoch 10 onward, the SPI synthetic rows have non-zero weights from the previous iteration, zero_initial becomes all-False, and these "initial" columns all read 0. I reproduced this directly against _household_weight_diagnostics:
epoch0 initial_zero_weight_rows: 2
epoch10 initial_zero_weight_rows: 0 # after dataset.household.household_weight = final_weights.sum(axis=0)
The household_is_spi_synthetic_* columns stay correct because the boolean flag column is not mutated, which is why test_calibrate_local_areas_logs_loss_targets_and_source_diagnostics doesn't catch it (it only runs epochs=1, so only the epoch-0 logging path is exercised).
Suggested fix: capture the zero mask once before entering the loop (e.g. alongside household_prior_weights) and pass it into _household_weight_diagnostics, or precompute the diagnostic keys that don't change per epoch. While there, consider adding a 2-epoch regression test that asserts initial_zero_weight_rows is stable across log entries.
2. initialize_weight_priors halves positive-household prior mass — is 0.5 the intended default?
When any zero-weight rows are present, positive households get weight * (1 - zero_weight_total_share) = weight * 0.5 in the prior (calibrate.py:142-144). With ~10k SPI synthetic donors vs the FRS positive-weight households, that effectively reassigns half the country's prior household mass to the synthetic pool. That's a substantial shift from the previous behaviour and shapes the starting point Adam optimises from.
A couple of suggestions:
- Worth saying explicitly in the docstring that the function rescales positive-weight rows (right now the docstring frames it as "reserving a share for zero-weight rows", which understates the side effect on positive rows).
- 0.5 is exposed as
zero_weight_prior_total_share, but it's also baked into a module-level constant; consider documenting the rationale for picking 0.5 specifically (or pointing to whatever empirical run motivated it). If this was the value that produced the released calibration, capturing that link in the changelog/docstring helps the next person who tunes it.
3. Loosening already-loose smoke tests
test_child_limit.py: 0.3 → 0.45 — fine, the comment explains it.test_scotland_uc_babies.py: 1.0 → 1.5 — at 150% relative tolerance this test now passes anywhere in[-0.5 × target, 2.5 × target]. It is functionally a smoke test that the build returned a number. The added comment is honest about this, but at this tolerance the test mostly costs CI time without catching anything except total failure; either tighten it back down once SPI prior mass stabilises or consider gating it on the production fixture only.
4. Minor
if not log_csv or get_performance is None:(calibrate.py:408) mixes a truthiness check with an identity check.log_csvcan legitimately be aPath("")or other falsy non-Nonevalue; preferif log_csv is None or get_performance is None:for symmetry and clarity._as_bool_maskhandles numeric and string fallbacks (calibrate.py:148-155), but the two callers (household_is_spi_synthetic,household_is_capital_gains_clone) both set bool literals. The string branch in particular (parsing"true"/"1"/"yes") feels like over-engineering for an internal flag column. Happy to leave it as defensive code, just flagging.test_impute_capital_gains_marks_capital_gains_clone_householdsdoesn't assert thathousehold_is_capital_gains_cloneis preserved afterdata.household["household_weight"] = household_weightoverwrites the weight column. Since the boolean column lives alongside, it should survive, but a one-line assertion on the final dataset's column would lock that in.capital_gains.pyremoves several unused imports (Dataset,tqdm,copy,subsample_dataset, the duplicatetqdm,system). Nice cleanup; unrelated to the SPI fix but worth calling out in the PR description so reviewers don't bisect it later.
What looks good
- The refactor that consolidates the duplicate logging block into
log_performanceis a clean win and fixes the latent crash in the non-verbose branch wherelog_csvcould be set withoutget_performance. saved_weights_lossas a complement totraining_loss(one with dropout, one without) is a useful diagnostic.- Replacing the
+ random*0.01jitter with a deterministic clipped log prior is more reproducible and easier to reason about. - Source flags on the household table are a good downstream-debugging primitive — they survive
stack_datasetsand let any consumer slice the population by provenance.
Net: the diagnostic bug in (1) is the only blocker from my read; everything else is discussion/polish.
vahid-ahmadi
left a comment
There was a problem hiding this comment.
pls check my comment
Summary
Local checks
.venv/bin/ruff format --check policyengine_uk_data/utils/calibrate.py policyengine_uk_data/datasets/imputations/income.py policyengine_uk_data/datasets/imputations/capital_gains.py policyengine_uk_data/tests/test_calibrate_save.py policyengine_uk_data/tests/test_imputation_source_flags.py.venv/bin/ruff check policyengine_uk_data/utils/calibrate.py policyengine_uk_data/datasets/imputations/income.py policyengine_uk_data/datasets/imputations/capital_gains.py policyengine_uk_data/tests/test_calibrate_save.py policyengine_uk_data/tests/test_imputation_source_flags.py.venv/bin/python -m pytest policyengine_uk_data/tests/test_calibrate_save.py policyengine_uk_data/tests/test_imputation_source_flags.py policyengine_uk_data/tests/test_calibration_progress.py policyengine_uk_data/tests/test_load_weights.pyRefs #420
Related to PolicyEngine/policyengine-us-data#1139 and PolicyEngine/policyengine-us-data#1140.