Problem
When creating H5 files from manipulated simulations (e.g., state-swapping households for geographic calibration), users can inadvertently save variables that corrupt calculations on reload. The current to_input_dataframe() method and sim.input_variables property don't protect against several pitfalls we've discovered:
1. Pseudo-input variables (see #417)
Variables with adds/subtracts that aggregate formula-based components appear in sim.input_variables but contain stale pre-computed values. When saved and reloaded, these override the formula calculations.
2. Stale calculated variables
If you change an input (like state_fips for geographic relocation) but don't manually clear the cache with sim.delete_arrays(), calculated variables retain old values.
3. No built-in identification of "true inputs"
Users must reimplement logic to identify variables with formulas/adds/subtracts. The Variable.is_input_variable() method exists but isn't exposed in a way that helps with safe exports.
4. Entity ID sensitivity
PolicyEngine's random() uses entity IDs as seeds. Users need to know which variables to preserve vs. regenerate.
Current Workarounds
In policyengine-us-data, we've had to implement:
get_calculated_variables(sim) - identifies variables with formulas/adds/subtracts
get_pseudo_input_variables(sim) - identifies pseudo-inputs that shouldn't be saved
- Manual cache invalidation after changing geographic variables
- Manual filtering of
input_variables before saving
Proposed Solution
Add a safe H5 export API to Simulation that:
- Identifies true inputs: Uses
Variable.is_input_variable() plus pseudo-input detection
- Warns about state changes: If geographic variables changed since load, warn that calculated variables may be stale
- Provides a "clean export" mode: Only exports variables safe to reload without corruption
- Documents the pitfalls: Clear documentation about what can go wrong when manipulating simulations before saving
This could be a new method like to_safe_h5() or improvements to the existing export functionality.
Related
Problem
When creating H5 files from manipulated simulations (e.g., state-swapping households for geographic calibration), users can inadvertently save variables that corrupt calculations on reload. The current
to_input_dataframe()method andsim.input_variablesproperty don't protect against several pitfalls we've discovered:1. Pseudo-input variables (see #417)
Variables with
adds/subtractsthat aggregate formula-based components appear insim.input_variablesbut contain stale pre-computed values. When saved and reloaded, these override the formula calculations.2. Stale calculated variables
If you change an input (like
state_fipsfor geographic relocation) but don't manually clear the cache withsim.delete_arrays(), calculated variables retain old values.3. No built-in identification of "true inputs"
Users must reimplement logic to identify variables with formulas/adds/subtracts. The
Variable.is_input_variable()method exists but isn't exposed in a way that helps with safe exports.4. Entity ID sensitivity
PolicyEngine's
random()uses entity IDs as seeds. Users need to know which variables to preserve vs. regenerate.Current Workarounds
In policyengine-us-data, we've had to implement:
get_calculated_variables(sim)- identifies variables with formulas/adds/subtractsget_pseudo_input_variables(sim)- identifies pseudo-inputs that shouldn't be savedinput_variablesbefore savingProposed Solution
Add a safe H5 export API to
Simulationthat:Variable.is_input_variable()plus pseudo-input detectionThis could be a new method like
to_safe_h5()or improvements to the existing export functionality.Related
adds/subtractscan corrupt H5 exports