Bug Description
The _filter_us_simulation_by_place() method in simulation.py filters the dataset at the person level instead of the household level, causing incorrect results in economy comparisons.
Current Implementation (Incorrect)
def _filter_us_simulation_by_place(self, simulation, simulation_type, region, reform):
_, place_fips_code = parse_us_place_region(region)
df = simulation.to_input_dataframe() # Returns person-level data
person_place_fips = simulation.calculate("place_fips", map_to="person").values
mask = (person_place_fips == place_fips_code) | (person_place_fips == place_fips_code.encode())
return simulation_type(dataset=df[mask], reform=reform) # Filters PERSONS
Expected Behavior
Should filter at the household level, keeping all persons in matching households, as demonstrated in the subsample() method in policyengine-core:
# Correct pattern from subsample():
h_df = df.groupby(household_id_column).first()
chosen_household_ids = h_df[mask].index
subset_df = df[df[household_id_column].isin(chosen_household_ids)]
Impact
When running the Mamdani NYC income tax analysis:
- Expected Decile 10 average: ~$-36,149 (from Jupyter notebook)
- Actual Decile 10 average: ~$-15,889 (from app using place filtering)
- Budgetary impact matches (~$8.87B), confirming the filtering captures the right population but calculates averages incorrectly
Root Cause
to_input_dataframe() maps all variables to person level (line 1516 in policyengine-core). When you filter this person-level dataframe directly and create a new simulation, household-level variable calculations become incorrect.
Note
The UK country filtering (country/ regions) has the same issue - it also uses map_to="person" and filters at person level.
Related
Bug Description
The
_filter_us_simulation_by_place()method insimulation.pyfilters the dataset at the person level instead of the household level, causing incorrect results in economy comparisons.Current Implementation (Incorrect)
Expected Behavior
Should filter at the household level, keeping all persons in matching households, as demonstrated in the
subsample()method in policyengine-core:Impact
When running the Mamdani NYC income tax analysis:
Root Cause
to_input_dataframe()maps all variables to person level (line 1516 in policyengine-core). When you filter this person-level dataframe directly and create a new simulation, household-level variable calculations become incorrect.Note
The UK country filtering (
country/regions) has the same issue - it also usesmap_to="person"and filters at person level.Related