Skip to content

Persist and return the resolved model/data release bundle in policyengine-api #3393

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine-api currently records or exposes package versions in a few places, but it does not persist and return the resolved immutable execution bundle for a simulation.

That creates three reproducibility problems:

  1. household calculations run against whatever country package is installed in-process, with no explicit model/data pin at execution time
  2. economy flows have partial model_version plumbing but effectively no real data-version pinning
  3. cache keys are based on request payloads rather than the resolved execution bundle

Relevant code paths:

  • household execution instantiates the installed country package Simulation(...) directly and only reports the installed package version in metadata: policyengine_api/country.py
  • economy setup hardcodes dataset version resolution to None, builds dataset aliases like ...@None, and strips that suffix back off during setup: policyengine_api/data/model_setup.py, policyengine_api/services/economy_service.py
  • the Modal adapter drops data_version before job submission: policyengine_api/libs/simulation_api_modal.py
  • cache keys are hashes of request bodies, not of the resolved bundle: policyengine_api/utils/cache_utils.py
  • the bump workflow only updates country package versions, so deployment automation also treats package version as the whole contract: gcp/bump_country_package.py

Desired contract

For every simulation or cached result, the API should persist and return a resolved immutable bundle, not just a country ID or API version.

At minimum that bundle should include:

  • orchestrator version if applicable (policyengine.py or equivalent)
  • country model package name/version
  • country data package name/version
  • resolved dataset artifact locator or manifest revision
  • checksum or manifest ID for verification

This should apply to both household-style in-process calculations and economy/report-style asynchronous calculations.

What should change

  1. Resolve and persist the execution bundle at simulation creation time.
  2. Return that bundle in simulation metadata and user-facing API responses.
  3. Stop dropping data_version or equivalent bundle identity before job submission.
  4. Make cache keys and dedupe keys include the resolved bundle identity.
  5. Update deployment/version-bump tooling so it does not treat country package version alone as the full runtime contract.
  6. Keep backward compatibility for existing clients where possible, but add new structured provenance fields rather than overloading the current api_version field.

Acceptance criteria

  • Household calculations and economy calculations both persist the resolved model/data bundle used at execution time.
  • API responses expose structured provenance fields rather than only api_version or country package version.
  • The economy pipeline no longer hardcodes dataset version resolution to None.
  • The Modal submission path preserves the resolved bundle, including data release identity.
  • Cache and dedupe keys include the resolved bundle identity so floating defaults cannot collide across releases.
  • Deployment/version bump workflows can update or verify the full runtime bundle, not just the country model package version.

Upstream dependencies

This should consume the data-release contracts from:

And it should stay aligned with the orchestration work in:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions