Skip to content

Support GCS runtime dataset materialization#381

Merged
anth-volk merged 2 commits into
mainfrom
codex/gcs-runtime-datasets
May 29, 2026
Merged

Support GCS runtime dataset materialization#381
anth-volk merged 2 commits into
mainfrom
codex/gcs-runtime-datasets

Conversation

@anth-volk
Copy link
Copy Markdown
Contributor

Fixes #379

Summary

  • Add runtime dataset source materialization for gs:// and hf:// artifact URIs.
  • Download GCS artifacts into .datasets/ with CRC-backed disk caching and support for generation or metadata-versioned objects.
  • Materialize US/UK dataset sources before handing them to country package Microsimulation constructors, preserving canonical bundle provenance separately.
  • Add focused tests for URI parsing, materialization, and country package handoff behavior.

Root Cause

The simulation API needs to pass GCS dataset URIs for runtime execution, but the US country package intercepts string dataset references before policyengine-core can handle gs://. This caused GCS runtime dataset references to fail instead of being downloaded to a local path.

Validation

  • env POLICYENGINE_SKIP_COUNTRY_IMPORTS=1 uv run --extra dev --extra us --extra uk pytest --noconftest tests/test_dataset_sources.py -q
  • ruff check pyproject.toml src/policyengine/provenance/dataset_sources.py src/policyengine/utils/data src/policyengine/utils/google_cloud_bucket.py src/policyengine/tax_benefit_models/us/datasets.py src/policyengine/tax_benefit_models/us/model.py src/policyengine/tax_benefit_models/uk/datasets.py src/policyengine/tax_benefit_models/uk/model.py tests/test_dataset_sources.py tests/test_release_manifests.py

Deployment Note

The simulation API GCS normalization changes should wait for a policyengine.py release containing this support, unless the API intentionally pins a branch or wheel during testing.

@anth-volk anth-volk marked this pull request as ready for review May 29, 2026 23:58
@anth-volk anth-volk merged commit 7c3416d into main May 29, 2026
11 checks passed
@anth-volk anth-volk deleted the codex/gcs-runtime-datasets branch May 29, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support GCS runtime dataset URIs in policyengine.py

1 participant