Skip to content

Refactor local/national H5 publishing into testable local_h5 services #722

@anth-volk

Description

@anth-volk

Problem

The US local-area and national H5 publishing path is still too procedural and too side-effectful. The old path spread business logic across publish_local_area.py, modal_app/local_area.py, and modal_app/worker_script.py, with repeated source-dataset setup, implicit contracts, and weakly structured worker/coordinator communication.

That made it hard to:

  • reason about what one area build actually does
  • reuse pieces of the publishing stack
  • unit test the individual steps without spinning up large runtime surfaces
  • enforce or even observe validation/failure behavior cleanly
  • propagate exact package geography through H5 publishing

What We Want To Fix

Refactor the H5 publishing path into a set of clear, scoped, composable classes and contracts, in the same direction as the microplex-style architecture we discussed.

The target shape is:

  • typed request/result contracts
  • pure selection/reindexing/cloning steps
  • a worker-scoped source snapshot
  • explicit US-only augmentation services
  • a builder/writer pair for one-area H5 materialization
  • worker/coordinator services that communicate with structured results rather than ad hoc dict mutation

Scope

This work should cover the production Modal H5 path for:

  • regional/state/district/city publishing
  • national H5 publishing

It should also preserve the existing public build_h5(...) facade while moving the real work under a reusable local_h5 library surface.

Requirements

  • Prefer exact geography loaded from the calibration package over seed-based regeneration.
  • Reduce repeated source-dataset setup within workers.
  • Make validation results explicit and testable.
  • Keep fingerprint/resume logic at the adapter boundary.
  • Make the one-area build stack unit-testable, with only a thin seam/integration layer above it.

Expected Deliverables

  • policyengine_us_data/calibration/local_h5/ package with the core services and contracts
  • thin worker/coordinator adapters around that package
  • targeted unit coverage for the new components
  • a minimal real build_h5(...) integration seam test
  • updated docs explaining the new boundary with the legacy surface

Known Follow-Ups

  • Validation policy fields beyond enabled are still only partially implemented today and should be enforced explicitly in a later slice.
  • Standalone publish_local_area.py loop optimizations can remain separate from the production Modal path.
  • Fingerprinting can eventually move to a package-geography-centric schema version once we choose to change resume semantics intentionally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions