Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
3125253
local_h5: add core contracts and initial tests
anth-volk Apr 8, 2026
b955bc5
local_h5: extract work partitioning and fix scheduler crash
anth-volk Apr 8, 2026
6f7e5bc
local_h5: make worker validation explicit
anth-volk Apr 8, 2026
1ce50a2
local_h5: load exact geography from calibration package
anth-volk Apr 8, 2026
38e219c
local_h5: harden validation and fingerprinting
anth-volk Apr 8, 2026
b7e6e71
local_h5: extract weight layout and area selection
anth-volk Apr 9, 2026
b7c9b1b
local_h5: add worker-scoped source snapshots
anth-volk Apr 9, 2026
1dc1ff7
local_h5: extract pure entity reindexing
anth-volk Apr 9, 2026
ea88a35
local_h5: extract variable cloning
anth-volk Apr 9, 2026
e06fe82
local_h5: extract US-specific augmentations
anth-volk Apr 9, 2026
6c916c4
local_h5: canonicalize clone count from weights
anth-volk Apr 9, 2026
ee916f5
local_h5: add seam coverage for adapters and package io
anth-volk Apr 9, 2026
567f6a1
local_h5: introduce builder and writer facade
anth-volk Apr 9, 2026
3f7cb48
local_h5: introduce worker session and service
anth-volk Apr 9, 2026
defb216
local_h5: refactor coordinators around requests
anth-volk Apr 9, 2026
1a03776
local_h5: document landed architecture
anth-volk Apr 9, 2026
36e27b9
local_h5: tighten adapter boundaries
anth-volk Apr 9, 2026
c34a626
pipeline: test validation diagnostics writes
anth-volk Apr 9, 2026
0e9aebb
local_h5: add minimal build_h5 integration test
anth-volk Apr 9, 2026
b23779b
Tighten H5 publish fingerprint ownership
anth-volk Apr 10, 2026
6ab9cef
Document worker work-items compatibility
anth-volk Apr 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/internals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,14 @@ expansion.

______________________________________________________________________

## Refactor notes

| Note | Purpose |
| ---- | ------- |
| [`local_h5_refactor_status.md`](local_h5_refactor_status.md) | Records the landed `local_h5` architecture for area publishing, the thin adapter layers that still remain, and the work that was explicitly deferred in the H5 refactor PR. |

______________________________________________________________________

## Pipeline orchestration reference

The pipeline runs on [Modal](https://modal.com) via `modal_app/pipeline.py`. It chains five steps
Expand Down
173 changes: 173 additions & 0 deletions docs/internals/local_h5_refactor_status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Local H5 Refactor Status

Date: 2026-04-09

This note records what actually landed in the `fix/target-architecture-h5`
refactor for the US local and national H5 publishing path.

It is intentionally narrower than the broader architecture planning docs. The goal here is to
describe the code that now exists, the remaining thin spots, and the work that was explicitly
deferred.

## What Landed

The H5 path now has explicit internal contracts and a request-driven architecture:

- `policyengine_us_data.calibration.local_h5.contracts`
- request, filter, validation, and worker result contracts
- `to_dict()` / `from_dict()` support for adapter boundaries
- `policyengine_us_data.calibration.local_h5.partitioning`
- tested weighted work partitioning
- `policyengine_us_data.calibration.local_h5.package_geography`
- exact calibration-package geography loading
- `policyengine_us_data.calibration.local_h5.fingerprinting`
- typed publish fingerprint inputs and records
- `policyengine_us_data.calibration.local_h5.selection`
- clone-weight layout and area selection
- `policyengine_us_data.calibration.local_h5.source_dataset`
- worker-scoped source snapshot with lazy variable access
- `policyengine_us_data.calibration.local_h5.reindexing`
- pure entity reindexing
- `policyengine_us_data.calibration.local_h5.variables`
- variable cloning and export policy
- `policyengine_us_data.calibration.local_h5.us_augmentations`
- US-only payload augmentation
- `policyengine_us_data.calibration.local_h5.builder`
- `LocalAreaDatasetBuilder` as the one-area orchestration root
- `policyengine_us_data.calibration.local_h5.writer`
- `H5Writer` as the H5 persistence boundary
- `policyengine_us_data.calibration.local_h5.worker_service`
- `WorkerSession`
- `LocalH5WorkerService`
- validation context loading
- request/result adaptation helpers
- `policyengine_us_data.calibration.local_h5.area_catalog`
- concrete `USAreaCatalog`

The public entrypoints still exist, but they are now adapters over the internal components:

- `policyengine_us_data.calibration.publish_local_area.build_h5(...)`
- `modal_app.worker_script`
- `modal_app.local_area.coordinate_publish(...)`
- `modal_app.local_area.coordinate_national_publish(...)`

## Current Shape

The current H5 publishing path is:

1. coordinator derives publish inputs and fingerprint
2. coordinator builds concrete US requests from `USAreaCatalog`
3. coordinator partitions weighted requests across workers
4. worker script loads one `WorkerSession`
5. worker service iterates requests in the chunk
6. builder creates one in-memory payload per request
7. writer persists the H5
8. validation runs per output when enabled
9. coordinator aggregates structured worker results

In other words:

- one-area build logic now lives in `LocalAreaDatasetBuilder`
- one-worker-chunk logic now lives in `LocalH5WorkerService`
- coordinator logic is thinner and request-driven

## What Stayed Concrete And US-Specific

This refactor deliberately did **not** try to create a fake shared cross-country core.

Still US-specific by design:

- `CloneWeightMatrix`
- `USAreaCatalog`
- `USAugmentationService`
- the current local-H5 coordinator/orchestration adapters

That is intentional. The code was only generalized where there was already a real stable seam.

## Test Status

The refactor added a cheap unit-first suite around the new seams. At the end of
the coordinator refactor, the targeted local-H5 suite was passing:

```text
81 passed
```

Coverage now exists for:

- contracts
- partitioning
- validation helpers and worker validation contract
- package geography loading
- fingerprinting
- selection
- source snapshot loading
- reindexing
- variable cloning
- US augmentations
- builder and writer seams
- worker service behavior
- US area catalog behavior
- coordinator contract behavior
- calibration package serialized geography round-trip

The deliberate gap is heavy runtime integration. The branch does **not** add a broad slow parity
suite.

This was intentional. The PR was designed so most correctness lives in unit-testable
components, with only thin compatibility or seam coverage on top.

## Deferred Follow-Ups

These items were explicitly left out or only partially handled:

1. Heavy compatibility and invariant testing
- broader `build_h5` runtime parity
- deeper `X @ w` / area-aggregate invariants
- full Modal-like integration coverage

2. Validator unification
- per-area target validation is now structurally correct
- national validation is still partly separate
- only `ValidationPolicy.enabled` is enforced today; the finer-grained
validation policy fields are present but not fully wired through

3. Fingerprint schema simplification
- clone count is now canonicalized from weights
- long-term package-backed fingerprinting should stop treating `n_clones` and `seed` as
semantic equality inputs

4. Possible later shared-core extraction
- nothing in this branch proves that the US abstractions are yet the right shared abstractions
for UK or another country

5. Coordinator cleanup beyond the H5 scope
- Modal upload/promotion/manifest logic remains adapter-heavy
- that is outside the intended scope of this refactor

## What This Documentation Does Not Claim

This branch does **not** establish a reusable cross-country core library.

It does establish a cleaner set of seams that another country pipeline could
learn from:

- request/result contracts
- builder and worker-service boundaries
- package-backed geography loading
- lazy source snapshot handling

Whether any of those should later move into a real shared abstraction should be
decided only after a second concrete implementation proves the shape.

## Reading Order

If you need to understand the landed architecture quickly, read in this order:

1. `policyengine_us_data/calibration/local_h5/contracts.py`
2. `policyengine_us_data/calibration/local_h5/builder.py`
3. `policyengine_us_data/calibration/local_h5/worker_service.py`
4. `policyengine_us_data/calibration/local_h5/area_catalog.py`
5. `policyengine_us_data/calibration/publish_local_area.py`
6. `modal_app/worker_script.py`
7. `modal_app/local_area.py`
Loading