In-memory snapshot toolkit (git stash for timesteps)#195
Conversation
Spun off from the 2026-05-11 deformable-surface design discussion as a self-contained UW3 capability. Covers motivation (backtrack-on-failure, adaptive Δt retry, RK staging, predictor-corrector probes, crash recovery, bisection, debugging captures), the two existing on-disk paths (write_checkpoint and write_timestep), the state-as-dataclass serialisation contract for solver-internal Python state, the three-backend story (in-memory + on-disk-full-state + existing write_timestep unchanged), schema versioning, the swarm population-generation counter, eight architectural work items in dependency order, scope boundaries, and open implementation questions. Baseline-of-record for the feature/in-memory-checkpoint branch. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
First true unitary checkpoint in UW3. Captures mesh coordinates and
mesh-variable global-vector DOFs across every registered mesh into a
plain-Python token; restores back onto the same Model instance within
the same process, bit-equivalent. Distinct from the existing per-variable
write_timestep/read_timestep path, which continues to serve visualisation
and partial restart unchanged.
What lands:
- src/underworld3/checkpoint/ — new module
- backend.py: CheckpointBackend Protocol + InMemoryBackend (eager
copy on save and load; tokens hold numpy data only, never PETSc
handles, so DM-lifecycle hazards do not apply)
- snapshot.py: Snapshot dataclass + snapshot()/restore() routines.
Restore order: mesh coords via _deform_mesh() → MV gvec write +
globalToLocal sync. Within-process invalidation gate:
_mesh_version mismatch raises SnapshotInvalidatedError before any
write happens.
- src/underworld3/model.py — Model.snapshot()/restore() thin delegates
- tests/test_0007_snapshot_inmemory.py — 6 tier-A level-1 tests:
scalar/vector MV roundtrip, snapshot independence from later writes,
_mesh_version invalidation, type rejection, NotImplementedError on
path= (v1.1 scope).
Design open-question resolutions:
- Q1 module location: src/underworld3/checkpoint/ (top-level sibling,
room to grow into swarm + state-as-dataclass + on-disk in v1.1+)
rather than the persistence.py stub.
- Q2 PETSc API for in-memory capture: Vec.array (numpy view) +
subdm.createSubDM + globalToLocal is sufficient. No Viewer needed.
- Q3 restore order verified empirically: mesh._deform_mesh first
(rebuilds coord caches + callbacks), then per-var gvec write +
globalToLocal sync; _stale_lvec flagged so downstream caches refresh.
- Q4 memory budget: not yet measured; deferred to a later PR with a
realistic coupled-physics setup.
Deviation from PR 1 plan: the plan also mentioned refactoring
Mesh.write_checkpoint() to call the new protocol. Skipped here because
that path is write-only (no read_checkpoint exists), so refactoring
without an exercising load-half is risk without value. The HDF5
backend lands with v1.1 on-disk full-state, where it is load-bearing
and the protocol shape can be validated against both backends.
Not yet covered (subsequent PRs):
- swarm coverage + _population_generation counter (PR 2)
- state-as-dataclass contract + DDt retrofit (PR 3)
- parameter mutation history + CI check (PR 4)
- on-disk full-state backend (v1.1; PR 5)
- schema versioning + migration registry (PR 6)
- cross-process restore + broader test suite (PR 7)
See docs/developer/design/in_memory_checkpoint_design.md for the full
roadmap.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Extends the unitary snapshot to capture per-rank swarm positions and user swarm-variable data. Adds Swarm._population_generation, a counter bumped at every particle-population mutation site, used as the within-process invalidation gate: restoring a snapshot taken before a populate / migrate / add_particles / remesh event raises SnapshotInvalidatedError rather than silently corrupting a now-stale position array. Counter init + bump sites (src/underworld3/swarm.py): - Swarm.__init__: initialise to 0 next to _mesh_version. - populate(): bump once at the end (covers the 1-3 internal addNPoints calls in the populate body). - Swarm.migrate() after the migration_disabled early-exit: bump unconditionally; conservative even when migrate is a no-op, because under-bumping risks silent corruption while over-bumping is safe. - add_particles_with_coordinates() after its direct self.dm.migrate(): this path doesn't go through Swarm.migrate so we bump explicitly. - add_particles_with_global_coordinates() right after addNPoints: catches the migrate=False case too; the migrate=True path will double-bump via Swarm.migrate, which is fine. - advection() remesh path after the addNPoints reinjection. Snapshot extensions (src/underworld3/checkpoint/snapshot.py): - New fields: swarm_keys, swarm_generations, swarm_mesh_versions, swarmvar_names. - _capture_swarm: reads DMSwarmPIC_coor via dm.getField → copy → restoreField; iterates swarm.vars excluding DMSwarm* internals; records both _population_generation and _mesh_version. - _restore_swarm: validates both counters before any write; writes back positions + per-var data in place. Deliberately bypasses populate/add_particles/migrate so the restore itself does not bump the counter or mutate the population we just confirmed stable. Test coverage (tests/test_0007_snapshot_inmemory.py): 5 new tests on top of the 6 mesh-only tests: - swarm positions + user-variable roundtrip after scribble - counter bumps on populate, migrate, add_particles_with_coordinates, add_particles_with_global_coordinates (in monotonic order) - migrate-between-snapshot-and-restore raises SnapshotInvalidatedError - add_particles-between-snapshot-and-restore raises likewise - DMSwarm_* internal variables stay out of the captured key set Not yet covered: cross-process restore (v1.1), advection remesh invalidation test (needs a recycle-enabled swarm + a velocity field, larger setup than belongs in the core 0007 file). Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
…PR 2)" This reverts commit 001f961.
The earlier design draft proposed Swarm._population_generation as an invalidation gate: counter mismatch between capture and restore would raise SnapshotInvalidatedError. That is wrong. The whole point of the toolkit is to undo intervening state changes — including particle motion, migration, and repopulation. Refusing on counter mismatch breaks the central use cases (RK staging, backtrack-on-instability, adaptive Δt retry, all of which migrate particles between capture and restore). Corrected swarm semantics: - Restore rebuilds the swarm's local population: clear current particles, re-add at captured per-rank coords via add_particles_with_coordinates(..., migrate=False), write captured per-variable data back into the new particles in order. - The _population_generation counter stays as informational metadata (logging, cache invalidation in other consumers, possible future fast-path optimisations), but it is not a restore gate. Mesh-adapt scope boundary reframed: - v1 keeps _mesh_version mismatch as a refusal, because the captured DOF arrays don't fit a different DM's section. - v1.2 will replace the refusal with a mesh-rebuild path on the same rebuild-on-restore principle: destroy the post-adapt DM, rebuild the pre-adapt one from captured topology + section, re-bind all MeshVariable / Swarm / solver wrappers. - v1 captures the topology / section info even though v1 restore ignores it, so the snapshot payload is forward-compatible with v1.2 without a schema bump. Architectural-work item 5 updated to match: snapshot captures per-rank particle coords + per-var arrays; restore clears + re-adds + writes; counter is informational, not a gate. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
…2 redo)
Replaces the reverted counter-as-gate PR with rebuild-on-restore
semantics for swarms. Restore now succeeds across the cases the earlier
design wrongly refused (migrate, add_particles, repopulate between
snapshot and restore) — these are precisely the cases the snapshot
toolkit exists to enable (RK staging, backtrack on instability,
adaptive Δt retry).
Design changes since the reverted PR:
- Swarm._population_generation stays, but is now purely informational:
bumped at every population-mutation site for logging / debugging /
downstream caches, but NOT consulted by restore. Restore rebuilds
the local population from the snapshot regardless of intervening
mutations.
- Snapshot is keyed by stable name (mesh.name for meshes,
f"swarm_{instance_number}" for swarms), not Python id(). Forward-
compat for v1.1 cross-process restore and v1.2 mesh-rebuild after
mesh.adapt() (where the wrapper survives but its DM is destroyed).
- Restore logic moved off the snapshot module and onto wrapper
methods: Mesh.apply_snapshot_payload and Swarm.apply_snapshot_payload.
v1 implementations write back in place; v1.2's Mesh implementation
can switch to rebuild-from-payload without touching snapshot.py.
- Snapshot payloads include a reserved "topology": None slot on the
mesh side, populated in v1.2 with section/DM-topology data
sufficient to rebuild the DM. v1 leaves it None; the schema doesn't
need to bump when v1.2 lands.
Mesh restore (Mesh.apply_snapshot_payload at
discretisation_mesh.py:2570):
- Verify _mesh_version matches (v1 refusal; v1.2 will rebuild here).
- Write coords via _deform_mesh, write per-MV gvec arrays + sync to
local vec.
Swarm restore (Swarm.apply_snapshot_payload at swarm.py:4084):
- Drop every current local particle via dm.removePoint() (O(N) total,
removes-from-end is O(1) per call).
- addNPoints(n_saved), write coords directly to DMSwarmPIC_coor, set
ranks. Deliberately bypasses add_particles_with_coordinates (which
filters via points_in_domain and triggers migrate — both
unnecessary here since saved coords were local at capture and the
mesh hasn't changed).
- Invalidate _canonical_data caches so subsequent var.data accesses
re-resolve from PETSc.
- Write captured per-variable data back in particle-order. Internal
DMSwarm_* variables are filtered out at capture.
Tests (11 total):
- 6 mesh-only tests preserved.
- 5 swarm tests, including the critical positive tests
test_swarm_restore_after_migrate and test_swarm_restore_after_add_particles.
Those are the cases the reverted PR wrongly raised on; they're now
the central proof that the design works.
Regression: 35 existing core tests pass unchanged.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Introduces the state-as-dataclass serialisation contract from the
design note and applies it to the canonical DDt flavor (Symbolic).
PR 4 will mechanically extend the same pattern to Eulerian,
SemiLagrangian, Lagrangian, and Lagrangian_Swarm — they share the
same dt_history / history_initialised / n_solves_completed / dt
core; the variation is purely in how psi_star is bound.
New infrastructure:
- src/underworld3/checkpoint/state.py:
- SnapshottableState dataclass base with _schema_version field
(load-bearing for v1.1 on-disk migration; checked for strict
equality in v1 since capture and restore are same-process).
- Snapshottable runtime_checkable protocol — anything with a
.state attribute returning a SnapshottableState. Drives
discovery in Model.snapshot().
- Model._state_bearers (WeakSet) + Model._register_state_bearer():
state-bearing helpers self-register on construction without
pinning their lifetime.
DDt Symbolic retrofit (option (B)-style adapter per design note):
- DDtSymbolicState dataclass with dt_history, history_initialised,
n_solves_completed, dt, psi_star.
- Symbolic gets a `state` property (builds the dataclass from the
existing private attrs on read) and a `state.setter` (unpacks,
validates schema version + dt_history length, writes attrs back,
re-derives BDF/AM coefficient values so downstream reads see the
restored state immediately rather than waiting for the next
update_pre_solve).
- Symbolic.__init__ auto-registers with the default model.
Snapshot/restore wiring:
- Snapshot.state_bearers: list of (stable_key, state_dataclass)
with stable_key = f"{type(obj).__name__}_{obj.instance_number}".
- snapshot() iterates Model._state_bearers, deepcopies obj.state,
stores the copy. Deepcopy isolates the snapshot from later
mutation of the live state-bearer.
- restore() matches captured states to current state-bearers by
stable_key, deepcopies, writes via obj.state setter. Missing
state-bearer → SnapshotInvalidatedError.
Drive-by fix in Mesh.snapshot_payload (caught by the DDt tests):
mesh variables with _gvec=None (lazy allocation: var created but
never written to) are now skipped during capture rather than
crashing on var._gvec.array. Restore correspondingly only touches
variables present in payload["vars"], so an unallocated-at-capture
variable is left in its current state.
Tests (6 new on top of the 11 mesh+swarm ones):
- DDt auto-registers in Model._state_bearers
- .state returns a SnapshottableState with correct schema version
- mid-trajectory snapshot+restore recovers dt_history,
history_initialised, n_solves_completed, dt
- wrong _schema_version on apply raises ValueError
- dt_history length mismatch on apply raises ValueError
- snapshot is a deep copy: scribbling live DDt internals doesn't
leak into the captured state-bearer payload
17/17 snapshot tests pass; 41 existing core + DDt tests still green.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
…agrangian/Lagrangian_Swarm) (PR 4) Mechanically extends the PR 3 Snapshottable contract from Symbolic to the other four DDt flavors. Each gets: - A flavor-specific State dataclass inheriting from a new _DDtCoreState base that carries the shared dt_history / history_initialised / n_solves_completed / dt fields. - A .state property building the dataclass on read and a .state.setter unpacking on write, re-deriving BDF/AM coefficients so post-restore reads are consistent without waiting for the next update_pre_solve. - Self-registration with the default model in __init__ (try/except for safety when no model is active). State dataclasses (in src/underworld3/systems/ddt.py): - _DDtCoreState: shared fields. Subclasses add psi_star representation specific to the flavor. - DDtSymbolicState (already present, refactored to inherit base): psi_star is a list of sympy expressions. - DDtEulerianState: psi_star is a list of MeshVariables; State carries their clean_names for restore-side verification (the actual DOF arrays travel via the mesh-variable snapshot path). - DDtSemiLagrangianState: same as Eulerian plus optional forcing_star_var_name and with_forcing_history flag for ETD-2 Maxwell-relaxation integration. - DDtLagrangianState / DDtLagrangianSwarmState: psi_star is a list of SwarmVariables on the DDt's swarm; data travels via the swarm-variable path. Notes: - SemiLagrangian's update_pre_solve hardcodes theta=0.5 (the class doesn't accept a theta arg in __init__), so the state setter matches that — not self.theta which doesn't exist. - Lagrangian itself has a pre-existing AttributeError in __init__ (references uw.swarm.UWSwarm which doesn't exist). The retrofit code is in place and follows the same pattern; consumers that construct Lagrangian via the higher-level solver pathways will get .state / .state.setter / registration automatically. The pre-existing bug is out of scope for this PR but worth flagging. - ParameterRegistry retrofit deferred. The class isn't currently wired into Model anywhere in core code (only mentioned in a docstring example). Retrofitting now would be dead code; the retrofit lands together with the real consumer in a follow-up. Tests (3 new on top of PR 3's 6 DDt tests, for 20 total): - Eulerian DDt roundtrip via manual primary-state mutation - SemiLagrangian DDt roundtrip - Lagrangian_Swarm DDt registration + state-type check (no roundtrip — advection needs a velocity-field setup beyond a core unit test) Roundtrips on Eulerian/SemiLagrangian exercise the .state property and .state.setter directly rather than running full projection solves: the BDF/AM coefficient re-derivation happens in the setter, so manual primary-state mutation is sufficient to validate the retrofit logic. 20/20 snapshot tests pass; 41 existing core + DDt tests green. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
Companion to the snapshot toolkit design note (PR 0) and the Snapshottable / DDt retrofit implementation (PRs 3, 4). Audience is developers adding new solver-internal helper classes; the guide covers what goes in a State dataclass, when to use option (B) adapter vs option (C) authoritative-store, what NOT to capture (PETSc handles, bulk arrays already carried by mesh-var / swarm-var paths), how schema versioning is intended to work in v1.1, and a minimal roundtrip test pattern. Closes the doc gap noted in PR 3's commit message; rounds out v1 of the snapshot toolkit. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
The Lagrangian DDt flavor has been unconstructible since 2025-07-07 (commit 0778b7d, "Fix uw.function.evaluate / eliminate evalf"), which typo'd uw.swarm.Swarm(mesh) as uw.swarm.UWSwarm(mesh) while editing nearby code. UWSwarm does not exist — never has — so every direct construction of Lagrangian since that commit has died with AttributeError. Higher-level solver pathways that wrap Lagrangian were presumably also broken; consumers may have silently been using Lagrangian_Swarm or another flavor as a workaround. One-character fix: revert that line. No other UWSwarm references exist in the tree. This bug surfaced during the snapshot toolkit work (PR 4) when the state-as-dataclass retrofit included a Lagrangian roundtrip test that couldn't run. With the fix in place, the test now runs and passes — included in the same commit so the bug fix and the proof it works land together. Should be cherry-picked to development; the typo is unrelated to the snapshot feature work it surfaced from. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
Every previous snapshot test is unit-style — build a thing, snapshot,
scribble, restore, check equality. None exercise the actual use case
that motivated the toolkit: detect a bad step, snap back, retry.
This commit adds one focused end-to-end test covering the canonical
adaptive-Δt CFL workflow.
The test simultaneously exercises all three captured state surfaces
in one realistic story:
- A swarm with an outward-radial velocity field carries particles
outward at known speeds.
- A material variable on the swarm carries a per-particle marker
(initial x-coord), so we can prove particle identity is recovered,
not just particle count.
- A Symbolic DDt accumulates BDF history (manually advanced past
startup), so the state-bearer / state-as-dataclass path also gets
exercised.
The flow:
1. Snapshot before the speculative step.
2. Take a candidate Δt = 0.5 → max displacement ~0.27, ~6× the
cell radius. CFL violated; the consumer's check trips.
3. model.restore(snap): particle positions, material data, and
DDt history all roll back to the snapshot point.
4. Retry at Δt = 0.05 → max displacement ~0.033, sub-cell. CFL
satisfied; state evolves cleanly.
The Δt and threshold values come from a probe run on the same mesh
(min_radius ≈ 0.044; |V| ≤ 0.71 at corners), so the CFL violation
on the candidate step is a real physical observation rather than a
parameter tweak. Smaller dt → strictly smaller displacement gives a
robust ratio-based assertion that doesn't rely on exact numbers.
This is the test pattern that consumers (RK staging, adaptive Δt,
predictor-corrector, regime-change feeling-out) will adapt. v1 of
the snapshot toolkit is genuine, not just unit-tested.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Every prior test proved *state equality after restore*. That is
necessary but not the guarantee a backtracking consumer actually
relies on. The real guarantee — "git stash for steps": a discarded
speculative step leaves zero trace after restore + continuation —
was untested. This commit adds it, asserted bit-for-bit
(np.array_equal, no tolerance).
Two tests, both with a live swarm + driven mesh variable + Symbolic
DDt so the mesh -> swarm -> state-bearer restore ordering is
exercised together:
- test_continuation_deterministic_after_restore:
snapshot S -> K steps -> A; restore(S) -> K steps -> B.
A == B bit-for-bit. Proves restore leaves no residual state that
perturbs subsequent evolution.
- test_continuation_bit_identical_across_stash_and_recover:
control: S -> K good steps -> ctrl
stash: S -> disruptive 10x-dt step -> restore(S)
-> same K good steps -> stash
ctrl == stash bit-for-bit. The regretted step leaves no trace.
This also closes the #3 concern (does Mesh.apply_snapshot_payload's
_deform_mesh call disturb a registered swarm before the swarm
restore runs?). Both tests have a swarm and a mesh variable live
through restore; the mesh restore calls _deform_mesh with unchanged
coords, then swarm restore, then DDt restore. If the mesh restore
perturbed the swarm, continuation would not be bit-identical. It is.
For v1 scope this fully covers the _deform_mesh-on-restore path,
because a *deformed* mesh (bumped _mesh_version) is refused on
restore anyway — the only path that runs _deform_mesh on restore is
the same-coords path these tests now cover.
Remaining production blockers (unchanged by this commit): parallel
(MPI) is still untested and the swarm rebuild deliberately bypasses
migration; no real-solver test; memory cost unmeasured.
24 snapshot tests, 24 regression tests, all green.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
The one real production blocker was "works everywhere" — i.e. correct
under MPI. This adds a parallel ptest and confirms the design intent:
swarm restore is a per-rank reconstruction, not a redistribution, so
the global state is exactly rebuilt under cross-rank migration
provided the rank count is unchanged (the documented v1 scope).
ptest_0007_snapshot_inmemory.py (mesh + swarm + per-particle global-id
tag + material + Symbolic DDt; rotation field that circulates
particles across the strip partition). Three collective properties,
asserted on rank 0:
P1 restore recovers the exact global particle count. The
disruptive step is deliberately *allowed* to lose particles
across ranks (advect out / clip) — that is exactly the failure
a stash-and-restore exists to undo.
P2 exact reconstruction: gather (gid, x, y, material) from every
rank, sort by global id, np.array_equal pre-step vs
post-restore. Order- and rank-independent — the real proof
that per-rank reconstruction yields the correct global state.
P3 bit-identical continuation across a stash, in parallel.
Results:
-np 1 : 2052 particles, all properties pass.
-np 3 : 2013 particles; disruptive step loses 28 across ranks;
restore recovers all 2013 exactly; P2/P3 bit-for-bit.
-np 4 : 2006 particles; disruptive step loses 35 across ranks;
restore recovers all 2006 exactly; P2/P3 bit-for-bit.
The genuinely strong result: the toolkit demonstrably recovers from
real cross-rank particle loss — the exact production scenario it
exists for — with bit-identical continuation afterwards.
Registered in mpi_runner.sh at -np 1 / 3 (uneven) / 4.
Production-blocker status: parallel correctness now confirmed (was
the gate). Remaining items are confidence/hardening only — a
real-solver test (SNES state is negligible: previous solution
travels via MeshVariable, already captured) and the accepted,
documented in-memory memory cost (mitigation: route through the
v1.1 on-disk backend via a flag).
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
…loor characterised
Closes the last confidence gap: snapshot/restore driven by an actual
PETSc solver (AdvDiffusion, which carries an internal SemiLagrangian
DDt with an auxiliary projection SNES + nodal trace-back swarm),
through the stash-and-recover loop.
Investigation findings (each verified by a standalone diagnostic):
* AdvDiffusion solve is bit-deterministic — two independent
identical runs with no snapshot are np.array_equal (max|d| = 0.0).
So any drift introduced by snapshot/restore is a real fidelity
question, not solver noise.
* restore() recovers the primary solution field T bit-exactly.
* THE core "git stash for steps" guarantee holds bit-for-bit even
through real solves:
restore -> regretted absurd-dt solve -> restore -> K solves
is np.array_equal to
restore -> K solves
The discarded step leaves zero trace (B == C, max|d| = 0.0).
* The only residual is restore's reproducibility floor against a
*never-snapshotted* control: ~7e-7 here. Mechanism: restore
resyncs fields through gvec->lvec rather than reproducing the
solver-produced lvec exactly; the implicit diffusion operator
amplifies that to solver-tolerance level over steps. This is NOT
contamination from the discarded step (proven by B == C), it is
the cost of round-tripping through the snapshot representation,
within solver tolerance, and consistent with the design intent
that auxiliary solver state is intentionally not captured.
Three tests encode exactly this (no overclaiming):
- test_realsolver_restore_recovers_solution_field (T np.array_equal)
- test_realsolver_regretted_step_leaves_no_trace (B == C, bit-exact)
- test_realsolver_continuation_within_solver_tolerance (vs never-stashed
control: < 1e-5, asserted explicitly non-bit-exact so the test
tightens itself if the floor is ever eliminated)
Honest production statement: discarding a bad step is bit-exact even
through real solvers; recovering to a never-stashed control is within
solver tolerance. Both are correct for the "git stash for steps"
use case.
51 tests pass (24 serial snapshot + 3 real-solver + 24 regression);
parallel ptest (np 1/3/4) unchanged.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
There was a problem hiding this comment.
Pull request overview
Adds an in-memory snapshot/restore toolkit (Model.snapshot() / Model.restore()) intended as a "git stash for timesteps": a unitary state capture that lets time-stepping code back out of a regretted step (RK staging, adaptive Δt retry, predictor-corrector, etc.). It introduces a new underworld3.checkpoint subpackage with a backend abstraction (currently in-memory only; HDF5 v1.1 stubbed), a Snapshottable/SnapshottableState contract, and retrofits all five DDt flavors plus Mesh and Swarm to participate via snapshot_payload/apply_snapshot_payload and .state accessors. Swarm restore uses rebuild-on-restore semantics; mesh restore refuses on _mesh_version change (v1.2 will rebuild). Also fixes the uw.swarm.UWSwarm → uw.swarm.Swarm typo in Lagrangian.__init__ (duplicated in #184).
Changes:
- New
underworld3.checkpointsubpackage withSnapshot,InMemoryBackend,SnapshottableState/Snapshottableprotocol, and capture/restore orchestration. Mesh,Swarm, and all five DDt flavors gainsnapshot_payload/apply_snapshot_payload/.state(option-B dataclass adapter); swarm gets an informational_population_generationcounter.- Extensive new tests: serial unit tests, real-solver AdvDiffusion confidence test, and an MPI parallel test (np 1/3/4) plus a developer guide and design note.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/underworld3/checkpoint/init.py | Public API surface for the new subpackage. |
| src/underworld3/checkpoint/backend.py | CheckpointBackend protocol + InMemoryBackend. |
| src/underworld3/checkpoint/snapshot.py | Snapshot dataclass and capture/restore orchestration over meshes, swarms, and state bearers. |
| src/underworld3/checkpoint/state.py | SnapshottableState base dataclass + Snapshottable protocol. |
| src/underworld3/model.py | Adds _state_bearers WeakSet, _register_state_bearer, and Model.snapshot()/restore() wrappers. |
| src/underworld3/discretisation/discretisation_mesh.py | Mesh snapshot_payload/apply_snapshot_payload capturing deformed coords + MV gvec DOFs. |
| src/underworld3/swarm.py | Swarm snapshot_payload/apply_snapshot_payload (rebuild-on-restore) and _population_generation counter bumps. |
| src/underworld3/systems/ddt.py | Five DDt state dataclasses + .state getter/setter retrofit; fixes UWSwarm typo. |
| src/underworld3/init.py | Imports underworld3.checkpoint. |
| tests/test_0007_snapshot_inmemory.py | Serial unit and end-to-end snapshot/restore tests. |
| tests/test_0008_snapshot_realsolver.py | AdvDiffusion real-solver confidence test. |
| tests/parallel/ptest_0007_snapshot_inmemory.py | MPI snapshot/restore test. |
| tests/parallel/mpi_runner.sh | Adds new parallel test invocations. |
| docs/developer/guides/state-as-dataclass.md | Developer guide for the .state contract. |
| docs/developer/design/in_memory_checkpoint_design.md | Design note for the snapshot toolkit. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # state-bearer. Safe if no model is active. | ||
| try: | ||
| import underworld3 as _uw | ||
|
|
||
| _uw.get_default_model()._register_state_bearer(self) | ||
| except Exception: | ||
| pass |
| # The clear+re-add path bumped _population_generation already | ||
| # (we don't bump on removePoint, but addNPoints isn't bumped | ||
| # either — these are raw PETSc calls). For consistency with | ||
| # other mutation paths, bump explicitly here. |
| # Restore-vs-pristine reproducibility floor for this setup (measured). | ||
| # The regretted-step guarantee is asserted bit-exact (np.array_equal); | ||
| # only the never-stashed-control comparison uses this tolerance. |
| # Step 3: write captured per-variable data. | ||
| current_vars = {var.clean_name: var for var in self._vars.values()} | ||
| for var_clean_name, saved in payload["vars"].items(): | ||
| var = current_vars.get(var_clean_name) | ||
| if var is None: | ||
| raise SnapshotInvalidatedError( | ||
| f"swarm {self._snapshot_stable_name()!r}: variable " | ||
| f"{var_clean_name!r} from snapshot is not present" | ||
| ) | ||
| current = np.asarray(var.data) | ||
| if current.shape != saved.shape: | ||
| raise SnapshotInvalidatedError( | ||
| f"swarm {self._snapshot_stable_name()!r}: variable " | ||
| f"{var_clean_name!r} data shape mismatch — current " | ||
| f"{current.shape} vs snapshot {saved.shape}" | ||
| ) | ||
| current[...] = saved | ||
|
|
Four fixes for points raised in Copilot's review of the in-memory snapshot toolkit PR: 1. ddt.py — narrow ``except Exception`` to ``except (ImportError, AttributeError)`` at all five DDt ``_register_state_bearer`` sites. Only the genuine bootstrap cases (import not yet wired during underworld3 init, or older Model without the registry method) get swallowed; real registration bugs now propagate instead of silently masking the silent-state-loss failure mode the design note explicitly warns against. 2. swarm.py — rewrite the contradictory comment in ``Swarm.apply_snapshot_payload`` around the explicit ``_population_generation += 1`` bump. The previous wording said the clear+re-add path had already bumped, which is wrong — neither ``removePoint`` nor the raw ``addNPoints`` call here touches the counter; the explicit bump is what makes a restore visible to downstream consumers as a population change. 3. swarm.py — ``apply_snapshot_payload`` now raises ``SnapshotInvalidatedError`` if the live swarm has user variables that were not in the snapshot. Previously those "extra" vars survived the clear+addNPoints reallocation with uninitialised/stale contents — silent incoherence after restore. Contract is now symmetric with the mesh-variable restore (same variable set on both sides). 4. test_0008_snapshot_realsolver — added comment explaining the ~14× headroom on ``_RESTORE_FLOOR_ATOL = 1e-5`` vs the measured ~7e-7 floor (PETSc/BLAS/MPI variability allowance on CI). Regression: 45 single-rank snapshot+core tests pass; parallel ptest at np-4 still PASS with the new strict-extras check. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
First slice of the on-disk snapshot format (v1.1). Establishes the file structure and the inspectability bar; no PETSc bulk yet (that is phase 2). Stacked on the in-memory snapshot toolkit (#195) and the model tracker (#196) so it can serialise both later. What lands: - src/underworld3/checkpoint/disk_snapshot.py - DISK_SNAPSHOT_SCHEMA_VERSION = 1 - write_snapshot_skeleton(model, path): writes /metadata attrs + empty stub groups /mesh /variables /swarms /python_state (the structure phases 2+ will fill in). - read_snapshot_metadata(path): reads /metadata back as a plain dict, decodes JSON-encoded list fields for convenience, validates schema version. - inspect_snapshot(path): human-readable summary suitable for print(...) at a notebook prompt. - src/underworld3/checkpoint/__init__.py: exports. - tests/test_0010_snapshot_disk_format.py (7, tier_a level_1): - top-level group structure matches the spec - h5py-readable /metadata attrs cover identity, schema, tracker conventions, geometry, MPI rank count, and inventories of meshes / swarms / state-bearer classes / variables — the proxy for "an external user running h5ls/h5dump sees useful info" - read/write roundtrip - rejection of non-snapshot files and wrong-schema files with clear errors (not obscure h5py noise) - inspect_snapshot includes the key facts - skeleton groups carry `filled_by` attrs so phases 2/3 readers and external inspectors can tell whether content is populated yet. Design notes encoded: - UW3-controlled rich-metadata wrapper around PETSc bulk; pure PETSc HDF5 dumps fail the inspectability bar so are rejected as the format. - List-typed metadata stored as JSON strings in scalar attrs so h5py / h5ls handle them cleanly; read API exposes them as plain Python lists alongside the *_json originals. - Swarm storage left as a phase-3 decision: the metadata wrapper is designed to support `@external_file` on /swarms/swarm_X/ when individual swarms grow too bulky for a single file. No commitment to inline vs split until phase 3 has real swarm sizes in hand. Stacked on feature/model-tracker; PRs to development after #195 and #196 land. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
…t roundtrip Builds on phase 1's metadata wrapper to actually carry mesh + mesh- variable state to disk and read it back. Delegates the heavy lifting to #146's `Mesh.write_checkpoint` / `MeshVariable.read_checkpoint` PETSc-DMPlex primitives — phase 2's job is layout, dispatch, and tying the wrapper to the bulk data via a simple convention. Layout (final v1.1 shape): /path/to/run.snap.h5 wrapper (h5py-inspectable) /path/to/run.snap.bulk/ companion directory (one per snap) {mesh_safe}.mesh.00000.h5 {mesh_safe}.{var_clean}.00000.h5 Wrapper carries /meshes/{mesh_safe}/ with @name, @mesh_file, and /meshes/{mesh_safe}/variables/{var_safe}/ with @name, @components, @degree, @continuous, @external_file. The bulk-dir path is derived from the wrapper path by convention (`.h5` → `.bulk`), so no external_file attr is needed for the standard placement. Move them together; a clear FileNotFoundError fires if bulk is missing on read. Phase 1 layout refactor folded in: - /mesh (singular) → /meshes (plural) — supports multi-mesh natively. - /variables removed from the top level — now nests under each mesh as /meshes/{name}/variables/{var}, matching the in-memory snapshot's mesh→vars structure. New API: - `write_snapshot(model, path)` — writes wrapper + bulk; covers every registered mesh and every allocated meshvar on each mesh. Lazy-allocated vars (_gvec is None) are skipped — same rule as the in-memory path. - `read_snapshot(model, path)` — loads var DOFs back into already- registered meshes by name. Mesh / variable mismatch raises a clear ValueError (mesh-rebuild on read is v1.2 scope). - `write_snapshot_skeleton` / `read_snapshot_metadata` / `inspect_snapshot` stay as phase-1 metadata-only entry points. Branch hygiene: merged origin/development (which now has #146) into this branch so the new code can actually call read_checkpoint. The merge was clean — #146 and the snapshot toolkit only overlap at different methods in `discretisation_mesh.py`, as the earlier analysis predicted. PR target will be development once #195/#196 land; the diff stays clean because the merged dev commits are already there. Tests (12 total, 5 new in phase 2, tier_a level_1): - write produces wrapper + bulk-dir with the expected file pattern - wrapper populated with the per-mesh + per-var metadata that makes inspectability self-sufficient - bit-exact write→scribble→read roundtrip on a 2D mesh with one scalar + one vector variable (np.array_equal, zero tolerance) - missing bulk-dir → clear FileNotFoundError - mismatched mesh on read → clear ValueError (not an obscure h5py trace) Regression: 64 tests pass (24 snapshot + 9 tracker + 12 disk-format + 19 core/regression). Phase 3 next: swarms (with the @external_file freedom kept open for bulky swarms) + /python_state for DDt + ModelTracker via dataclass- to-HDF5-attrs serialisation. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)
Per Louis's direction ("break out the swarm information into a
separate file in the first instance — bulk is a problem with swarms,
always"), swarms always go to their own h5py-direct sidecar from day
one. No inline-vs-split toggle — sidecar is the only path.
Layout:
/path/to/run.snap.h5 wrapper
/path/to/run.snap.bulk/{swarm_safe}.swarm.h5 swarm sidecar (one
per swarm)
Sidecar structure (h5py-native, no PETSc — swarms aren't DMPlex
section/vec):
@num_particles_local, @dim, @mesh_name, @population_generation
/coordinates dataset, (n_local, dim)
/variables/{var_clean_name} dataset, (n_local, num_components)
@num_components, @dtype
The sidecar's top-level @attrs and group structure mean `h5ls -v`
on the sidecar alone tells you "this holds N particles in dim D on
mesh M with these variables" — same inspectability bar as the
wrapper.
Wrapper /swarms/{swarm_safe}/ carries metadata + the @external_file
pointer to the sidecar in the bulk dir.
Restore mirrors the in-memory Swarm.apply_snapshot_payload exactly:
clear local population via dm.removePoint loop, addNPoints at saved
coords, write var data back. Same rebuild-on-restore semantics — the
disk snapshot recovers from a particle-population mutation (added
particles between snapshot and restore) just like the in-memory path
does, proven by test_swarm_restore_recovers_after_particle_count_change.
Tests (5 new, 21 total tier_a level_1):
- swarm sidecar lands in bulk dir with predictable name; wrapper
records external_file ref + mesh_name + var inventory
- sidecar is self-inspectable via h5py (file-level attrs +
/coordinates + /variables with per-var attrs)
- whole swarm (coords + svar data) round-trips bit-exact through
write → scribble → read
- rebuild-on-restore parity with in-memory path: snapshot, mutate
population, restore → exact local population recovered
- PETSc-internal DMSwarm_* variables filtered at capture (same rule
as in-memory)
MPI: single-rank only in this phase. The current rank-0-only sidecar
write only captures rank 0's local particles in a parallel run.
Phase 6 will either use h5py-mpi parallel HDF5 or per-rank sidecars
to match #195's parallel exact-reconstruction guarantee.
73 tests pass (24 in-memory + 9 tracker + 21 disk-format + 19
core/regression).
Phase 4 next: format detection + dispatch in MeshVariable.read_timestep
so it reads BOTH the legacy per-variable layout AND the new v1.1
sidecar format via the KDTree bridge. Closes the compatibility
commitment from the design discussion.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Single user-facing entry point for all snapshot use cases. Same
methods serve in-memory ephemeral stash and on-disk persistent
snapshot — the dispatch is mechanical, the user has one API to
learn:
token = model.save_state() # in-memory, returns Snapshot
model.load_state(token) # restore from token
model.save_state(file="step42.snap.h5") # on-disk, returns path
model.load_state("step42.snap.h5") # restore from disk
# (also: load_state(file=…))
load_state dispatches on argument type — Snapshot → in-memory
restore; str/PathLike → disk restore. Type-mismatched source raises
TypeError with a clear message.
Renames replace the prior Model.snapshot() / Model.restore() pair
from #195. Pre-merge, no public users to migrate; getting the
user-facing API right now means there is never a disparate version
shipped. uw.checkpoint.{snapshot,restore,write_snapshot,read_snapshot,
read_snapshot_metadata,inspect_snapshot,write_snapshot_skeleton}
stay as power-user / lower-level entry points that save_state /
load_state delegate to.
Files updated (mechanical renames, except the doc rewrite):
- src/underworld3/model.py: save_state / load_state methods replace
snapshot / restore; load_state accepts positional Snapshot or
str/os.PathLike, with TypeError on anything else.
- tests/test_0007_snapshot_inmemory.py — 23 callers renamed; obsolete
test_snapshot_path_is_v1_1_scope deleted (v1.1 has landed).
- tests/test_0008_snapshot_realsolver.py — 3 tests renamed.
- tests/test_0009_model_tracker.py — 9 tests renamed.
- tests/test_0010_snapshot_disk_format.py — 21 tests: replace
uw.checkpoint.write_snapshot / read_snapshot with model.save_state
/ model.load_state at user-style call sites; keep
write_snapshot_skeleton + read_snapshot_metadata where the test is
specifically exercising the lower-level entry points.
- tests/parallel/ptest_0007_snapshot_inmemory.py — np-1/3/4 ptest.
- tests/run_snapshot_backstepping_{demo,spatial}.py — demo scripts.
- docs/advanced/snapshot-restore.md — rewritten API section to show
both modes; added "On-disk file layout" section and a "Choosing
between paths" comparison table covering write_timestep,
write_checkpoint, and save_state. Limitations section updated to
reflect that on-disk is now real (was "in-memory only").
Regression: 75 single-rank tests pass (was 76 — minus the deleted
obsolete v1.1-scope test); MPI ptest at -np 4 still PASS with the
parallel exact-reconstruction guarantee. Docs build clean with no
snapshot-related warnings; the new layout + choosing-between-paths
sections render.
Phase 4 (read_timestep format-aware dispatch for backward compat)
becomes a nice-to-have at this point — save_state / load_state is
the recommended surface, write_timestep / read_timestep keep their
existing role unchanged. Phase 6 (parallel HDF5 / per-rank sidecars
for on-disk MPI) is the remaining correctness item.
Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Summary
Adds
Model.snapshot()/Model.restore()— a unitary, in-memory "hold that thought, I might need to come back" mechanism for timesteppers (backtrack-on-instability, adaptive Δt retry, RK staging, predictor-corrector probes). Distinct from the existing per-variablewrite_timesteppath, which is unchanged.Captured & restored: mesh coordinates, mesh-variable DOFs, swarm particle positions + swarm-variable data (rebuild-on-restore semantics), and solver-internal Python state for all five DDt flavors via a new state-as-dataclass contract.
Design highlights
001f961→6d1b635→9399f87) once it was clear that "particles moved" is exactly what restore exists to undo. Per-rank capture + per-rank rebuild = exact global reconstruction.Snapshottablecontract (src/underworld3/checkpoint/state.py) — option (B) derived-dataclass adapters for the retrofitted DDt flavors; documented indocs/developer/guides/state-as-dataclass.mdfor new code.topologyslot in the mesh payload,_schema_versionon every State dataclass. Mesh-adapt rebuild (v1.2) and on-disk backend (v1.1) can land without a schema bump.Correctness — proven, serial + parallel + real solver
np.array_equal, zero tolerance).tests/parallel/ptest_0007_snapshot_inmemory.py, np 1/3/4): a disruptive step that loses 28–35 particles across ranks is fully recovered by restore — exact global reconstruction (gather + sort by per-particle gid) and bit-identical continuation.tests/test_0008, AdvDiffusion with internal SemiLagrangian DDt): discarding a regretted step is bit-exact even through real PETSc solves (B == C, max|d| = 0.0). Recovering to a never-snapshotted control is within solver tolerance (~7e-7) — a documented, characterised restore floor (gvec→lvec resync amplification), by design, not step contamination.Scope (intentional)
In-memory, same-rank-count, whole-state stash/pop. Out of scope by design: on-disk durability (v1.1 — the mitigation for the accepted in-memory memory cost), selective per-field restore, cross-rank-count restore, mesh-adapt rebuild (v1.2, refused with a clear error in v1).
Notes for the reviewer
b179183(LagrangianUWSwarmtypo fix) duplicates already-merged fix(ddt): Lagrangian.__init__ — uw.swarm.UWSwarm typo #184; git resolves it cleanly on merge — no action needed.Test plan
pixi run -e amr-dev pytest tests/test_0007_snapshot_inmemory.py tests/test_0008_snapshot_realsolver.py(27 tests)cd tests/parallel && mpirun -np 4 python ./ptest_0007_snapshot_inmemory.py(and np 1, 3)pytest tests/test_0000_imports.py tests/test_0002_model.py tests/test_0003_save_load.py tests/test_1052_ddt_set_initial_history.pyUnderworld development team with AI support from Claude Code