In-memory snapshot toolkit (git stash for timesteps) by lmoresi · Pull Request #195 · underworldcode/underworld3

lmoresi · 2026-05-19T05:28:34Z

Summary

Adds Model.snapshot() / Model.restore() — a unitary, in-memory "hold that thought, I might need to come back" mechanism for timesteppers (backtrack-on-instability, adaptive Δt retry, RK staging, predictor-corrector probes). Distinct from the existing per-variable write_timestep path, which is unchanged.

Captured & restored: mesh coordinates, mesh-variable DOFs, swarm particle positions + swarm-variable data (rebuild-on-restore semantics), and solver-internal Python state for all five DDt flavors via a new state-as-dataclass contract.

Design highlights

Rebuild-on-restore for swarms (not refuse-on-mutation). An earlier counter-as-gate design was reverted mid-branch (commits 001f961 → 6d1b635 → 9399f87) once it was clear that "particles moved" is exactly what restore exists to undo. Per-rank capture + per-rank rebuild = exact global reconstruction.
State-as-dataclass Snapshottable contract (src/underworld3/checkpoint/state.py) — option (B) derived-dataclass adapters for the retrofitted DDt flavors; documented in docs/developer/guides/state-as-dataclass.md for new code.
v1.2-forward-compat baked in: name-keyed payloads, reserved topology slot in the mesh payload, _schema_version on every State dataclass. Mesh-adapt rebuild (v1.2) and on-disk backend (v1.1) can land without a schema bump.

Correctness — proven, serial + parallel + real solver

Serial: 24 tests, exact roundtrip; bit-identical continuation (np.array_equal, zero tolerance).
Parallel (tests/parallel/ptest_0007_snapshot_inmemory.py, np 1/3/4): a disruptive step that loses 28–35 particles across ranks is fully recovered by restore — exact global reconstruction (gather + sort by per-particle gid) and bit-identical continuation.
Real solver (tests/test_0008, AdvDiffusion with internal SemiLagrangian DDt): discarding a regretted step is bit-exact even through real PETSc solves (B == C, max|d| = 0.0). Recovering to a never-snapshotted control is within solver tolerance (~7e-7) — a documented, characterised restore floor (gvec→lvec resync amplification), by design, not step contamination.

Scope (intentional)

In-memory, same-rank-count, whole-state stash/pop. Out of scope by design: on-disk durability (v1.1 — the mitigation for the accepted in-memory memory cost), selective per-field restore, cross-rank-count restore, mesh-adapt rebuild (v1.2, refused with a clear error in v1).

Notes for the reviewer

Commit b179183 (Lagrangian UWSwarm typo fix) duplicates already-merged fix(ddt): Lagrangian.__init__ — uw.swarm.UWSwarm typo #184; git resolves it cleanly on merge — no action needed.
History intentionally retains the reverted counter-as-gate pair + design-correction commit; it documents the design journey and the revert commit explains why.

Test plan

pixi run -e amr-dev pytest tests/test_0007_snapshot_inmemory.py tests/test_0008_snapshot_realsolver.py (27 tests)
cd tests/parallel && mpirun -np 4 python ./ptest_0007_snapshot_inmemory.py (and np 1, 3)
Regression: pytest tests/test_0000_imports.py tests/test_0002_model.py tests/test_0003_save_load.py tests/test_1052_ddt_set_initial_history.py

Underworld development team with AI support from Claude Code

Spun off from the 2026-05-11 deformable-surface design discussion as a self-contained UW3 capability. Covers motivation (backtrack-on-failure, adaptive Δt retry, RK staging, predictor-corrector probes, crash recovery, bisection, debugging captures), the two existing on-disk paths (write_checkpoint and write_timestep), the state-as-dataclass serialisation contract for solver-internal Python state, the three-backend story (in-memory + on-disk-full-state + existing write_timestep unchanged), schema versioning, the swarm population-generation counter, eight architectural work items in dependency order, scope boundaries, and open implementation questions. Baseline-of-record for the feature/in-memory-checkpoint branch. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

First true unitary checkpoint in UW3. Captures mesh coordinates and mesh-variable global-vector DOFs across every registered mesh into a plain-Python token; restores back onto the same Model instance within the same process, bit-equivalent. Distinct from the existing per-variable write_timestep/read_timestep path, which continues to serve visualisation and partial restart unchanged. What lands: - src/underworld3/checkpoint/ — new module - backend.py: CheckpointBackend Protocol + InMemoryBackend (eager copy on save and load; tokens hold numpy data only, never PETSc handles, so DM-lifecycle hazards do not apply) - snapshot.py: Snapshot dataclass + snapshot()/restore() routines. Restore order: mesh coords via _deform_mesh() → MV gvec write + globalToLocal sync. Within-process invalidation gate: _mesh_version mismatch raises SnapshotInvalidatedError before any write happens. - src/underworld3/model.py — Model.snapshot()/restore() thin delegates - tests/test_0007_snapshot_inmemory.py — 6 tier-A level-1 tests: scalar/vector MV roundtrip, snapshot independence from later writes, _mesh_version invalidation, type rejection, NotImplementedError on path= (v1.1 scope). Design open-question resolutions: - Q1 module location: src/underworld3/checkpoint/ (top-level sibling, room to grow into swarm + state-as-dataclass + on-disk in v1.1+) rather than the persistence.py stub. - Q2 PETSc API for in-memory capture: Vec.array (numpy view) + subdm.createSubDM + globalToLocal is sufficient. No Viewer needed. - Q3 restore order verified empirically: mesh._deform_mesh first (rebuilds coord caches + callbacks), then per-var gvec write + globalToLocal sync; _stale_lvec flagged so downstream caches refresh. - Q4 memory budget: not yet measured; deferred to a later PR with a realistic coupled-physics setup. Deviation from PR 1 plan: the plan also mentioned refactoring Mesh.write_checkpoint() to call the new protocol. Skipped here because that path is write-only (no read_checkpoint exists), so refactoring without an exercising load-half is risk without value. The HDF5 backend lands with v1.1 on-disk full-state, where it is load-bearing and the protocol shape can be validated against both backends. Not yet covered (subsequent PRs): - swarm coverage + _population_generation counter (PR 2) - state-as-dataclass contract + DDt retrofit (PR 3) - parameter mutation history + CI check (PR 4) - on-disk full-state backend (v1.1; PR 5) - schema versioning + migration registry (PR 6) - cross-process restore + broader test suite (PR 7) See docs/developer/design/in_memory_checkpoint_design.md for the full roadmap. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Extends the unitary snapshot to capture per-rank swarm positions and user swarm-variable data. Adds Swarm._population_generation, a counter bumped at every particle-population mutation site, used as the within-process invalidation gate: restoring a snapshot taken before a populate / migrate / add_particles / remesh event raises SnapshotInvalidatedError rather than silently corrupting a now-stale position array. Counter init + bump sites (src/underworld3/swarm.py): - Swarm.__init__: initialise to 0 next to _mesh_version. - populate(): bump once at the end (covers the 1-3 internal addNPoints calls in the populate body). - Swarm.migrate() after the migration_disabled early-exit: bump unconditionally; conservative even when migrate is a no-op, because under-bumping risks silent corruption while over-bumping is safe. - add_particles_with_coordinates() after its direct self.dm.migrate(): this path doesn't go through Swarm.migrate so we bump explicitly. - add_particles_with_global_coordinates() right after addNPoints: catches the migrate=False case too; the migrate=True path will double-bump via Swarm.migrate, which is fine. - advection() remesh path after the addNPoints reinjection. Snapshot extensions (src/underworld3/checkpoint/snapshot.py): - New fields: swarm_keys, swarm_generations, swarm_mesh_versions, swarmvar_names. - _capture_swarm: reads DMSwarmPIC_coor via dm.getField → copy → restoreField; iterates swarm.vars excluding DMSwarm* internals; records both _population_generation and _mesh_version. - _restore_swarm: validates both counters before any write; writes back positions + per-var data in place. Deliberately bypasses populate/add_particles/migrate so the restore itself does not bump the counter or mutate the population we just confirmed stable. Test coverage (tests/test_0007_snapshot_inmemory.py): 5 new tests on top of the 6 mesh-only tests: - swarm positions + user-variable roundtrip after scribble - counter bumps on populate, migrate, add_particles_with_coordinates, add_particles_with_global_coordinates (in monotonic order) - migrate-between-snapshot-and-restore raises SnapshotInvalidatedError - add_particles-between-snapshot-and-restore raises likewise - DMSwarm_* internal variables stay out of the captured key set Not yet covered: cross-process restore (v1.1), advection remesh invalidation test (needs a recycle-enabled swarm + a velocity field, larger setup than belongs in the core 0007 file). Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

…PR 2)" This reverts commit 001f961.

The earlier design draft proposed Swarm._population_generation as an invalidation gate: counter mismatch between capture and restore would raise SnapshotInvalidatedError. That is wrong. The whole point of the toolkit is to undo intervening state changes — including particle motion, migration, and repopulation. Refusing on counter mismatch breaks the central use cases (RK staging, backtrack-on-instability, adaptive Δt retry, all of which migrate particles between capture and restore). Corrected swarm semantics: - Restore rebuilds the swarm's local population: clear current particles, re-add at captured per-rank coords via add_particles_with_coordinates(..., migrate=False), write captured per-variable data back into the new particles in order. - The _population_generation counter stays as informational metadata (logging, cache invalidation in other consumers, possible future fast-path optimisations), but it is not a restore gate. Mesh-adapt scope boundary reframed: - v1 keeps _mesh_version mismatch as a refusal, because the captured DOF arrays don't fit a different DM's section. - v1.2 will replace the refusal with a mesh-rebuild path on the same rebuild-on-restore principle: destroy the post-adapt DM, rebuild the pre-adapt one from captured topology + section, re-bind all MeshVariable / Swarm / solver wrappers. - v1 captures the topology / section info even though v1 restore ignores it, so the snapshot payload is forward-compatible with v1.2 without a schema bump. Architectural-work item 5 updated to match: snapshot captures per-rank particle coords + per-var arrays; restore clears + re-adds + writes; counter is informational, not a gate. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

…2 redo) Replaces the reverted counter-as-gate PR with rebuild-on-restore semantics for swarms. Restore now succeeds across the cases the earlier design wrongly refused (migrate, add_particles, repopulate between snapshot and restore) — these are precisely the cases the snapshot toolkit exists to enable (RK staging, backtrack on instability, adaptive Δt retry). Design changes since the reverted PR: - Swarm._population_generation stays, but is now purely informational: bumped at every population-mutation site for logging / debugging / downstream caches, but NOT consulted by restore. Restore rebuilds the local population from the snapshot regardless of intervening mutations. - Snapshot is keyed by stable name (mesh.name for meshes, f"swarm_{instance_number}" for swarms), not Python id(). Forward- compat for v1.1 cross-process restore and v1.2 mesh-rebuild after mesh.adapt() (where the wrapper survives but its DM is destroyed). - Restore logic moved off the snapshot module and onto wrapper methods: Mesh.apply_snapshot_payload and Swarm.apply_snapshot_payload. v1 implementations write back in place; v1.2's Mesh implementation can switch to rebuild-from-payload without touching snapshot.py. - Snapshot payloads include a reserved "topology": None slot on the mesh side, populated in v1.2 with section/DM-topology data sufficient to rebuild the DM. v1 leaves it None; the schema doesn't need to bump when v1.2 lands. Mesh restore (Mesh.apply_snapshot_payload at discretisation_mesh.py:2570): - Verify _mesh_version matches (v1 refusal; v1.2 will rebuild here). - Write coords via _deform_mesh, write per-MV gvec arrays + sync to local vec. Swarm restore (Swarm.apply_snapshot_payload at swarm.py:4084): - Drop every current local particle via dm.removePoint() (O(N) total, removes-from-end is O(1) per call). - addNPoints(n_saved), write coords directly to DMSwarmPIC_coor, set ranks. Deliberately bypasses add_particles_with_coordinates (which filters via points_in_domain and triggers migrate — both unnecessary here since saved coords were local at capture and the mesh hasn't changed). - Invalidate _canonical_data caches so subsequent var.data accesses re-resolve from PETSc. - Write captured per-variable data back in particle-order. Internal DMSwarm_* variables are filtered out at capture. Tests (11 total): - 6 mesh-only tests preserved. - 5 swarm tests, including the critical positive tests test_swarm_restore_after_migrate and test_swarm_restore_after_add_particles. Those are the cases the reverted PR wrongly raised on; they're now the central proof that the design works. Regression: 35 existing core tests pass unchanged. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Introduces the state-as-dataclass serialisation contract from the design note and applies it to the canonical DDt flavor (Symbolic). PR 4 will mechanically extend the same pattern to Eulerian, SemiLagrangian, Lagrangian, and Lagrangian_Swarm — they share the same dt_history / history_initialised / n_solves_completed / dt core; the variation is purely in how psi_star is bound. New infrastructure: - src/underworld3/checkpoint/state.py: - SnapshottableState dataclass base with _schema_version field (load-bearing for v1.1 on-disk migration; checked for strict equality in v1 since capture and restore are same-process). - Snapshottable runtime_checkable protocol — anything with a .state attribute returning a SnapshottableState. Drives discovery in Model.snapshot(). - Model._state_bearers (WeakSet) + Model._register_state_bearer(): state-bearing helpers self-register on construction without pinning their lifetime. DDt Symbolic retrofit (option (B)-style adapter per design note): - DDtSymbolicState dataclass with dt_history, history_initialised, n_solves_completed, dt, psi_star. - Symbolic gets a `state` property (builds the dataclass from the existing private attrs on read) and a `state.setter` (unpacks, validates schema version + dt_history length, writes attrs back, re-derives BDF/AM coefficient values so downstream reads see the restored state immediately rather than waiting for the next update_pre_solve). - Symbolic.__init__ auto-registers with the default model. Snapshot/restore wiring: - Snapshot.state_bearers: list of (stable_key, state_dataclass) with stable_key = f"{type(obj).__name__}_{obj.instance_number}". - snapshot() iterates Model._state_bearers, deepcopies obj.state, stores the copy. Deepcopy isolates the snapshot from later mutation of the live state-bearer. - restore() matches captured states to current state-bearers by stable_key, deepcopies, writes via obj.state setter. Missing state-bearer → SnapshotInvalidatedError. Drive-by fix in Mesh.snapshot_payload (caught by the DDt tests): mesh variables with _gvec=None (lazy allocation: var created but never written to) are now skipped during capture rather than crashing on var._gvec.array. Restore correspondingly only touches variables present in payload["vars"], so an unallocated-at-capture variable is left in its current state. Tests (6 new on top of the 11 mesh+swarm ones): - DDt auto-registers in Model._state_bearers - .state returns a SnapshottableState with correct schema version - mid-trajectory snapshot+restore recovers dt_history, history_initialised, n_solves_completed, dt - wrong _schema_version on apply raises ValueError - dt_history length mismatch on apply raises ValueError - snapshot is a deep copy: scribbling live DDt internals doesn't leak into the captured state-bearer payload 17/17 snapshot tests pass; 41 existing core + DDt tests still green. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

…agrangian/Lagrangian_Swarm) (PR 4) Mechanically extends the PR 3 Snapshottable contract from Symbolic to the other four DDt flavors. Each gets: - A flavor-specific State dataclass inheriting from a new _DDtCoreState base that carries the shared dt_history / history_initialised / n_solves_completed / dt fields. - A .state property building the dataclass on read and a .state.setter unpacking on write, re-deriving BDF/AM coefficients so post-restore reads are consistent without waiting for the next update_pre_solve. - Self-registration with the default model in __init__ (try/except for safety when no model is active). State dataclasses (in src/underworld3/systems/ddt.py): - _DDtCoreState: shared fields. Subclasses add psi_star representation specific to the flavor. - DDtSymbolicState (already present, refactored to inherit base): psi_star is a list of sympy expressions. - DDtEulerianState: psi_star is a list of MeshVariables; State carries their clean_names for restore-side verification (the actual DOF arrays travel via the mesh-variable snapshot path). - DDtSemiLagrangianState: same as Eulerian plus optional forcing_star_var_name and with_forcing_history flag for ETD-2 Maxwell-relaxation integration. - DDtLagrangianState / DDtLagrangianSwarmState: psi_star is a list of SwarmVariables on the DDt's swarm; data travels via the swarm-variable path. Notes: - SemiLagrangian's update_pre_solve hardcodes theta=0.5 (the class doesn't accept a theta arg in __init__), so the state setter matches that — not self.theta which doesn't exist. - Lagrangian itself has a pre-existing AttributeError in __init__ (references uw.swarm.UWSwarm which doesn't exist). The retrofit code is in place and follows the same pattern; consumers that construct Lagrangian via the higher-level solver pathways will get .state / .state.setter / registration automatically. The pre-existing bug is out of scope for this PR but worth flagging. - ParameterRegistry retrofit deferred. The class isn't currently wired into Model anywhere in core code (only mentioned in a docstring example). Retrofitting now would be dead code; the retrofit lands together with the real consumer in a follow-up. Tests (3 new on top of PR 3's 6 DDt tests, for 20 total): - Eulerian DDt roundtrip via manual primary-state mutation - SemiLagrangian DDt roundtrip - Lagrangian_Swarm DDt registration + state-type check (no roundtrip — advection needs a velocity-field setup beyond a core unit test) Roundtrips on Eulerian/SemiLagrangian exercise the .state property and .state.setter directly rather than running full projection solves: the BDF/AM coefficient re-derivation happens in the setter, so manual primary-state mutation is sufficient to validate the retrofit logic. 20/20 snapshot tests pass; 41 existing core + DDt tests green. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Companion to the snapshot toolkit design note (PR 0) and the Snapshottable / DDt retrofit implementation (PRs 3, 4). Audience is developers adding new solver-internal helper classes; the guide covers what goes in a State dataclass, when to use option (B) adapter vs option (C) authoritative-store, what NOT to capture (PETSc handles, bulk arrays already carried by mesh-var / swarm-var paths), how schema versioning is intended to work in v1.1, and a minimal roundtrip test pattern. Closes the doc gap noted in PR 3's commit message; rounds out v1 of the snapshot toolkit. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

The Lagrangian DDt flavor has been unconstructible since 2025-07-07 (commit 0778b7d, "Fix uw.function.evaluate / eliminate evalf"), which typo'd uw.swarm.Swarm(mesh) as uw.swarm.UWSwarm(mesh) while editing nearby code. UWSwarm does not exist — never has — so every direct construction of Lagrangian since that commit has died with AttributeError. Higher-level solver pathways that wrap Lagrangian were presumably also broken; consumers may have silently been using Lagrangian_Swarm or another flavor as a workaround. One-character fix: revert that line. No other UWSwarm references exist in the tree. This bug surfaced during the snapshot toolkit work (PR 4) when the state-as-dataclass retrofit included a Lagrangian roundtrip test that couldn't run. With the fix in place, the test now runs and passes — included in the same commit so the bug fix and the proof it works land together. Should be cherry-picked to development; the typo is unrelated to the snapshot feature work it surfaced from. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Every previous snapshot test is unit-style — build a thing, snapshot, scribble, restore, check equality. None exercise the actual use case that motivated the toolkit: detect a bad step, snap back, retry. This commit adds one focused end-to-end test covering the canonical adaptive-Δt CFL workflow. The test simultaneously exercises all three captured state surfaces in one realistic story: - A swarm with an outward-radial velocity field carries particles outward at known speeds. - A material variable on the swarm carries a per-particle marker (initial x-coord), so we can prove particle identity is recovered, not just particle count. - A Symbolic DDt accumulates BDF history (manually advanced past startup), so the state-bearer / state-as-dataclass path also gets exercised. The flow: 1. Snapshot before the speculative step. 2. Take a candidate Δt = 0.5 → max displacement ~0.27, ~6× the cell radius. CFL violated; the consumer's check trips. 3. model.restore(snap): particle positions, material data, and DDt history all roll back to the snapshot point. 4. Retry at Δt = 0.05 → max displacement ~0.033, sub-cell. CFL satisfied; state evolves cleanly. The Δt and threshold values come from a probe run on the same mesh (min_radius ≈ 0.044; |V| ≤ 0.71 at corners), so the CFL violation on the candidate step is a real physical observation rather than a parameter tweak. Smaller dt → strictly smaller displacement gives a robust ratio-based assertion that doesn't rely on exact numbers. This is the test pattern that consumers (RK staging, adaptive Δt, predictor-corrector, regime-change feeling-out) will adapt. v1 of the snapshot toolkit is genuine, not just unit-tested. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Every prior test proved *state equality after restore*. That is necessary but not the guarantee a backtracking consumer actually relies on. The real guarantee — "git stash for steps": a discarded speculative step leaves zero trace after restore + continuation — was untested. This commit adds it, asserted bit-for-bit (np.array_equal, no tolerance). Two tests, both with a live swarm + driven mesh variable + Symbolic DDt so the mesh -> swarm -> state-bearer restore ordering is exercised together: - test_continuation_deterministic_after_restore: snapshot S -> K steps -> A; restore(S) -> K steps -> B. A == B bit-for-bit. Proves restore leaves no residual state that perturbs subsequent evolution. - test_continuation_bit_identical_across_stash_and_recover: control: S -> K good steps -> ctrl stash: S -> disruptive 10x-dt step -> restore(S) -> same K good steps -> stash ctrl == stash bit-for-bit. The regretted step leaves no trace. This also closes the #3 concern (does Mesh.apply_snapshot_payload's _deform_mesh call disturb a registered swarm before the swarm restore runs?). Both tests have a swarm and a mesh variable live through restore; the mesh restore calls _deform_mesh with unchanged coords, then swarm restore, then DDt restore. If the mesh restore perturbed the swarm, continuation would not be bit-identical. It is. For v1 scope this fully covers the _deform_mesh-on-restore path, because a *deformed* mesh (bumped _mesh_version) is refused on restore anyway — the only path that runs _deform_mesh on restore is the same-coords path these tests now cover. Remaining production blockers (unchanged by this commit): parallel (MPI) is still untested and the swarm rebuild deliberately bypasses migration; no real-solver test; memory cost unmeasured. 24 snapshot tests, 24 regression tests, all green. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

The one real production blocker was "works everywhere" — i.e. correct under MPI. This adds a parallel ptest and confirms the design intent: swarm restore is a per-rank reconstruction, not a redistribution, so the global state is exactly rebuilt under cross-rank migration provided the rank count is unchanged (the documented v1 scope). ptest_0007_snapshot_inmemory.py (mesh + swarm + per-particle global-id tag + material + Symbolic DDt; rotation field that circulates particles across the strip partition). Three collective properties, asserted on rank 0: P1 restore recovers the exact global particle count. The disruptive step is deliberately *allowed* to lose particles across ranks (advect out / clip) — that is exactly the failure a stash-and-restore exists to undo. P2 exact reconstruction: gather (gid, x, y, material) from every rank, sort by global id, np.array_equal pre-step vs post-restore. Order- and rank-independent — the real proof that per-rank reconstruction yields the correct global state. P3 bit-identical continuation across a stash, in parallel. Results: -np 1 : 2052 particles, all properties pass. -np 3 : 2013 particles; disruptive step loses 28 across ranks; restore recovers all 2013 exactly; P2/P3 bit-for-bit. -np 4 : 2006 particles; disruptive step loses 35 across ranks; restore recovers all 2006 exactly; P2/P3 bit-for-bit. The genuinely strong result: the toolkit demonstrably recovers from real cross-rank particle loss — the exact production scenario it exists for — with bit-identical continuation afterwards. Registered in mpi_runner.sh at -np 1 / 3 (uneven) / 4. Production-blocker status: parallel correctness now confirmed (was the gate). Remaining items are confidence/hardening only — a real-solver test (SNES state is negligible: previous solution travels via MeshVariable, already captured) and the accepted, documented in-memory memory cost (mitigation: route through the v1.1 on-disk backend via a flag). Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

…loor characterised Closes the last confidence gap: snapshot/restore driven by an actual PETSc solver (AdvDiffusion, which carries an internal SemiLagrangian DDt with an auxiliary projection SNES + nodal trace-back swarm), through the stash-and-recover loop. Investigation findings (each verified by a standalone diagnostic): * AdvDiffusion solve is bit-deterministic — two independent identical runs with no snapshot are np.array_equal (max|d| = 0.0). So any drift introduced by snapshot/restore is a real fidelity question, not solver noise. * restore() recovers the primary solution field T bit-exactly. * THE core "git stash for steps" guarantee holds bit-for-bit even through real solves: restore -> regretted absurd-dt solve -> restore -> K solves is np.array_equal to restore -> K solves The discarded step leaves zero trace (B == C, max|d| = 0.0). * The only residual is restore's reproducibility floor against a *never-snapshotted* control: ~7e-7 here. Mechanism: restore resyncs fields through gvec->lvec rather than reproducing the solver-produced lvec exactly; the implicit diffusion operator amplifies that to solver-tolerance level over steps. This is NOT contamination from the discarded step (proven by B == C), it is the cost of round-tripping through the snapshot representation, within solver tolerance, and consistent with the design intent that auxiliary solver state is intentionally not captured. Three tests encode exactly this (no overclaiming): - test_realsolver_restore_recovers_solution_field (T np.array_equal) - test_realsolver_regretted_step_leaves_no_trace (B == C, bit-exact) - test_realsolver_continuation_within_solver_tolerance (vs never-stashed control: < 1e-5, asserted explicitly non-bit-exact so the test tightens itself if the floor is ever eliminated) Honest production statement: discarding a bad step is bit-exact even through real solvers; recovering to a never-stashed control is within solver tolerance. Both are correct for the "git stash for steps" use case. 51 tests pass (24 serial snapshot + 3 real-solver + 24 regression); parallel ptest (np 1/3/4) unchanged. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Copilot

Pull request overview

Adds an in-memory snapshot/restore toolkit (Model.snapshot() / Model.restore()) intended as a "git stash for timesteps": a unitary state capture that lets time-stepping code back out of a regretted step (RK staging, adaptive Δt retry, predictor-corrector, etc.). It introduces a new underworld3.checkpoint subpackage with a backend abstraction (currently in-memory only; HDF5 v1.1 stubbed), a Snapshottable/SnapshottableState contract, and retrofits all five DDt flavors plus Mesh and Swarm to participate via snapshot_payload/apply_snapshot_payload and .state accessors. Swarm restore uses rebuild-on-restore semantics; mesh restore refuses on _mesh_version change (v1.2 will rebuild). Also fixes the uw.swarm.UWSwarm → uw.swarm.Swarm typo in Lagrangian.__init__ (duplicated in #184).

Changes:

New underworld3.checkpoint subpackage with Snapshot, InMemoryBackend, SnapshottableState/Snapshottable protocol, and capture/restore orchestration.
Mesh, Swarm, and all five DDt flavors gain snapshot_payload / apply_snapshot_payload / .state (option-B dataclass adapter); swarm gets an informational _population_generation counter.
Extensive new tests: serial unit tests, real-solver AdvDiffusion confidence test, and an MPI parallel test (np 1/3/4) plus a developer guide and design note.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/underworld3/checkpoint/init.py	Public API surface for the new subpackage.
src/underworld3/checkpoint/backend.py	`CheckpointBackend` protocol + `InMemoryBackend`.
src/underworld3/checkpoint/snapshot.py	`Snapshot` dataclass and capture/restore orchestration over meshes, swarms, and state bearers.
src/underworld3/checkpoint/state.py	`SnapshottableState` base dataclass + `Snapshottable` protocol.
src/underworld3/model.py	Adds `_state_bearers` WeakSet, `_register_state_bearer`, and `Model.snapshot()`/`restore()` wrappers.
src/underworld3/discretisation/discretisation_mesh.py	Mesh `snapshot_payload`/`apply_snapshot_payload` capturing deformed coords + MV gvec DOFs.
src/underworld3/swarm.py	Swarm `snapshot_payload`/`apply_snapshot_payload` (rebuild-on-restore) and `_population_generation` counter bumps.
src/underworld3/systems/ddt.py	Five DDt state dataclasses + `.state` getter/setter retrofit; fixes `UWSwarm` typo.
src/underworld3/init.py	Imports `underworld3.checkpoint`.
tests/test_0007_snapshot_inmemory.py	Serial unit and end-to-end snapshot/restore tests.
tests/test_0008_snapshot_realsolver.py	AdvDiffusion real-solver confidence test.
tests/parallel/ptest_0007_snapshot_inmemory.py	MPI snapshot/restore test.
tests/parallel/mpi_runner.sh	Adds new parallel test invocations.
docs/developer/guides/state-as-dataclass.md	Developer guide for the `.state` contract.
docs/developer/design/in_memory_checkpoint_design.md	Design note for the snapshot toolkit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # state-bearer. Safe if no model is active.
+        try:
+            import underworld3 as _uw
+
+            _uw.get_default_model()._register_state_bearer(self)
+        except Exception:
+            pass


+        # The clear+re-add path bumped _population_generation already
+        # (we don't bump on removePoint, but addNPoints isn't bumped
+        # either — these are raw PETSc calls). For consistency with
+        # other mutation paths, bump explicitly here.


+# Restore-vs-pristine reproducibility floor for this setup (measured).
+# The regretted-step guarantee is asserted bit-exact (np.array_equal);
+# only the never-stashed-control comparison uses this tolerance.


+        # Step 3: write captured per-variable data.
+        current_vars = {var.clean_name: var for var in self._vars.values()}
+        for var_clean_name, saved in payload["vars"].items():
+            var = current_vars.get(var_clean_name)
+            if var is None:
+                raise SnapshotInvalidatedError(
+                    f"swarm {self._snapshot_stable_name()!r}: variable "
+                    f"{var_clean_name!r} from snapshot is not present"
+                )
+            current = np.asarray(var.data)
+            if current.shape != saved.shape:
+                raise SnapshotInvalidatedError(
+                    f"swarm {self._snapshot_stable_name()!r}: variable "
+                    f"{var_clean_name!r} data shape mismatch — current "
+                    f"{current.shape} vs snapshot {saved.shape}"
+                )
+            current[...] = saved
+


Four fixes for points raised in Copilot's review of the in-memory snapshot toolkit PR: 1. ddt.py — narrow ``except Exception`` to ``except (ImportError, AttributeError)`` at all five DDt ``_register_state_bearer`` sites. Only the genuine bootstrap cases (import not yet wired during underworld3 init, or older Model without the registry method) get swallowed; real registration bugs now propagate instead of silently masking the silent-state-loss failure mode the design note explicitly warns against. 2. swarm.py — rewrite the contradictory comment in ``Swarm.apply_snapshot_payload`` around the explicit ``_population_generation += 1`` bump. The previous wording said the clear+re-add path had already bumped, which is wrong — neither ``removePoint`` nor the raw ``addNPoints`` call here touches the counter; the explicit bump is what makes a restore visible to downstream consumers as a population change. 3. swarm.py — ``apply_snapshot_payload`` now raises ``SnapshotInvalidatedError`` if the live swarm has user variables that were not in the snapshot. Previously those "extra" vars survived the clear+addNPoints reallocation with uninitialised/stale contents — silent incoherence after restore. Contract is now symmetric with the mesh-variable restore (same variable set on both sides). 4. test_0008_snapshot_realsolver — added comment explaining the ~14× headroom on ``_RESTORE_FLOOR_ATOL = 1e-5`` vs the measured ~7e-7 floor (PETSc/BLAS/MPI variability allowance on CI). Regression: 45 single-rank snapshot+core tests pass; parallel ptest at np-4 still PASS with the new strict-extras check. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

First slice of the on-disk snapshot format (v1.1). Establishes the file structure and the inspectability bar; no PETSc bulk yet (that is phase 2). Stacked on the in-memory snapshot toolkit (#195) and the model tracker (#196) so it can serialise both later. What lands: - src/underworld3/checkpoint/disk_snapshot.py - DISK_SNAPSHOT_SCHEMA_VERSION = 1 - write_snapshot_skeleton(model, path): writes /metadata attrs + empty stub groups /mesh /variables /swarms /python_state (the structure phases 2+ will fill in). - read_snapshot_metadata(path): reads /metadata back as a plain dict, decodes JSON-encoded list fields for convenience, validates schema version. - inspect_snapshot(path): human-readable summary suitable for print(...) at a notebook prompt. - src/underworld3/checkpoint/__init__.py: exports. - tests/test_0010_snapshot_disk_format.py (7, tier_a level_1): - top-level group structure matches the spec - h5py-readable /metadata attrs cover identity, schema, tracker conventions, geometry, MPI rank count, and inventories of meshes / swarms / state-bearer classes / variables — the proxy for "an external user running h5ls/h5dump sees useful info" - read/write roundtrip - rejection of non-snapshot files and wrong-schema files with clear errors (not obscure h5py noise) - inspect_snapshot includes the key facts - skeleton groups carry `filled_by` attrs so phases 2/3 readers and external inspectors can tell whether content is populated yet. Design notes encoded: - UW3-controlled rich-metadata wrapper around PETSc bulk; pure PETSc HDF5 dumps fail the inspectability bar so are rejected as the format. - List-typed metadata stored as JSON strings in scalar attrs so h5py / h5ls handle them cleanly; read API exposes them as plain Python lists alongside the *_json originals. - Swarm storage left as a phase-3 decision: the metadata wrapper is designed to support `@external_file` on /swarms/swarm_X/ when individual swarms grow too bulky for a single file. No commitment to inline vs split until phase 3 has real swarm sizes in hand. Stacked on feature/model-tracker; PRs to development after #195 and #196 land. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

@name

…t roundtrip Builds on phase 1's metadata wrapper to actually carry mesh + mesh- variable state to disk and read it back. Delegates the heavy lifting to #146's `Mesh.write_checkpoint` / `MeshVariable.read_checkpoint` PETSc-DMPlex primitives — phase 2's job is layout, dispatch, and tying the wrapper to the bulk data via a simple convention. Layout (final v1.1 shape): /path/to/run.snap.h5 wrapper (h5py-inspectable) /path/to/run.snap.bulk/ companion directory (one per snap) {mesh_safe}.mesh.00000.h5 {mesh_safe}.{var_clean}.00000.h5 Wrapper carries /meshes/{mesh_safe}/ with @name, @mesh_file, and /meshes/{mesh_safe}/variables/{var_safe}/ with @name, @components, @degree, @continuous, @external_file. The bulk-dir path is derived from the wrapper path by convention (`.h5` → `.bulk`), so no external_file attr is needed for the standard placement. Move them together; a clear FileNotFoundError fires if bulk is missing on read. Phase 1 layout refactor folded in: - /mesh (singular) → /meshes (plural) — supports multi-mesh natively. - /variables removed from the top level — now nests under each mesh as /meshes/{name}/variables/{var}, matching the in-memory snapshot's mesh→vars structure. New API: - `write_snapshot(model, path)` — writes wrapper + bulk; covers every registered mesh and every allocated meshvar on each mesh. Lazy-allocated vars (_gvec is None) are skipped — same rule as the in-memory path. - `read_snapshot(model, path)` — loads var DOFs back into already- registered meshes by name. Mesh / variable mismatch raises a clear ValueError (mesh-rebuild on read is v1.2 scope). - `write_snapshot_skeleton` / `read_snapshot_metadata` / `inspect_snapshot` stay as phase-1 metadata-only entry points. Branch hygiene: merged origin/development (which now has #146) into this branch so the new code can actually call read_checkpoint. The merge was clean — #146 and the snapshot toolkit only overlap at different methods in `discretisation_mesh.py`, as the earlier analysis predicted. PR target will be development once #195/#196 land; the diff stays clean because the merged dev commits are already there. Tests (12 total, 5 new in phase 2, tier_a level_1): - write produces wrapper + bulk-dir with the expected file pattern - wrapper populated with the per-mesh + per-var metadata that makes inspectability self-sufficient - bit-exact write→scribble→read roundtrip on a 2D mesh with one scalar + one vector variable (np.array_equal, zero tolerance) - missing bulk-dir → clear FileNotFoundError - mismatched mesh on read → clear ValueError (not an obscure h5py trace) Regression: 64 tests pass (24 snapshot + 9 tracker + 12 disk-format + 19 core/regression). Phase 3 next: swarms (with the @external_file freedom kept open for bulky swarms) + /python_state for DDt + ModelTracker via dataclass- to-HDF5-attrs serialisation. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

@dim

Per Louis's direction ("break out the swarm information into a separate file in the first instance — bulk is a problem with swarms, always"), swarms always go to their own h5py-direct sidecar from day one. No inline-vs-split toggle — sidecar is the only path. Layout: /path/to/run.snap.h5 wrapper /path/to/run.snap.bulk/{swarm_safe}.swarm.h5 swarm sidecar (one per swarm) Sidecar structure (h5py-native, no PETSc — swarms aren't DMPlex section/vec): @num_particles_local, @dim, @mesh_name, @population_generation /coordinates dataset, (n_local, dim) /variables/{var_clean_name} dataset, (n_local, num_components) @num_components, @dtype The sidecar's top-level @attrs and group structure mean `h5ls -v` on the sidecar alone tells you "this holds N particles in dim D on mesh M with these variables" — same inspectability bar as the wrapper. Wrapper /swarms/{swarm_safe}/ carries metadata + the @external_file pointer to the sidecar in the bulk dir. Restore mirrors the in-memory Swarm.apply_snapshot_payload exactly: clear local population via dm.removePoint loop, addNPoints at saved coords, write var data back. Same rebuild-on-restore semantics — the disk snapshot recovers from a particle-population mutation (added particles between snapshot and restore) just like the in-memory path does, proven by test_swarm_restore_recovers_after_particle_count_change. Tests (5 new, 21 total tier_a level_1): - swarm sidecar lands in bulk dir with predictable name; wrapper records external_file ref + mesh_name + var inventory - sidecar is self-inspectable via h5py (file-level attrs + /coordinates + /variables with per-var attrs) - whole swarm (coords + svar data) round-trips bit-exact through write → scribble → read - rebuild-on-restore parity with in-memory path: snapshot, mutate population, restore → exact local population recovered - PETSc-internal DMSwarm_* variables filtered at capture (same rule as in-memory) MPI: single-rank only in this phase. The current rank-0-only sidecar write only captures rank 0's local particles in a parallel run. Phase 6 will either use h5py-mpi parallel HDF5 or per-rank sidecars to match #195's parallel exact-reconstruction guarantee. 73 tests pass (24 in-memory + 9 tracker + 21 disk-format + 19 core/regression). Phase 4 next: format detection + dispatch in MeshVariable.read_timestep so it reads BOTH the legacy per-variable layout AND the new v1.1 sidecar format via the KDTree bridge. Closes the compatibility commitment from the design discussion. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

Single user-facing entry point for all snapshot use cases. Same methods serve in-memory ephemeral stash and on-disk persistent snapshot — the dispatch is mechanical, the user has one API to learn: token = model.save_state() # in-memory, returns Snapshot model.load_state(token) # restore from token model.save_state(file="step42.snap.h5") # on-disk, returns path model.load_state("step42.snap.h5") # restore from disk # (also: load_state(file=…)) load_state dispatches on argument type — Snapshot → in-memory restore; str/PathLike → disk restore. Type-mismatched source raises TypeError with a clear message. Renames replace the prior Model.snapshot() / Model.restore() pair from #195. Pre-merge, no public users to migrate; getting the user-facing API right now means there is never a disparate version shipped. uw.checkpoint.{snapshot,restore,write_snapshot,read_snapshot, read_snapshot_metadata,inspect_snapshot,write_snapshot_skeleton} stay as power-user / lower-level entry points that save_state / load_state delegate to. Files updated (mechanical renames, except the doc rewrite): - src/underworld3/model.py: save_state / load_state methods replace snapshot / restore; load_state accepts positional Snapshot or str/os.PathLike, with TypeError on anything else. - tests/test_0007_snapshot_inmemory.py — 23 callers renamed; obsolete test_snapshot_path_is_v1_1_scope deleted (v1.1 has landed). - tests/test_0008_snapshot_realsolver.py — 3 tests renamed. - tests/test_0009_model_tracker.py — 9 tests renamed. - tests/test_0010_snapshot_disk_format.py — 21 tests: replace uw.checkpoint.write_snapshot / read_snapshot with model.save_state / model.load_state at user-style call sites; keep write_snapshot_skeleton + read_snapshot_metadata where the test is specifically exercising the lower-level entry points. - tests/parallel/ptest_0007_snapshot_inmemory.py — np-1/3/4 ptest. - tests/run_snapshot_backstepping_{demo,spatial}.py — demo scripts. - docs/advanced/snapshot-restore.md — rewritten API section to show both modes; added "On-disk file layout" section and a "Choosing between paths" comparison table covering write_timestep, write_checkpoint, and save_state. Limitations section updated to reflect that on-disk is now real (was "in-memory only"). Regression: 75 single-rank tests pass (was 76 — minus the deleted obsolete v1.1-scope test); MPI ptest at -np 4 still PASS with the parallel exact-reconstruction guarantee. Docs build clean with no snapshot-related warnings; the new layout + choosing-between-paths sections render. Phase 4 (read_timestep format-aware dispatch for backward compat) becomes a nice-to-have at this point — save_state / load_state is the recommended surface, write_timestep / read_timestep keep their existing role unchanged. Phase 6 (parallel HDF5 / per-rank sidecars for on-disk MPI) is the remaining correctness item. Underworld development team with AI support from Claude Code (https://claude.com/claude-code)

lmoresi added 14 commits May 11, 2026 21:04

Revert "checkpoint: swarm coverage + _population_generation counter (…

6d1b635

…PR 2)" This reverts commit 001f961.

Copilot AI review requested due to automatic review settings May 19, 2026 05:28

Copilot started reviewing on behalf of lmoresi May 19, 2026 05:28 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

This was referenced May 19, 2026

Model.tracker — snapshot-managed run state (stacked on #195) #196

Merged

Add PETSc DMPlex checkpoint reload for mesh variables #146

Merged

On-disk snapshot toolkit v1.1 (stacked on #195, #196) #198

Merged

lmoresi merged commit cd0f252 into development May 20, 2026
1 check passed

lmoresi mentioned this pull request May 20, 2026

docs: snapshot toolkit — CHANGES entry + current API + toctree #199

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-memory snapshot toolkit (git stash for timesteps)#195

In-memory snapshot toolkit (git stash for timesteps)#195
lmoresi merged 15 commits into
developmentfrom
feature/in-memory-checkpoint

lmoresi commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmoresi commented May 19, 2026

Summary

Design highlights

Correctness — proven, serial + parallel + real solver

Scope (intentional)

Notes for the reviewer

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants