Skip to content

In-memory snapshot toolkit (git stash for timesteps)#195

Merged
lmoresi merged 15 commits into
developmentfrom
feature/in-memory-checkpoint
May 20, 2026
Merged

In-memory snapshot toolkit (git stash for timesteps)#195
lmoresi merged 15 commits into
developmentfrom
feature/in-memory-checkpoint

Conversation

@lmoresi
Copy link
Copy Markdown
Member

@lmoresi lmoresi commented May 19, 2026

Summary

Adds Model.snapshot() / Model.restore() — a unitary, in-memory "hold that thought, I might need to come back" mechanism for timesteppers (backtrack-on-instability, adaptive Δt retry, RK staging, predictor-corrector probes). Distinct from the existing per-variable write_timestep path, which is unchanged.

Captured & restored: mesh coordinates, mesh-variable DOFs, swarm particle positions + swarm-variable data (rebuild-on-restore semantics), and solver-internal Python state for all five DDt flavors via a new state-as-dataclass contract.

Design highlights

  • Rebuild-on-restore for swarms (not refuse-on-mutation). An earlier counter-as-gate design was reverted mid-branch (commits 001f9616d1b6359399f87) once it was clear that "particles moved" is exactly what restore exists to undo. Per-rank capture + per-rank rebuild = exact global reconstruction.
  • State-as-dataclass Snapshottable contract (src/underworld3/checkpoint/state.py) — option (B) derived-dataclass adapters for the retrofitted DDt flavors; documented in docs/developer/guides/state-as-dataclass.md for new code.
  • v1.2-forward-compat baked in: name-keyed payloads, reserved topology slot in the mesh payload, _schema_version on every State dataclass. Mesh-adapt rebuild (v1.2) and on-disk backend (v1.1) can land without a schema bump.

Correctness — proven, serial + parallel + real solver

  • Serial: 24 tests, exact roundtrip; bit-identical continuation (np.array_equal, zero tolerance).
  • Parallel (tests/parallel/ptest_0007_snapshot_inmemory.py, np 1/3/4): a disruptive step that loses 28–35 particles across ranks is fully recovered by restore — exact global reconstruction (gather + sort by per-particle gid) and bit-identical continuation.
  • Real solver (tests/test_0008, AdvDiffusion with internal SemiLagrangian DDt): discarding a regretted step is bit-exact even through real PETSc solves (B == C, max|d| = 0.0). Recovering to a never-snapshotted control is within solver tolerance (~7e-7) — a documented, characterised restore floor (gvec→lvec resync amplification), by design, not step contamination.

Scope (intentional)

In-memory, same-rank-count, whole-state stash/pop. Out of scope by design: on-disk durability (v1.1 — the mitigation for the accepted in-memory memory cost), selective per-field restore, cross-rank-count restore, mesh-adapt rebuild (v1.2, refused with a clear error in v1).

Notes for the reviewer

  • Commit b179183 (Lagrangian UWSwarm typo fix) duplicates already-merged fix(ddt): Lagrangian.__init__ — uw.swarm.UWSwarm typo #184; git resolves it cleanly on merge — no action needed.
  • History intentionally retains the reverted counter-as-gate pair + design-correction commit; it documents the design journey and the revert commit explains why.

Test plan

  • pixi run -e amr-dev pytest tests/test_0007_snapshot_inmemory.py tests/test_0008_snapshot_realsolver.py (27 tests)
  • cd tests/parallel && mpirun -np 4 python ./ptest_0007_snapshot_inmemory.py (and np 1, 3)
  • Regression: pytest tests/test_0000_imports.py tests/test_0002_model.py tests/test_0003_save_load.py tests/test_1052_ddt_set_initial_history.py

Underworld development team with AI support from Claude Code

lmoresi added 14 commits May 11, 2026 21:04
Spun off from the 2026-05-11 deformable-surface design discussion as a
self-contained UW3 capability. Covers motivation (backtrack-on-failure,
adaptive Δt retry, RK staging, predictor-corrector probes, crash recovery,
bisection, debugging captures), the two existing on-disk paths
(write_checkpoint and write_timestep), the state-as-dataclass
serialisation contract for solver-internal Python state, the three-backend
story (in-memory + on-disk-full-state + existing write_timestep unchanged),
schema versioning, the swarm population-generation counter, eight
architectural work items in dependency order, scope boundaries, and open
implementation questions.

Baseline-of-record for the feature/in-memory-checkpoint branch.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
First true unitary checkpoint in UW3. Captures mesh coordinates and
mesh-variable global-vector DOFs across every registered mesh into a
plain-Python token; restores back onto the same Model instance within
the same process, bit-equivalent. Distinct from the existing per-variable
write_timestep/read_timestep path, which continues to serve visualisation
and partial restart unchanged.

What lands:
- src/underworld3/checkpoint/ — new module
  - backend.py: CheckpointBackend Protocol + InMemoryBackend (eager
    copy on save and load; tokens hold numpy data only, never PETSc
    handles, so DM-lifecycle hazards do not apply)
  - snapshot.py: Snapshot dataclass + snapshot()/restore() routines.
    Restore order: mesh coords via _deform_mesh() → MV gvec write +
    globalToLocal sync. Within-process invalidation gate:
    _mesh_version mismatch raises SnapshotInvalidatedError before any
    write happens.
- src/underworld3/model.py — Model.snapshot()/restore() thin delegates
- tests/test_0007_snapshot_inmemory.py — 6 tier-A level-1 tests:
  scalar/vector MV roundtrip, snapshot independence from later writes,
  _mesh_version invalidation, type rejection, NotImplementedError on
  path= (v1.1 scope).

Design open-question resolutions:
- Q1 module location: src/underworld3/checkpoint/ (top-level sibling,
  room to grow into swarm + state-as-dataclass + on-disk in v1.1+)
  rather than the persistence.py stub.
- Q2 PETSc API for in-memory capture: Vec.array (numpy view) +
  subdm.createSubDM + globalToLocal is sufficient. No Viewer needed.
- Q3 restore order verified empirically: mesh._deform_mesh first
  (rebuilds coord caches + callbacks), then per-var gvec write +
  globalToLocal sync; _stale_lvec flagged so downstream caches refresh.
- Q4 memory budget: not yet measured; deferred to a later PR with a
  realistic coupled-physics setup.

Deviation from PR 1 plan: the plan also mentioned refactoring
Mesh.write_checkpoint() to call the new protocol. Skipped here because
that path is write-only (no read_checkpoint exists), so refactoring
without an exercising load-half is risk without value. The HDF5
backend lands with v1.1 on-disk full-state, where it is load-bearing
and the protocol shape can be validated against both backends.

Not yet covered (subsequent PRs):
- swarm coverage + _population_generation counter (PR 2)
- state-as-dataclass contract + DDt retrofit (PR 3)
- parameter mutation history + CI check (PR 4)
- on-disk full-state backend (v1.1; PR 5)
- schema versioning + migration registry (PR 6)
- cross-process restore + broader test suite (PR 7)

See docs/developer/design/in_memory_checkpoint_design.md for the full
roadmap.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Extends the unitary snapshot to capture per-rank swarm positions and
user swarm-variable data. Adds Swarm._population_generation, a counter
bumped at every particle-population mutation site, used as the
within-process invalidation gate: restoring a snapshot taken before a
populate / migrate / add_particles / remesh event raises
SnapshotInvalidatedError rather than silently corrupting a now-stale
position array.

Counter init + bump sites (src/underworld3/swarm.py):
- Swarm.__init__: initialise to 0 next to _mesh_version.
- populate(): bump once at the end (covers the 1-3 internal addNPoints
  calls in the populate body).
- Swarm.migrate() after the migration_disabled early-exit: bump
  unconditionally; conservative even when migrate is a no-op, because
  under-bumping risks silent corruption while over-bumping is safe.
- add_particles_with_coordinates() after its direct self.dm.migrate():
  this path doesn't go through Swarm.migrate so we bump explicitly.
- add_particles_with_global_coordinates() right after addNPoints:
  catches the migrate=False case too; the migrate=True path will
  double-bump via Swarm.migrate, which is fine.
- advection() remesh path after the addNPoints reinjection.

Snapshot extensions (src/underworld3/checkpoint/snapshot.py):
- New fields: swarm_keys, swarm_generations, swarm_mesh_versions,
  swarmvar_names.
- _capture_swarm: reads DMSwarmPIC_coor via dm.getField → copy →
  restoreField; iterates swarm.vars excluding DMSwarm* internals;
  records both _population_generation and _mesh_version.
- _restore_swarm: validates both counters before any write; writes
  back positions + per-var data in place. Deliberately bypasses
  populate/add_particles/migrate so the restore itself does not bump
  the counter or mutate the population we just confirmed stable.

Test coverage (tests/test_0007_snapshot_inmemory.py): 5 new tests on
top of the 6 mesh-only tests:
- swarm positions + user-variable roundtrip after scribble
- counter bumps on populate, migrate, add_particles_with_coordinates,
  add_particles_with_global_coordinates (in monotonic order)
- migrate-between-snapshot-and-restore raises SnapshotInvalidatedError
- add_particles-between-snapshot-and-restore raises likewise
- DMSwarm_* internal variables stay out of the captured key set

Not yet covered: cross-process restore (v1.1), advection remesh
invalidation test (needs a recycle-enabled swarm + a velocity field,
larger setup than belongs in the core 0007 file).

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
The earlier design draft proposed Swarm._population_generation as an
invalidation gate: counter mismatch between capture and restore would
raise SnapshotInvalidatedError. That is wrong. The whole point of the
toolkit is to undo intervening state changes — including particle
motion, migration, and repopulation. Refusing on counter mismatch
breaks the central use cases (RK staging, backtrack-on-instability,
adaptive Δt retry, all of which migrate particles between capture and
restore).

Corrected swarm semantics:
- Restore rebuilds the swarm's local population: clear current
  particles, re-add at captured per-rank coords via
  add_particles_with_coordinates(..., migrate=False), write captured
  per-variable data back into the new particles in order.
- The _population_generation counter stays as informational metadata
  (logging, cache invalidation in other consumers, possible future
  fast-path optimisations), but it is not a restore gate.

Mesh-adapt scope boundary reframed:
- v1 keeps _mesh_version mismatch as a refusal, because the captured
  DOF arrays don't fit a different DM's section.
- v1.2 will replace the refusal with a mesh-rebuild path on the same
  rebuild-on-restore principle: destroy the post-adapt DM, rebuild the
  pre-adapt one from captured topology + section, re-bind all
  MeshVariable / Swarm / solver wrappers.
- v1 captures the topology / section info even though v1 restore
  ignores it, so the snapshot payload is forward-compatible with v1.2
  without a schema bump.

Architectural-work item 5 updated to match: snapshot captures per-rank
particle coords + per-var arrays; restore clears + re-adds + writes;
counter is informational, not a gate.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
…2 redo)

Replaces the reverted counter-as-gate PR with rebuild-on-restore
semantics for swarms. Restore now succeeds across the cases the earlier
design wrongly refused (migrate, add_particles, repopulate between
snapshot and restore) — these are precisely the cases the snapshot
toolkit exists to enable (RK staging, backtrack on instability,
adaptive Δt retry).

Design changes since the reverted PR:
- Swarm._population_generation stays, but is now purely informational:
  bumped at every population-mutation site for logging / debugging /
  downstream caches, but NOT consulted by restore. Restore rebuilds
  the local population from the snapshot regardless of intervening
  mutations.
- Snapshot is keyed by stable name (mesh.name for meshes,
  f"swarm_{instance_number}" for swarms), not Python id(). Forward-
  compat for v1.1 cross-process restore and v1.2 mesh-rebuild after
  mesh.adapt() (where the wrapper survives but its DM is destroyed).
- Restore logic moved off the snapshot module and onto wrapper
  methods: Mesh.apply_snapshot_payload and Swarm.apply_snapshot_payload.
  v1 implementations write back in place; v1.2's Mesh implementation
  can switch to rebuild-from-payload without touching snapshot.py.
- Snapshot payloads include a reserved "topology": None slot on the
  mesh side, populated in v1.2 with section/DM-topology data
  sufficient to rebuild the DM. v1 leaves it None; the schema doesn't
  need to bump when v1.2 lands.

Mesh restore (Mesh.apply_snapshot_payload at
discretisation_mesh.py:2570):
- Verify _mesh_version matches (v1 refusal; v1.2 will rebuild here).
- Write coords via _deform_mesh, write per-MV gvec arrays + sync to
  local vec.

Swarm restore (Swarm.apply_snapshot_payload at swarm.py:4084):
- Drop every current local particle via dm.removePoint() (O(N) total,
  removes-from-end is O(1) per call).
- addNPoints(n_saved), write coords directly to DMSwarmPIC_coor, set
  ranks. Deliberately bypasses add_particles_with_coordinates (which
  filters via points_in_domain and triggers migrate — both
  unnecessary here since saved coords were local at capture and the
  mesh hasn't changed).
- Invalidate _canonical_data caches so subsequent var.data accesses
  re-resolve from PETSc.
- Write captured per-variable data back in particle-order. Internal
  DMSwarm_* variables are filtered out at capture.

Tests (11 total):
- 6 mesh-only tests preserved.
- 5 swarm tests, including the critical positive tests
  test_swarm_restore_after_migrate and test_swarm_restore_after_add_particles.
  Those are the cases the reverted PR wrongly raised on; they're now
  the central proof that the design works.

Regression: 35 existing core tests pass unchanged.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Introduces the state-as-dataclass serialisation contract from the
design note and applies it to the canonical DDt flavor (Symbolic).
PR 4 will mechanically extend the same pattern to Eulerian,
SemiLagrangian, Lagrangian, and Lagrangian_Swarm — they share the
same dt_history / history_initialised / n_solves_completed / dt
core; the variation is purely in how psi_star is bound.

New infrastructure:
- src/underworld3/checkpoint/state.py:
  - SnapshottableState dataclass base with _schema_version field
    (load-bearing for v1.1 on-disk migration; checked for strict
    equality in v1 since capture and restore are same-process).
  - Snapshottable runtime_checkable protocol — anything with a
    .state attribute returning a SnapshottableState. Drives
    discovery in Model.snapshot().
- Model._state_bearers (WeakSet) + Model._register_state_bearer():
  state-bearing helpers self-register on construction without
  pinning their lifetime.

DDt Symbolic retrofit (option (B)-style adapter per design note):
- DDtSymbolicState dataclass with dt_history, history_initialised,
  n_solves_completed, dt, psi_star.
- Symbolic gets a `state` property (builds the dataclass from the
  existing private attrs on read) and a `state.setter` (unpacks,
  validates schema version + dt_history length, writes attrs back,
  re-derives BDF/AM coefficient values so downstream reads see the
  restored state immediately rather than waiting for the next
  update_pre_solve).
- Symbolic.__init__ auto-registers with the default model.

Snapshot/restore wiring:
- Snapshot.state_bearers: list of (stable_key, state_dataclass)
  with stable_key = f"{type(obj).__name__}_{obj.instance_number}".
- snapshot() iterates Model._state_bearers, deepcopies obj.state,
  stores the copy. Deepcopy isolates the snapshot from later
  mutation of the live state-bearer.
- restore() matches captured states to current state-bearers by
  stable_key, deepcopies, writes via obj.state setter. Missing
  state-bearer → SnapshotInvalidatedError.

Drive-by fix in Mesh.snapshot_payload (caught by the DDt tests):
mesh variables with _gvec=None (lazy allocation: var created but
never written to) are now skipped during capture rather than
crashing on var._gvec.array. Restore correspondingly only touches
variables present in payload["vars"], so an unallocated-at-capture
variable is left in its current state.

Tests (6 new on top of the 11 mesh+swarm ones):
- DDt auto-registers in Model._state_bearers
- .state returns a SnapshottableState with correct schema version
- mid-trajectory snapshot+restore recovers dt_history,
  history_initialised, n_solves_completed, dt
- wrong _schema_version on apply raises ValueError
- dt_history length mismatch on apply raises ValueError
- snapshot is a deep copy: scribbling live DDt internals doesn't
  leak into the captured state-bearer payload

17/17 snapshot tests pass; 41 existing core + DDt tests still green.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
…agrangian/Lagrangian_Swarm) (PR 4)

Mechanically extends the PR 3 Snapshottable contract from Symbolic to
the other four DDt flavors. Each gets:

- A flavor-specific State dataclass inheriting from a new
  _DDtCoreState base that carries the shared dt_history /
  history_initialised / n_solves_completed / dt fields.
- A .state property building the dataclass on read and a .state.setter
  unpacking on write, re-deriving BDF/AM coefficients so post-restore
  reads are consistent without waiting for the next update_pre_solve.
- Self-registration with the default model in __init__ (try/except
  for safety when no model is active).

State dataclasses (in src/underworld3/systems/ddt.py):
- _DDtCoreState: shared fields. Subclasses add psi_star
  representation specific to the flavor.
- DDtSymbolicState (already present, refactored to inherit base):
  psi_star is a list of sympy expressions.
- DDtEulerianState: psi_star is a list of MeshVariables; State
  carries their clean_names for restore-side verification (the
  actual DOF arrays travel via the mesh-variable snapshot path).
- DDtSemiLagrangianState: same as Eulerian plus optional
  forcing_star_var_name and with_forcing_history flag for ETD-2
  Maxwell-relaxation integration.
- DDtLagrangianState / DDtLagrangianSwarmState: psi_star is a list
  of SwarmVariables on the DDt's swarm; data travels via the
  swarm-variable path.

Notes:
- SemiLagrangian's update_pre_solve hardcodes theta=0.5 (the class
  doesn't accept a theta arg in __init__), so the state setter
  matches that — not self.theta which doesn't exist.
- Lagrangian itself has a pre-existing AttributeError in __init__
  (references uw.swarm.UWSwarm which doesn't exist). The retrofit
  code is in place and follows the same pattern; consumers that
  construct Lagrangian via the higher-level solver pathways will get
  .state / .state.setter / registration automatically. The
  pre-existing bug is out of scope for this PR but worth flagging.
- ParameterRegistry retrofit deferred. The class isn't currently
  wired into Model anywhere in core code (only mentioned in a
  docstring example). Retrofitting now would be dead code; the
  retrofit lands together with the real consumer in a follow-up.

Tests (3 new on top of PR 3's 6 DDt tests, for 20 total):
- Eulerian DDt roundtrip via manual primary-state mutation
- SemiLagrangian DDt roundtrip
- Lagrangian_Swarm DDt registration + state-type check (no
  roundtrip — advection needs a velocity-field setup beyond a core
  unit test)

Roundtrips on Eulerian/SemiLagrangian exercise the .state property
and .state.setter directly rather than running full projection
solves: the BDF/AM coefficient re-derivation happens in the setter,
so manual primary-state mutation is sufficient to validate the
retrofit logic.

20/20 snapshot tests pass; 41 existing core + DDt tests green.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Companion to the snapshot toolkit design note (PR 0) and the
Snapshottable / DDt retrofit implementation (PRs 3, 4). Audience is
developers adding new solver-internal helper classes; the guide
covers what goes in a State dataclass, when to use option (B)
adapter vs option (C) authoritative-store, what NOT to capture
(PETSc handles, bulk arrays already carried by mesh-var / swarm-var
paths), how schema versioning is intended to work in v1.1, and a
minimal roundtrip test pattern.

Closes the doc gap noted in PR 3's commit message; rounds out v1 of
the snapshot toolkit.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
The Lagrangian DDt flavor has been unconstructible since 2025-07-07
(commit 0778b7d, "Fix uw.function.evaluate / eliminate evalf"), which
typo'd uw.swarm.Swarm(mesh) as uw.swarm.UWSwarm(mesh) while editing
nearby code. UWSwarm does not exist — never has — so every direct
construction of Lagrangian since that commit has died with
AttributeError. Higher-level solver pathways that wrap Lagrangian
were presumably also broken; consumers may have silently been using
Lagrangian_Swarm or another flavor as a workaround.

One-character fix: revert that line. No other UWSwarm references
exist in the tree.

This bug surfaced during the snapshot toolkit work (PR 4) when the
state-as-dataclass retrofit included a Lagrangian roundtrip test
that couldn't run. With the fix in place, the test now runs and
passes — included in the same commit so the bug fix and the proof
it works land together.

Should be cherry-picked to development; the typo is unrelated to
the snapshot feature work it surfaced from.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Every previous snapshot test is unit-style — build a thing, snapshot,
scribble, restore, check equality. None exercise the actual use case
that motivated the toolkit: detect a bad step, snap back, retry.
This commit adds one focused end-to-end test covering the canonical
adaptive-Δt CFL workflow.

The test simultaneously exercises all three captured state surfaces
in one realistic story:

- A swarm with an outward-radial velocity field carries particles
  outward at known speeds.
- A material variable on the swarm carries a per-particle marker
  (initial x-coord), so we can prove particle identity is recovered,
  not just particle count.
- A Symbolic DDt accumulates BDF history (manually advanced past
  startup), so the state-bearer / state-as-dataclass path also gets
  exercised.

The flow:
  1. Snapshot before the speculative step.
  2. Take a candidate Δt = 0.5 → max displacement ~0.27, ~6× the
     cell radius. CFL violated; the consumer's check trips.
  3. model.restore(snap): particle positions, material data, and
     DDt history all roll back to the snapshot point.
  4. Retry at Δt = 0.05 → max displacement ~0.033, sub-cell. CFL
     satisfied; state evolves cleanly.

The Δt and threshold values come from a probe run on the same mesh
(min_radius ≈ 0.044; |V| ≤ 0.71 at corners), so the CFL violation
on the candidate step is a real physical observation rather than a
parameter tweak. Smaller dt → strictly smaller displacement gives a
robust ratio-based assertion that doesn't rely on exact numbers.

This is the test pattern that consumers (RK staging, adaptive Δt,
predictor-corrector, regime-change feeling-out) will adapt. v1 of
the snapshot toolkit is genuine, not just unit-tested.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Every prior test proved *state equality after restore*. That is
necessary but not the guarantee a backtracking consumer actually
relies on. The real guarantee — "git stash for steps": a discarded
speculative step leaves zero trace after restore + continuation —
was untested. This commit adds it, asserted bit-for-bit
(np.array_equal, no tolerance).

Two tests, both with a live swarm + driven mesh variable + Symbolic
DDt so the mesh -> swarm -> state-bearer restore ordering is
exercised together:

- test_continuation_deterministic_after_restore:
    snapshot S -> K steps -> A;  restore(S) -> K steps -> B.
    A == B bit-for-bit. Proves restore leaves no residual state that
    perturbs subsequent evolution.

- test_continuation_bit_identical_across_stash_and_recover:
    control: S -> K good steps                          -> ctrl
    stash:   S -> disruptive 10x-dt step -> restore(S)
               -> same K good steps                     -> stash
    ctrl == stash bit-for-bit. The regretted step leaves no trace.

This also closes the #3 concern (does Mesh.apply_snapshot_payload's
_deform_mesh call disturb a registered swarm before the swarm
restore runs?). Both tests have a swarm and a mesh variable live
through restore; the mesh restore calls _deform_mesh with unchanged
coords, then swarm restore, then DDt restore. If the mesh restore
perturbed the swarm, continuation would not be bit-identical. It is.
For v1 scope this fully covers the _deform_mesh-on-restore path,
because a *deformed* mesh (bumped _mesh_version) is refused on
restore anyway — the only path that runs _deform_mesh on restore is
the same-coords path these tests now cover.

Remaining production blockers (unchanged by this commit): parallel
(MPI) is still untested and the swarm rebuild deliberately bypasses
migration; no real-solver test; memory cost unmeasured.

24 snapshot tests, 24 regression tests, all green.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
The one real production blocker was "works everywhere" — i.e. correct
under MPI. This adds a parallel ptest and confirms the design intent:
swarm restore is a per-rank reconstruction, not a redistribution, so
the global state is exactly rebuilt under cross-rank migration
provided the rank count is unchanged (the documented v1 scope).

ptest_0007_snapshot_inmemory.py (mesh + swarm + per-particle global-id
tag + material + Symbolic DDt; rotation field that circulates
particles across the strip partition). Three collective properties,
asserted on rank 0:

  P1  restore recovers the exact global particle count. The
      disruptive step is deliberately *allowed* to lose particles
      across ranks (advect out / clip) — that is exactly the failure
      a stash-and-restore exists to undo.
  P2  exact reconstruction: gather (gid, x, y, material) from every
      rank, sort by global id, np.array_equal pre-step vs
      post-restore. Order- and rank-independent — the real proof
      that per-rank reconstruction yields the correct global state.
  P3  bit-identical continuation across a stash, in parallel.

Results:
  -np 1 : 2052 particles, all properties pass.
  -np 3 : 2013 particles; disruptive step loses 28 across ranks;
          restore recovers all 2013 exactly; P2/P3 bit-for-bit.
  -np 4 : 2006 particles; disruptive step loses 35 across ranks;
          restore recovers all 2006 exactly; P2/P3 bit-for-bit.

The genuinely strong result: the toolkit demonstrably recovers from
real cross-rank particle loss — the exact production scenario it
exists for — with bit-identical continuation afterwards.

Registered in mpi_runner.sh at -np 1 / 3 (uneven) / 4.

Production-blocker status: parallel correctness now confirmed (was
the gate). Remaining items are confidence/hardening only — a
real-solver test (SNES state is negligible: previous solution
travels via MeshVariable, already captured) and the accepted,
documented in-memory memory cost (mitigation: route through the
v1.1 on-disk backend via a flag).

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
…loor characterised

Closes the last confidence gap: snapshot/restore driven by an actual
PETSc solver (AdvDiffusion, which carries an internal SemiLagrangian
DDt with an auxiliary projection SNES + nodal trace-back swarm),
through the stash-and-recover loop.

Investigation findings (each verified by a standalone diagnostic):

  * AdvDiffusion solve is bit-deterministic — two independent
    identical runs with no snapshot are np.array_equal (max|d| = 0.0).
    So any drift introduced by snapshot/restore is a real fidelity
    question, not solver noise.

  * restore() recovers the primary solution field T bit-exactly.

  * THE core "git stash for steps" guarantee holds bit-for-bit even
    through real solves:
        restore -> regretted absurd-dt solve -> restore -> K solves
    is np.array_equal to
        restore -> K solves
    The discarded step leaves zero trace (B == C, max|d| = 0.0).

  * The only residual is restore's reproducibility floor against a
    *never-snapshotted* control: ~7e-7 here. Mechanism: restore
    resyncs fields through gvec->lvec rather than reproducing the
    solver-produced lvec exactly; the implicit diffusion operator
    amplifies that to solver-tolerance level over steps. This is NOT
    contamination from the discarded step (proven by B == C), it is
    the cost of round-tripping through the snapshot representation,
    within solver tolerance, and consistent with the design intent
    that auxiliary solver state is intentionally not captured.

Three tests encode exactly this (no overclaiming):
  - test_realsolver_restore_recovers_solution_field      (T np.array_equal)
  - test_realsolver_regretted_step_leaves_no_trace       (B == C, bit-exact)
  - test_realsolver_continuation_within_solver_tolerance (vs never-stashed
       control: < 1e-5, asserted explicitly non-bit-exact so the test
       tightens itself if the floor is ever eliminated)

Honest production statement: discarding a bad step is bit-exact even
through real solvers; recovering to a never-stashed control is within
solver tolerance. Both are correct for the "git stash for steps"
use case.

51 tests pass (24 serial snapshot + 3 real-solver + 24 regression);
parallel ptest (np 1/3/4) unchanged.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Copilot AI review requested due to automatic review settings May 19, 2026 05:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an in-memory snapshot/restore toolkit (Model.snapshot() / Model.restore()) intended as a "git stash for timesteps": a unitary state capture that lets time-stepping code back out of a regretted step (RK staging, adaptive Δt retry, predictor-corrector, etc.). It introduces a new underworld3.checkpoint subpackage with a backend abstraction (currently in-memory only; HDF5 v1.1 stubbed), a Snapshottable/SnapshottableState contract, and retrofits all five DDt flavors plus Mesh and Swarm to participate via snapshot_payload/apply_snapshot_payload and .state accessors. Swarm restore uses rebuild-on-restore semantics; mesh restore refuses on _mesh_version change (v1.2 will rebuild). Also fixes the uw.swarm.UWSwarmuw.swarm.Swarm typo in Lagrangian.__init__ (duplicated in #184).

Changes:

  • New underworld3.checkpoint subpackage with Snapshot, InMemoryBackend, SnapshottableState/Snapshottable protocol, and capture/restore orchestration.
  • Mesh, Swarm, and all five DDt flavors gain snapshot_payload / apply_snapshot_payload / .state (option-B dataclass adapter); swarm gets an informational _population_generation counter.
  • Extensive new tests: serial unit tests, real-solver AdvDiffusion confidence test, and an MPI parallel test (np 1/3/4) plus a developer guide and design note.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/underworld3/checkpoint/init.py Public API surface for the new subpackage.
src/underworld3/checkpoint/backend.py CheckpointBackend protocol + InMemoryBackend.
src/underworld3/checkpoint/snapshot.py Snapshot dataclass and capture/restore orchestration over meshes, swarms, and state bearers.
src/underworld3/checkpoint/state.py SnapshottableState base dataclass + Snapshottable protocol.
src/underworld3/model.py Adds _state_bearers WeakSet, _register_state_bearer, and Model.snapshot()/restore() wrappers.
src/underworld3/discretisation/discretisation_mesh.py Mesh snapshot_payload/apply_snapshot_payload capturing deformed coords + MV gvec DOFs.
src/underworld3/swarm.py Swarm snapshot_payload/apply_snapshot_payload (rebuild-on-restore) and _population_generation counter bumps.
src/underworld3/systems/ddt.py Five DDt state dataclasses + .state getter/setter retrofit; fixes UWSwarm typo.
src/underworld3/init.py Imports underworld3.checkpoint.
tests/test_0007_snapshot_inmemory.py Serial unit and end-to-end snapshot/restore tests.
tests/test_0008_snapshot_realsolver.py AdvDiffusion real-solver confidence test.
tests/parallel/ptest_0007_snapshot_inmemory.py MPI snapshot/restore test.
tests/parallel/mpi_runner.sh Adds new parallel test invocations.
docs/developer/guides/state-as-dataclass.md Developer guide for the .state contract.
docs/developer/design/in_memory_checkpoint_design.md Design note for the snapshot toolkit.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +574 to +580
# state-bearer. Safe if no model is active.
try:
import underworld3 as _uw

_uw.get_default_model()._register_state_bearer(self)
except Exception:
pass
Comment thread src/underworld3/swarm.py Outdated
Comment on lines +4190 to +4193
# The clear+re-add path bumped _population_generation already
# (we don't bump on removePoint, but addNPoints isn't bumped
# either — these are raw PETSc calls). For consistency with
# other mutation paths, bump explicitly here.
Comment thread tests/test_0008_snapshot_realsolver.py Outdated
Comment on lines +43 to +45
# Restore-vs-pristine reproducibility floor for this setup (measured).
# The regretted-step guarantee is asserted bit-exact (np.array_equal);
# only the never-stashed-control comparison uses this tolerance.
Comment thread src/underworld3/swarm.py Outdated
Comment on lines +4196 to +4213
# Step 3: write captured per-variable data.
current_vars = {var.clean_name: var for var in self._vars.values()}
for var_clean_name, saved in payload["vars"].items():
var = current_vars.get(var_clean_name)
if var is None:
raise SnapshotInvalidatedError(
f"swarm {self._snapshot_stable_name()!r}: variable "
f"{var_clean_name!r} from snapshot is not present"
)
current = np.asarray(var.data)
if current.shape != saved.shape:
raise SnapshotInvalidatedError(
f"swarm {self._snapshot_stable_name()!r}: variable "
f"{var_clean_name!r} data shape mismatch — current "
f"{current.shape} vs snapshot {saved.shape}"
)
current[...] = saved

Four fixes for points raised in Copilot's review of the in-memory
snapshot toolkit PR:

1. ddt.py — narrow ``except Exception`` to
   ``except (ImportError, AttributeError)`` at all five DDt
   ``_register_state_bearer`` sites. Only the genuine bootstrap
   cases (import not yet wired during underworld3 init, or older
   Model without the registry method) get swallowed; real
   registration bugs now propagate instead of silently masking
   the silent-state-loss failure mode the design note explicitly
   warns against.

2. swarm.py — rewrite the contradictory comment in
   ``Swarm.apply_snapshot_payload`` around the explicit
   ``_population_generation += 1`` bump. The previous wording said
   the clear+re-add path had already bumped, which is wrong —
   neither ``removePoint`` nor the raw ``addNPoints`` call here
   touches the counter; the explicit bump is what makes a restore
   visible to downstream consumers as a population change.

3. swarm.py — ``apply_snapshot_payload`` now raises
   ``SnapshotInvalidatedError`` if the live swarm has user
   variables that were not in the snapshot. Previously those
   "extra" vars survived the clear+addNPoints reallocation with
   uninitialised/stale contents — silent incoherence after
   restore. Contract is now symmetric with the mesh-variable
   restore (same variable set on both sides).

4. test_0008_snapshot_realsolver — added comment explaining the
   ~14× headroom on ``_RESTORE_FLOOR_ATOL = 1e-5`` vs the measured
   ~7e-7 floor (PETSc/BLAS/MPI variability allowance on CI).

Regression: 45 single-rank snapshot+core tests pass; parallel
ptest at np-4 still PASS with the new strict-extras check.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
@lmoresi lmoresi merged commit cd0f252 into development May 20, 2026
1 check passed
lmoresi added a commit that referenced this pull request May 20, 2026
First slice of the on-disk snapshot format (v1.1). Establishes the
file structure and the inspectability bar; no PETSc bulk yet (that
is phase 2). Stacked on the in-memory snapshot toolkit (#195) and
the model tracker (#196) so it can serialise both later.

What lands:
- src/underworld3/checkpoint/disk_snapshot.py
  - DISK_SNAPSHOT_SCHEMA_VERSION = 1
  - write_snapshot_skeleton(model, path): writes /metadata attrs +
    empty stub groups /mesh /variables /swarms /python_state (the
    structure phases 2+ will fill in).
  - read_snapshot_metadata(path): reads /metadata back as a plain
    dict, decodes JSON-encoded list fields for convenience, validates
    schema version.
  - inspect_snapshot(path): human-readable summary suitable for
    print(...) at a notebook prompt.
- src/underworld3/checkpoint/__init__.py: exports.
- tests/test_0010_snapshot_disk_format.py (7, tier_a level_1):
  - top-level group structure matches the spec
  - h5py-readable /metadata attrs cover identity, schema, tracker
    conventions, geometry, MPI rank count, and inventories of meshes /
    swarms / state-bearer classes / variables — the proxy for "an
    external user running h5ls/h5dump sees useful info"
  - read/write roundtrip
  - rejection of non-snapshot files and wrong-schema files with
    clear errors (not obscure h5py noise)
  - inspect_snapshot includes the key facts
  - skeleton groups carry `filled_by` attrs so phases 2/3 readers and
    external inspectors can tell whether content is populated yet.

Design notes encoded:
- UW3-controlled rich-metadata wrapper around PETSc bulk; pure PETSc
  HDF5 dumps fail the inspectability bar so are rejected as the
  format.
- List-typed metadata stored as JSON strings in scalar attrs so
  h5py / h5ls handle them cleanly; read API exposes them as plain
  Python lists alongside the *_json originals.
- Swarm storage left as a phase-3 decision: the metadata wrapper is
  designed to support `@external_file` on /swarms/swarm_X/ when
  individual swarms grow too bulky for a single file. No commitment
  to inline vs split until phase 3 has real swarm sizes in hand.

Stacked on feature/model-tracker; PRs to development after #195 and
#196 land.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
lmoresi added a commit that referenced this pull request May 20, 2026
…t roundtrip

Builds on phase 1's metadata wrapper to actually carry mesh + mesh-
variable state to disk and read it back. Delegates the heavy lifting
to #146's `Mesh.write_checkpoint` / `MeshVariable.read_checkpoint`
PETSc-DMPlex primitives — phase 2's job is layout, dispatch, and
tying the wrapper to the bulk data via a simple convention.

Layout (final v1.1 shape):

    /path/to/run.snap.h5          wrapper (h5py-inspectable)
    /path/to/run.snap.bulk/       companion directory (one per snap)
        {mesh_safe}.mesh.00000.h5
        {mesh_safe}.{var_clean}.00000.h5

Wrapper carries /meshes/{mesh_safe}/ with @name, @mesh_file, and
/meshes/{mesh_safe}/variables/{var_safe}/ with @name, @components,
@degree, @continuous, @external_file. The bulk-dir path is derived
from the wrapper path by convention (`.h5` → `.bulk`), so no
external_file attr is needed for the standard placement. Move them
together; a clear FileNotFoundError fires if bulk is missing on read.

Phase 1 layout refactor folded in:
- /mesh (singular) → /meshes (plural) — supports multi-mesh natively.
- /variables removed from the top level — now nests under each mesh
  as /meshes/{name}/variables/{var}, matching the in-memory
  snapshot's mesh→vars structure.

New API:
- `write_snapshot(model, path)` — writes wrapper + bulk; covers
  every registered mesh and every allocated meshvar on each mesh.
  Lazy-allocated vars (_gvec is None) are skipped — same rule as the
  in-memory path.
- `read_snapshot(model, path)` — loads var DOFs back into already-
  registered meshes by name. Mesh / variable mismatch raises a
  clear ValueError (mesh-rebuild on read is v1.2 scope).
- `write_snapshot_skeleton` / `read_snapshot_metadata` /
  `inspect_snapshot` stay as phase-1 metadata-only entry points.

Branch hygiene: merged origin/development (which now has #146) into
this branch so the new code can actually call read_checkpoint. The
merge was clean — #146 and the snapshot toolkit only overlap at
different methods in `discretisation_mesh.py`, as the earlier
analysis predicted. PR target will be development once #195/#196
land; the diff stays clean because the merged dev commits are
already there.

Tests (12 total, 5 new in phase 2, tier_a level_1):
- write produces wrapper + bulk-dir with the expected file pattern
- wrapper populated with the per-mesh + per-var metadata that makes
  inspectability self-sufficient
- bit-exact write→scribble→read roundtrip on a 2D mesh with one
  scalar + one vector variable (np.array_equal, zero tolerance)
- missing bulk-dir → clear FileNotFoundError
- mismatched mesh on read → clear ValueError (not an obscure h5py
  trace)

Regression: 64 tests pass (24 snapshot + 9 tracker + 12 disk-format
+ 19 core/regression).

Phase 3 next: swarms (with the @external_file freedom kept open for
bulky swarms) + /python_state for DDt + ModelTracker via dataclass-
to-HDF5-attrs serialisation.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
lmoresi added a commit that referenced this pull request May 20, 2026
Per Louis's direction ("break out the swarm information into a
separate file in the first instance — bulk is a problem with swarms,
always"), swarms always go to their own h5py-direct sidecar from day
one. No inline-vs-split toggle — sidecar is the only path.

Layout:
    /path/to/run.snap.h5                       wrapper
    /path/to/run.snap.bulk/{swarm_safe}.swarm.h5  swarm sidecar (one
                                                   per swarm)

Sidecar structure (h5py-native, no PETSc — swarms aren't DMPlex
section/vec):
    @num_particles_local, @dim, @mesh_name, @population_generation
    /coordinates                  dataset, (n_local, dim)
    /variables/{var_clean_name}   dataset, (n_local, num_components)
        @num_components, @dtype

The sidecar's top-level @attrs and group structure mean `h5ls -v`
on the sidecar alone tells you "this holds N particles in dim D on
mesh M with these variables" — same inspectability bar as the
wrapper.

Wrapper /swarms/{swarm_safe}/ carries metadata + the @external_file
pointer to the sidecar in the bulk dir.

Restore mirrors the in-memory Swarm.apply_snapshot_payload exactly:
clear local population via dm.removePoint loop, addNPoints at saved
coords, write var data back. Same rebuild-on-restore semantics — the
disk snapshot recovers from a particle-population mutation (added
particles between snapshot and restore) just like the in-memory path
does, proven by test_swarm_restore_recovers_after_particle_count_change.

Tests (5 new, 21 total tier_a level_1):
- swarm sidecar lands in bulk dir with predictable name; wrapper
  records external_file ref + mesh_name + var inventory
- sidecar is self-inspectable via h5py (file-level attrs +
  /coordinates + /variables with per-var attrs)
- whole swarm (coords + svar data) round-trips bit-exact through
  write → scribble → read
- rebuild-on-restore parity with in-memory path: snapshot, mutate
  population, restore → exact local population recovered
- PETSc-internal DMSwarm_* variables filtered at capture (same rule
  as in-memory)

MPI: single-rank only in this phase. The current rank-0-only sidecar
write only captures rank 0's local particles in a parallel run.
Phase 6 will either use h5py-mpi parallel HDF5 or per-rank sidecars
to match #195's parallel exact-reconstruction guarantee.

73 tests pass (24 in-memory + 9 tracker + 21 disk-format + 19
core/regression).

Phase 4 next: format detection + dispatch in MeshVariable.read_timestep
so it reads BOTH the legacy per-variable layout AND the new v1.1
sidecar format via the KDTree bridge. Closes the compatibility
commitment from the design discussion.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
lmoresi added a commit that referenced this pull request May 20, 2026
Single user-facing entry point for all snapshot use cases. Same
methods serve in-memory ephemeral stash and on-disk persistent
snapshot — the dispatch is mechanical, the user has one API to
learn:

    token = model.save_state()                  # in-memory, returns Snapshot
    model.load_state(token)                     # restore from token

    model.save_state(file="step42.snap.h5")     # on-disk, returns path
    model.load_state("step42.snap.h5")          # restore from disk
                                                # (also: load_state(file=…))

load_state dispatches on argument type — Snapshot → in-memory
restore; str/PathLike → disk restore. Type-mismatched source raises
TypeError with a clear message.

Renames replace the prior Model.snapshot() / Model.restore() pair
from #195. Pre-merge, no public users to migrate; getting the
user-facing API right now means there is never a disparate version
shipped. uw.checkpoint.{snapshot,restore,write_snapshot,read_snapshot,
read_snapshot_metadata,inspect_snapshot,write_snapshot_skeleton}
stay as power-user / lower-level entry points that save_state /
load_state delegate to.

Files updated (mechanical renames, except the doc rewrite):
- src/underworld3/model.py: save_state / load_state methods replace
  snapshot / restore; load_state accepts positional Snapshot or
  str/os.PathLike, with TypeError on anything else.
- tests/test_0007_snapshot_inmemory.py — 23 callers renamed; obsolete
  test_snapshot_path_is_v1_1_scope deleted (v1.1 has landed).
- tests/test_0008_snapshot_realsolver.py — 3 tests renamed.
- tests/test_0009_model_tracker.py — 9 tests renamed.
- tests/test_0010_snapshot_disk_format.py — 21 tests: replace
  uw.checkpoint.write_snapshot / read_snapshot with model.save_state
  / model.load_state at user-style call sites; keep
  write_snapshot_skeleton + read_snapshot_metadata where the test is
  specifically exercising the lower-level entry points.
- tests/parallel/ptest_0007_snapshot_inmemory.py — np-1/3/4 ptest.
- tests/run_snapshot_backstepping_{demo,spatial}.py — demo scripts.
- docs/advanced/snapshot-restore.md — rewritten API section to show
  both modes; added "On-disk file layout" section and a "Choosing
  between paths" comparison table covering write_timestep,
  write_checkpoint, and save_state. Limitations section updated to
  reflect that on-disk is now real (was "in-memory only").

Regression: 75 single-rank tests pass (was 76 — minus the deleted
obsolete v1.1-scope test); MPI ptest at -np 4 still PASS with the
parallel exact-reconstruction guarantee. Docs build clean with no
snapshot-related warnings; the new layout + choosing-between-paths
sections render.

Phase 4 (read_timestep format-aware dispatch for backward compat)
becomes a nice-to-have at this point — save_state / load_state is
the recommended surface, write_timestep / read_timestep keep their
existing role unchanged. Phase 6 (parallel HDF5 / per-rank sidecars
for on-disk MPI) is the remaining correctness item.

Underworld development team with AI support from Claude Code
(https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants