Skip to content

kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels#895

Open
bootjp wants to merge 1 commit into
mainfrom
feat/composed1-m3-verify-gate
Open

kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels#895
bootjp wants to merge 1 commit into
mainfrom
feat/composed1-m3-verify-gate

Conversation

@bootjp
Copy link
Copy Markdown
Owner

@bootjp bootjp commented May 31, 2026

Summary

Third milestone (M3) of the Composed-1 cross-group commit-time guard per docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md §M3.

Wires the safety property at apply time using the M2 versioned-snapshot ring and the M1 ObservedRouteVersion field.

Stacked on PR #894 (M2 plumbing).

What's enforced

  • (a) Observed-version owner — the spec-level Composed-1 check from tla/composed/Composed.tla. Every write key of a Composed-1-pinned txn must be owned by THIS Raft group at the txn's observed catalog version. Refusal returns ErrComposed1Violation.
  • (b) Current-version cross-version-read fence — the §4.4 / §3 codex P1 trace. Even when (a) passes, a route shift between BeginTxn and Commit can leave the write landing on the OLD owner while readers at the new version route to the NEW owner and miss the write. The current-version fence refuses that case. Refusal returns ErrComposed1Violation with a different diagnostic prefix (current-version vs observed-version) so the M4 retry path can distinguish.
  • (c) Retention-miss fail-closed — when the txn's observed version has been evicted from the M2 ring (long-running txn or high catalog churn), return ErrComposed1VersionGCd. Per design doc §4.3, the not-found semantics is a hard retryable error, NOT a soft pass, because soft-pass would let the gate be bypassed exactly in the cases where the cross-version-read hazard is most likely.

Short-circuits

Three legacy / not-applicable cases that the gate explicitly bypasses:

  • FSM constructed without WithRouteHistory (legacy / test seam): routes == nil, gate returns nil.
  • Request carries ObservedRouteVersion == 0 (unpinned — pre-M1 caller, or 2PC ABORT request that doesn't carry the version): gate returns nil. This is what keeps M3 from regressing every caller that has not migrated to set OperationGroup.ObservedRouteVersion.
  • Engine.Current returns (zero, false) — engine has no history (bare-struct test seam): the (b) fence falls through.

Changes

  • distribution/engine.go — adds Engine.Current() returning the snapshot at the current catalogVersion. Adds SetHistoryDepthForTest as the cross-package test seam that lets kv-side tests trigger eviction without bypassing the package boundary (claude review on PR kv(composed1): M2 — versioned-snapshot ring + kvFSM RouteHistory wiring #894 — fragile-but-test-local lock contract documented inline).
  • kv/fsm.go — defines ErrComposed1Violation and ErrComposed1VersionGCd sentinels with the wrapped-error diagnostic shape M4 retry will use. Adds verifyComposed1 and verifyOwnerFromSnapshot. Wires verifyComposed1 into handleTxnRequest at the top so PREPARE / COMMIT / ABORT / NONE all pass through the gate (ABORT carries ObservedRouteVersion=0 and falls through naturally).
  • kv/route_history.goRouteHistory interface gains Current(). Adapter forwards to distribution.Engine.Current().

Tests (kv/fsm_composed1_test.go)

Test Maps to design doc M3 "Done when"
TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails (i) observed-version snapshot resolves the key to a different group than this FSM
TestVerifyComposed1_ObservedVersionOlderThanRingFails (ii) observed version is outside the ring → ErrComposed1VersionGCd
TestVerifyComposed1_ObservedPassesButCurrentDiffersFails (iii) / §3 codex P1 trace: observed-version check passes but current-version differs → ErrComposed1Violation (current-version diagnostic)
TestVerifyComposed1_ObservedVersionZeroSkipsGate Legacy caller bypasses the gate
TestVerifyComposed1_NilRouteHistorySkipsGate Unwired FSM bypasses the gate

Verification

  • go build ./... — clean
  • go vet ./... — clean
  • go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass
  • make lint — 0 issues

Self-review (5 lenses)

  1. Data loss — the gate refuses commits but cannot lose them. Every rejection emits a sentinel the M4 coordinator path will convert into a successful commit on the correct owner.
  2. Concurrency — the gate runs under the FSM's apply loop (Raft applies are serial), so no new lock ordering. RouteHistory reads take the engine's read lock; SnapshotAt and Current each acquire and release in a single call.
  3. Performance — two map lookups + a per-mutation linear scan of the snapshot's sorted routes per Composed-1-pinned commit. OwnerOf short-circuits via the M2 round-1 break optimisation, so the per-mutation cost is bounded by the first non-covering gap rather than the full routes slice. Legacy callers (ObservedRouteVersion=0) pay one branch per commit and exit.
  4. Data consistency — closes the Composed-1 (observed) and Composed-1a (current) gaps end-to-end. Spec correspondence: tla/composed/Composed.tla's Commit precondition is now enforced at apply time; the §4.4 fence matches Composed1a_CommitToCurrentOwner from PR tla: Composed-1a current-version fence + currentfence_gap (#870 follow-up) #878.
  5. Test coverage — 5 new unit tests cover the three design doc criteria + the two legacy short-circuit paths. Existing handleTxnRequest tests pass unchanged (their requests carry ObservedRouteVersion=0 from M1's behaviour-neutral default).

Test plan

Resolves

The M3 row in the Composed-1 design doc.

Summary by CodeRabbit

  • Bug Fixes

    • Introduced validation to prevent cross-group transaction commits when route ownership has shifted between observed and current catalog versions.
    • Added specific error handling for stale observed versions and unavailable route history scenarios.
  • Tests

    • Added comprehensive test coverage for transaction validation behavior across version and ownership scenarios.

Third milestone of the Composed-1 cross-group commit-time guard per
docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md
§M3.  Wires the safety property at apply time using the M2
versioned-snapshot ring and the M1 ObservedRouteVersion field.

Stacked on PR #894 (M2 plumbing).

What's enforced:

  (a) Observed-version owner — the spec-level Composed-1 check from
      tla/composed/Composed.tla.  Every write key of a Composed-1-
      pinned txn must be owned by THIS Raft group at the txn's
      observed catalog version (the version it read its read-set
      at, set in M1 via OperationGroup.ObservedRouteVersion).
      Refusal returns ErrComposed1Violation.

  (b) Current-version cross-version-read fence — the §4.4 / §3
      codex P1 trace.  Even when (a) passes, a route shift between
      BeginTxn and Commit can leave the write landing on the OLD
      owner while readers at the new version route to the NEW
      owner and miss the write.  The current-version fence refuses
      that case so M4 retry can re-route.  Refusal returns
      ErrComposed1Violation with a different diagnostic prefix
      ("current-version" vs "observed-version") so the retry path
      can distinguish.

  (c) Retention-miss fail-closed — when the txn's observed version
      has been evicted from the M2 ring (long-running txn or high
      catalog churn), return ErrComposed1VersionGCd.  Per design
      doc §4.3, the not-found semantics is a hard retryable error,
      NOT a soft pass, because soft-pass would let the gate be
      bypassed exactly in the cases where the cross-version-read
      hazard is most likely.

Short-circuits cleanly in three legacy / not-applicable cases:

  * FSM constructed without WithRouteHistory (legacy / test seam):
    routes == nil, gate returns nil.
  * Request carries ObservedRouteVersion == 0 (unpinned — pre-M1
    caller, or 2PC ABORT request that doesn't carry the version):
    gate returns nil.  This is what keeps M3 from regressing every
    caller that has not migrated to set OperationGroup.
    ObservedRouteVersion.
  * Engine.Current returns (zero, false) — engine has no history
    (bare-struct test seam): the (b) fence falls through.

Changes:

  * distribution/engine.go — adds Engine.Current() returning the
    snapshot at the current catalogVersion (used by the M3 fence).
    Adds SetHistoryDepthForTest as the cross-package test seam
    that lets kv-side tests trigger eviction without bypassing
    the package boundary (claude review on PR #894 — fragile-but-
    test-local lock contract documented inline).

  * kv/fsm.go — defines ErrComposed1Violation and
    ErrComposed1VersionGCd sentinels with the wrapped-error
    diagnostic shape M4 retry will use.  Adds verifyComposed1 and
    verifyOwnerFromSnapshot.  Wires verifyComposed1 into
    handleTxnRequest at the top so PREPARE / COMMIT / ABORT /
    NONE all pass through the gate (ABORT carries
    ObservedRouteVersion=0 and falls through naturally).

  * kv/route_history.go — RouteHistory interface gains Current().
    Adapter forwards to distribution.Engine.Current().

Tests (kv/fsm_composed1_test.go):

  * TestVerifyComposed1_StaleObservedVersionWithMovedKeyFails —
    design doc M3 "done when" criterion (i): observed-version
    snapshot resolves the key to a different group than this FSM.
  * TestVerifyComposed1_ObservedVersionOlderThanRingFails —
    criterion (ii): observed version is outside the ring →
    ErrComposed1VersionGCd.  Uses depth=2 via the cross-package
    SetHistoryDepthForTest seam.
  * TestVerifyComposed1_ObservedPassesButCurrentDiffersFails —
    criterion (iii) / the §3 codex P1 trace: observed-version
    check passes (routes[1][k1]=g1) but the current snapshot at
    v=2 has moved k1 to g2 → ErrComposed1Violation with the
    "current-version" diagnostic prefix.
  * TestVerifyComposed1_ObservedVersionZeroSkipsGate — legacy
    caller (ObservedRouteVersion=0) bypasses the gate.
  * TestVerifyComposed1_NilRouteHistorySkipsGate — unwired FSM
    (no WithRouteHistory option) bypasses the gate; matches the
    pre-feature posture.

Verification:
  * go build ./...                            — clean
  * go vet ./...                              — clean
  * go test -race -count=1 ./kv ./distribution — 10.2 s + 1.0 s, pass

Self-review (5 lenses):

  1. Data loss — the gate refuses commits but cannot lose them.
     Every rejection emits a sentinel the M4 coordinator path will
     convert into a successful commit on the correct owner.
  2. Concurrency — the gate runs under the FSM's apply loop (Raft
     applies are serial), so no new lock ordering.  RouteHistory
     reads take the engine's read lock; SnapshotAt and Current
     each acquire and release in a single call.
  3. Performance — two map lookups + a per-mutation linear scan
     of the snapshot's sorted routes per Composed-1-pinned commit.
     OwnerOf short-circuits via the M2 round-1 break optimisation,
     so the per-mutation cost is bounded by the first non-covering
     gap rather than the full routes slice.  Legacy callers
     (ObservedRouteVersion=0) pay one branch per commit and exit.
  4. Data consistency — closes the Composed-1 (observed) and
     Composed-1a (current) gaps end-to-end.  Spec correspondence:
     tla/composed/Composed.tla's Commit precondition is now
     enforced at apply time; the §4.4 fence matches
     Composed1a_CommitToCurrentOwner from PR #878.
  5. Test coverage — 5 new unit tests cover the three design doc
     criteria + the two legacy short-circuit paths.  Existing
     handleTxnRequest tests pass unchanged (their requests carry
     ObservedRouteVersion=0 from M1's behaviour-neutral default).

Next milestone (separate PR per design doc §6): M4 — coordinator
retry path.  When either Composed-1 sentinel returns, the
coordinator re-reads the route cache, re-routes the txn against
the new owning group, and re-issues it once.
@bootjp
Copy link
Copy Markdown
Owner Author

bootjp commented May 31, 2026

@claude review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 31, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces a Composed-1 gate that enforces transaction correctness by validating ownership at both historical and current catalog versions. It extends the route-history mechanism with Current() snapshots and implements dual ownership checks for transaction mutations before apply, rejecting stale commits and preventing cross-group consistency violations.

Changes

Composed-1 Gate for Transaction Consistency

Layer / File(s) Summary
Engine history snapshot APIs
distribution/engine.go, kv/route_history.go
Engine.Current() fetches the current catalogVersion route snapshot with (RouteHistorySnapshot, bool) return, Engine.SetHistoryDepthForTest() controls FIFO ring eviction, and distributionEngineAdapter.Current() adapts the engine method to the RouteHistory interface.
RouteHistory interface contract
kv/fsm.go
RouteHistory interface is documented to expose Current() alongside SnapshotAt(), signaling availability of current snapshots for ownership validation.
Composed-1 gate logic and errors
kv/fsm.go
ErrComposed1Violation and ErrComposed1VersionGCd errors distinguish ownership drift from evicted history, verifyComposed1() validates each non-internal mutation is owned by the FSM's shard group at both the observed (via SnapshotAt) and current (via Current()) versions, and handleTxnRequest invokes the gate before phase dispatch.
Composed-1 gate test suite
kv/fsm_composed1_test.go
Test helpers (applyComposed1Snapshot, newComposed1FSM, commitTxnRequest) wire a kvFSM with route history and apply catalog snapshots to an in-memory engine, followed by 5 tests covering stale observed + moved key, evicted observed version, current-version fence override, legacy zero-version bypass, and nil-routes bypass.

Sequence Diagram

sequenceDiagram
  participant Client
  participant FSM as kvFSM<br/>(handleTxnRequest)
  participant Gate as verifyComposed1()
  participant Observed as RouteHistory<br/>(SnapshotAt)
  participant Current as RouteHistory<br/>(Current)

  Client->>FSM: Txn with ObservedRouteVersion
  FSM->>Gate: verifyComposed1(request)
  Gate->>Observed: SnapshotAt(ObservedVersion)
  alt Observed version evicted
    Observed-->>Gate: nil, false
    Gate-->>FSM: ErrComposed1VersionGCd
  else Observed version found
    Observed-->>Gate: RouteSnapshot
    Gate->>Gate: Check key owners vs shard group
    alt Owner mismatch at observed
      Gate-->>FSM: ErrComposed1Violation(observed-version)
    else Owner match at observed
      Gate->>Current: Current()
      Current-->>Gate: RouteSnapshot
      Gate->>Gate: Check key owners vs shard group
      alt Owner mismatch at current
        Gate-->>FSM: ErrComposed1Violation(current-version)
      else All checks pass
        Gate-->>FSM: nil
        FSM->>FSM: Proceed to phase dispatch
      end
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • bootjp/elastickv#881: Propagates ObservedRouteVersion into pb.Request across shards during PREPARE/COMMIT, which the Composed-1 gate consumes to validate historical ownership.
  • bootjp/elastickv#894: Introduces versioned route-history ring (history, SnapshotAt(), RouteHistorySnapshot), which this PR extends with Current() and gates for ownership validation.

Poem

🐰 A gate guards the shard's precious keys,
Checking owners at observed and now
No stale commits slip through the breeze,
Each mutation speaks its vow.
Cross-group chaos bows!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the PR's primary objective—implementing the M3 verifyComposed1 gate with retry sentinels—and is specific, concise, and clearly related to all major changes across distribution/engine.go, kv/fsm.go, and supporting files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/composed1-m3-verify-gate

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

TLA+ spec divergence review (auto-triggered)

This PR touches files that the TLA+ safety spec has an anchor on (per
docs/design/2026_05_28_partial_tla_safety_spec.md §3),
so an AI review is requested below to verify the implementation has not drifted
from the model.

Anchored files changed in this PR head (c94e0d1):

  • distribution/engine.go
  • kv/fsm.go

What to check, by subsystem:

  • kv/hlc*.goNext() must respect the HLC-4 preconditions (i)/(ii)/(iii) from the design doc: bounded skew, logical-counter handoff on leader change (strategy (c) Observe(MaxAppliedHLC)), and the commit-time ceiling fence (fail-closed when wall_now >= physicalCeiling). Any change to the bit layout (48/16), the CAS loop, or the ceiling getter/setter is in scope.
  • kv/coordinator.go, kv/sharded_coordinator.goRunHLCLeaseRenewal, hlcRenewalInterval, hlcPhysicalWindowMs constants, and the new-term detection that calls Observe(fsm.MaxAppliedHLC()) (strategy (c)). Any change to renewal cadence, group selection, or fail-closed behaviour is in scope.
  • kv/transaction.go, kv/lock_resolver.go — OCC commit-ts assignment, lock-map encoding (key, lock_ts) -> start_ts, and the LockResolver action OCC-3 depends on. (M2 spec will land OCC-1..OCC-5; until then the spec doc §5.2 is the contract.)
  • kv/fsm.go — FSM apply of HLC lease entries (SetPhysicalCeiling), and any future MaxAppliedHLC() accessor that strategy (c) needs.
  • store/mvcc_store.go — version visibility, snapshot install, and the MVCC-1..MVCC-4 invariants (M3 scope).
  • distribution/** — route catalog versioning, SplitRange atomicity, and CatalogWatcher async fan-out (M4 scope).

If the change is correct but requires a spec update, edit tla/hlc/HLC.tla (or the corresponding M2..M5 module once landed) and the design doc in the same PR. The tla-check workflow runs the TLC model check on the same paths.


@claude review please verify TLA+ spec divergence per the checklist above.

@codex review please verify TLA+ spec divergence per the checklist above.

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@claude
Copy link
Copy Markdown

claude Bot commented May 31, 2026

Claude finished @bootjp's task in 3m 24s —— View job


Code Review — kv(composed1): M3 — verifyComposed1 apply-time gate + retry sentinels

  • Gather context and understand the request
  • Read changed files
  • Review design doc for M3 spec correspondence
  • Review lens 1: Data loss
  • Review lens 2: Concurrency / distributed failures
  • Review lens 3: Performance
  • Review lens 4: Data consistency
  • Review lens 5: Test coverage

Overall the implementation is solid. The gate logic matches the TLA+ spec (§4.2(a)) and the cross-version fence (§4.4) is correctly wired. No data-loss paths, no lock-ordering issues, no safety bugs found. A few issues worth addressing are called out below in order of severity.


Issues

1. Misleading comment in SetHistoryDepthForTest — contradicts its own implementation

distribution/engine.go:217–228

// SetHistoryDepthForTest overrides the FIFO ring depth from outside
// the package.  Test-only — callers MUST set the depth before the
// Engine is shared with any concurrent reader (no internal
// synchronisation here …
func (e *Engine) SetHistoryDepthForTest(depth int) {
    e.mu.Lock()       // ← lock IS acquired
    defer e.mu.Unlock()
    e.historyDepth = depth
}

The comment says "no internal synchronisation here" but the function acquires e.mu.Lock(). A future maintainer reading the comment will think concurrent calls are unsafe when they are actually protected. The comment was likely copied from the in-package direct field write (TestEngineSnapshotAt_FIFOEviction) and not updated.

Suggested fix: change "no internal synchronisation here""caller should set the depth before the Engine receives concurrent Snapshots to avoid test-interleaving surprises, but the write itself is lock-protected." Fix this →


2. Missing "happy path" test — gate can silently block valid txns undetected

kv/fsm_composed1_test.go

All 5 tests assert that the gate fires (violation, GC'd) or short-circuits (nil routes, zero version). There is no test that asserts the gate passes cleanly when the txn's observed version correctly identifies this group as the owner at both the observed and current versions. Without a passing-gate test, an over-aggressive check (e.g., an off-by-one in OwnerOf or an inadvertent negation of owner != f.shardGroupID) could incorrectly reject every valid commit and the test suite would not catch it.

Suggested addition — a test along these lines:

func TestVerifyComposed1_ValidOwnershipPassesGate(t *testing.T) {
    t.Parallel()
    e := distribution.NewEngine()
    // v=1 and v=2 both assign k to g1 — no route shift.
    applyComposed1Snapshot(t, e, 1, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    applyComposed1Snapshot(t, e, 2, []distribution.RouteDescriptor{
        {RouteID: 100, Start: []byte(""), End: nil, GroupID: 1, State: distribution.RouteStateActive},
    })
    fsm := newComposed1FSM(t, e, 1)

    err := fsm.verifyComposed1(commitTxnRequest(1, "k"))
    require.NoError(t, err, "gate must pass when this group owns the key at both observed and current version")
}

Fix this →


3. Minor: ErrComposed1Violation doc comment overstates the scope of the check

kv/fsm.go:215

// ErrComposed1Violation is returned … because the
// txn's read-set or write-set keys are not owned by this group …

verifyOwnerFromSnapshot iterates r.GetMutations() (the write-set only). r.ReadKeys is not checked — consistent with the TLA+ spec where Composed1_CommitToOwningGroup quantifies over txnWriteSet. The doc comment's "read-set or write-set" is misleading.

Fix this →


4. Minor: Design doc Q1 not addressed at M3

Design doc §9 Q1 says: "allow unpinned at M1, log a warning at M3, error at M5."

The M3 implementation short-circuits ObservedRouteVersion==0 with return nil and no log. Keeping the gate silent is probably the right call right now (to avoid log spam from all legacy callers), but it drifts from the design doc's stated M3 plan. Either update the doc to defer the warning to M4/M5, or add a single slog.WarnContext at the observedVer == 0 short-circuit. This is a documentation/tracking issue, not a correctness issue.


5. Nit: Self-correction prose in test comment

kv/fsm_composed1_test.go:73–77

// Wait — re-read the design doc: criterion (i) is …

The "Wait — re-read the design doc" reads like a debugging note. The clarification is useful but could be written as a direct statement. This is purely cosmetic.


Minor optimisation opportunity (non-blocking)

When observedVer == currentSnap.Version() (the catalog hasn't advanced since txn-begin), verifyOwnerFromSnapshot is called twice against the same snapshot. Adding an early exit after the observed check passes avoids the redundant linear scan:

// (b) Current-version fence.
currentSnap, ok := f.routes.Current()
if !ok {
    return nil
}
// Fast path: if the current version is the same as the observed version,
// the observed-version check already validated current ownership.
if currentSnap.Version() == observedVer {
    return nil
}
return f.verifyOwnerFromSnapshot(r.GetMutations(), currentSnap, currentSnap.Version(), "current")

This is O(|write-set|) savings and entirely optional, but it's a clean win for the common steady-state case.


Lens summary

Lens Result
Data loss Clean — gate can only reject, never silently drop committed writes
Concurrency Clean — gate runs under the serial Raft apply loop; SnapshotAt/Current acquire and release e.mu.RLock atomically
Performance Two map lookups + O(
Data consistency Spec correspondence correct; write-set only check matches Composed1_CommitToOwningGroup; fail-closed on ring miss is correct per §4.3
Test coverage 5 failure/short-circuit tests present; missing happy-path test (see issue 2 above); no multi-key test

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the apply-time Composed-1 gate to prevent cross-version-read anomalies during route shifts. It introduces the Current() method to retrieve the latest route catalog snapshot, defines ErrComposed1Violation and ErrComposed1VersionGCd errors, and adds verification logic in kvFSM.handleTxnRequest to validate key ownership against both the observed and current catalog versions. Comprehensive unit tests are also added. The reviewer identified two critical issues: first, ABORT requests must bypass the Composed-1 gate to avoid leaking MVCC locks, and a short-circuit is needed when shardGroupID is unset (0); second, reducing history depth via SetHistoryDepthForTest on an already populated engine can cause a runtime panic unless excess history entries are immediately truncated.

Comment thread kv/fsm.go
Comment on lines +582 to +585
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
if f.routes == nil {
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are two critical issues here:

  1. ABORT requests must not be blocked by the Composed-1 gate: If an ABORT request is rejected due to a route shift (ErrComposed1Violation) or history eviction (ErrComposed1VersionGCd), the transaction's locks and intents will be permanently leaked in the MVCC store, blocking future writes. Aborts should always bypass the gate to ensure successful cleanup.
  2. Missing short-circuit for unset shardGroupID: The comments state that a shardGroupID of 0 represents an unset/legacy state and should short-circuit the gate. However, the code does not check for f.shardGroupID == 0, which will cause spurious ErrComposed1Violation errors on unset groups.
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
	if r.Phase == pb.Phase_ABORT {
		return nil
	}
	if f.routes == nil || f.shardGroupID == 0 {
		return nil
	}

Comment thread distribution/engine.go
Comment on lines +225 to +229
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If SetHistoryDepthForTest is called to reduce the history depth after the history ring has already been populated with more entries than the new depth, a subsequent call to recordHistorySnapshotLocked will trigger a runtime panic. This is because make([]uint64, len(e.historyOrder)-1, e.historyDepth) will be invoked with a length greater than its capacity. To prevent this, truncate the excess history entries immediately when updating the depth.

Suggested change
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
if len(e.historyOrder) > depth {
excess := len(e.historyOrder) - depth
for i := 0; i < excess; i++ {
delete(e.history, e.historyOrder[i])
}
retained := make([]uint64, depth)
copy(retained, e.historyOrder[excess:])
e.historyOrder = retained
}
}

Comment thread kv/fsm.go
// docs/design/2026_05_29_proposed_composed1_cross_group_commit_guard.md
// §4.2(a) + §4.4. Runs two checks before the txn's writes land:
//
// (a) Observed-version owner — the txn's read-set was captured
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
File is not properly formatted (gci)

Comment thread kv/fsm_composed1_test.go
// in main.go's buildShardGroups; this helper short-circuits to the
// test-only fixture without spinning up a real Raft group.
//
//nolint:unparam // shardGroupID is currently always 1 in tests but
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [golangci] reported by reviewdog 🐶
File is not properly formatted (gci)

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
kv/fsm_composed1_test.go (1)

32-35: ⚡ Quick win

Drop the new //nolint:unparam and refactor the helper instead.

This suppression is avoidable in new test code: either bake in the single currently used group ID, or make the group ID vary through actual table cases when you need that coverage.

As per coding guidelines, "Avoid //nolint annotations — refactor instead."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/fsm_composed1_test.go` around lines 32 - 35, Remove the //nolint:unparam
annotation and refactor newComposed1FSM to stop accepting the unused
shardGroupID parameter: either bake the single used group ID (1) directly into
newComposed1FSM (replace the shardGroupID parameter with the constant 1 inside
the function) and update all call sites to call newComposed1FSM(t, e)
accordingly, or convert the tests that need varying group IDs into table-driven
cases that construct the FSM with different explicit group IDs (so the parameter
becomes used). Ensure all references to function newComposed1FSM are updated and
delete the //nolint comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@distribution/engine.go`:
- Around line 225-229: SetHistoryDepthForTest should validate the incoming depth
and reject non-positive values to fail fast; inside
Engine.SetHistoryDepthForTest (protected by e.mu), add a check like if depth <=
0 { panic("SetHistoryDepthForTest: depth must be > 0") } (or return an
error/report) before assigning e.historyDepth so recordHistorySnapshotLocked and
the eviction path never encounter an empty historyOrder; update any tests that
call SetHistoryDepthForTest to provide positive depths.

In `@kv/fsm.go`:
- Around line 582-589: verifyComposed1 currently only short-circuits when
f.routes == nil but the documented behavior says shardGroupID == 0 should also
bypass; update kvFSM.verifyComposed1 to return nil immediately when
f.shardGroupID == 0 (before any route/version checks) so callers that install
routes before setting shardGroupID preserve the legacy skip-for-group-0
behavior; keep the existing observedVer handling unchanged.

---

Nitpick comments:
In `@kv/fsm_composed1_test.go`:
- Around line 32-35: Remove the //nolint:unparam annotation and refactor
newComposed1FSM to stop accepting the unused shardGroupID parameter: either bake
the single used group ID (1) directly into newComposed1FSM (replace the
shardGroupID parameter with the constant 1 inside the function) and update all
call sites to call newComposed1FSM(t, e) accordingly, or convert the tests that
need varying group IDs into table-driven cases that construct the FSM with
different explicit group IDs (so the parameter becomes used). Ensure all
references to function newComposed1FSM are updated and delete the //nolint
comment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af79a132-4d95-4f0c-b28e-0e8008f5f03d

📥 Commits

Reviewing files that changed from the base of the PR and between 2b118a4 and c94e0d1.

📒 Files selected for processing (4)
  • distribution/engine.go
  • kv/fsm.go
  • kv/fsm_composed1_test.go
  • kv/route_history.go

Comment thread distribution/engine.go
Comment on lines +225 to +229
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-positive history depths here.

Passing depth <= 0 makes the next recordHistorySnapshotLocked() hit the eviction path with an empty historyOrder, so the helper panics later with an index-out-of-range instead of failing fast at the call site.

Suggested fix
 func (e *Engine) SetHistoryDepthForTest(depth int) {
 	e.mu.Lock()
 	defer e.mu.Unlock()
+	if depth <= 0 {
+		panic("SetHistoryDepthForTest: depth must be > 0")
+	}
 	e.historyDepth = depth
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
e.historyDepth = depth
}
func (e *Engine) SetHistoryDepthForTest(depth int) {
e.mu.Lock()
defer e.mu.Unlock()
if depth <= 0 {
panic("SetHistoryDepthForTest: depth must be > 0")
}
e.historyDepth = depth
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@distribution/engine.go` around lines 225 - 229, SetHistoryDepthForTest should
validate the incoming depth and reject non-positive values to fail fast; inside
Engine.SetHistoryDepthForTest (protected by e.mu), add a check like if depth <=
0 { panic("SetHistoryDepthForTest: depth must be > 0") } (or return an
error/report) before assigning e.historyDepth so recordHistorySnapshotLocked and
the eviction path never encounter an empty historyOrder; update any tests that
call SetHistoryDepthForTest to provide positive depths.

Comment thread kv/fsm.go
Comment on lines +582 to +589
func (f *kvFSM) verifyComposed1(r *pb.Request) error {
if f.routes == nil {
return nil
}
observedVer := r.GetObservedRouteVersion()
if observedVer == 0 {
return nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor the documented shardGroupID == 0 bypass.

kvFSM.shardGroupID is documented as “unset => short-circuit”, but verifyComposed1() only checks routes == nil. A partially wired caller that installs route history before the group ID will start rejecting every pinned txn against group 0 instead of preserving the legacy skip behavior.

Suggested fix
 func (f *kvFSM) verifyComposed1(r *pb.Request) error {
-	if f.routes == nil {
+	if f.routes == nil || f.shardGroupID == 0 {
 		return nil
 	}
 	observedVer := r.GetObservedRouteVersion()
 	if observedVer == 0 {
 		return nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@kv/fsm.go` around lines 582 - 589, verifyComposed1 currently only
short-circuits when f.routes == nil but the documented behavior says
shardGroupID == 0 should also bypass; update kvFSM.verifyComposed1 to return nil
immediately when f.shardGroupID == 0 (before any route/version checks) so
callers that install routes before setting shardGroupID preserve the legacy
skip-for-group-0 behavior; keep the existing observedVer handling unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant