TQ: Shrink the ack requirements for partial commit by andrewjstone · Pull Request #9852 · oxidecomputer/omicron

andrewjstone · 2026-02-11T20:10:37Z

Currently we wait for N-Z acked commits to allow a new configuration to be installed in the Committing state. For a 32 node system this means that 29 nodes have to ack. If more than 3 nodes are offline than a customer cannot add or remove any sleds. This is an unnecessarily harsh restriction.

This PR changes it so that only K+Z nodes must ack commits to go from Preparing to CommittedPartially, similar to the prepare phase of the protocol that takes us from Preparing to Committing. We bump Z a little bit for larger cluster to balance things out. Now, only 21 nodes must ack commits for new configurations to be accepted. This allows up to 11 nodes to be offline and eliminates the need for an escape hatch for oxide support if 4 or 5 nodes rather than 3 are offline. If 11 nodes are offline, something major is going on and really, oxide support should be involved. We will not allow automatic triage via omdb in this situation.

This PR also makes acking commits idempotent in the datastore layer, which fixes a potential bug in the case of multiple nexuses trying to commit the same nodes simultaneously.

Fixes #9826

Currently we wait for N-Z acked commits to allow a new configuration to be installed in the `Committing` state. For a 32 node system this means that 29 nodes have to ack. If more than 3 nodes are offline than a customer cannot add or remove any sleds. This is an unnecessarily harsh restriction. This PR changes it so that only K+Z nodes must ack commits to go from `Preparing` to `CommittedPartially`, similar to the prepare phase of the protocol that takes us from `Preparing` to `Committing`. We bump Z a little bit for larger cluster to balance things out. Now, only 21 nodes must ack commits for new configurations to be accepted. This allows up to 11 nodes to be offline and eliminates the need for an escape hatch for oxide support if 4 or 5 nodes rather than 3 are offline. If 11 nodes are offline, something major is going on and really, oxide support should be involved. We will not allow automatic triage via omdb in this situation. This PR also makes acking commits idempotent in the datastore layer, which fixes a potential bug in the case of multiple nexuses trying to commit the same nodes simultaneously. Fixes #9826

andrewjstone · 2026-02-11T20:14:06Z

nexus/types/src/trust_quorum.rs

-            9..=16 => 2,
-            _ => 3,
+            4..=7 => 1,
+            8..=15 => 2,


This might be too harsh. It would only allow 1 node to be offline for an 8 node cluster. Then again, we should never have 8 node clusters in production.

andrewjstone · 2026-02-11T20:21:33Z

This PR also makes acking commits idempotent in the datastore layer, which fixes a potential bug in the case of multiple nexuses trying to commit the same nodes simultaneously.

This actually isn't strictly a bug, except in the contrived test here. It can make commit occur more slowly though. In the case of competing BG tasks, one can ack say nodes 1,2,3 and a second ack 2,3,4 which would have failed. The next time around a BG task will only ask for node 4 and record that since it will know 2 and 3 have already committed. But this requires an extra loop through the BG task which in our case is up to 1 minute.

In any event, there was no reason for the removed filter and so it is gone.

nexus/db-queries/src/db/datastore/trust_quorum.rs

andrewjstone requested a review from plaidfinch February 11, 2026 20:10

andrewjstone commented Feb 11, 2026

View reviewed changes

plaidfinch reviewed Feb 11, 2026

View reviewed changes

nexus/db-queries/src/db/datastore/trust_quorum.rs Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TQ: Shrink the ack requirements for partial commit#9852

TQ: Shrink the ack requirements for partial commit#9852
andrewjstone wants to merge 1 commit intomainfrom
tq-modify-commit-params

andrewjstone commented Feb 11, 2026

Uh oh!

andrewjstone Feb 11, 2026

Uh oh!

andrewjstone commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewjstone commented Feb 11, 2026

Uh oh!

andrewjstone Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

andrewjstone commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants