-
Notifications
You must be signed in to change notification settings - Fork 66
TQ: Support sled expunge via trust quorum pathway #9765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I tested this out by first trying to abort and watching it fail because
there is no trust quorum configuration. Then I issued an LRTQ upgrade,
which will fail because I didn't restart the sled-agents to pick up the
LRTQ shares. Then I aborted that configuration stuck in prepare. Lastly,
I successfully issued a new LRTQ upgrade after restartng the sled agents
and watched it commit.
Here's the external API calls:
```
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
error; status code: 404 Not Found
{
"error_code": "Not Found",
"message": "No trust quorum configuration exists for this rack",
"request_id": "819eb6ab-3f04-401c-af5f-663bb15fb029"
}
error
➜ oxide.rs git:(main) ✗
➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery api '/v1/system/hardware/racks/ea7f612b-38ad-43b9-973c-5ce63ef0ddf6/membership/abort' --method POST
{
"members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"rack_id": "ea7f612b-38ad-43b9-973c-5ce63ef0ddf6",
"state": "aborted",
"time_aborted": "2026-01-29T01:54:02.590683Z",
"time_committed": null,
"time_created": "2026-01-29T01:37:07.476451Z",
"unacknowledged_members": [
{
"part_number": "913-0000019",
"serial_number": "20000000"
},
{
"part_number": "913-0000019",
"serial_number": "20000001"
},
{
"part_number": "913-0000019",
"serial_number": "20000003"
}
],
"version": 2
}
```
Here's the omdb calls:
```
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Error: lrtq upgrade
Caused by:
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "8503cd68-7ff4-4bf1-b358-0e70279c6347", "content-length": "124", "date": "Thu, 29 Jan 2026 01:37:09 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "8503cd68-7ff4-4bf1-b358-0e70279c6347" }
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
2,
),
last_committed_epoch: None,
state: Aborted,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T01:37:07.476451Z,
time_committing: None,
time_committed: None,
time_aborted: Some(
2026-01-29T01:54:02.590683Z,
),
abort_reason: Some(
"Aborted via API request",
),
}
root@oxz_switch:~# omdb nexus trust-quorum lrtq-upgrade -w
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
Started LRTQ upgrade at epoch 3
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: PreparingLrtqUpgrade,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: None,
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Unacked,
share_digest: None,
time_prepared: None,
time_committed: None,
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: None,
time_committed: None,
time_aborted: None,
abort_reason: None,
}
root@oxz_switch:~# omdb nexus trust-quorum get-config ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 latest
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:17:1:d01::6]:12232
TrustQuorumConfig {
rack_id: ea7f612b-38ad-43b9-973c-5ce63ef0ddf6 (rack),
epoch: Epoch(
3,
),
last_committed_epoch: None,
state: Committed,
threshold: Threshold(
2,
),
commit_crash_tolerance: 0,
coordinator: BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
},
encrypted_rack_secrets: Some(
EncryptedRackSecrets {
salt: Salt(
[
143,
198,
3,
63,
136,
48,
212,
180,
101,
106,
50,
2,
251,
84,
234,
25,
46,
39,
139,
46,
29,
99,
252,
166,
76,
146,
78,
238,
28,
146,
191,
126,
],
),
data: [
167,
223,
29,
18,
50,
230,
103,
71,
159,
77,
118,
39,
173,
97,
16,
92,
27,
237,
125,
173,
53,
51,
96,
242,
203,
70,
36,
188,
200,
59,
251,
53,
126,
48,
182,
141,
216,
162,
240,
5,
4,
255,
145,
106,
97,
62,
91,
161,
51,
110,
220,
16,
132,
29,
147,
60,
],
},
),
members: {
BaseboardId {
part_number: "913-0000019",
serial_number: "20000000",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 13c0a6113e55963ed35b275e49df4c3f0b3221143ea674bb1bd5188f4dac84,
),
time_prepared: Some(
2026-01-29T02:20:46.792674Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000001",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: 8557d74f678fa4e8278714d917f14befd88ed1411f27c57d641d4bf6c77f3b,
),
time_prepared: Some(
2026-01-29T02:20:47.236089Z,
),
time_committed: Some(
2026-01-29T02:21:49.503179Z,
),
},
BaseboardId {
part_number: "913-0000019",
serial_number: "20000003",
}: TrustQuorumMemberData {
state: Committed,
share_digest: Some(
sha3 digest: d61888c42a1b5e83adcb5ebe29d8c6c66dc586d451652e4e1a92befe41719cd,
),
time_prepared: Some(
2026-01-29T02:20:46.809779Z,
),
time_committed: Some(
2026-01-29T02:21:52.248351Z,
),
},
},
time_created: 2026-01-29T02:20:03.848507Z,
time_committing: Some(
2026-01-29T02:20:47.597276Z,
),
time_committed: Some(
2026-01-29T02:21:52.263198Z,
),
time_aborted: None,
abort_reason: None,
}
```
After chatting with @davepacheco, I changed the authz checks in the datastore to do lookups with Rack scope. This fixed the test bug, but is only a shortcut. Trust quorum should have it's own authz object and I"m going to open an issue for that. Additionally, for methods that already took an authorized connection, I removed the unnecessary authz checks and opctx parameter.
This commit adds a 3 phase mechanism for sled expungement. The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for commit the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before. The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving! The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that that the last committed trust quorum configuration does not contain the sled that is to be expunged. The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked. This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in. This also builds on #9741 and should merge after that PR.
|
Damn, Looks like the changes to expunge are breaking existing tests. I'll either need to update those tests, update the expunge function, or move the check inside the expunge function into omdb. |
Fixed by properly inserting fake tq during RSS handoff. |
|
We actually can't merge this until R18 is out the door since it relies on having trust quorum configurations in order to perform expunge. We'll have to merge main into it once #9737 merges so we can do more hardware testing on racklettes. |
This commit adds a 3 phase mechanism for sled expungement.
The first phase is to remove the sled from the latest trust quorum configuration via omdb. The second phase is to reboot the sled after polling for the commit of the configuration with the trust quorum removal. The third phase is to issue the existing omdb expunge command, which changes the sled policy as before.
The first and second phases remove the need to physically remove the sled before expungement. They act as a software mechanism that gates the sled-agent from restarting on the sled and doing work when it should be treated as "absent". We've discussed this numerous times in the update huddle and it is finally arriving!
The third phase is what informs reconfigurator that the sled is gone and remains the same except for an extra sanity check that the last committed trust quorum configuration does not contain the sled that is to be expunged.
The removed sled may be added back to this rack or another after being clean slated. I tested this by deleting the files in the internal "cluster" and "config" directories and rebooting the removed sled in a4x2 and it worked.
This PR is marked draft because it changes the current sled-expunge pathway to depend on real trust quorum. We cannot safely merge it in until the key-rotation work from #9737 is merged in.
This also builds on #9741 and should merge after that PR.