Skip to content

fm: persist support bundle requests and data selections#10200

Open
mergeconflict wants to merge 3 commits intomainfrom
mergeconflict/support-bundle-request-tables-models-queries
Open

fm: persist support bundle requests and data selections#10200
mergeconflict wants to merge 3 commits intomainfrom
mergeconflict/support-bundle-request-tables-models-queries

Conversation

@mergeconflict
Copy link
Copy Markdown
Contributor

@mergeconflict mergeconflict commented Mar 31, 2026

Add database tables, models, and datastore methods for persisting BundleDataSelection — the set of data categories a support bundle should collect — in two contexts:

  1. FM sitrep support bundle requests: an fm_support_bundle_request table (new) links a bundle request to a case and sitrep, and five per-variant child tables (fm_support_bundle_request_data_selection_{reconfigurator, host_info, sled_cubby_info, sp_dumps, ereports}) record which data categories to gather and with what filters.

  2. Support bundle data selection: five analogous per-variant child tables (support_bundle_data_selection_{reconfigurator, host_info, sled_cubby_info, sp_dumps, ereports}) record the data selection for each support bundle row. Existing bundles are backfilled with BundleDataSelection::all() in the migration.

Schema changes (version 247):

  • fm_support_bundle_request table with indexes.
  • fm_support_bundle_request_data_selection_* child tables.
  • support_bundle.fm_case_id: nullable column linking a bundle to the originating FM case.
  • support_bundle_data_selection_* child tables.
  • Backfill migration populates data selection rows for all pre-existing support bundles.

Model changes:

  • SupportBundleRequest and child row models.
  • Child row models for SupportBundle.

Datastore changes:

  • fm_sitrep_read: refactor alert-request loading into fm_sitrep_alert_requests_read; add symmetric fm_sitrep_support_bundle_requests_read that loads request rows and reassembles BundleDataSelection from child tables.
  • fm_sitrep_insert: decompose BundleDataSelection into per-variant rows on write.
  • Sitrep GC deletes per-variant rows before request rows.
  • support_bundle_data_selection_get: look up a bundle's data selection.
  • Support bundle creation inserts BundleDataSelection::all() rows.
  • Support bundle deletion cleans up data selection rows in a transaction.
  • New test test_sitrep_support_bundle_requests_roundtrip; extended coverage in existing sitrep roundtrip and bundle lifecycle tests.

Also fix BundleDataSelection::all() to use now_db_precision() for microsecond-precision timestamps that survive CRDB roundtrips.

Context: #10062.

@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-tables-models-queries branch from 751277e to 97a6eea Compare March 31, 2026 20:32
@mergeconflict mergeconflict added the fault-management Everything related to the fault-management initiative (RFD480 and others) label Mar 31, 2026
@mergeconflict mergeconflict self-assigned this Mar 31, 2026
Comment on lines -515 to -517
dsl::state
.eq(SupportBundleState::Destroying)
.or(dsl::state.eq(SupportBundleState::Failed)),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the old version of this code filtered for Destroying / Failed to avoid side-effects in cases when the bundle was not ready to be deleted.

I don't think these are possible - because of the precondition hoisted onto the caller, that all backing storage for a bundle (e.g. on a sled agent) should be deleted before calling this API - but it was added here out of an abundance of caution.

However, now that this function is issuing multiple writes in a transaction, it's possible that:

  1. We delete the "data selection table" (unconditionally)
  2. We try to delete the bundle, but touch zero rows (it's not in a state that can be deleted, somehow)

And we return "Ok", with a partially deleted bundle.

This seems like it would be bad (granted, with the current callers, I think it would be impossible).

Could we instead:

  • Check the state of the bundle
  • If it doesn't exist: return Ok (idempotency)
  • If it exists and is in a bad state, return an error
  • Otherwise: delete everything, in whatever order

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good catch. I think I've got this right in dc9583f, can you take a look?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change looks right, though it might be nice to have a test for that?

use omicron_uuid_kinds::{GenericUuid, SledUuid};

/// Shared payload for host-info data selection rows.
pub(crate) struct HostInfoData {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kinda confused; what's the point of this struct? looks like it's only being constructed to just immediately be converted into a BundleData struct? Why not just supply the arguments to BundleData directly?

(or, if this is complex, why not make this part of the BundleData constructor?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two pairs of nearly-identical HostInfo and Ereport model types -- one for the fm support bundle request child tables, and another for the actual support bundle child tables. They both need to convert into BundleData, and I wanted to deduplicate the not-totally-trivial copy-paste logic for those conversions. This seemed to be the least bad way to do that.

Alternatives I considered:

  • Just make the model types share their common substructure using #[diesel(embed)], but that wasn't possible.
  • Abstract over the model types with a trait with accessor methods for each of the common fields, but they would have to copy or clone if the fields are extracted one-by-one, which seemed ugly.
  • Just accept the code duplication. I could still be persuaded to do that if you don't like what I've done here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted in dc9583f. I'm just duplicating the impl From code on both sides now, and it's fine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think that if we can avoid duplication, that's worthwhile!

My critique was that you had two callsites using this struct:

From nexus/db-model/src/support_bundle.rs:

        BundleData::try_from(crate::bundle_data_selection::EreportFilterData {
            start_time: row.start_time,
            end_time: row.end_time,
            only_serials: row.only_serials,
            only_classes: row.only_classes,
        })

From nexus/db-model/src/fm/support_bundle_request.rs:

        BundleData::try_from(crate::bundle_data_selection::EreportFilterData {
            start_time: row.start_time,
            end_time: row.end_time,
            only_serials: row.only_serials,
            only_classes: row.only_classes,
        })

Both of these callsites pick apart different rows to create this EreportFilterData struct, and then construct the same BundleData type.

With your more recent commit, your PR right now is duplicating the logic of the conversion in both:

nexus/db-model/src/fm/support_bundle_request.rs:

impl TryFrom<nexus_db_model::fm::support_bundle_request::Ereports> for BundleData

nexus/db-model/src/support_bundle.rs:

impl TryFrom<nexus_db_model::support_bundle::Ereports> for BundleData

When I said: "why not make this part of the BundleData constructor?" I meant, why not add a function to BundleData, or to nexus_types::fm::ereport::EreportFilters directly, to take "start, end, serials, classes", and handle this logic?

Looking at the nexus/types crate, it looks like you already added a builder API for EreportFilters in e7d260a - why not use that, rather than inserting that same conversion logic once (or, as this PR currently does twice) more in the codebase?

}
}

if !data_selection_reconfigurator_rows.is_empty() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to block the PR on this, but:

There are many spots where we appear to be operating on each of these sub-tables "effectively independently" but they're functionally blocked in a linear order.

These are all async futures. I wonder if it would be worth spawning tasks for all of these and joining them?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about this actually... I meant to ask and forgot. We use the async_bb8_diesel stuff to make all of our DB requests asynchronous, but from what I could tell, it seems like we always just immediately await? on the futures we get back rather than trying to parallelize.

What's your general feeling about this? Should we point Claude at the db-queries codebase and tell it to go parallelize some queries? You mentioned spawning tasks; if we wanted to issue queries in parallel would we need a task for each, or just something like try_join / join_all!?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we point Claude at the db-queries codebase and tell it to go parallelize some queries?

No, I think we'll want to be a lot more cautious about performing database operations in parallel --- for a lot of stuff where there are small queries, the overhead of spawning two tasks to both try and read one bit in parallel, both having to separately wait to check out connections from the pool first, and so on, is probably not gonna be worth it. But I think there are places where we are bulk reading a whole lot of data where we can do some of it in parallel.

}
}

Ok(selections)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of spots where we basically enumerate "all the selection types" for:

  • Reading
  • Writing
  • Deleting
  • Etc

Suppose someone adds a new selection type tomorrow, and misses one of these cases. How would we catch it?

(I think there are a lot of different ways to resolve this - generate code from a list, check for these values in tests, etc - but "missing one of these uses" seems like it would be a bad bug, that currently won't throw compilation errors / fail existing tests)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a valid concern. We do exhaustive matching where we can (e.g. on the conversions from the enum to the DB model types), but on the read-from-the-DB paths that's not possible.

I think the simplest way to provide coverage for the production read path in practice is to make sure that whatever we've written to the DB round-trips correctly when we read it back out. I have that sort of half-done in this PR. I'll be able to provide better coverage specifically for the support_bundle data selection round-trip when I'm no longer hardcoding it to always write BundleDataSelection::all().

What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this kind of a downside of BundleDataSelection being a hashmap...we don't have a struct initializer that fails if you forgot to load some row from the DB.

One solution for that might be some proptests or something that try to build an arbitrary BundleDataSelection and round-trip it through the DB, and assert that the thing that comes out is the same as the one that went in?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution for that might be some proptests or something that try to build an arbitrary BundleDataSelection and round-trip it through the DB

Right, that's what I'd like to do. It's a bit inconvenient to do in this PR for the support_bundle side because we currently hardcode the write path to always write BundleDataSelection::all(). I could juggle the ordering of my commits around a bit to make that change (taking a BundleDataSelection) up front, if needed, to make it possible to write that test.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the simplest way to provide coverage for the production read path in practice is to make sure that whatever we've written to the DB round-trips correctly when we read it back out.

I think this helps! also depends on the test coverage being sufficiently updated. If we add a new "Data selection" type that never gets inserted in a test, we certainly wouldn't catch anything wrong on the read/delete pathway.

What do you think?

I went through a similar pattern in #10143 (I wanted to find all child tables of a sitrep). I ended up creating a macro here (sitrep_child_tables!) which helped me input "here are the tables I care about", which generated a compile-time slice of all the variants I cared about.

I think that's one way to handle this. Then, if a test has an "example bundle selection" of each type, it can also add an assertion to check against the auto-generated "slice of all bundle selections" to make sure it has sufficient coverage.

I do still think it's a great idea to have the round-trip tests as you suggested, as long as we have the coverage needed.

It might be worth extending test_sitrep_support_bundle_requests_roundtrip to check that we can actually delete the cases here. This would be doable by (1) making a new target sitrep, removing the old one from history, (2) triggering the GC function (you could just call fm_sitrep_gc_orphans, like the background task does). Otherwise, I think we're missing real coverage for the deletion pathway here.

@mergeconflict mergeconflict requested review from hawkw and smklein April 1, 2026 21:06
@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-tables-models-queries branch from dc9583f to 49ba2ee Compare April 2, 2026 19:52
@mergeconflict mergeconflict force-pushed the mergeconflict/support-bundle-request-tables-models-queries branch from f2eac4b to 58bf134 Compare April 2, 2026 21:01
Copy link
Copy Markdown
Member

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the rest of my comments are mostly nitpicks, but I think a nicer solution to #10200 (comment) is worth trying to figure out...

Comment on lines +53 to +56
/// Flags table row — tracks which payload-less data categories are selected.
/// Always inserted alongside the parent bundle request.
#[derive(Queryable, Insertable, Clone, Debug, Selectable)]
#[diesel(table_name = fm_support_bundle_request_data_selection_reconfigurator)]
pub struct Reconfigurator {
#[diesel(table_name = fm_support_bundle_request_data_selection_flags)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is always inserted alongside the parent support bundle request row...what is the benefit to splitting this across multiple tables? AFAICT, separating this into two things basically only has downsides (that they are inserted non-atomically and one must do a txn if you want to ensure that both are present).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could merge it into fm_support_bundle_request. Same with support_bundle_data_selection_flags, I could merge its three boolean columns into the support_bundle table. From the DB perspective that would totally make sense (you don't need to join them or use a txn or whatever).

But I did it this way because it felt cleaner to me not to mix them together. Like, in the domain of rust types, a BundleDataSelection is its own thing. If I want to load the data selection for a bundle, I don't really want to have to read a SupportBundle(Request) and then drop the bits I don't care about. Likewise, I don't want the SupportBundle and BundleDataSelection models to have three random boolean fields that are unusable on their own.

Part of this is probably my fear and loathing of Diesel... If you can suggest a cleaner way that doesn't have the model-side disadvantages I described, I'm all ears.

Comment on lines +119 to +122
SupportBundleRequestDataSelectionFlags => { table: "fm_support_bundle_request_data_selection_flags" },
SupportBundleRequestDataSelectionHostInfo => { table: "fm_support_bundle_request_data_selection_host_info" },
SupportBundleRequestDataSelectionEreports => { table: "fm_support_bundle_request_data_selection_ereports" },
SupportBundleRequest => { table: "fm_support_bundle_request" },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, not a blocker for this PR or anything, but more of a teensy side thing that @smklein can go do if he's super bored later, but: it kinda seems like it would be nice to be able to communicate to the code that constructs the background task status object that some of these are children of each other, so that the status can say "deleted N support bundle requests...which in turn required deleting X, Y, and Z of these other things...".

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"deleted N support bundle requests...which in turn required deleting X, Y, and Z of these other things..."

I may need to read this PR more closely, but I don't think the current GC is implemented this way.

You're right that there's an implied relationship between "bundle requests" and "bundle request data selection" tables, but from the GC code point-of-view, they're both "tables with a defunct sitrep UUID".

The only table which "Triggers deletion of other tables" is the highest "sitrep table". After that, everything is considered an orphan child, because it holds a defunct sitrep_id reference.

(Unless you're saying we should do this?)

}
}

Ok(selections)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this kind of a downside of BundleDataSelection being a hashmap...we don't have a struct initializer that fails if you forgot to load some row from the DB.

One solution for that might be some proptests or something that try to build an arbitrary BundleDataSelection and round-trip it through the DB, and assert that the thing that comes out is the same as the one that went in?

Comment on lines +788 to +809
if !support_bundles_requested.is_empty() {
diesel::insert_into(
support_bundle_req_dsl::fm_support_bundle_request,
)
.values(support_bundles_requested)
.execute_async(&*conn)
.await
.map_err(|e| {
public_error_from_diesel(e, ErrorHandler::Server)
.internal_context(
"failed to insert support bundle requests",
)
})?;
}

// Insert child data selection rows.
Self::data_selections_insert_on_conn(
&conn,
sitrep_id,
bundle_data_selections_requested,
)
.await?;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, take it or leave it: it kinda seems to me that inserting a support bundle request and the bundle data selection rows should be a single datastore method?

Comment on lines +849 to +853
async fn data_selections_insert_on_conn(
conn: &async_bb8_diesel::Connection<DbConnection>,
sitrep_id: SitrepUuid,
data_selections: Vec<(SupportBundleUuid, BundleDataSelection)>,
) -> Result<(), InsertSitrepError> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think it's probably nicer for DB performance to insert all the bundle data selection rows for every requested bundle in one big operation, but I do wonder: do we expect that every support bundle request will have a single flags row with false for every flag? Because if so, there's nothing really enforcing that at this level...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do expect that every fm_support_bundle_request row will have a corresponding fm_support_bundle_request_data_selection_flags row. The values of the columns will depend on the data selection (see flags_rows.push(DataSelectionFlags::from_sitrep(...)) below). Do I have a bug I'm not seeing?

///
/// Both callers run this within a transaction alongside the bundle row
/// delete, since the data selection tables have no ON DELETE CASCADE.
async fn support_bundle_data_selection_delete(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn support_bundle_data_selection_delete(
async fn support_bundle_data_selection_delete_on_conn(

Comment on lines -515 to -517
dsl::state
.eq(SupportBundleState::Destroying)
.or(dsl::state.eq(SupportBundleState::Failed)),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change looks right, though it might be nice to have a test for that?

INSERT INTO omicron.public.support_bundle_data_selection_ereports (bundle_id, start_time, end_time, only_serials, only_classes)
SELECT id, NULL, NULL, ARRAY[]::TEXT[], ARRAY[]::TEXT[]
FROM omicron.public.support_bundle
ON CONFLICT DO NOTHING;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized it was okay for a single file in a migration to contain multiple INSERTs, but according to the README I guess it is, that's good to know!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true, for some reason!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because of this line from the cockroachdb docs:

CockroachDB only supports DDL changes within implicit transactions (individual statements). If a schema management tool uses transactions on your behalf, it should only execute one schema change operation per transaction.

source

All the upXX.sql files are implicitly put in transaction - hence the restriction of "one DDL statement per file" - but "DDL" means "Data Definition Language" (i.e., creating indexes, tables, types, etc).

"DML" - "Data Manipulation Language" - in contrast, refers to all the normal data-using operations, like SELECT/INSERT/UPDATE/DELETE. These don't change the schema, they just act on data within the schema.

Anyway, TL;DR: You are limited to "one DDL statement", but you can write as much DML as you want within a transaction. "Many DML statements in a transaction" is basically just normal transaction usage!

Comment on lines +3182 to +3209
CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_flags (
bundle_id UUID NOT NULL,
include_reconfigurator BOOL NOT NULL,
include_sled_cubby_info BOOL NOT NULL,
include_sp_dumps BOOL NOT NULL,

PRIMARY KEY (bundle_id)
);

CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_host_info (
bundle_id UUID NOT NULL,
all_sleds BOOL NOT NULL,
sled_ids UUID[] NOT NULL DEFAULT ARRAY[],

PRIMARY KEY (bundle_id),
CONSTRAINT all_sleds_and_specific_sleds_are_mutually_exclusive CHECK (NOT (all_sleds AND cardinality(sled_ids) > 0))
);

CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_ereports (
bundle_id UUID NOT NULL,
start_time TIMESTAMPTZ,
end_time TIMESTAMPTZ,
only_serials TEXT[] NOT NULL DEFAULT ARRAY[],
only_classes TEXT[] NOT NULL DEFAULT ARRAY[],

PRIMARY KEY (bundle_id),
CHECK (start_time IS NULL OR end_time IS NULL OR start_time <= end_time)
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not to be too annoying, but i would like it if these were wrapped at the 80th column (we have no rustfmt for SQL sadly)

Comment on lines +7549 to +7578
CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_flags (
sitrep_id UUID NOT NULL,
request_id UUID NOT NULL,
include_reconfigurator BOOL NOT NULL,
include_sled_cubby_info BOOL NOT NULL,
include_sp_dumps BOOL NOT NULL,

PRIMARY KEY (sitrep_id, request_id)
);

CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_host_info (
sitrep_id UUID NOT NULL,
request_id UUID NOT NULL,
all_sleds BOOL NOT NULL,
sled_ids UUID[] NOT NULL DEFAULT ARRAY[],

PRIMARY KEY (sitrep_id, request_id),
CONSTRAINT all_sleds_and_specific_sleds_are_mutually_exclusive CHECK (NOT (all_sleds AND cardinality(sled_ids) > 0))
);

CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_ereports (
sitrep_id UUID NOT NULL,
request_id UUID NOT NULL,
start_time TIMESTAMPTZ,
end_time TIMESTAMPTZ,
only_serials TEXT[] NOT NULL DEFAULT ARRAY[],
only_classes TEXT[] NOT NULL DEFAULT ARRAY[],

PRIMARY KEY (sitrep_id, request_id),
CHECK (start_time IS NULL OR end_time IS NULL OR start_time <= end_time)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly would be happier with line wrapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants