fm: persist support bundle requests and data selections#10200
fm: persist support bundle requests and data selections#10200mergeconflict wants to merge 3 commits intomainfrom
Conversation
751277e to
97a6eea
Compare
| dsl::state | ||
| .eq(SupportBundleState::Destroying) | ||
| .or(dsl::state.eq(SupportBundleState::Failed)), |
There was a problem hiding this comment.
So, the old version of this code filtered for Destroying / Failed to avoid side-effects in cases when the bundle was not ready to be deleted.
I don't think these are possible - because of the precondition hoisted onto the caller, that all backing storage for a bundle (e.g. on a sled agent) should be deleted before calling this API - but it was added here out of an abundance of caution.
However, now that this function is issuing multiple writes in a transaction, it's possible that:
- We delete the "data selection table" (unconditionally)
- We try to delete the bundle, but touch zero rows (it's not in a state that can be deleted, somehow)
And we return "Ok", with a partially deleted bundle.
This seems like it would be bad (granted, with the current callers, I think it would be impossible).
Could we instead:
- Check the state of the bundle
- If it doesn't exist: return Ok (idempotency)
- If it exists and is in a bad state, return an error
- Otherwise: delete everything, in whatever order
There was a problem hiding this comment.
Thanks, this is a good catch. I think I've got this right in dc9583f, can you take a look?
There was a problem hiding this comment.
I think the change looks right, though it might be nice to have a test for that?
| use omicron_uuid_kinds::{GenericUuid, SledUuid}; | ||
|
|
||
| /// Shared payload for host-info data selection rows. | ||
| pub(crate) struct HostInfoData { |
There was a problem hiding this comment.
I'm kinda confused; what's the point of this struct? looks like it's only being constructed to just immediately be converted into a BundleData struct? Why not just supply the arguments to BundleData directly?
(or, if this is complex, why not make this part of the BundleData constructor?)
There was a problem hiding this comment.
There are two pairs of nearly-identical HostInfo and Ereport model types -- one for the fm support bundle request child tables, and another for the actual support bundle child tables. They both need to convert into BundleData, and I wanted to deduplicate the not-totally-trivial copy-paste logic for those conversions. This seemed to be the least bad way to do that.
Alternatives I considered:
- Just make the model types share their common substructure using
#[diesel(embed)], but that wasn't possible. - Abstract over the model types with a trait with accessor methods for each of the common fields, but they would have to copy or clone if the fields are extracted one-by-one, which seemed ugly.
- Just accept the code duplication. I could still be persuaded to do that if you don't like what I've done here.
There was a problem hiding this comment.
Deleted in dc9583f. I'm just duplicating the impl From code on both sides now, and it's fine.
There was a problem hiding this comment.
I still think that if we can avoid duplication, that's worthwhile!
My critique was that you had two callsites using this struct:
From nexus/db-model/src/support_bundle.rs:
BundleData::try_from(crate::bundle_data_selection::EreportFilterData {
start_time: row.start_time,
end_time: row.end_time,
only_serials: row.only_serials,
only_classes: row.only_classes,
})From nexus/db-model/src/fm/support_bundle_request.rs:
BundleData::try_from(crate::bundle_data_selection::EreportFilterData {
start_time: row.start_time,
end_time: row.end_time,
only_serials: row.only_serials,
only_classes: row.only_classes,
})Both of these callsites pick apart different rows to create this EreportFilterData struct, and then construct the same BundleData type.
With your more recent commit, your PR right now is duplicating the logic of the conversion in both:
nexus/db-model/src/fm/support_bundle_request.rs:
impl TryFrom<nexus_db_model::fm::support_bundle_request::Ereports> for BundleDatanexus/db-model/src/support_bundle.rs:
impl TryFrom<nexus_db_model::support_bundle::Ereports> for BundleDataWhen I said: "why not make this part of the BundleData constructor?" I meant, why not add a function to BundleData, or to nexus_types::fm::ereport::EreportFilters directly, to take "start, end, serials, classes", and handle this logic?
Looking at the nexus/types crate, it looks like you already added a builder API for EreportFilters in e7d260a - why not use that, rather than inserting that same conversion logic once (or, as this PR currently does twice) more in the codebase?
| } | ||
| } | ||
|
|
||
| if !data_selection_reconfigurator_rows.is_empty() { |
There was a problem hiding this comment.
We don't need to block the PR on this, but:
There are many spots where we appear to be operating on each of these sub-tables "effectively independently" but they're functionally blocked in a linear order.
These are all async futures. I wonder if it would be worth spawning tasks for all of these and joining them?
There was a problem hiding this comment.
I was wondering about this actually... I meant to ask and forgot. We use the async_bb8_diesel stuff to make all of our DB requests asynchronous, but from what I could tell, it seems like we always just immediately await? on the futures we get back rather than trying to parallelize.
What's your general feeling about this? Should we point Claude at the db-queries codebase and tell it to go parallelize some queries? You mentioned spawning tasks; if we wanted to issue queries in parallel would we need a task for each, or just something like try_join / join_all!?
There was a problem hiding this comment.
Should we point Claude at the db-queries codebase and tell it to go parallelize some queries?
No, I think we'll want to be a lot more cautious about performing database operations in parallel --- for a lot of stuff where there are small queries, the overhead of spawning two tasks to both try and read one bit in parallel, both having to separately wait to check out connections from the pool first, and so on, is probably not gonna be worth it. But I think there are places where we are bulk reading a whole lot of data where we can do some of it in parallel.
| } | ||
| } | ||
|
|
||
| Ok(selections) |
There was a problem hiding this comment.
There are a lot of spots where we basically enumerate "all the selection types" for:
- Reading
- Writing
- Deleting
- Etc
Suppose someone adds a new selection type tomorrow, and misses one of these cases. How would we catch it?
(I think there are a lot of different ways to resolve this - generate code from a list, check for these values in tests, etc - but "missing one of these uses" seems like it would be a bad bug, that currently won't throw compilation errors / fail existing tests)
There was a problem hiding this comment.
Yeah this is a valid concern. We do exhaustive matching where we can (e.g. on the conversions from the enum to the DB model types), but on the read-from-the-DB paths that's not possible.
I think the simplest way to provide coverage for the production read path in practice is to make sure that whatever we've written to the DB round-trips correctly when we read it back out. I have that sort of half-done in this PR. I'll be able to provide better coverage specifically for the support_bundle data selection round-trip when I'm no longer hardcoding it to always write BundleDataSelection::all().
What do you think?
There was a problem hiding this comment.
Yeah, I think this kind of a downside of BundleDataSelection being a hashmap...we don't have a struct initializer that fails if you forgot to load some row from the DB.
One solution for that might be some proptests or something that try to build an arbitrary BundleDataSelection and round-trip it through the DB, and assert that the thing that comes out is the same as the one that went in?
There was a problem hiding this comment.
One solution for that might be some proptests or something that try to build an arbitrary
BundleDataSelectionand round-trip it through the DB
Right, that's what I'd like to do. It's a bit inconvenient to do in this PR for the support_bundle side because we currently hardcode the write path to always write BundleDataSelection::all(). I could juggle the ordering of my commits around a bit to make that change (taking a BundleDataSelection) up front, if needed, to make it possible to write that test.
There was a problem hiding this comment.
I think the simplest way to provide coverage for the production read path in practice is to make sure that whatever we've written to the DB round-trips correctly when we read it back out.
I think this helps! also depends on the test coverage being sufficiently updated. If we add a new "Data selection" type that never gets inserted in a test, we certainly wouldn't catch anything wrong on the read/delete pathway.
What do you think?
I went through a similar pattern in #10143 (I wanted to find all child tables of a sitrep). I ended up creating a macro here (sitrep_child_tables!) which helped me input "here are the tables I care about", which generated a compile-time slice of all the variants I cared about.
I think that's one way to handle this. Then, if a test has an "example bundle selection" of each type, it can also add an assertion to check against the auto-generated "slice of all bundle selections" to make sure it has sufficient coverage.
I do still think it's a great idea to have the round-trip tests as you suggested, as long as we have the coverage needed.
It might be worth extending test_sitrep_support_bundle_requests_roundtrip to check that we can actually delete the cases here. This would be doable by (1) making a new target sitrep, removing the old one from history, (2) triggering the GC function (you could just call fm_sitrep_gc_orphans, like the background task does). Otherwise, I think we're missing real coverage for the deletion pathway here.
dc9583f to
49ba2ee
Compare
f2eac4b to
58bf134
Compare
hawkw
left a comment
There was a problem hiding this comment.
At this point, the rest of my comments are mostly nitpicks, but I think a nicer solution to #10200 (comment) is worth trying to figure out...
| /// Flags table row — tracks which payload-less data categories are selected. | ||
| /// Always inserted alongside the parent bundle request. | ||
| #[derive(Queryable, Insertable, Clone, Debug, Selectable)] | ||
| #[diesel(table_name = fm_support_bundle_request_data_selection_reconfigurator)] | ||
| pub struct Reconfigurator { | ||
| #[diesel(table_name = fm_support_bundle_request_data_selection_flags)] |
There was a problem hiding this comment.
If this is always inserted alongside the parent support bundle request row...what is the benefit to splitting this across multiple tables? AFAICT, separating this into two things basically only has downsides (that they are inserted non-atomically and one must do a txn if you want to ensure that both are present).
There was a problem hiding this comment.
I could merge it into fm_support_bundle_request. Same with support_bundle_data_selection_flags, I could merge its three boolean columns into the support_bundle table. From the DB perspective that would totally make sense (you don't need to join them or use a txn or whatever).
But I did it this way because it felt cleaner to me not to mix them together. Like, in the domain of rust types, a BundleDataSelection is its own thing. If I want to load the data selection for a bundle, I don't really want to have to read a SupportBundle(Request) and then drop the bits I don't care about. Likewise, I don't want the SupportBundle and BundleDataSelection models to have three random boolean fields that are unusable on their own.
Part of this is probably my fear and loathing of Diesel... If you can suggest a cleaner way that doesn't have the model-side disadvantages I described, I'm all ears.
| SupportBundleRequestDataSelectionFlags => { table: "fm_support_bundle_request_data_selection_flags" }, | ||
| SupportBundleRequestDataSelectionHostInfo => { table: "fm_support_bundle_request_data_selection_host_info" }, | ||
| SupportBundleRequestDataSelectionEreports => { table: "fm_support_bundle_request_data_selection_ereports" }, | ||
| SupportBundleRequest => { table: "fm_support_bundle_request" }, |
There was a problem hiding this comment.
hm, not a blocker for this PR or anything, but more of a teensy side thing that @smklein can go do if he's super bored later, but: it kinda seems like it would be nice to be able to communicate to the code that constructs the background task status object that some of these are children of each other, so that the status can say "deleted N support bundle requests...which in turn required deleting X, Y, and Z of these other things...".
There was a problem hiding this comment.
"deleted N support bundle requests...which in turn required deleting X, Y, and Z of these other things..."
I may need to read this PR more closely, but I don't think the current GC is implemented this way.
You're right that there's an implied relationship between "bundle requests" and "bundle request data selection" tables, but from the GC code point-of-view, they're both "tables with a defunct sitrep UUID".
The only table which "Triggers deletion of other tables" is the highest "sitrep table". After that, everything is considered an orphan child, because it holds a defunct sitrep_id reference.
(Unless you're saying we should do this?)
| } | ||
| } | ||
|
|
||
| Ok(selections) |
There was a problem hiding this comment.
Yeah, I think this kind of a downside of BundleDataSelection being a hashmap...we don't have a struct initializer that fails if you forgot to load some row from the DB.
One solution for that might be some proptests or something that try to build an arbitrary BundleDataSelection and round-trip it through the DB, and assert that the thing that comes out is the same as the one that went in?
| if !support_bundles_requested.is_empty() { | ||
| diesel::insert_into( | ||
| support_bundle_req_dsl::fm_support_bundle_request, | ||
| ) | ||
| .values(support_bundles_requested) | ||
| .execute_async(&*conn) | ||
| .await | ||
| .map_err(|e| { | ||
| public_error_from_diesel(e, ErrorHandler::Server) | ||
| .internal_context( | ||
| "failed to insert support bundle requests", | ||
| ) | ||
| })?; | ||
| } | ||
|
|
||
| // Insert child data selection rows. | ||
| Self::data_selections_insert_on_conn( | ||
| &conn, | ||
| sitrep_id, | ||
| bundle_data_selections_requested, | ||
| ) | ||
| .await?; |
There was a problem hiding this comment.
nit, take it or leave it: it kinda seems to me that inserting a support bundle request and the bundle data selection rows should be a single datastore method?
| async fn data_selections_insert_on_conn( | ||
| conn: &async_bb8_diesel::Connection<DbConnection>, | ||
| sitrep_id: SitrepUuid, | ||
| data_selections: Vec<(SupportBundleUuid, BundleDataSelection)>, | ||
| ) -> Result<(), InsertSitrepError> { |
There was a problem hiding this comment.
Hm, I think it's probably nicer for DB performance to insert all the bundle data selection rows for every requested bundle in one big operation, but I do wonder: do we expect that every support bundle request will have a single flags row with false for every flag? Because if so, there's nothing really enforcing that at this level...
There was a problem hiding this comment.
We do expect that every fm_support_bundle_request row will have a corresponding fm_support_bundle_request_data_selection_flags row. The values of the columns will depend on the data selection (see flags_rows.push(DataSelectionFlags::from_sitrep(...)) below). Do I have a bug I'm not seeing?
| /// | ||
| /// Both callers run this within a transaction alongside the bundle row | ||
| /// delete, since the data selection tables have no ON DELETE CASCADE. | ||
| async fn support_bundle_data_selection_delete( |
There was a problem hiding this comment.
| async fn support_bundle_data_selection_delete( | |
| async fn support_bundle_data_selection_delete_on_conn( |
| dsl::state | ||
| .eq(SupportBundleState::Destroying) | ||
| .or(dsl::state.eq(SupportBundleState::Failed)), |
There was a problem hiding this comment.
I think the change looks right, though it might be nice to have a test for that?
| INSERT INTO omicron.public.support_bundle_data_selection_ereports (bundle_id, start_time, end_time, only_serials, only_classes) | ||
| SELECT id, NULL, NULL, ARRAY[]::TEXT[], ARRAY[]::TEXT[] | ||
| FROM omicron.public.support_bundle | ||
| ON CONFLICT DO NOTHING; |
There was a problem hiding this comment.
I hadn't realized it was okay for a single file in a migration to contain multiple INSERTs, but according to the README I guess it is, that's good to know!
There was a problem hiding this comment.
It's true, for some reason!
There was a problem hiding this comment.
It's because of this line from the cockroachdb docs:
CockroachDB only supports DDL changes within implicit transactions (individual statements). If a schema management tool uses transactions on your behalf, it should only execute one schema change operation per transaction.
All the upXX.sql files are implicitly put in transaction - hence the restriction of "one DDL statement per file" - but "DDL" means "Data Definition Language" (i.e., creating indexes, tables, types, etc).
"DML" - "Data Manipulation Language" - in contrast, refers to all the normal data-using operations, like SELECT/INSERT/UPDATE/DELETE. These don't change the schema, they just act on data within the schema.
Anyway, TL;DR: You are limited to "one DDL statement", but you can write as much DML as you want within a transaction. "Many DML statements in a transaction" is basically just normal transaction usage!
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_flags ( | ||
| bundle_id UUID NOT NULL, | ||
| include_reconfigurator BOOL NOT NULL, | ||
| include_sled_cubby_info BOOL NOT NULL, | ||
| include_sp_dumps BOOL NOT NULL, | ||
|
|
||
| PRIMARY KEY (bundle_id) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_host_info ( | ||
| bundle_id UUID NOT NULL, | ||
| all_sleds BOOL NOT NULL, | ||
| sled_ids UUID[] NOT NULL DEFAULT ARRAY[], | ||
|
|
||
| PRIMARY KEY (bundle_id), | ||
| CONSTRAINT all_sleds_and_specific_sleds_are_mutually_exclusive CHECK (NOT (all_sleds AND cardinality(sled_ids) > 0)) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.support_bundle_data_selection_ereports ( | ||
| bundle_id UUID NOT NULL, | ||
| start_time TIMESTAMPTZ, | ||
| end_time TIMESTAMPTZ, | ||
| only_serials TEXT[] NOT NULL DEFAULT ARRAY[], | ||
| only_classes TEXT[] NOT NULL DEFAULT ARRAY[], | ||
|
|
||
| PRIMARY KEY (bundle_id), | ||
| CHECK (start_time IS NULL OR end_time IS NULL OR start_time <= end_time) | ||
| ); |
There was a problem hiding this comment.
not to be too annoying, but i would like it if these were wrapped at the 80th column (we have no rustfmt for SQL sadly)
| CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_flags ( | ||
| sitrep_id UUID NOT NULL, | ||
| request_id UUID NOT NULL, | ||
| include_reconfigurator BOOL NOT NULL, | ||
| include_sled_cubby_info BOOL NOT NULL, | ||
| include_sp_dumps BOOL NOT NULL, | ||
|
|
||
| PRIMARY KEY (sitrep_id, request_id) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_host_info ( | ||
| sitrep_id UUID NOT NULL, | ||
| request_id UUID NOT NULL, | ||
| all_sleds BOOL NOT NULL, | ||
| sled_ids UUID[] NOT NULL DEFAULT ARRAY[], | ||
|
|
||
| PRIMARY KEY (sitrep_id, request_id), | ||
| CONSTRAINT all_sleds_and_specific_sleds_are_mutually_exclusive CHECK (NOT (all_sleds AND cardinality(sled_ids) > 0)) | ||
| ); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.fm_support_bundle_request_data_selection_ereports ( | ||
| sitrep_id UUID NOT NULL, | ||
| request_id UUID NOT NULL, | ||
| start_time TIMESTAMPTZ, | ||
| end_time TIMESTAMPTZ, | ||
| only_serials TEXT[] NOT NULL DEFAULT ARRAY[], | ||
| only_classes TEXT[] NOT NULL DEFAULT ARRAY[], | ||
|
|
||
| PRIMARY KEY (sitrep_id, request_id), | ||
| CHECK (start_time IS NULL OR end_time IS NULL OR start_time <= end_time) |
There was a problem hiding this comment.
similarly would be happier with line wrapping
Add database tables, models, and datastore methods for persisting
BundleDataSelection— the set of data categories a support bundle should collect — in two contexts:FM sitrep support bundle requests: an
fm_support_bundle_requesttable (new) links a bundle request to a case and sitrep, and five per-variant child tables (fm_support_bundle_request_data_selection_{reconfigurator, host_info, sled_cubby_info, sp_dumps, ereports}) record which data categories to gather and with what filters.Support bundle data selection: five analogous per-variant child tables (
support_bundle_data_selection_{reconfigurator, host_info, sled_cubby_info, sp_dumps, ereports}) record the data selection for each support bundle row. Existing bundles are backfilled withBundleDataSelection::all()in the migration.Schema changes (version 247):
fm_support_bundle_requesttable with indexes.fm_support_bundle_request_data_selection_*child tables.support_bundle.fm_case_id: nullable column linking a bundle to the originating FM case.support_bundle_data_selection_*child tables.Model changes:
SupportBundleRequestand child row models.SupportBundle.Datastore changes:
fm_sitrep_read: refactor alert-request loading intofm_sitrep_alert_requests_read; add symmetricfm_sitrep_support_bundle_requests_readthat loads request rows and reassembles BundleDataSelection from child tables.fm_sitrep_insert: decompose BundleDataSelection into per-variant rows on write.support_bundle_data_selection_get: look up a bundle's data selection.test_sitrep_support_bundle_requests_roundtrip; extended coverage in existing sitrep roundtrip and bundle lifecycle tests.Also fix
BundleDataSelection::all()to usenow_db_precision()for microsecond-precision timestamps that survive CRDB roundtrips.Context: #10062.