Skip to content

callers of sled_add_zone_nexus() should have to provide Nexus generation#9022

Merged
davepacheco merged 3 commits intomainfrom
dap/nexus-gen-required
Sep 16, 2025
Merged

callers of sled_add_zone_nexus() should have to provide Nexus generation#9022
davepacheco merged 3 commits intomainfrom
dap/nexus-gen-required

Conversation

@davepacheco
Copy link
Copy Markdown
Collaborator

In order to write a live test for Nexus handoff, I need to be able to use the blueprint builder to provision new Nexus zones at a specific generation:

// We're finally ready to start the test!
//
// For each zone in the current generation, create a replacement at the next
// generation.
let next_generation = blueprint1.nexus_generation.next();
let planner_config = datastore
.reconfigurator_config_get_latest(opctx)
.await
.expect("obtained latest reconfigurator config")
.map_or_else(PlannerConfig::default, |cs| cs.config.planner_config);
let planning_input =
PlanningInputFromDb::assemble(opctx, datastore, planner_config)
.await
.expect("planning input");
let (_blueprint1, blueprint2) = blueprint_edit_current_target(
log,
&planning_input,
&collection,
&nexus,
&|builder: &mut BlueprintBuilder| {
for current_nexus in current_nexus_zones.values() {
builder
.sled_add_zone_nexus(
current_nexus.sled_id,
current_nexus.image_source.clone(),
next_generation,
)
.context("adding Nexus zone")?;
}
Ok(())
},
)
.await
.expect("editing blueprint to add zones");
info!(
log,
"wrote new target blueprint with new Nexus zones";
"blueprint_id" => %blueprint2.id
);

This is basically what I was suggesting in these threads:
#8863 (comment)
#8936 (comment)

We could add yet a third function for this, but I think generally the builder should avoid making non-trivial policy choices (and this is a big policy choice).

This PR overlaps with #8936. If we think it makes more sense to put this directly into 8936, that's okay. But I'm hoping to decouple the work to add a live test for Nexus handoff from 8936 (because 8936 isn't logically necessary to test handoff).

&mut self,
sled_id: SledUuid,
image_source: BlueprintZoneImageSource,
nexus_generation: Generation,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hesitation around this API was that:

  • A caller can have a nexus generation which uses multiple distinct image sources (this would be a blippy failure, as of (7/N) Use nexus_generation, update it #8936)
  • A caller can provision a nexus generation totally independently of the "current active generation"

Can I ask: what are we trying to do with this live test? The nexus generation we want to use should be inferrable from the blueprint entirely during the planning phase, which is why I didn't supply it:

  • If any existing Nexus zone has the same image source, reuse its generation
  • Otherwise, use the highest existing generation + 1

So: What live tests situations are you trying to poke at that don't follow this logic?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the live test is to be able to test the Nexus handoff mechanism (which, as implemented today, has nothing to do with images at all) independent of what images are currently being run. By "the Nexus handoff mechanism", I'm talking about:

  • the blueprint properties that refer to Nexus generation (at the top level and in each Nexus zone's config)
  • all the interactions around the db_metadata_nexus records
  • the whole quiesce process

There's a lot of complexity there that has nothing to do with the images and it's useful to be able to test that. It's not just the live test. A developer working on handoff in the future may want to test their changes and there's no intrinsic reason they should have to build a new Nexus image for each test they want to run.

The relationship between images and generations (i.e., that new images generally should have a new generation) is definitely true about production systems and the planner should always ensure that, but I think it's useful that tests and reconfigurator-cli can let you do different things.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a higher level - several months ago we had some conversations about the planner/builder split and how much the builder should enforce correctness via its API. I'm not finding where we wrote down bits of that discussion, but from recollection:

  • The planner/builder split has historically been a little muddy, and that's particularly painful when we make changes to the builder that assume it's only being used by the planner. reconfigurator-cli has been left out in the cold a few times, and we've gone back and made changes to the builder to allow it to do somewhat weirder things specifically for tests.
  • If there are things that are egregiously wrong and we'd have to go out of our way to make the builder allow it, we decided it's fine for the builder to enforce correctness, and a caller that wants to do that kind of testing would have to manually edit a blueprint instead of going through the builder. (An example of this was: the builder keeps track of changes to sled configs and automatically handles bumping their generation if a change is made. If a test wants to exercise "what happens if you make a change but don't bump the generation", they have to manually edit the resulting blueprint to decrement the generation back.)
  • Otherwise, it's okay for the builder to expose an API that can be used to produce questionable/incorrect blueprints, and expect that the planner has its own tests to ensure it doesn't produce such a thing (e.g., via blippy), and it's expected that reconfigurator-cli and other tests may intentionally use these APIs "incorrectly" to exercise various paths.
  • For any given API, it's really a judgment call which case it falls into.

I had no objection to sled_add_zone_nexus() inferring the correct nexus generation (like the builder does for sled config generations), but also have no objection to this change given we now have a use case for a test that wants to do something more explicit. If we didn't do this, how painful would it be to have the live test edit the blueprint manually to tweak the generation(s) of the Nexus(es) it adds?

@davepacheco davepacheco merged commit 3c10adb into main Sep 16, 2025
16 checks passed
@davepacheco davepacheco deleted the dap/nexus-gen-required branch September 16, 2025 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants