callers of sled_add_zone_nexus() should have to provide Nexus generation by davepacheco · Pull Request #9022 · oxidecomputer/omicron

davepacheco · 2025-09-15T23:02:34Z

In order to write a live test for Nexus handoff, I need to be able to use the blueprint builder to provision new Nexus zones at a specific generation:

omicron/live-tests/tests/test_nexus_handoff.rs

Lines 144 to 182 in f7c92b3

    
           // We're finally ready to start the test! 
        
           // 
        
           // For each zone in the current generation, create a replacement at the next 
        
           // generation. 
        
           let next_generation = blueprint1.nexus_generation.next(); 
        
           let planner_config = datastore 
        
               .reconfigurator_config_get_latest(opctx) 
        
               .await 
        
               .expect("obtained latest reconfigurator config") 
        
               .map_or_else(PlannerConfig::default, |cs| cs.config.planner_config); 
        
           let planning_input = 
        
               PlanningInputFromDb::assemble(opctx, datastore, planner_config) 
        
                   .await 
        
                   .expect("planning input"); 
        
           let (_blueprint1, blueprint2) = blueprint_edit_current_target( 
        
               log, 
        
               &planning_input, 
        
               &collection, 
        
               &nexus, 
        
               &|builder: &mut BlueprintBuilder| { 
        
                   for current_nexus in current_nexus_zones.values() { 
        
                       builder 
        
                           .sled_add_zone_nexus( 
        
                               current_nexus.sled_id, 
        
                               current_nexus.image_source.clone(), 
        
                               next_generation, 
        
                           ) 
        
                           .context("adding Nexus zone")?; 
        
                   } 
        
                   Ok(()) 
        
               }, 
        
           ) 
        
           .await 
        
           .expect("editing blueprint to add zones"); 
        
           info!( 
        
               log, 
        
               "wrote new target blueprint with new Nexus zones"; 
        
               "blueprint_id" => %blueprint2.id 
        
           );

This is basically what I was suggesting in these threads:
#8863 (comment)
#8936 (comment)

We could add yet a third function for this, but I think generally the builder should avoid making non-trivial policy choices (and this is a big policy choice).

This PR overlaps with #8936. If we think it makes more sense to put this directly into 8936, that's okay. But I'm hoping to decouple the work to add a live test for Nexus handoff from 8936 (because 8936 isn't logically necessary to test handoff).

smklein · 2025-09-15T23:18:02Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

        &mut self,
        sled_id: SledUuid,
        image_source: BlueprintZoneImageSource,
+        nexus_generation: Generation,


My hesitation around this API was that:

A caller can have a nexus generation which uses multiple distinct image sources (this would be a blippy failure, as of (7/N) Use nexus_generation, update it #8936)

A caller can provision a nexus generation totally independently of the "current active generation"

Can I ask: what are we trying to do with this live test? The nexus generation we want to use should be inferrable from the blueprint entirely during the planning phase, which is why I didn't supply it:

If any existing Nexus zone has the same image source, reuse its generation

Otherwise, use the highest existing generation + 1

So: What live tests situations are you trying to poke at that don't follow this logic?

The goal of the live test is to be able to test the Nexus handoff mechanism (which, as implemented today, has nothing to do with images at all) independent of what images are currently being run. By "the Nexus handoff mechanism", I'm talking about:

the blueprint properties that refer to Nexus generation (at the top level and in each Nexus zone's config)

all the interactions around the db_metadata_nexus records

the whole quiesce process

There's a lot of complexity there that has nothing to do with the images and it's useful to be able to test that. It's not just the live test. A developer working on handoff in the future may want to test their changes and there's no intrinsic reason they should have to build a new Nexus image for each test they want to run.

The relationship between images and generations (i.e., that new images generally should have a new generation) is definitely true about production systems and the planner should always ensure that, but I think it's useful that tests and reconfigurator-cli can let you do different things.

At a higher level - several months ago we had some conversations about the planner/builder split and how much the builder should enforce correctness via its API. I'm not finding where we wrote down bits of that discussion, but from recollection:

The planner/builder split has historically been a little muddy, and that's particularly painful when we make changes to the builder that assume it's only being used by the planner. reconfigurator-cli has been left out in the cold a few times, and we've gone back and made changes to the builder to allow it to do somewhat weirder things specifically for tests.

If there are things that are egregiously wrong and we'd have to go out of our way to make the builder allow it, we decided it's fine for the builder to enforce correctness, and a caller that wants to do that kind of testing would have to manually edit a blueprint instead of going through the builder. (An example of this was: the builder keeps track of changes to sled configs and automatically handles bumping their generation if a change is made. If a test wants to exercise "what happens if you make a change but don't bump the generation", they have to manually edit the resulting blueprint to decrement the generation back.)

Otherwise, it's okay for the builder to expose an API that can be used to produce questionable/incorrect blueprints, and expect that the planner has its own tests to ensure it doesn't produce such a thing (e.g., via blippy), and it's expected that reconfigurator-cli and other tests may intentionally use these APIs "incorrectly" to exercise various paths.

For any given API, it's really a judgment call which case it falls into.

I had no objection to sled_add_zone_nexus() inferring the correct nexus generation (like the builder does for sled config generations), but also have no objection to this change given we now have a use case for a test that wants to do something more explicit. If we didn't do this, how painful would it be to have the live test edit the blueprint manually to tweak the generation(s) of the Nexus(es) it adds?

…ion (#9022)

callers of sled_add_zone_nexus() should have to provide Nexus generation

cb98214

davepacheco requested review from jgallagher and smklein September 15, 2025 23:02

smklein reviewed Sep 15, 2025

View reviewed changes

fix live test

537fb45

smklein approved these changes Sep 15, 2025

View reviewed changes

fix clippy

454aef5

davepacheco mentioned this pull request Sep 16, 2025

add live test for Nexus handoff #9024

Merged

1 task

jgallagher approved these changes Sep 16, 2025

View reviewed changes

davepacheco merged commit 3c10adb into main Sep 16, 2025
16 checks passed

davepacheco deleted the dap/nexus-gen-required branch September 16, 2025 16:25

charliepark pushed a commit that referenced this pull request Sep 19, 2025

callers of sled_add_zone_nexus() should have to provide Nexus generat…

c20363e

…ion (#9022)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

callers of sled_add_zone_nexus() should have to provide Nexus generation#9022

callers of sled_add_zone_nexus() should have to provide Nexus generation#9022
davepacheco merged 3 commits intomainfrom
dap/nexus-gen-required

davepacheco commented Sep 15, 2025

Uh oh!

smklein Sep 15, 2025

Uh oh!

davepacheco Sep 15, 2025

Uh oh!

jgallagher Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// We're finally ready to start the test!
	//
	// For each zone in the current generation, create a replacement at the next
	// generation.
	let next_generation = blueprint1.nexus_generation.next();
	let planner_config = datastore
	.reconfigurator_config_get_latest(opctx)
	.await
	.expect("obtained latest reconfigurator config")
	.map_or_else(PlannerConfig::default, \|cs\| cs.config.planner_config);
	let planning_input =
	PlanningInputFromDb::assemble(opctx, datastore, planner_config)
	.await
	.expect("planning input");
	let (_blueprint1, blueprint2) = blueprint_edit_current_target(
	log,
	&planning_input,
	&collection,
	&nexus,
	&\|builder: &mut BlueprintBuilder\| {
	for current_nexus in current_nexus_zones.values() {
	builder
	.sled_add_zone_nexus(
	current_nexus.sled_id,
	current_nexus.image_source.clone(),
	next_generation,
	)
	.context("adding Nexus zone")?;
	}
	Ok(())
	},
	)
	.await
	.expect("editing blueprint to add zones");
	info!(
	log,
	"wrote new target blueprint with new Nexus zones";
	"blueprint_id" => %blueprint2.id
	);

Conversation

davepacheco commented Sep 15, 2025

Uh oh!

smklein Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

davepacheco Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants