callers of sled_add_zone_nexus() should have to provide Nexus generation#9022
callers of sled_add_zone_nexus() should have to provide Nexus generation#9022davepacheco merged 3 commits intomainfrom
Conversation
| &mut self, | ||
| sled_id: SledUuid, | ||
| image_source: BlueprintZoneImageSource, | ||
| nexus_generation: Generation, |
There was a problem hiding this comment.
My hesitation around this API was that:
- A caller can have a nexus generation which uses multiple distinct image sources (this would be a blippy failure, as of (7/N) Use nexus_generation, update it #8936)
- A caller can provision a nexus generation totally independently of the "current active generation"
Can I ask: what are we trying to do with this live test? The nexus generation we want to use should be inferrable from the blueprint entirely during the planning phase, which is why I didn't supply it:
- If any existing Nexus zone has the same image source, reuse its generation
- Otherwise, use the highest existing generation + 1
So: What live tests situations are you trying to poke at that don't follow this logic?
There was a problem hiding this comment.
The goal of the live test is to be able to test the Nexus handoff mechanism (which, as implemented today, has nothing to do with images at all) independent of what images are currently being run. By "the Nexus handoff mechanism", I'm talking about:
- the blueprint properties that refer to Nexus generation (at the top level and in each Nexus zone's config)
- all the interactions around the
db_metadata_nexusrecords - the whole quiesce process
There's a lot of complexity there that has nothing to do with the images and it's useful to be able to test that. It's not just the live test. A developer working on handoff in the future may want to test their changes and there's no intrinsic reason they should have to build a new Nexus image for each test they want to run.
The relationship between images and generations (i.e., that new images generally should have a new generation) is definitely true about production systems and the planner should always ensure that, but I think it's useful that tests and reconfigurator-cli can let you do different things.
There was a problem hiding this comment.
At a higher level - several months ago we had some conversations about the planner/builder split and how much the builder should enforce correctness via its API. I'm not finding where we wrote down bits of that discussion, but from recollection:
- The planner/builder split has historically been a little muddy, and that's particularly painful when we make changes to the builder that assume it's only being used by the planner.
reconfigurator-clihas been left out in the cold a few times, and we've gone back and made changes to the builder to allow it to do somewhat weirder things specifically for tests. - If there are things that are egregiously wrong and we'd have to go out of our way to make the builder allow it, we decided it's fine for the builder to enforce correctness, and a caller that wants to do that kind of testing would have to manually edit a blueprint instead of going through the builder. (An example of this was: the builder keeps track of changes to sled configs and automatically handles bumping their generation if a change is made. If a test wants to exercise "what happens if you make a change but don't bump the generation", they have to manually edit the resulting blueprint to decrement the generation back.)
- Otherwise, it's okay for the builder to expose an API that can be used to produce questionable/incorrect blueprints, and expect that the planner has its own tests to ensure it doesn't produce such a thing (e.g., via blippy), and it's expected that
reconfigurator-cliand other tests may intentionally use these APIs "incorrectly" to exercise various paths. - For any given API, it's really a judgment call which case it falls into.
I had no objection to sled_add_zone_nexus() inferring the correct nexus generation (like the builder does for sled config generations), but also have no objection to this change given we now have a use case for a test that wants to do something more explicit. If we didn't do this, how painful would it be to have the live test edit the blueprint manually to tweak the generation(s) of the Nexus(es) it adds?
In order to write a live test for Nexus handoff, I need to be able to use the blueprint builder to provision new Nexus zones at a specific generation:
omicron/live-tests/tests/test_nexus_handoff.rs
Lines 144 to 182 in f7c92b3
This is basically what I was suggesting in these threads:
#8863 (comment)
#8936 (comment)
We could add yet a third function for this, but I think generally the builder should avoid making non-trivial policy choices (and this is a big policy choice).
This PR overlaps with #8936. If we think it makes more sense to put this directly into 8936, that's okay. But I'm hoping to decouple the work to add a live test for Nexus handoff from 8936 (because 8936 isn't logically necessary to test handoff).