[2/n] [reconfigurator] test that Reconfigurator updates are in server-side topological order#10099
Conversation
Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1 [skip ci]
Created using spr 1.3.6-beta.1
Previously, the DNS server depended on its own client (dns-service-client) for two things: * `TransientServer` for a transient in-memory server. * The `dnsadm` binary. This isn't a real dependency in the sense that the DNS server doesn't use the client to call other instances of itself in normal use, so there was an exclusion for this in `api-manifest.toml`. This exclusion was causing issues for #10099. Move both of these out into their own crates, and drop the exclusion from `api-manifest.toml`. There is a small, benign change to the output of `ls-apis apis` (captured by `dev-tools/ls-apis/tests/api_dependencies.out`). With `--show-deps`: ``` DNS Server (client: dns-service-client) [...] consumed by: omicron-sled-agent (omicron/sled-agent) via 2 paths via path 1: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0 via path 2: path+file:///home/rain/dev/oxide/omicron/dns-server/transient#transient-dns-server@0.1.0 via path 2: path+file:///home/rain/dev/oxide/omicron/sled-agent#omicron-sled-agent@0.1.0 ```
Created using spr 1.3.6-beta.1
|
The NTP issue is covered by #8769, which I believe will be resolved with this change. |
davepacheco
left a comment
There was a problem hiding this comment.
Nice! I've reviewed everything except the new test itself here.
| // The NTP Admin API is client-side-versioned and currently frozen. We | ||
| // allow trivial changes to go through. If we did not, we would need to | ||
| // unfreeze the API and bump the version number for trivial changes. | ||
| .allow_trivial_changes_for_latest(), |
There was a problem hiding this comment.
Does this mean that:
- we allow the user to update the OpenAPI document for a blessed version as long as it's trivial, or
- we allow the generated OpenAPI document to diverge from the corresponding blessed one as long as the changes are trivial?
The first one seems better for the reasons we've previously discussed not wanting to accumulate delta (it makes it harder for future people to see what their changes were) but I guess either is okay right now.
| // Do not create new versions of this client-side versioned API. | ||
| // https://github.com/oxidecomputer/omicron/issues/9290 |
There was a problem hiding this comment.
I do feel like we discussed this comment in the context of some other API and settled on the text you've got here. Is that right?
I don't personally love it because it's not quite true that you can't create new versions. If we need a new endpoint for example we do have options like adding a new version and not using it until the next release or tolerate the endpoint failing until the next release, etc.
| } | ||
|
|
||
| /// Record metrics from a CockroachDB node | ||
| /// Record metrics from a CockroachDB node. |
There was a problem hiding this comment.
This is not important but this change seems superfluous. It makes it differ from the others here in terms of style. (At some point, I picked up the pattern of not using punctuation on fragments used as the one-line summary. I don't know what we do more often in Omicron.)
I wouldn't even mention it but it's the only change to this file.
| //! Types shared between `omicron-ls-apis` and consumers of its output | ||
| //! (e.g. the reconfigurator planner tests). |
There was a problem hiding this comment.
What about giving this a more meaningful name, like omicron-deployment-graph?
| .context("serializing DAG edges as TOML")?; | ||
| print!( | ||
| "# BEGIN @generated server-side deployment unit DAG edges.\n\ | ||
| # To regenerate, run `EXPECTORATE=overwrite cargo nextest run -p omicron-ls-apis`.\n\ |
There was a problem hiding this comment.
What do you think of:
| # To regenerate, run `EXPECTORATE=overwrite cargo nextest run -p omicron-ls-apis`.\n\ | |
| # To regenerate, run `EXPECTORATE=overwrite cargo nextest run -p omicron-ls-apis`.\n\ | |
| # YOU SHOULD STILL REVIEW CHANGES TO THIS FILE FOR CORRECTNESS. |
| Sled Agent only uses the DNS service client during RSS, which we don't care | ||
| about for the purpose of upgrade. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Should the comment about Non-DAG dependencies have gotten preserved here?
| [[dependency_filter_rules]] | ||
| ancestor = "omicron-sled-agent" | ||
| client = "cockroach-admin-client" | ||
| evaluation = "rss-only" | ||
| note = """ | ||
| Sled Agent only uses the Cockroach Admin client during RSS, which we don't care | ||
| about for the purpose of upgrade. | ||
| """ | ||
|
|
||
| [[dependency_filter_rules]] | ||
| ancestor = "omicron-sled-agent" | ||
| client = "dns-service-client" | ||
| evaluation = "rss-only" | ||
| note = """ | ||
| Sled Agent only uses the DNS service client during RSS, which we don't care | ||
| about for the purpose of upgrade. | ||
| """ |
There was a problem hiding this comment.
We also discussed putting RSS into its own package so that we could eliminate the risk that some other part of sled agent grows these dependencies. Do you have any idea how much work that is? Seems at least worth filing an issue.
| # The host OS includes Sled Agent, Propolis, and all the components that get | ||
| # bundled into the switch zone. | ||
| [[deployment_units]] | ||
| id = "host_os" |
There was a problem hiding this comment.
I think adding a second field here makes it pretty likely people will copy/paste and forget to change either the id or label.
Out of curiosity, why not use the label for this (or use this as the label)?
If we want to preserve both as distinct things:
- Should we make
deployment_unitsa map instead of an array? (I wonder if we could make it aniddqdmap withidas the key?) - Should we add a check that the ids and labels are both unique?
| /// the update DAG. Unlike `NonDag`, | ||
| /// this does not require that the target API uses client-side versioning. |
There was a problem hiding this comment.
spurious line break here
| } | ||
|
|
||
| /// Maps a `ZoneKind` to its omicron-ls-apis deployment unit ID. | ||
| fn zone_kind_to_deployment_unit(kind: ZoneKind) -> DeploymentUnitId { |
There was a problem hiding this comment.
Hmm. This feels brittle but I'm not sure what to do about it.
I guess we could make an enum of deployment unit ids in ls-apis-shared and expect that the API manifest uses one of those?
Add a test which makes sure that when Reconfigurator runs an update, it is in topological order as defined by omicron-ls-apis (only considering server-side versioned APIs): for a particular producer/consumer pair, all instances of producer must be updated before the first instance of consumer.
The point of coordination between the two is a new file,
dev-tools/ls-apis/tests/output/deployment-unit-dag.toml.api-manifest.tomlwith just the edges (and there's an expectorate test for it).This exercise found three violations:
rss-onlydependency filter rule. (rss-onlyis similar tonon-dag, except it doesn't check thatversioned_howisclient; there is a larger discussion to be had about how we may wish to define client-side versioning as applying to edges rather than nodes, but that is outside the scope of this work.)Depends on: