Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
Conversation
…outes_to_edgezero
When edgezero_enabled=true, reads edgezero_rollout_pct (0-100) from the trusted_server_config store. Routes each request to EdgeZero if fnv1a_bucket(client_ip) < rollout_pct, giving the ops team sticky per-user canary control without a deploy. Absent key defaults to 100 (full rollout, backward compatible with edgezero_enabled=true deployments that predate this PR). Invalid values and read errors default to 0 (all legacy, fail-safe).
Set to "0" so local dev and CI stay on the legacy path by default. Ops changes this in the production config store — no re-deploy required.
Documents the two config store keys, canary progression stages (1% → 10% → 50% → 100%), hold-point durations, pass/fail thresholds, and instant rollback procedure.
…ro-pr19-cutover-canary
| // Fires per-request when key is absent and edgezero_enabled=true. | ||
| // At production scale this creates one warn per request until the key is set. | ||
| // Resolution: set edgezero_rollout_pct = "0" before setting edgezero_enabled = "true". | ||
| log::warn!( |
There was a problem hiding this comment.
🔧 wrench — Per-request warn log on absent edgezero_rollout_pct.
This emits a log::warn! for every request when the key is absent. The code comment acknowledges "At production scale this creates one warn per request until the key is set" but does not fix it. Combined with the absent-key default = 100% rollout, this is a real operational hazard: a single ops ordering mistake (setting edgezero_enabled = "true" before edgezero_rollout_pct) results in both an immediate full cutover and a sustained warn flood. The warn doesn't tell anyone anything is broken (traffic flows fine), so it's likely to be silenced/ignored — not acted on.
Fix (pick one):
- Demote to
log::debug!— the runbook + setup procedure is the real safety net, not this log line:Ok(None) => { log::debug!( "edgezero_rollout_pct key absent, defaulting to 100 (full rollout — backward compat)" ); 100 }
- Better: change the absent-key default to
0(fail-safe) — see the design question in the review body — and drop the warn entirely. An absent key would then surface as zero canary traffic, observable from real-time stats, rather than as a log flood.
| return; | ||
| } | ||
|
|
||
| let rollout_pct = read_rollout_pct(&edgezero_config_store); |
There was a problem hiding this comment.
🤔 thinking — Two config-store reads per request.
Each request now hits config_store.get(...) twice (is_edgezero_enabled then read_rollout_pct). Fastly Config Store reads are local-edge and fast, but at full QPS that's a measurable doubling. Consider consolidating into one helper that returns (enabled, rollout_pct) after a single read sequence, or wrapping both behind a cached snapshot if the SDK supports it. Not urgent — flag only.
| /// | Key present, valid 0–100 | parsed value | Partial or full rollout | | ||
| /// | Key present, invalid | `0` | All legacy (safe default) | | ||
| /// | Key read error | `0` | All legacy (safe default) | | ||
| fn read_rollout_pct(config_store: &ConfigStoreHandle) -> u8 { |
There was a problem hiding this comment.
♻️ refactor — Add unit tests for read_rollout_pct using an in-memory ConfigStore.
The plan asserts ConfigStoreHandle cannot be constructed in unit tests; this is not accurate at the pinned edgezero-core rev. The ConfigStore trait is public, and ConfigStoreHandle::new(Arc::new(...)) is public — edgezero-core's own tests use exactly this pattern.
The four documented branches are safety-critical; only the first is indirectly covered (via parse_rollout_pct tests).
Suggested addition in the existing #[cfg(test)] mod tests:
use edgezero_core::config_store::{ConfigStore, ConfigStoreError};
struct TestConfigStore {
response: Result<Option<String>, ConfigStoreError>,
}
impl ConfigStore for TestConfigStore {
fn get(&self, _key: &str) -> Result<Option<String>, ConfigStoreError> {
self.response.clone()
}
}
fn handle(response: Result<Option<String>, ConfigStoreError>) -> ConfigStoreHandle {
ConfigStoreHandle::new(Arc::new(TestConfigStore { response }))
}
#[test]
fn read_rollout_pct_absent_defaults_to_full_rollout() {
assert_eq!(read_rollout_pct(&handle(Ok(None))), 100);
}
#[test]
fn read_rollout_pct_invalid_value_defaults_to_zero() {
assert_eq!(read_rollout_pct(&handle(Ok(Some("abc".into())))), 0);
}
#[test]
fn read_rollout_pct_out_of_range_defaults_to_zero() {
assert_eq!(read_rollout_pct(&handle(Ok(Some("101".into())))), 0);
}
#[test]
fn read_rollout_pct_read_error_defaults_to_zero() {
assert_eq!(
read_rollout_pct(&handle(Err(ConfigStoreError::unavailable("boom")))),
0,
);
}These tests pin the safety-critical defaults under refactoring — the part of the design that matters in incidents.
| String::new() | ||
| } | ||
| }; | ||
| let bucket = fnv1a_bucket(&routing_key); |
There was a problem hiding this comment.
🌱 seedling — Skip the FNV hash + IpAddr.to_string() allocation when rollout_pct == 0 || == 100.
routing_key = ip.to_string() allocates a String per request, and the FNV computation runs even when the decision is degenerate (0 → always legacy, 100 → always EdgeZero). For steady-state rollback (0) and post-cutover (100) — which together cover most of the canary's lifetime — this work is wasted.
let rollout_pct = read_rollout_pct(&edgezero_config_store);
let route_to_edgezero = match rollout_pct {
0 => false,
100 => true,
pct => {
let routing_key = req.get_client_ip_addr()
.map(|ip| ip.to_string())
.unwrap_or_default();
let bucket = fnv1a_bucket(&routing_key);
canary_routes_to_edgezero(bucket, pct)
}
};If you want to skip the to_string() allocation in the canary path too, FNV-1a can be fed IpAddr octets directly:
fn fnv1a_bucket_octets(bytes: &[u8]) -> u8 { /* same algorithm */ }
let bucket = match req.get_client_ip_addr() {
Some(IpAddr::V4(v)) => fnv1a_bucket_octets(&v.octets()),
Some(IpAddr::V6(v)) => fnv1a_bucket_octets(&v.octets()),
None => fnv1a_bucket_octets(&[]),
};Not urgent.
| /// `bucket` must be in `0..100`; `rollout_pct` in `0..=100`. | ||
| /// When `rollout_pct = 0` no bucket ever routes to `EdgeZero` (instant rollback). | ||
| /// When `rollout_pct = 100` every bucket routes to `EdgeZero` (full cutover). | ||
| fn canary_routes_to_edgezero(bucket: u8, rollout_pct: u8) -> bool { |
There was a problem hiding this comment.
⛏ nitpick — canary_routes_to_edgezero is just bucket < rollout_pct; "canary" is misleading at the call site.
The function works at any rollout_pct (0, 50, 100), not only canary stages. Reads awkwardly at the call site: if canary_routes_to_edgezero(bucket, rollout_pct).
Suggest routes_to_edgezero(bucket, rollout_pct) — shorter and accurate at any rollout value.
ChristianPavilonis
left a comment
There was a problem hiding this comment.
Summary
One low-severity docs/ops note: the canary runbook references routing log lines that are emitted at debug level, so production verification may require a debug/log-level deployment or equivalent observability setup.
| 1. Confirm `edgezero_rollout_pct = "0"` is already set in the production config store | ||
| (set it now if not — the pre-condition above explains why this must come first). | ||
| 2. Set `edgezero_enabled = "true"` in the production config store. | ||
| 3. Verify via log tailing that `routing request through legacy path (bucket=N, rollout_pct=0)` |
There was a problem hiding this comment.
P3: Clarify that these verification logs require debug-level logging
This is useful operational guidance, but the referenced routing request through ... messages are emitted with log::debug! in the Fastly adapter while the normal Fastly logger is capped at Info. If operators are expected to verify this via log tailing, it may be worth noting that this requires a debug/log-level deployment or another equivalent canary observability signal.
Summary
edgezero_rollout_pct(0–100) to thetrusted_server_configstore so EdgeZero traffic can be shifted incrementally without a deployclient_ipfor deterministic, sticky per-user bucket assignment (0–99); routing predicate isbucket < rollout_pctrollout_pct = 0is the immediate no-deploy rollbackChanges
crates/trusted-server-adapter-fastly/src/main.rsparse_rollout_pct,fnv1a_bucket,canary_routes_to_edgezero,read_rollout_pct; updatedmain()with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)fastly.tomledgezero_rollout_pct = "0"to local Viceroy config store with ops commentdocs/internal/EDGEZERO_MIGRATION.mdCloses
Closes #500
Test plan
cargo test-fastly— 65 + 956 + 21 + 3 tests greencargo test-axum— 31 tests greencargo clippy-fastly && cargo clippy-axum— no warningscargo fmt --all -- --check— cleancd crates/js/lib && npx vitest run— 291 tests greencd crates/js/lib && npm run format— cleancd docs && npm run format— cleancargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1fastly compute servewithedgezero_rollout_pct = "1"and"0"(rollback)Checklist
unwrap()in production code — useexpect("should ...")logmacros (notprintln!)