Skip to content

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727

Open
prk-Jr wants to merge 9 commits into
feature/edgezero-pr19-spin-adapterfrom
feature/edgezero-pr19-cutover-canary
Open

Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19)#727
prk-Jr wants to merge 9 commits into
feature/edgezero-pr19-spin-adapterfrom
feature/edgezero-pr19-cutover-canary

Conversation

@prk-Jr
Copy link
Copy Markdown
Collaborator

@prk-Jr prk-Jr commented May 21, 2026

Summary

  • Adds edgezero_rollout_pct (0–100) to the trusted_server_config store so EdgeZero traffic can be shifted incrementally without a deploy
  • Uses FNV-1a 32-bit hash of client_ip for deterministic, sticky per-user bucket assignment (0–99); routing predicate is bucket < rollout_pct
  • Fail-safe defaults: read error or invalid value → 0 (all legacy); absent key (backward compat) → 100 (all EdgeZero); rollout_pct = 0 is the immediate no-deploy rollback

Changes

File Change
crates/trusted-server-adapter-fastly/src/main.rs Added parse_rollout_pct, fnv1a_bucket, canary_routes_to_edgezero, read_rollout_pct; updated main() with canary dispatch; 11 unit tests (pinned FNV-1a vectors, distribution, boundary routing)
fastly.toml Added edgezero_rollout_pct = "0" to local Viceroy config store with ops comment
docs/internal/EDGEZERO_MIGRATION.md New ops runbook: config store keys, failure modes, pre-flight activation, 4 canary stages, pass/fail thresholds, rollback procedure, monitoring

Closes

Closes #500

Test plan

  • cargo test-fastly — 65 + 956 + 21 + 3 tests green
  • cargo test-axum — 31 tests green
  • cargo clippy-fastly && cargo clippy-axum — no warnings
  • cargo fmt --all -- --check — clean
  • JS tests: cd crates/js/lib && npx vitest run — 291 tests green
  • JS format: cd crates/js/lib && npm run format — clean
  • Docs format: cd docs && npm run format — clean
  • WASM build: cargo build --package trusted-server-adapter-fastly --release --target wasm32-wasip1
  • Manual testing via fastly compute serve with edgezero_rollout_pct = "1" and "0" (rollback)

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses log macros (not println!)
  • New code has tests
  • No secrets or credentials committed

prk-Jr added 8 commits May 21, 2026 14:26
When edgezero_enabled=true, reads edgezero_rollout_pct (0-100) from the
trusted_server_config store. Routes each request to EdgeZero if
fnv1a_bucket(client_ip) < rollout_pct, giving the ops team sticky
per-user canary control without a deploy.

Absent key defaults to 100 (full rollout, backward compatible with
edgezero_enabled=true deployments that predate this PR). Invalid values
and read errors default to 0 (all legacy, fail-safe).
Set to "0" so local dev and CI stay on the legacy path by default.
Ops changes this in the production config store — no re-deploy required.
Documents the two config store keys, canary progression stages (1% →
10% → 50% → 100%), hold-point durations, pass/fail thresholds, and
instant rollback procedure.
@prk-Jr prk-Jr self-assigned this May 21, 2026
@prk-Jr prk-Jr changed the title Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) Add EdgeZero canary routing (rollout_pct flag + FNV-1a bucket assignment) (PR 19) May 21, 2026
@aram356 aram356 requested review from ChristianPavilonis and aram356 and removed request for aram356 May 21, 2026 15:14
@prk-Jr prk-Jr changed the base branch from feature/edgezero-pr18-phase5-verification to feature/edgezero-pr19-spin-adapter May 27, 2026 13:05
// Fires per-request when key is absent and edgezero_enabled=true.
// At production scale this creates one warn per request until the key is set.
// Resolution: set edgezero_rollout_pct = "0" before setting edgezero_enabled = "true".
log::warn!(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔧 wrench — Per-request warn log on absent edgezero_rollout_pct.

This emits a log::warn! for every request when the key is absent. The code comment acknowledges "At production scale this creates one warn per request until the key is set" but does not fix it. Combined with the absent-key default = 100% rollout, this is a real operational hazard: a single ops ordering mistake (setting edgezero_enabled = "true" before edgezero_rollout_pct) results in both an immediate full cutover and a sustained warn flood. The warn doesn't tell anyone anything is broken (traffic flows fine), so it's likely to be silenced/ignored — not acted on.

Fix (pick one):

  • Demote to log::debug! — the runbook + setup procedure is the real safety net, not this log line:
    Ok(None) => {
        log::debug!(
            "edgezero_rollout_pct key absent, defaulting to 100 (full rollout — backward compat)"
        );
        100
    }
  • Better: change the absent-key default to 0 (fail-safe) — see the design question in the review body — and drop the warn entirely. An absent key would then surface as zero canary traffic, observable from real-time stats, rather than as a log flood.

return;
}

let rollout_pct = read_rollout_pct(&edgezero_config_store);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 thinking — Two config-store reads per request.

Each request now hits config_store.get(...) twice (is_edgezero_enabled then read_rollout_pct). Fastly Config Store reads are local-edge and fast, but at full QPS that's a measurable doubling. Consider consolidating into one helper that returns (enabled, rollout_pct) after a single read sequence, or wrapping both behind a cached snapshot if the SDK supports it. Not urgent — flag only.

/// | Key present, valid 0–100 | parsed value | Partial or full rollout |
/// | Key present, invalid | `0` | All legacy (safe default) |
/// | Key read error | `0` | All legacy (safe default) |
fn read_rollout_pct(config_store: &ConfigStoreHandle) -> u8 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ refactor — Add unit tests for read_rollout_pct using an in-memory ConfigStore.

The plan asserts ConfigStoreHandle cannot be constructed in unit tests; this is not accurate at the pinned edgezero-core rev. The ConfigStore trait is public, and ConfigStoreHandle::new(Arc::new(...)) is public — edgezero-core's own tests use exactly this pattern.

The four documented branches are safety-critical; only the first is indirectly covered (via parse_rollout_pct tests).

Suggested addition in the existing #[cfg(test)] mod tests:

use edgezero_core::config_store::{ConfigStore, ConfigStoreError};

struct TestConfigStore {
    response: Result<Option<String>, ConfigStoreError>,
}

impl ConfigStore for TestConfigStore {
    fn get(&self, _key: &str) -> Result<Option<String>, ConfigStoreError> {
        self.response.clone()
    }
}

fn handle(response: Result<Option<String>, ConfigStoreError>) -> ConfigStoreHandle {
    ConfigStoreHandle::new(Arc::new(TestConfigStore { response }))
}

#[test]
fn read_rollout_pct_absent_defaults_to_full_rollout() {
    assert_eq!(read_rollout_pct(&handle(Ok(None))), 100);
}

#[test]
fn read_rollout_pct_invalid_value_defaults_to_zero() {
    assert_eq!(read_rollout_pct(&handle(Ok(Some("abc".into())))), 0);
}

#[test]
fn read_rollout_pct_out_of_range_defaults_to_zero() {
    assert_eq!(read_rollout_pct(&handle(Ok(Some("101".into())))), 0);
}

#[test]
fn read_rollout_pct_read_error_defaults_to_zero() {
    assert_eq!(
        read_rollout_pct(&handle(Err(ConfigStoreError::unavailable("boom")))),
        0,
    );
}

These tests pin the safety-critical defaults under refactoring — the part of the design that matters in incidents.

String::new()
}
};
let bucket = fnv1a_bucket(&routing_key);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌱 seedling — Skip the FNV hash + IpAddr.to_string() allocation when rollout_pct == 0 || == 100.

routing_key = ip.to_string() allocates a String per request, and the FNV computation runs even when the decision is degenerate (0 → always legacy, 100 → always EdgeZero). For steady-state rollback (0) and post-cutover (100) — which together cover most of the canary's lifetime — this work is wasted.

let rollout_pct = read_rollout_pct(&edgezero_config_store);

let route_to_edgezero = match rollout_pct {
    0 => false,
    100 => true,
    pct => {
        let routing_key = req.get_client_ip_addr()
            .map(|ip| ip.to_string())
            .unwrap_or_default();
        let bucket = fnv1a_bucket(&routing_key);
        canary_routes_to_edgezero(bucket, pct)
    }
};

If you want to skip the to_string() allocation in the canary path too, FNV-1a can be fed IpAddr octets directly:

fn fnv1a_bucket_octets(bytes: &[u8]) -> u8 { /* same algorithm */ }
let bucket = match req.get_client_ip_addr() {
    Some(IpAddr::V4(v)) => fnv1a_bucket_octets(&v.octets()),
    Some(IpAddr::V6(v)) => fnv1a_bucket_octets(&v.octets()),
    None => fnv1a_bucket_octets(&[]),
};

Not urgent.

/// `bucket` must be in `0..100`; `rollout_pct` in `0..=100`.
/// When `rollout_pct = 0` no bucket ever routes to `EdgeZero` (instant rollback).
/// When `rollout_pct = 100` every bucket routes to `EdgeZero` (full cutover).
fn canary_routes_to_edgezero(bucket: u8, rollout_pct: u8) -> bool {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpickcanary_routes_to_edgezero is just bucket < rollout_pct; "canary" is misleading at the call site.

The function works at any rollout_pct (0, 50, 100), not only canary stages. Reads awkwardly at the call site: if canary_routes_to_edgezero(bucket, rollout_pct).

Suggest routes_to_edgezero(bucket, rollout_pct) — shorter and accurate at any rollout value.

Copy link
Copy Markdown
Collaborator

@ChristianPavilonis ChristianPavilonis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

One low-severity docs/ops note: the canary runbook references routing log lines that are emitted at debug level, so production verification may require a debug/log-level deployment or equivalent observability setup.

1. Confirm `edgezero_rollout_pct = "0"` is already set in the production config store
(set it now if not — the pre-condition above explains why this must come first).
2. Set `edgezero_enabled = "true"` in the production config store.
3. Verify via log tailing that `routing request through legacy path (bucket=N, rollout_pct=0)`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Clarify that these verification logs require debug-level logging

This is useful operational guidance, but the referenced routing request through ... messages are emitted with log::debug! in the Fastly adapter while the normal Fastly logger is capped at Info. If operators are expected to verify this via log tailing, it may be worth noting that this requires a debug/log-level deployment or another equivalent canary observability signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants