From a36da31042781e941ad4a5e3e47f7e6d82819413 Mon Sep 17 00:00:00 2001 From: Shahzaib Date: Sat, 9 May 2026 20:09:37 -0700 Subject: [PATCH 1/6] Add oncall telemetry weekly report skill --- .../oncall-weekly-telemetry-report/SKILL.md | 395 ++++ .../assets/bucket-trends.js | 93 + .../assets/code-attribution-template.md | 147 ++ .../assets/kusto-cheatsheet.md | 194 ++ .../assets/report-template.html | 1779 +++++++++++++++++ .../assets/summarize-attribution.js | 101 + 6 files changed, 2709 insertions(+) create mode 100644 .github/skills/oncall-weekly-telemetry-report/SKILL.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/report-template.html create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md new file mode 100644 index 00000000..c9110e50 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -0,0 +1,395 @@ +--- +name: oncall-weekly-telemetry-report +description: Generate the weekly Android Broker on-call (OCE) WoW + 60-day trend telemetry report as a polished self-contained HTML file. Use this skill for the weekly OCE rotation when asked to "produce the OCE report", "weekly on-call report", "WoW telemetry report", "weekly broker health report", or "generate this week's on-call summary". Pulls from `android_spans` materialized views, attributes regressions/improvements to PRs in `broker/` and `common/`, and writes to `oncall-wow-report-vN.html` at repo root. +--- + +# OCE Weekly Report + +Produce the weekly Android Broker on-call (OCE) telemetry report as a self-contained HTML file at `$env:USERPROFILE\android-oce-reports\oncall-wow-report-v{N+1}.html` (i.e. `~/android-oce-reports/`, outside the workspace so reports never accidentally get committed). + +The output mirrors the structure of the canonical template at [`assets/report-template.html`](assets/report-template.html) — copy it to `oncall-wow-report-v{N+1}.html` at repo root and edit in place. Do **not** redesign the layout each week. + +**Before writing any KQL, read [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md).** It captures the canonical view names, helper functions, the HLL device-count gotcha, week-alignment rules, and ready-to-paste query templates — distilled from the production Android Broker Dashboard. + +Reusable helpers in [`assets/`](assets/): + +| File | Purpose | +|---|---| +| [`report-template.html`](assets/report-template.html) | Canonical layout — copy and replace data only, never restructure CSS | +| [`kusto-cheatsheet.md`](assets/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates | +| [`code-attribution-template.md`](assets/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | +| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs` | +| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards | + +--- + +## Inputs to confirm with the user + +1. **Reporting week** — defaults to the most recent complete week (Sun → Sat ending yesterday or today). **Confirm explicit dates with the user.** Note that Kusto's `startofweek()` is **Sunday-aligned**, so a user-spoken "week of May 3 → May 9" maps to the bucket `startofweek == 2026-05-03`. Off-by-one-week is the #1 silent error — verify by printing the distinct `startofweek` buckets from your first query and confirming the label matches the user's intent. +2. **Comparison baseline** — defaults to the prior complete week. +3. **60-day window** — last 8 complete weeks (drop the partial start week when computing trend deltas). +4. **Output filename** — `$env:USERPROFILE\android-oce-reports\oncall-wow-report-YYYY-MM-DD.html`, where `YYYY-MM-DD` is the **Sunday `startofweek` bucket** of the reporting week (e.g. the report for the week of May 3 → May 9, 2026 is `oncall-wow-report-2026-05-03.html`). User-scoped, outside the workspace; the date matches the Kusto bucket label used throughout the report. + +If any of these are unstated, ask once, then proceed. + +--- + +## Required sections (in order) + +1. **Top-line health KPIs** — total requests, total devices, silent-auth reliability %, interactive reliability %, p95 latency on the hot spans. WoW delta on each. Inline SVG sparklines. +2. **Things that need attention this week** — three callouts: + - **Denominator caveat** — explain any large total-spans device-count shift caused by span-emission changes (e.g. `goAsync()` refactors). Always state which denominator the report uses (auth-only: `SilentAuthStats` ∪ `InteractiveAuthStats`). + - **Real WoW regressions** worth investigation, with PR links. + - **Slow-burn 60-day regressions** (rising on 60d even when WoW looks flat). Link to the 60-Day Trend section. + - **Real wins this week**, with PR links. + - **Traffic shape** — flat / surge / collapse summary. +3. **📈 60-Day Trend Analysis** — built from the `ErrorStatsMetrics` materialized view over the last 8 complete weeks. **Run the bucketing pipeline FOUR times — the cross-product of `{error_code, error_type} × {devs, reqs}`** — and union the regression sets. An entry (code OR type) is flagged if it regresses on either metric. + + - **% of devices** affected (`devsHit / authActiveDevs`) — catches errors hitting more users. + - **% of requests** affected (`errReqs / authTotalReqs`) — catches per-device retry storms (fewer users, more traffic per user). The previous report would have missed `kdfv2_key_derivation_error` (262 → 5,374 reqs on ~57 devices) without this dim. + + Categories: True 60d regression / Ephemeral 60d spike (peak-then-recover) / True 60d improvement / Flat. Every rising entry — whether `error_code` or `error_type` — gets the same Spike Attribution + Code Attribution treatment (Step 4 / Step 5). + + Always apply `MergeUiRequiredExceptions(error_type)` before bucketing on type; otherwise the 6+ string variants of `UiRequiredException` will each be tracked separately and skew the buckets. +4. **🔎 Spike Attribution** — one card per WoW regression AND per 60-day regression, **for both `error_code` and `error_type` regressions**. Each card slices on **all 7 dimensions** (broker version, span, active broker pkg, calling app, account type AAD/MSA, shared-device mode, client SKU). Each card ends with a **deep Code Attribution block** (see Step 4 for the required fields) and a Traffic Attribution verdict. +5. **🚚 Traffic Attribution** — top-level section listing every error whose spike is fully or partly explained by traffic volume from a specific calling app, rather than a code regression. If none qualify this week, render the section with an explicit "None this week" note. +6. **Error codes — WoW with stable denominator** — full table with `Δ reqs %` and `Δ devs %` columns and the 60d sparkline. +7. **Error types — WoW with stable denominator** — full table, **same columns and rigor as the error-codes table** (`Δ reqs %`, `Δ devs %`, 60d sparkline, status pill). Any regressing type also gets a spike-attribution card in Section 4. For composite types (e.g. `ClientException` is the umbrella for many sub-codes), include a **decomposition card** that breaks the WoW Δ down into the top 3 contributing sub-codes — so a `ClientException` −5 pp drop is explicitly attributed to e.g. `−8.5 pp timed_out_execution` + `−3.4 pp unknown_authority` + `−0.15 pp illegal_argument_exception`. +8. **📊 Traffic analysis** — total requests/devices (WoW + 60d), top calling apps, top spans, **requests-per-device ratio** per error and overall (a rising ratio = retry storm; a falling ratio = caching gain), sampling-rate change indicator. +9. **Latency** — p50/p95/p99 by hot span. +10. **Broker version adoption** — week-over-week version share. +11. **Appendix** — query list and methodology. + +--- + +## Step-by-step workflow + +### Step 1 — Bootstrap the new report file from the template + +This skill ships with a canonical template at [`assets/report-template.html`](assets/report-template.html) (a real prior week's report kept as the reference layout). **Always start from this template** — never assume a prior week's report exists on the file system. + +```pwsh +# Reports live OUTSIDE the workspace, in the user's home folder, so they never +# accidentally get committed and don't pollute the repo root. +$reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' +New-Item -ItemType Directory -Force $reportDir | Out-Null + +# Filename uses the Sunday startofweek bucket of the reporting week (matches the +# Kusto bucket label used throughout the report). For "week of May 3 -> May 9, 2026" +# this evaluates to 2026-05-03. +$reportingSunday = '2026-05-03' # <-- replace with the confirmed reporting-week Sunday +$next = Join-Path $reportDir "oncall-wow-report-$reportingSunday.html" + +if (Test-Path $next) { + Write-Warning "$next already exists — confirm with the user before overwriting." +} + +Copy-Item c:\Users\shjameel\Repos\android-complete\.github\skills\oncall-weekly-telemetry-report\assets\report-template.html $next -Force +Write-Host "Bootstrapped $next from skill template." +``` + +Edit `$next` only. The template defines the layout, CSS, sparkline structure, attribution-card markup, and section ordering — **do not redesign these per week**. Replace the data inside each section with the current week's content; keep the structure verbatim. + +If the template ever needs structural improvements (new section, new card style, etc.), update `assets/report-template.html` in the skill folder and commit it so future weeks inherit the change. + +### Step 2 — Pull WoW reliability data + +Use the Kusto MCP tool against: +- **Cluster:** `https://idsharedeus2.kusto.windows.net` +- **Database:** `ad-accounts-android-otel` + +**Always prefer the canonical `materialized_view('XxxMetrics' or 'XxxUpdated')` variants** — these are what the production dashboard uses, are pre-aggregated and HLL-bucketed, and avoid the 240 s MCP timeout that plain `android_spans` queries hit. Full schema, gotchas, and query templates: [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). + +| Need | View | +|------|------| +| Per-error-code / per-error-type / per-span counts | `materialized_view('ErrorStatsMetrics')` | +| Total broker reqs / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | +| Silent auth reliability | `SilentAuthStatsAllRequestsMetrics` + `SilentAuthStatsRequestsWithoutExpectedErrorMetrics` | +| Interactive auth reliability | `InteractiveAuthStatsAllRequestsMetrics` + `InteractiveAuthStatsRequestsWithoutExpectedErrorMetrics` | +| Latency (p50/p95/p99) | `materialized_view('PerfStatsUpdated')` — use `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` | +| Broker version share | `BrokerAdoptionStatsUpdated` | +| Calling app share | `AppStatsUpdated` | +| SKU share | `SkuStatsUpdated` | +| Spike-by-flight slicing | `Operations_ByFlight`, `ErrorCodeBySpan_ByFlight`, `ErrorType_ByFlight` | + +Time filter: always use `EventInfo_Time` on materialized views. Use `PipelineInfo_IngestionTime` only on raw `android_spans`. + +**Three rules that will silently corrupt your data if violated** (full detail in the cheatsheet): + +1. **Distinct devices are HLL-encoded.** Use `dcount_hll(hll_merge(countDevicesHll))`, never `sum(countDevices)`. Summing double-counts every device that appears in more than one row. +2. **Apply the dashboard helper functions** so this report agrees with the dashboard: `MergeAccountType(account_type)`, `MergeIsSharedDevice(is_shared_device)`, `MergeUiRequiredExceptions(error_type)`. +3. **Auth-only denominator for reliability %s:** sum `countRequests` from `SilentAuthStatsAllRequestsMetrics` ∪ `InteractiveAuthStatsAllRequestsMetrics` — not total broker spans. Total span counts are sensitive to `goAsync()` / receiver refactors and will give false WoW reliability swings. + +### Step 3 — Pull 60-day trend + +Don't pre-filter to a hand-picked top-N list — small-but-rising errors (e.g. `null_pointer_error` at ~67K devices) will fall off and never show up in the trend section. Instead pull every error code **and every error type** with a meaningful baseline across the window, then bucket each. + +#### 3a. Per-error-code trend + +```kql +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time > ago(70d) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| order by error_code asc, week asc +``` + +#### 3b. Per-error-type trend (same rigor) + +```kql +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time > ago(70d) +| where isnotempty(unified_error_type) +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), unified_error_type +| order by unified_error_type asc, week asc +``` + +`MergeUiRequiredExceptions` is mandatory — without it the 6+ string variants of `UiRequiredException` (raw, fully-qualified, com.microsoft.identity.common.exception.*) each show as separate rows and skew the buckets. + +#### 3c. Run the bucketer 4 times (cross-product of `{code, type} × {devs, reqs}`) + +```pwsh +# Error codes — by devices, then by requests +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --metric=reqs + +# Error types — by devices, then by requests +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --metric=reqs +``` + +Take the **union** of all four regression sets. Both `error_code` and `error_type` regressions get a spike-attribution card in Step 5. + +It will print regression / spike / improvement / flat buckets, sorted by peak. The thresholds (in case you need to tune): + +- **True 60d regression:** `delta > +15%` and trajectory is monotonic-ish (no single-week spike dominating). +- **Ephemeral 60d spike:** peak week is ≥3× the mean of the surrounding weeks (peak-then-recover shape). +- **True 60d improvement:** `delta < −15%`. +- **Flat:** otherwise. +- Codes/types with peak weekly devs `< 10K` (or peak weekly reqs `< 100K` when `--metric=reqs`) are filtered out (`--peak-floor=N` to override). + +**Why both axes matter:** +- *codes × reqs:* in v5, `kdfv2_key_derivation_error` spiked +1,951% on requests across only ~57 devices — a per-device retry storm device-only bucketing would have missed. +- *types × either:* `error_type` is the umbrella (e.g. `ClientException`, `ServiceException`, `UiRequiredException`) — a moving type that doesn't map cleanly to one moving code is a strong signal of a *new* sub-code being introduced or an existing one being reclassified (the v5 `ClientException` −10% drop was driven by `timed_out_execution` reclassification under PR #141, which would have been invisible from the codes table alone). + +**Always present side-by-side WoW tables for BOTH error_code AND error_type** with `Δ reqs %` and `Δ devs %` columns; flag any row where either crosses threshold. + +### Step 4 — Code attribution (deep PR correlation) + +For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/code-attribution-template.md`](assets/code-attribution-template.md) as the per-card checklist. + +| Field | What goes in it | How to find it | +|---|---|---| +| **Originator** | Where the error physically originates: broker code / common / Android system (WebView / Conscrypt / Keystore) / 3rd-party lib (Nimbus JWT, okhttp) / eSTS server / environmental (enterprise TLS interception). Use the colour-coded `origin-tag` spans (`origin-broker`, `origin-android`, `origin-thirdparty`, `origin-env`). | Grep the error string across `broker/`, `common/`, `msal/`. If no match, it's not our code — search the Android SDK or call out as eSTS-returned. | +| **Top throw site** | Fully-qualified file:line where the exception is constructed, plus the % of cases that throw from this single site. | Pull `error_location` / stack-prefix from `android_spans` for the spiking error code (one targeted query, narrow time window). Cite the dominant site. | +| **Wrapper** | Broker/common code that catches the originator's exception and re-throws it as the user-visible error code. Often `IDToken.parseJWT()`, `ServiceException(...)`, `ExceptionAdapter.exceptionFromAuthorizationResult()`. | Walk up the stack from the throw site — check for `try { ... } catch (X e) { throw new Y(...); }` patterns in broker/common. | +| **Caller hot-spots** | Top 1–3 callers of the wrapper, with device counts. Helps identify the specific code path the regression flows through. | `android_spans` slice by `error_location` (or `error.stack_trace` first frame inside our code). | +| **Underlying cause** | The proximate cause one level deeper (e.g. "99% `CertificateException` from `TrustManagerImpl.verifyChain`", "84% `no_such_algorithm` from `ProviderFactory.getMessageDigest`"). | `android_spans` slice by `error.cause` or `error_message` first 80 chars. | +| **Top error_messages** | Top 3–5 distinct `error_message` strings with counts. Often reveals the 3rd-party library or environmental signal (e.g. `net::ERR_SSL_PROTOCOL_ERROR`, Zscaler-issued cert names). | `summarize count() by tostring(error_message)` on raw `android_spans` filtered to the spike. | +| **Likely PRs** | 1–3 PRs with confidence rating (high / medium / low / none), full GitHub URL, commit SHA, author, AB#, and a 1-sentence **why-it's-the-suspect** justification (not just the title). Use the `pr-card` markup. | See PR-grep below. **Cite confidence honestly** — "none" is a valid verdict for environmental errors. | +| **Next step** | Concrete action with a named owner: who runs the next slice, who files the bug, what flight to flip, what correlation IDs to pull. | Pulled from PR authors / CODEOWNERS for the affected file. | + +#### PR-grep workflow + +```pwsh +cd c:\Users\shjameel\Repos\android-complete\broker +git log --since='' --until='' --oneline ` + --grep='||' -i + +cd ..\common +git log --since='' --until='' --oneline ` + --grep='||' -i +``` + +When the error name doesn't directly grep (e.g. `timed_out_execution`), grep for related concepts: `timeout`, `coroutine`, `executor`, `cancellation`, `thread pool`, `cache`, `authority`, etc. Then for each candidate PR, **read the diff at the throw site** to confirm it actually touches the failing code path — don't cite a PR just because it grep-matched. + +#### Repo URL patterns for citations + +| Repo | URL pattern | +|------|-------------| +| `common/` | `https://github.com/AzureAD/microsoft-authentication-library-common-for-android/pull/` | +| `broker/` | `https://github.com/identity-authnz-teams/ad-accounts-for-android/pull/` | +| `msal/` | `https://github.com/AzureAD/microsoft-authentication-library-for-android/pull/` | +| `adal/` | `https://github.com/AzureAD/azure-activedirectory-library-for-android/pull/` | + +#### Non-broker errors + +For errors with no broker code in the stack (Android system errors like `Code:-10`/`Code:-11`, OEM-specific keystore failures, eSTS-returned codes, environmental TLS interception), explicitly cite **"⚪ None — not in scope"** with confidence `none`, and explain *why* in the why-it's-the-suspect line. Do not invent broker PRs to fill the slot. Tag these errors as `environmental` or `non-broker` so they're tracked but don't page. + +### Step 5 — Spike attribution dimensions + +**Coverage rule: every `error_code` AND every `error_type` that lands in either the WoW regression list OR the 60-day regression list MUST get a spike-attribution card.** No silent skips. + +**`ErrorStatsMetrics` already carries `account_type` and `is_shared_device`** (use the `MergeAccountType` / `MergeIsSharedDevice` helpers to normalize) — so you do **not** need a fallback to raw `android_spans` for these dims. Earlier versions of this skill claimed otherwise; that was wrong. The only dim that requires `android_spans` is `DeviceInfo_OsVersion` (OEM/version slicing). + +Slice on **all 7 dimensions** for each spike. Run **one query per dimension** (multi-dim cartesians from MCP can return >500 KB of JSON and risk truncation). For `error_type` cards, swap `error_code in (codes)` for `unified_error_type in (types)` and aggregate by the `MergeUiRequiredExceptions(error_type)` extension — otherwise everything else is identical. + +| # | Dimension | Source | Cross-check | +|---|-----------|--------|-------------| +| 1 | Broker version | `ErrorStatsMetrics` group by `broker_version` | Cross-reference `BrokerAdoptionStatsUpdated` to see if the version's request share *also* moved that week — if yes, the spike is rollout-driven, not code-driven | +| 2 | Span name | `ErrorStatsMetrics` group by `span_name` | A single span hosting >60% of the error → strong code-path signal | +| 3 | Active broker package | `ErrorStatsMetrics` group by `active_broker_package_name` | E.g. CompanyPortal vs Authenticator vs LTW | +| 4 | Calling package | `ErrorStatsMetrics` group by `calling_package_name` | If 1–2 callers dominate, this is likely a traffic-attribution case (see Step 6) | +| 5 | Account type (AAD vs MSA) | `ErrorStatsMetrics`, `extend t = MergeAccountType(account_type)` group by `t` | If the split deviates significantly from fleet (~85% AAD / 15% MSA), call it out | +| 6 | Shared device mode | `ErrorStatsMetrics`, `extend s = MergeIsSharedDevice(is_shared_device)` group by `s` | Shared-device fleets have very different error profiles | +| 7 | OS version | `android_spans` filtered by `error_code in (codes)` (or `error_type in (types)`) and a tight time window, group by `DeviceInfo_OsVersion` | OEM-specific Android quirks, especially for `io_error`, `unknown_crypto_error`, `null_pointer_error` | + +#### Type cards have one extra required dimension: sub-code decomposition + +Because `error_type` is an umbrella over many `error_code` values, every `error_type` regression card MUST also include an **8th dimension: sub-code breakdown** showing the top 3–5 `error_code`s rolled up under that type, with their device counts and Δ vs prior week. This lets the reader see whether the type-level move is driven by one sub-code or many — and routes the deep Code Attribution work to the right sub-code. + +```kql +let target_types = dynamic(['ClientException', 'ServiceException']); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time > ago(14d) +| where unified_error_type in (target_types) +| extend wk = startofweek(EventInfo_Time) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, error_code +| order by unified_error_type asc, wk asc, devs desc +``` + +Cite the dominant sub-codes inline in the type card's verdict (e.g. *"`ClientException` −10.2% drop is dominated by −8.5 pp `timed_out_execution` + −3.4 pp `unknown_authority`"*) and link to those sub-codes' own attribution cards. The deep Code Attribution block (Step 4) for the type card itself focuses on the **wrapper / catch-and-rethrow** path that defines the type (e.g. `BaseException.java`, `ServiceException.java` constructors), not on each sub-code. + +Feed the seven JSON outputs into the helper to roll up dim shares per (error_code, week): + +```pwsh +node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution.js ` + --label=span span.json ` + --label=calling_app app.json ` + --label=active_broker ab.json ` + --label=broker_version ver.json ` + --label=account_type acct.json ` + --label=shared_device shared.json ` + --label=os_version os.json +``` + +Ready-to-paste KQL for the per-dimension query is in [`assets/kusto-cheatsheet.md` § 8c](assets/kusto-cheatsheet.md). + +**Concentration thresholds** (paint the dim bar red): +- > 80% in a single value → strong attribution (one root cause) +- 60–80% → medium attribution +- < 60% → broad / cross-cutting → say so explicitly, don't fabricate a single cause + +### Step 6 — Traffic analysis + traffic attribution + +Do this section in three parts. Traffic changes (up *or* down) need the same level of root-cause reasoning as error spikes — a uniform "−9% requests across all top apps with flat devices" is **not** a satisfactory verdict on its own; explain *why*. + +**6a. Top-line traffic shape.** Compare WoW *and* 60d for both totals and per-segment: + +```kql +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time > ago(70d) +| summarize totalReq = sum(countRequests), + totalDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time) +| order by week asc +``` + +For each of the following, report direction + magnitude: +- Total requests (WoW %, 60d %) +- Total devices (WoW %, 60d %) +- Requests-per-device ratio (a drop often means a benign caching improvement; a spike often means a retry storm) +- Top 10 calling apps (`AppStatsUpdated`) — which apps drove the change? +- Top spans by request volume — did one span explode or collapse? +- Sampling-rate change indicator: if total spans moved >20% but auth-only device count moved <5%, suspect a sampling/instrumentation change. + +**6b. Reasoning for material traffic shifts (>10% on any segment).** For every span/app/active-broker that moved meaningfully WoW *or* 60d, run this slicing-and-correlation pass: + +| # | Question | How to check | +|---|---|---| +| 1 | **Is the move concentrated in one span?** | Slice top-10 spans by `Δreq` absolute and `Δreq %`. A >50% move on a single span almost always points to a code change (span added / removed / sampled / `goAsync()`-ed). | +| 2 | **Is the move concentrated in one calling app?** | Slice `AppStatsUpdated` WoW. A single app moving >20% in requests with flat devices = client-side caching/retry change in that app — escalate to that app's owners, not broker. | +| 3 | **Is the move concentrated in one active broker pkg?** | Slice `BrokerAdoptionStatsUpdated` by `active_broker_package_name`. AppManager (LTW) vs Authenticator vs Intune CP often diverge during a rollout. | +| 4 | **Is the move concentrated in one broker version?** | Cross-check against rollout share. If a span dropped −80% on `16.0.1` but is flat on `15.1.0`, the cause is in the 16.0.1 diff. | +| 5 | **Did anything else co-move?** | A span dropping while `OnUpgradeReceiver`-style downstream spans also drop (`SecretKeyWrapping`, `WrappedKeyAlgorithmIdentifier` in v5) confirms a single upstream change. | + +For every meaningful shift, **search for a causal PR** in the repos likely to affect telemetry shape: + +```pwsh +# Broker (span add/remove, goAsync, scope changes, sampling/exporter config) +cd c:\Users\shjameel\Repos\android-complete\broker +git log --since='' --oneline -i ` + --grep='span|goAsync|receiver|telemetr|otel|trace|metric|sampl|exporter' + +# Common (instrumentation surfaces) +cd ..\common +git log --since='' --oneline -i ` + --grep='span|telemetr|otel|trace|sampl|instrument' +``` + +**Causal PR categories that meaningfully shift traffic counts** (flag any of these): + +- **Span removed / renamed / scope-narrowed** → drops the span's count to zero or partial +- **`goAsync()` / `BroadcastReceiver` refactor** → broadcast may complete before async work flushes the span (this is the v5 PR #88 / `OnUpgradeReceiver` story — call it out as a precedent) +- **Sampling-rate change** in broker `Otel*` / `Telemetry*` exporter config or `common/` instrumentation → uniformly scales counts up or down across many spans +- **New span added** in a hot path → request counts for that span jump from ~0 to material +- **Caller-side SDK change** (MSAL/MSAL_CPP/OneAuth release) that batches or caches requests → uniform per-app request drop with flat devices +- **Flight rollout** (ECS) that gates a code path on/off → bursty changes in a specific span on specific dates + +Cite the suspect PR(s) with the same confidence ratings used in Code Attribution (high / medium / low / none) and the same `pr-card` markup. If you can't pin one down, say so explicitly — *"uniform 5–22% per-app request drop with flat devices, no telemetry-platform PR identified, suspect caller-side SDK change in MSAL release X.Y"* is acceptable; "traffic is flat" without checking is not. + +**6c. Per-error traffic attribution (is the *error* spike traffic-driven?).** For every error code flagged in Step 5 as a regression, additionally check whether the spike is *traffic-driven* rather than *failure-rate-driven*: + +```kql +let target_code = ""; +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time > ago(14d) and error_code == target_code +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), calling_package_name +| order by week asc, devs desc +``` + +If the spike is concentrated in a single calling app whose **overall** request volume also rose that week (cross-check `AppStatsUpdated`), and the **per-request failure rate is essentially flat**, classify the spike as a **traffic-attribution case** rather than a code regression: + +> Example: "`no_account_found` +60% devices this week is fully explained by Outlook's request volume rising 65% — the per-Outlook-request failure rate is unchanged. No broker code change is implicated." + +Add a top-level **🚚 Traffic Attribution** section that lists every error matched to a traffic-driven origin, mirroring the Code Attribution section. **Each card must include**: the dominant calling app(s) with their WoW request-volume delta, the per-app per-request failure rate (now vs prior — show it's flat), and the recommended owner to route to (typically the calling app's team, not broker). If no errors qualify in a given week, render the section with an explicit "None this week" note rather than omitting it. + +### Step 7 — Validate & write + +- Run `get_errors` on the HTML file (no errors expected — pure HTML/CSS). +- Verify no stale phrases from prior weeks remain (`Select-String` for retracted hypotheses, prior week's PR numbers). +- Verify every PR link in the new file is reachable (the file paths just before the link should match what `git log` returned). + +--- + +## Hard rules + +- **Never `sum(countDevices)`.** Always `dcount_hll(hll_merge(countDevicesHll))`. Summing the per-row distinct count double-counts. +- **Always wrap view names in `materialized_view('Xxx')`** and use the canonical `Metrics`/`Updated` variants (see cheatsheet § 2). +- **Never sum percentiles.** Latency is a TDigest sketch — `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` only. +- **Always apply `MergeAccountType` / `MergeIsSharedDevice` / `MergeUiRequiredExceptions`** so this report agrees with the dashboard. +- **Confirm the week bucket label matches the user's intent** before writing the rest of the queries (Sunday-aligned). +- **Never claim "auxiliary spans" or denominator artifacts** without verifying the diff between broker versions in the actual commits. +- **Never report WoW-only verdicts** for errors that are flat-or-down WoW but rising on 60d — always cross-check both windows. +- **Never page** based on a regression that turns out to be a downstream of a denominator shift; always include the auth-only-denominator number alongside the all-spans number. +- **Always cite PRs** with full GitHub URLs (the repo URL patterns above), not bare commit SHAs. +- **Do not create a separate Markdown summary** of the report — the HTML *is* the deliverable. +- **Do not commit** the report file. It lives in `$env:USERPROFILE\android-oce-reports\` (outside the workspace) precisely so it can't be staged accidentally. + +--- + +## Output checklist + +- [ ] New `oncall-wow-report-YYYY-MM-DD.html` (where `YYYY-MM-DD` is the reporting-week Sunday) exists at `$env:USERPROFILE\android-oce-reports\` (NOT at repo root). +- [ ] All sections present and populated (incl. 🚚 Traffic Attribution — even if “None this week”) +- [ ] **60-day trend bucketing run on the full cross-product** — `{error_code, error_type} × {devs, reqs}` = 4 runs — union of regressions reported. Per-request retry storms (e.g. small device pool, exploding request count) are flagged on both axes. +- [ ] **Both error-codes AND error-types WoW tables have `Δ reqs %` and `Δ devs %` columns**, the 60d sparkline, and a status pill. Any row crossing threshold on either metric is in the regression list. +- [ ] Every WoW regression AND every 60d regression — **for both `error_code` and `error_type`** — has its own spike-attribution card with all 7 dimensions sliced. +- [ ] **Every `error_type` regression card includes the 8th-dimension sub-code decomposition** showing the top 3–5 contributing `error_code`s with their Δ vs prior week, and links to those sub-codes' own attribution cards. +- [ ] **Every regression card's Code Attribution block populates Originator + Top throw site + Wrapper + Caller hot-spots + Underlying cause + Top error_messages + Likely PRs (with confidence/why-it's-the-suspect) + Next step (with named owner)** — per [`assets/code-attribution-template.md`](assets/code-attribution-template.md). For type cards, the wrapper field focuses on the type's catch-and-rethrow site (e.g. `BaseException`, `ServiceException` constructor). Shallow PR-only attribution is not acceptable. +- [ ] Non-broker errors are explicitly tagged `environmental` / `non-broker` with confidence `none` — not invented broker PRs. +- [ ] Traffic analysis covers totals, per-app, per-span, requests-per-device ratio (per error AND overall), and a sampling-change check. +- [ ] **Every material traffic shift (>10% on any segment, up or down) has a reasoning paragraph** that names the dominant span/app/active-broker/broker-version, and either cites a causal PR (with confidence) — span removed/added, `goAsync()` refactor, sampling change, caller-side SDK release, ECS flight ramp — or explicitly says "no PR identified, suspect X" rather than leaving it unexplained. +- [ ] Auth-only denominator used for all reliability %s, denominator caveat called out at top. +- [ ] No stale text from previous weeks. +- [ ] `get_errors` clean on the HTML file. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js new file mode 100644 index 00000000..36d2a4f5 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js @@ -0,0 +1,93 @@ +#!/usr/bin/env node +/** + * bucket-trends.js — Bucket every error code into 60-day trend categories. + * + * Input: a Kusto MCP JSON result file from a query of the form: + * + * materialized_view('ErrorStatsMetrics') + * | where EventInfo_Time > ago(70d) + * | where isnotempty(error_code) and error_code != 'success' + * | summarize errs=sum(countOverall), + * devs=dcount_hll(hll_merge(countDevicesHll)) + * by week=startofweek(EventInfo_Time), error_code + * | order by error_code asc, week asc + * + * (Use dcount_hll on countDevicesHll, NOT sum(countDevices) — see kusto-cheatsheet.md.) + * + * Usage: + * node bucket-trends.js [--start=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs] + * + * --metric=devs (default) buckets on weekly device counts (catches errors hitting more users) + * --metric=reqs buckets on weekly request counts (catches per-device retry storms) + * + * Run BOTH metrics and union the regression sets. Reporting on devices alone misses + * retry-storm spikes (e.g. kdfv2_key_derivation_error: 262 -> 5,374 reqs on ~57 devices). + * + * Buckets (computed across the kept weeks, defaulting to all-but-the-first): + * regression: delta > +15% (and not a single-week spike) + * spike: peak >= 3 x mean(other weeks) and peak > 1.5 x max(first,last) + * improvement: delta < -15% + * flat: otherwise + */ +const fs = require('fs'); + +const args = process.argv.slice(2); +const file = args.find(a => !a.startsWith('--')); +const startArg = (args.find(a => a.startsWith('--start=')) || '').split('=')[1]; +const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); +if (!['devs', 'reqs'].includes(metric)) { + console.error(`--metric must be 'devs' or 'reqs', got '${metric}'`); + process.exit(1); +} +const defaultFloor = metric === 'reqs' ? 100000 : 10000; +const peakFloor = +((args.find(a => a.startsWith('--peak-floor=')) || '').split('=')[1] || defaultFloor); +const metricIdx = metric === 'reqs' ? 0 : 1; // [errs, devs] tuple + +if (!file) { + console.error('Usage: node bucket-trends.js [--start=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs]'); + process.exit(1); +} + +const d = JSON.parse(fs.readFileSync(file, 'utf8')); +const items = d.results.items.slice(1); // first row is the schema +const series = {}; +for (const [w, code, errs, devs] of items) { + if (!series[code]) series[code] = {}; + series[code][w] = [errs, devs]; +} +const weeks = [...new Set(items.map(r => r[0]))].sort(); +const startISO = startArg ? `${startArg}T00:00:00Z` : weeks[1]; // drop partial start week by default +const keep = weeks.filter(w => w >= startISO); +console.log('All weeks: ', weeks); +console.log('Trend weeks: ', keep, `(${keep.length} complete)`); +console.log('Metric: ', metric, `(peak floor=${peakFloor.toLocaleString()})`); + +const buckets = { regression: [], spike: [], improvement: [], flat: [] }; +for (const [code, wd] of Object.entries(series)) { + const vals = keep.map(w => (wd[w] || [0, 0])[metricIdx]); + const peak = Math.max(...vals); + if (peak < peakFloor) continue; + const first = vals[0] || 1, last = vals[vals.length - 1]; + const f = first || 1; + const delta = (last - f) / f; + const sumOthers = vals.reduce((s, x) => s + x, 0) - peak; + const meanOthers = sumOthers / Math.max(1, vals.length - 1); + const isSpike = peak >= 3 * meanOthers && peak > Math.max(first, last) * 1.5; + let cat; + if (isSpike) cat = 'spike'; + else if (delta > 0.15) cat = 'regression'; + else if (delta < -0.15) cat = 'improvement'; + else cat = 'flat'; + buckets[cat].push({ code, first, last, peak, delta: +(delta * 100).toFixed(1), series: vals }); +} + +for (const k of ['regression', 'improvement', 'spike', 'flat']) { + console.log(`\n=== ${k.toUpperCase()} (${buckets[k].length}) ===`); + buckets[k] + .sort((a, b) => b.peak - a.peak) + .forEach(r => { + console.log( + ` ${r.code.padEnd(60)} first=${String(r.first).padStart(11)} last=${String(r.last).padStart(11)} peak=${String(r.peak).padStart(11)} d=${r.delta >= 0 ? '+' : ''}${r.delta}% series=${JSON.stringify(r.series)}` + ); + }); +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md b/.github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md new file mode 100644 index 00000000..90aed84f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md @@ -0,0 +1,147 @@ +# Code Attribution Card — Per-Spike Checklist + +Use this template for **every** spike-attribution card in the report. The HTML markup matches the `code-attr` / `pr-card` / `origin-tag` styles already in [`report-template.html`](report-template.html). + +A card without a populated **Originator + Top throw site + Likely PRs + Next step** is not acceptable. "Caller hot-spots", "Underlying cause", and "Top error_messages" are required for any error where the originator is *not* obvious from the error name alone (Android system errors, 3rd-party library wrappers, environmental). + +--- + +## Required fields + +### 1. Originator + +One of: + +- 🟥 `broker` — error originates in our `broker/` or `broker4j/` code +- 🟥 `common` — originates in `common/` or `common4j/` +- 🟧 `Android system` — Android SDK (WebView, Conscrypt, Keystore, okhttp, KeyStore HAL) +- 🟦 `3rd-party lib` — Nimbus JOSE+JWT, Gson, etc. +- 🟦 `eSTS` — server-returned OAuth error (`invalid_grant`, `invalid_resource`, `unauthorized_client`, etc.) +- ⬜ `environmental` — enterprise TLS interception (Zscaler), OEM keystore quirks, network-policy + +### 2. Top throw site + +Fully-qualified `Class.method:line` plus % of cases that throw from this single site. Example: + +> `com.nimbusds.jwt.SignedJWT.getJWTClaimsSet:28`   **97% of cases**   thrown as `ParseException` + +How to find: query raw `android_spans` filtered to the spiking error code over a tight time window, group by `error_location` (or first frame of `error.stack_trace`), order desc. + +### 3. Wrapper + +The broker/common method that catches the originator's exception and re-throws it as the user-visible error code. Often `IDToken.parseJWT()`, `ServiceException(...)`, `ExceptionAdapter.exceptionFromAuthorizationResult()`, `ClientException("Code:" + err, ...)`. + +How to find: walk up the stack from the throw site; look for `try { ... } catch (X e) { throw new Y(...); }` patterns in `broker/` and `common/`. + +### 4. Caller hot-spots + +Top 1–3 callers of the wrapper, with device counts. Helps pin the regression to a specific code path. Example: + +> `GetRegistrationStateV0LegacyExecutor.execute:90` (84 dev) · `AndroidDeviceRegistrationClientController.execute:234` (47 dev) + +### 5. Underlying cause + +The proximate cause one level deeper than "the error fired". Example: + +> 99% `CertificateException` from `TrustManagerImpl.verifyChain` · cert-chain rejection at TLS layer + +How to find: slice on `error.cause` or first 80 chars of `error_message`. + +### 6. Top error_messages + +Top 3–5 distinct `error_message` strings with counts. Often the strongest signal for environmental errors (e.g. `net::ERR_SSL_PROTOCOL_ERROR`, Zscaler-issued cert names, OEM keystore exception text). + +```kql +android_spans +| where EventInfo_Time between (ago(7d) .. now()) +| where error_code == "" +| summarize count() by tostring(error_message) +| top 10 by count_ +``` + +### 7. Likely PRs + +1–3 PRs (or explicit "None"), each rendered as a `pr-card` with: + +- **Confidence**: `high` / `medium` / `low` / `none` (use the matching `pr-conf-*` CSS class) +- **GitHub URL** (full link, not bare SHA) +- **Commit SHA** (short) +- **Author** (`@username`) +- **AB#** if available +- **Why-it's-the-suspect** — 1 sentence explaining the *causal* link, not just the title. Bad: "touches MicrosoftStsAccountCredentialAdapter". Good: "touches the IDToken parse path on MSA interactive flows; matches the Apr 30 climb date." + +| Confidence | Use when | +|---|---| +| 🔴 **high** | Trajectory + flight rollout date both line up; PR diff touches the exact throw site | +| 🟡 **medium** | Code path matches but no flight gate evidence, or matches one of two suspects | +| 🟢 **low** | Candidate from grep, plausible but unverified | +| ⚪ **none** | No broker PR identified — explicitly say *why* (Android system error, eSTS-returned, OEM-specific, environmental) | + +### 8. Next step + +Concrete action with a **named owner** and a **measurable outcome**. Examples: + +- "Disable `ENABLE_OPENID_VC_HANDLING_IN_WEBVIEW_REDIRECT` flight for the affected slice (Outlook + msapps + 16.0.1) and verify spike subsides. Owner: **@somalaya**." +- "Pull 5–10 correlation IDs from Outlook devices hitting this and check eSTS logs for the actual rejected resource ID. Owner: **Outlook + eSTS teams**." +- "Slice by `bound_service_status` vs `content_provider_status` attributes to identify which IPC strategy is failing. Owner: **@pedroro**." + +--- + +## HTML skeleton (copy-paste, then fill in) + +```html +
+
Code attribution
+ +
+
Originator
+
broker short description
+
+ +
+
Top throw site
+
fully.qualified.Class.method:line   NN% of cases
+
+ +
+
Wrapper
+
wrapping.method wraps it as NewException(...)
+
+ +
+
Caller hot-spots
+
caller.A:NN (X dev)  ·  caller.B:NN (Y dev)
+
+ +
+
Underlying cause
+
NN% RootCauseException from root.method
+
+ +
+
Top error_messages
+
message 1  ·  N× message 2  ·  N× message 3
+
+ +
+
Likely PRs
+
+
+
+ 🔴 High +
+ repo#NN · PR title +
commit shortsha · 2026-MM-DD · author @user · AB#NNNNNNN
+
One-sentence causal explanation.
+
+
+
+
+
+ +
+
Next step
+
Concrete action. Owner: @name / team.
+
+
+``` diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md new file mode 100644 index 00000000..7f6a6013 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md @@ -0,0 +1,194 @@ +# Kusto Cheatsheet for the OCE Weekly Report + +Distilled from the **production Android Broker Dashboard** (374 queries) plus lessons learned running the skill end-to-end. **Read this before writing any KQL for this report** — it will save you from the most common silent-data-quality bugs. + +--- + +## 1. Connection + +| | | +|---|---| +| **Cluster** | `https://idsharedeus2.kusto.windows.net` | +| **Database** | `ad-accounts-android-otel` | +| **MCP tool** | `mcp_azure-mcp-ser_kusto` (command `query`) | +| **MCP timeout** | ~240 s — raw `android_spans` queries usually exceed this; **always prefer materialized views** | + +--- + +## 2. Use the canonical *materialized views*, not the bare names + +The dashboard never queries `ErrorStats` directly. It uses the `Metrics` / `Updated` variants, which are pre-aggregated and HLL-bucketed. Use these: + +| Use case | Canonical view | +|---|---| +| Per-error-code counts (devs, reqs) | `materialized_view('ErrorStatsMetrics')` | +| Total broker requests / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | +| Silent auth — all requests | `materialized_view('SilentAuthStatsAllRequestsMetrics')` | +| Silent auth — successes (without expected error) | `materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics')` | +| Interactive auth — all / success | `materialized_view('InteractiveAuthStatsAllRequestsMetrics')` / `…WithoutExpectedErrorMetrics` | +| FIDO requests | `materialized_view('FidoAllRequestsMetrics')` | +| Calling-app share | `materialized_view('AppStatsUpdated')` | +| SKU share | `materialized_view('SkuStatsUpdated')` | +| Latency (TDigest) | `materialized_view('PerfStatsUpdated')` | +| Per-flight slicing | `Operations_ByFlight`, `ErrorCodeBySpan_ByFlight`, `ErrorType_ByFlight` | + +Always wrap in `materialized_view(...)` — referencing the table name directly may pick up the raw, much slower base table. + +Time filter on materialized views is always **`EventInfo_Time`**. Use `PipelineInfo_IngestionTime` only when querying raw `android_spans`. + +--- + +## 3. THE distinct-device-count gotcha (most important rule) + +`countDevices` on `ErrorStats*` is a **per-row distinct count, not additive**. If you sum it across multiple rows you will double-count any device that appeared in more than one slice. **The dashboard never does this.** Every dashboard query computes devices via: + +```kql +| summarize countDevices = dcount_hll(hll_merge(countDevicesHll)) +``` + +`countDevicesHll` is the **HLL sketch** stored alongside the row. Merging HLLs across rows and then `dcount_hll`-ing gives the correct distinct count. + +**Symptom of the bug:** device counts that sum to more than the fleet size; WoW deltas that look enormous when the underlying user impact is small. + +For request counts, `sum(countRequests)` and `sum(countOverall)` are correct (they're additive). + +--- + +## 4. Helper functions used by the dashboard + +Reuse these so this report agrees with the dashboard: + +| Function | Purpose | Used on | +|---|---|---| +| `MergeAccountType(account_type)` | Collapse AAD variants together and MSA variants together | every error/perf query | +| `MergeIsSharedDevice(is_shared_device)` | Normalize null → "personal", true → "shared", false → "personal" | every error/perf query | +| `MergeUiRequiredExceptions(error_type)` | Collapse the 6+ string variants of `UiRequiredException` into one | error-type aggregation | +| `prettyFormatNumber(n)` | "1.2 M" / "856 k" formatting in tile output | display-only tiles | + +The 7-dimension attribution slicing is **fully achievable from `ErrorStatsMetrics`** — it has `account_type`, `is_shared_device`, `broker_version`, `active_broker_package_name`, `AppInfo_Version`, `client_sku`, `calling_package_name`, `span_name`. **You do NOT need a fallback to raw `android_spans` for these dimensions** (this skill previously claimed you did — that was wrong). + +--- + +## 5. Latency — never sum percentiles + +Latency is stored as a TDigest sketch. **Percentiles are not additive** — averaging p95 across rows is meaningless. Always merge first: + +```kql +materialized_view('PerfStatsUpdated') +| where EventInfo_Time between ((_startTime) .. (_endTime)) +| where span_name in ('AcquireTokenSilent','GetAccounts','RemoveAccount','ProcessWebsiteRequest') +| where span_status == 'OK' +| summarize p50 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 50, typeof(long)), + p95 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 95, typeof(long)), + p99 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 99, typeof(long)) + by week=startofweek(EventInfo_Time), span_name +``` + +**Note:** there is also a `PerfStatsMetrics` view, but it does **not** expose per-percentile columns directly — it has the merged TDigest. Use `PerfStatsUpdated` (preferred by the dashboard) and `percentile_tdigest(tdigest_merge(...), N, typeof(long))`. + +--- + +## 6. Column-name reference (so you don't burn a query on a typo) + +| View | Has column | Doesn't have | +|---|---|---| +| `ErrorStatsMetrics` | `error_code`, `error_type`, `span_name`, `broker_version`, `active_broker_package_name`, `AppInfo_Version`, `client_sku`, `calling_package_name`, `account_type`, `is_shared_device`, `EventInfo_Time`, `countOverall`, `countDevicesHll` | `calling_package` (no — it's `calling_package_name`), `countDevices` (no — use the HLL) | +| `BrokerAdoptionStatsUpdated` | `broker_version`, `EventInfo_Time`, `countRequests`, `countDevicesHll` | per-error breakdown (use ErrorStatsMetrics) | +| `PerfStatsUpdated` | `span_name`, `span_status`, `broker_version`, `active_broker_package_name`, `account_type`, `is_shared_device`, `client_sku`, `calling_package_name`, `responseTimeTDigest`, `countRequests` | `p50_ms` / `p95_ms` (no — use `percentile_tdigest`) | +| `AppStatsUpdated` | `calling_package_name`, `EventInfo_Time`, `countRequests`, `countDevicesHll` | error breakdown | + +--- + +## 7. Week alignment — Kusto `startofweek()` is **Sunday-aligned** + +If a user says "the week of May 2 → May 9", Kusto buckets it as `startofweek('2026-05-09') == 2026-05-03T00:00:00Z`. **Always confirm**: print the distinct `startofweek(EventInfo_Time)` values from your first query and verify the bucket label matches the user's intent. Off-by-one-week is the #1 silent error. + +For an 8-complete-week 60-day window ending Sat May 9, the buckets are: +`2026-03-08, 03-15, 03-22, 03-29, 04-05, 04-12, 04-19, 04-26, 05-03` — that's 9 buckets, one of which (the first) was a partial start. Drop the first; keep 8 complete weeks. + +--- + +## 8. Canonical query templates + +### 8a. Reliability (auth-only denominator) + +```kql +let all = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time > ago(70d) + | summarize allReq = sum(countRequests), + allDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time); +let ok = materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time > ago(70d) + | summarize okReq = sum(countRequests), + okDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time); +all | join kind=inner ok on week + | project week, + reqRel = round(100.0 * okReq / allReq, 3), + devRel = round(100.0 * okDev / allDev, 3) + | order by week asc +``` + +### 8b. 60-day error trend (feeds `bucket-trends.js`) + +```kql +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time > ago(70d) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| order by error_code asc, week asc +``` + +### 8c. Spike attribution — one slicing dim at a time + +The MCP tool can return ~50–700 KB of JSON; multi-dim cartesians blow this out. **Slice one dimension per query**, then post-process with `summarize-attribution.js`: + +```kql +let codes = dynamic(['no_tokens_found','unauthorized_client','Code:-6', + 'unknown_crypto_error','null_pointer_error','timed_out_execution']); +materialized_view('ErrorStatsMetrics') +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time > ago(14d) +| where error_code in (codes) +| extend wk = startofweek(EventInfo_Time) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)) + by wk, error_code, span_name // <-- swap this dim per query +| order by error_code asc, wk asc, devs desc +``` + +Run once each with the trailing dim set to: `span_name`, `calling_package_name`, `active_broker_package_name`, `broker_version`, `unified_account_type`, `unified_is_shared_device`, `client_sku`. That's the full 7. + +### 8d. Latency — see Section 5 above. + +### 8e. Broker version share + +```kql +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time > ago(21d) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), broker_version +| order by week asc, req desc +``` + +--- + +## 9. MCP output handling + +- Most queries with multi-week × per-error-code grain return **>50 KB** and are written to a side file by the tool. Read the side file with the `read_file` tool, or pipe through `bucket-trends.js` / `summarize-attribution.js`. +- The first row of `results.items` is the **schema object**, not data. The helper scripts know this. +- If a query times out or returns `BadRequest`, check **column name typos first** (the error message names the missing column). + +--- + +## 10. Helper scripts + +| Script | Purpose | +|---|---| +| [`bucket-trends.js`](bucket-trends.js) | Bucket every error code into regression / spike / improvement / flat across an N-week window | +| [`summarize-attribution.js`](summarize-attribution.js) | Roll up 7-dim attribution slices per (error_code, week) — feeds the spike-attribution cards | +| [`report-template.html`](report-template.html) | Canonical layout. Copy to `oncall-wow-report-v{N+1}.html` and replace data only — never restructure CSS | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html new file mode 100644 index 00000000..9ff9f624 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html @@ -0,0 +1,1779 @@ + + + + +Android Broker · On-Call Weekly Report + + + +
+ +
+
+

Android Broker · Weekly On-Call Report

+
+ Last 7 days vs prior 7 days  ·  + Source: AllAndroidSpans  ·  + Generated 2026-05-07 +
+
+ Live data +
+ + + + +

📊 Top-line health — last 13 days

+
+
+
Silent auth requests
+
10.37 B
+
−0.6% WoW
+
+
+
+
Silent auth devices
+
190.1 M
+
−0.7% WoW
+
+
+
+
Interactive auth requests
+
9.84 M
+
−1.0% WoW
+
+
+
+
Interactive auth devices
+
6.34 M
+
−1.8% WoW
+
+
+
+
Latest broker
+
16.0.1
+
📈 43.1% req share (was 21.2%)
+
+
+
+ + +

🚨 Things that need attention this week

+ +
+
ℹ️ Important caveat about denominators (read this first)
+

The all-spans device count dropped 38% WoW (572 M → 353 M) due to broker PR #88 moving OnUpgradeReceiver work to goAsync() — a fix for an OPPO GPU-overload issue. After 16.0.x rollout, that span no longer fires reliably (broadcast receivers can be killed before async work completes), removing ~509 M span events/week from the denominator. Auth-only device count is flat (190.1 M → 188.6 M) — users are unaffected.

+

All reliability metrics in this report use the auth-only denominator (SilentAuthStatsInteractiveAuthStats) so they reflect real user impact, not telemetry artifacts. The dashboard's default "Device Reliability" tile already does this.

+
+ +
+
🔴 Real regressions worth investigation
+
    +
  • invalid_resource — +20% device share, +57% raw devices (eSTS-side). Affects Outlook + Teams. Investigate eSTS error trend.
  • +
  • Failed to parse JWT — +367% device share (small absolute, 2,700 devices). LTW + 16.0.1 + MSA only.
  • +
  • Code:-10 (WebView UNSUPPORTED_SCHEME) — +161% devices. Tied to common #3013 openid-vc URL handling.
  • +
  • DeviceRegistrationException — +20% devices, broker code regression. Tied to broker #87.
  • +
  • kdfv2_key_derivation_error — bursty +1,951% requests. Server-side ECS flight ramp on AAD KDFv2.
  • +
+

See → 🔎 Spike Attribution for code-level root cause on each.

+
+ +
+
📈 Slow-burn regressions only visible on 60-day trend (WoW would miss these)
+
    +
  • timed_out_execution — devices 18M → 80.6M over 8 weeks (+348%, peaked at 143M Apr 19). WoW shows a −40% pullback this week, but the multi-week trajectory is sharply up. Likely tied to broker #141 (HTTP cancellation on ATS timeout).
  • +
  • no_tokens_found — devices 14.1M → 22.9M (+62%); as % of fleet 0.71% → 1.50% (~2.1×). Matches the dashboard chart you flagged. Suspect: common #3074 token-cache remove optimization.
  • +
  • null_pointer_error — devices 48K → 67K (+39%) over 8 weeks. Crash-bucket by error_location next.
  • +
  • unknown_crypto_error — devices +20%; correlate with OEM/Android-version next.
  • +
+

Full breakdown → 📈 60-Day Trend Analysis. The earlier "auxiliary spans in 16.0.0" hypothesis for io_error / no_account_found / invalid_grant has been retracted — see updated verdict in the Spike Attribution section.

+
+ +
+
Real wins this week
+
    +
  • unknown_authority−87% devices, −82% requests. Direct fix from common #3027 Bleu cloud support: now falls back to hardcoded authority list when discovery fails.
  • +
  • timed_out_execution−40% devices, −44% requests. Likely tied to broker #91 skip-account-aggregation flight.
  • +
  • ClientException (error_type) — −10% devices. Mostly the timed_out_execution drop above.
  • +
  • illegal_argument_exception / ArgumentException−67% devices. Mostly removed by PR #88 (OnUpgradeReceiver no longer fires synchronously, removing 100k IAE/wk from that path).
  • +
  • 429 (eSTS throttle) — −98% devices. Throttling cleared (Teams IP-Phone fleet).
  • +
  • Latency p99: RefreshPrt −50%, AcquireAtUsingPrt −49%, BrokerOperationRequestDispatcher −20%
  • +
+
+ +
+
📊 Traffic is flat (no surge, no collapse)
+

Silent auth: 10.37 B requests, 190.1 M devices (−0.6% / −0.7% WoW). Interactive: 9.84 M requests, 6.34 M devices (−1.0% / −1.8% WoW). Every top calling app is down 5–22% in requests with stable device counts — "fewer requests per device," likely a benign cache-efficiency improvement. See → Traffic Analysis.

+
+ + +

📈 60-Day Trend Analysis — rising errors that WoW alone misses

+ +
+ Why this section exists: Some errors don't move much week-over-week but have been climbing steadily for weeks or months. WoW deltas hide these slow-burn regressions. This section tracks weekly device counts (from the ErrorStats materialized view) over the last 8 complete weeks (Mar 8 → Apr 26; the partial weeks Mar 1 and May 3 are excluded). An error is flagged as a trend regression if devices grew >15% across this window even when WoW looks flat. +
+ +
+
⚠️ True 60-day regressions (rising even though WoW looked flat)
+
    +
  • timed_out_execution — devices 18.0M → 80.6M (+348%) over 8 weeks; peaked at 143M on Apr 19. As % of fleet: 0.91% → 5.29%. Massive slow-burn regression. Likely tied to broker PR #141 (5c64e1ebd — "Add flight-gated HTTP cancellation on ATS command-level timeout to eliminate zombie worker threads", AB#3542516) which actively converts long-running ATS calls into timed_out_execution errors instead of silent thread leaks. This may be a deliberate visibility increase, but the magnitude warrants confirming the flight rollout schedule and whether downstream callers retry cleanly.
  • +
  • no_tokens_found — devices 14.1M → 22.9M (+62%); as % of fleet 0.71% → 1.50% (~2.1×). Matches the dashboard chart you shared (no_tokens_found % requests climbing 2.5% → 3.7%, % devices 1.25% → 1.8%). Candidate PRs: common #3074 (4f869773a — "Optimize token cache remove path and add filter-first-clone flight for filtered retrieval", AB#3570409) and common #3081 (85f1948e8 — "Fix WPJ's BrokerDiscovery cache crash due to shared predefined encryption key with MSAL", AB#3577391). The cache-remove optimization in #3074 is the prime suspect — an over-aggressive remove or a flight enabling for more apps would directly elevate no_tokens_found.
  • +
  • null_pointer_error — devices 48.4K → 67.3K (+39%) over 8 weeks (peak 71.6K on Apr 19). As % of fleet it's flat (~0.0044%), but absolute device count is steadily climbing. Worth a focused crash-bucketing query on error_location / stack-trace fields to identify the specific call site before it grows further.
  • +
  • unknown_crypto_error — devices 63.8K → 76.3K (+20%); mild but consistent. Likely keystore / TEE-related; correlate with device OEM and Android OS version next pass.
  • +
  • unauthorized_client — devices 2.74M → 3.17M (+15%); mild and may reflect new app onboarding rather than a regression. Bucket by calling_package_name to confirm.
  • +
+
+ +
+
Ephemeral 60-day spike (already self-resolving)
+

unknown_authority — baseline ~1K devices/week through end-March, then exploded: 944K (Apr 5) → 20.5M (Apr 12) → 34.1M (Apr 19, peak) → 9.0M (Apr 26) → 1.3M (May 3, recovering). Strong candidate root cause: common PR #3082 (b53d87e34 — "Fix ABBA deadlock between AzureActiveDirectory and AzureActiveDirectoryAuthority class monitors", AB#3578299) which lands in the authority-validation code path. The mitigation appears to have already taken effect — but a 5-order-of-magnitude excursion deserves a post-mortem and a guardrail alert at >1M devs/week for this code.

+
+ +
+
True 60-day improvements (sustained, not just WoW noise)
+
    +
  • timed_out — devices 37.5M → 5.5M (−85%) over 8 weeks. Likely a downstream effect of the same ATS timeout refactor (PR #141) — generic timed_out is being reclassified into timed_out_execution. Net traffic is roughly conserved between these two codes.
  • +
  • invalid_scope2.00M → 0.38M (−81%). Genuine improvement.
  • +
  • timed_out_thread_pool_saturated1.71M → 0.68M (−60%). Consistent with the zombie-worker-thread fix in PR #141.
  • +
  • null_object8.17M → 5.39M (−34%). Steady improvement.
  • +
+
+ +
+
Flat on 60d (no trend regression, no improvement)
+

io_error, no_account_found, invalid_grant, interaction_required, device_network_not_available_doze_mode, authorization_pending, expired_token, illegal_argument_exception, User cancelled, auth_cancelled_by_sdk — all within ±10% across the 8-week window. This directly contradicts the WoW finding that io_error/no_account_found/invalid_grant regressed +58–66% on a per-device basis — reinforcing the denominator-effect hypothesis in the Spike Attribution card below.

+
+ + +

🔎 Spike Attribution — root-cause breakdown for each spike

+ +
+ What this section answers for each spike: + Is it tied to a broker version rollout? a specific span? active broker? calling app? account type (AAD vs MSA)? shared device mode? +  Each pill in the header summarizes the dominant attribution. Bars show device-share within the dimension. Red bars indicate >80% concentration in a single value (a strong signal). +
+ + +
+ +
+
+
+
Failed to parse JWT
+
Devices: 343 → 2,868  (+1,183%)
+
+
+ 3rd-party: Nimbus JWT + ⚡ broker 16.0.1 + ⚡ Link to Windows + ⚡ MSA only +
+
+
+
+ Verdict — Strong attribution. 91% of devices are on broker 16.0.1, 100% are com.microsoft.appmanager (Link to Windows) using OneAuth/MSAL_CPP, on AcquireTokenInteractive with MSA accounts. The spike began climbing on Apr 30, matching the 16.0.1 LTW rollout window. Action: file bug against LTW + OneAuth team for JWT parsing path on MSA interactive flows. +
+
+
+
Broker version
+
16.0.191%
+
+
16.0.03%
+
+
other (8 versions)6%
+
+
+
+
Active broker
+
com.microsoft.appmanager90%
+
+
com.azure.authenticator10%
+
+
+
+
Calling app
+
com.microsoft.appmanager100%
+
+
+
+
Span
+
AcquireTokenInteractive100%
+
+
+
+
Client SKU
+
MSAL_CPP (OneAuth)100%
+
+
+
+
Account type
+
+
+
+
+
+
+
+ MSA 99.97% + AAD 0.03% +
+
+
+
Shared device mode
+
+
+
+
Personal 100%
+
+
+ +
+
Code attribution
+
+
Originator
+
3rd-party lib Nimbus JOSE+JWT — wrapped by broker code
+
+
+
Top throw site
+
com.nimbusds.jwt.SignedJWT.getJWTClaimsSet:28  97% of cases  ·  thrown as ParseException
+
+
+
Wrapper
+
com.microsoft.identity.common.java.providers.oauth2.IDToken.parseJWT:38 wraps it as ServiceException("Failed to parse JWT", INVALID_JWT, e)
+
+
+
Likely PRs
+
+
+
+ 🟡 Medium +
+ broker #71 · Add Android integration layer for Browser SSO +
commit 92d660dd7 · 2026-03 · authors @melissaahn / Browser SSO team
+
New token-build path through Browser SSO → broker → OneAuth response. MSA-specific paths likely under-tested. Matches Apr 30 climb date.
+
+
+
+ 🟢 Low +
+ common #3006 + broker #76 · Edge TB: PoP support for WebApps +
commit d774c923b · 2026-03-17
+
Touches MicrosoftStsAccountCredentialAdapter near IDToken handling but only adds a new auth scheme branch — doesn't change the parse path itself.
+
+
+
+
+
+
+
Next step
+
Capture 5-10 correlation IDs from this spike, fetch the broker → OneAuth response payload, inspect actual idToken bytes to confirm whether it's empty, truncated, or base64-malformed.
+
+
+
+
+ + +
+
+
+
kdfv2_key_derivation_error
+
Requests: 262 → 5,374  (+1,951%) · 57 devices
+
+
+ Android system: Keystore + ⚡ ECS flight ramp + ⚡ AAD only + bursty (May 1, May 2) +
+
+
+
+ Verdict — Per-device retry storm. 99% of requests come from AAD accounts on a tiny pool (~57 devices). Two big bursts: 1,019 requests on May 1 and 3,026 requests on May 2, then dropped back to baseline. Looks like a small set of devices retrying KDFv2 derivation in a loop. Likely related to broker 16.0.1 crypto path. Action: check broker logs for those device IDs, may need a server-side flight to disable KDFv2 for these devices. +
+
+
+
Account type
+
+
+
+
+
+
+
+ AAD 99% + UNKNOWN 1% +
+
+
+
Shared device
+
+
+
+
Personal 100%
+
+
+
Daily request count (last 13 days)
+
+
Spikes on Apr 30, May 1, May 4 — bursty, not sustained.
+
+
+ +
+
Code attribution
+
+
Originator
+
Android system Keystore / SHA-256 provider on certain devices — wrapped by broker
+
+
+
Top throw site
+
com.microsoft.identity.broker4j.broker.prt.SessionKeyJwtRequestSigner.getSignedJwt:118
+
+
+
Underlying cause
+
84% no_such_algorithm from ProviderFactory.getMessageDigest:123  ·  16% invalid_key from SP800108KeyGen$1.perform:112
+
+
+
Likely PRs
+
+
+
+ 🔴 High +
+ Server-side ECS rollout · UseKdfVersion2 flight ramp +
Not a code PR — telemetry pattern matches flight ramp (bursts on May 1: 1,019 reqs, May 2: 3,026 reqs)
+
Broker code shipped July 2025 (PR #3144). What changed this week is the flight ramp, not the code.
+
+
+
+ 🟢 Low +
+ broker #152 · Enable KDFv2 by default +
commit 0fe27f7ab · 2026-04-17 · ships in v16.1.0 (NOT yet rolled out)
+
Code change exists but isn't in production yet. Server-side flight is the active driver.
+
+
+
+
+
+
+
Next step
+
Check ECS dashboard for UseKdfVersion2 ramp on Apr 30 / May 1. Add try/catch fallback in SessionKeyJwtRequestSigner.getSignedJwt():117-122 to retry with KDFv1 on no_such_algorithm. Block-list affected device models from the flight.
+
+
+
+
+ + +
+
+
+
SSLHandshakeException
+
Requests: 298k → 555k  (+97%) · only 233 devices
+
+
+ Android system: Conscrypt + NOT new broker + legacy broker 13.3.2 + Teams IP-Phone DCF +
+
+
+
+ Verdict — Same legacy device pool retrying more. 99% of requests come from broker 13.3.2 (legacy), all from com.microsoft.skype.teams.ipphone calling app, on the AcquireTokenDcfAuthRequest span (Device Code Flow). Same ~150 device pool — they're just retrying more. NOT caused by 16.0.1 rollout. Action: escalate to Teams IP-Phone team — they're on a 2+ year old broker that needs upgrading; their TLS path is failing. +
+
+
+
Broker version
+
13.3.2 (legacy)99%
+
+
13.9.11%
+
+
+
+
Active broker
+
com.azure.authenticator100%
+
+
+
+
Calling app
+
com.microsoft.skype.teams.ipphone99%
+
+
+
+
Span
+
AcquireTokenDcfAuthRequest99%
+
+
+
+
Client SKU
+
MSAL (Android)99%
+
+
+
+
Account type
+
+
+
+
UNKNOWN 100% (DCF pre-auth)
+
+
+ +
+
Code attribution
+
+
Originator
+
Android system Conscrypt TLS implementation — broker is a passive consumer
+
+
+
Top throw site
+
com.android.org.conscrypt.SSLUtils.toSSLHandshakeException:363 (125k requests)  ·  ConscryptFileDescriptorSocket.startHandshake:231 (45k)
+
+
+
Underlying cause
+
99%+ CertificateException from TrustManagerImpl.verifyChain  ·  cert-chain rejection at TLS layer
+
+
+
Likely PRs
+
+
+
+ ⚪ None +
+ No PR in scope +
Broker code is not in the call stack at all
+
Broker version 13.3.2 (legacy, from 2024) is dominantly affected — far outside the 15.1.0 → 16.0.1 window. The growth reflects an existing fleet's environmental TLS issues, not a code regression.
+
+
+
+
+
+
+
Next step
+
Tag as environmental — track but do not page. Already known: escalate to Teams IP-Phone team to upgrade their fleet off broker 13.3.2.
+
+
+
+
+ + +
+
+
+
SSLPeerUnverifiedException
+
Requests: 104 → 3,346  (+3,117%) · only 24 devices
+
+
+ Android system: okhttp + same root cause as SSLHandshake + legacy 13.3.2 + 13.9.1 +
+
+
+
+ Verdict — Same root cause as SSLHandshakeException. 95% of requests on broker 13.3.2 + 13.9.1 (legacy), 95% from com.microsoft.skype.teams.ipphone, all on AcquireTokenDcfAuthRequest. Probably the same TLS chain validation issue as above on the Teams IP-Phone fleet. Treat together with SSLHandshakeException. +
+
+
+
Broker version
+
13.3.262%
+
+
13.9.133%
+
+
+
+
Calling app
+
com.microsoft.skype.teams.ipphone95%
+
+
+
+
Span
+
AcquireTokenDcfAuthRequest95%
+
+
+
+
Account type
+
+
+
+
UNKNOWN 100%
+
+
+ +
+
Code attribution
+
+
Originator
+
Android system Bundled okhttp legacy stack — broker is a passive consumer
+
+
+
Top throw site
+
com.android.okhttp.internal.io.RealConnection.connectTls:205  88% of cases  ·  TLS hostname verification failure
+
+
+
Likely PRs
+
+
+
+ ⚪ None +
+ No PR in scope +
Same root cause class as SSLHandshakeException — Android system TLS
+
Treat together with SSLHandshakeException. Same legacy fleet (13.3.2 + 13.9.1).
+
+
+
+
+
+
+
+
+ + +
+
+
+
DeviceRegistrationException
+
Devices: 204 → 245  (+20%) · DRS-adjacent
+
+
+ broker code: PR #87 + ⚡ broker 16.0.1 + ⚡ Authenticator + DeviceRegistrationIpc +
+
+
+
+ Verdict — Likely tied to Authenticator 16.0.1 device registration path. 78% on broker 16.0.1 in com.azure.authenticator, 78% on the new DeviceRegistrationIpc span. Action: investigate Authenticator 16.0.1 device-registration IPC failures; may indicate regression in the new DRS protocol. +
+
+
+
Broker version
+
16.0.178%
+
+
15.1.013%
+
+
others9%
+
+
+
+
Active broker
+
com.azure.authenticator78%
+
+
com.microsoft.intune12%
+
+
+
+
Calling app
+
com.azure.authenticator78%
+
+
com.microsoft.intune15%
+
+
+
+
Span
+
DeviceRegistrationIpc78%
+
+
DeviceRegistrationApi22%
+
+
+
+
Account type
+
+
+
+
UNKNOWN 100% (pre-auth DRS)
+
+
+ +
+
Code attribution
+
+
Originator
+
Broker code New IPC path added in 16.0.x release window
+
+
+
Top throw site
+
com.microsoft.workaccount.workplacejoin.protocol.AndroidDeviceRegistrationProtocolPacker.throwIfBundleContainsDeviceRegistrationException:226 (207 of 215 cases)
+
+
+
Caller hot-spots
+
GetRegistrationStateV0LegacyExecutor.execute:90 (84 dev)  ·  AndroidDeviceRegistrationClientController.execute:234 (47 dev)
+
+
+
Likely PRs
+
+
+
+ 🔴 High +
+ broker #87 · Update OpenTelemetry integration for Device Registration IPC in client +
commit 9db76c2fe · 2026-03-16 · author @pedroro
+
107-line diff to AndroidDeviceRegistrationClientController.execute(); line 234 (where 47 device errors land) was touched in this range. Also added the DeviceRegistrationIpc span where 78% of telemetry now lands.
+
+
+
+ 🟡 Medium +
+ broker #81 + common #2926 · Add BoundServiceStrategy as a DR API IPC fallback +
commit 74b33b4b9 · 2026-03-16 · author @pedroro
+
Added a new IPC strategy that may fail and bubble up. Slice by bound_service_status/content_provider_status attributes (added in #87) to isolate which strategy is failing.
+
+
+
+
+
+
+
Next step
+
Owner: @pedroro. Slice by bound_service_status vs content_provider_status attributes to identify which IPC strategy is failing.
+
+
+
+
+ + +
+
+
+
Code:-11
+
Devices: 952 → 2,404  (+153%)
+
+
+ Android WebView: ERROR_FAILED_SSL_HANDSHAKE (-11) + environmental — no PR + Outlook + Teams IP-Phone + AAD-dominant +
+
+
+
+ Verdict — Mixed legacy + new fleet, no single root cause. Top broker is legacy 13.3.2 (17 devices, 3.5k requests = retry storm) but the device count growth comes from 15.1.0 (914 dev) and 16.0.1 (372 dev). 99% AAD. Spread across many calling apps (Outlook leads with 60% of devices). Not a clean version regression — looks like two unrelated populations: legacy IP-phone retry storm + a slowly-growing baseline across newer brokers. Action: low priority; track WoW for stability. +
+
+
+
Broker version (by devices)
+
15.1.038%
+
+
16.0.115%
+
+
15.0.011%
+
+
other (40+ versions)36%
+
+
+
+
Calling app (by devices)
+
com.microsoft.office.outlook60%
+
+
com.microsoft.teams16%
+
+
com.axisbank.siddhi.v35%
+
+
+
+
Span
+
AcquireTokenInteractive96%
+
+
+
+
Account type
+
+
+
+
+
+
+
+ AAD 96% + MSA 4% +
+
+
+
Shared device
+
+
+
+
+
+
+
+ Personal 99% + SDM 1% +
+
+
+ +
+
Code attribution
+
+
Originator
+
Android WebView ERROR_FAILED_SSL_HANDSHAKE = -11  ·  environmental enterprise TLS interception
+
+
+
Top throw site
+
com.microsoft.identity.common.internal.ui.webview.OAuth2WebViewClient.sendErrorToCallback wraps as new ClientException("Code:" + errorCode, ...)
+
+
+
Top error_messages
+
+ 5,298× net::ERR_SSL_PROTOCOL_ERROR  ·  + 2,689× Zscaler-issued cert for login.live.com  ·  + many Zscaler/proxy certs for aadcdn.msftauth.net, aadgatewaymsit.msidentity.com, etc.  ·  + net::ERR_BAD_SSL_CLIENT_AUTH_CERT  ·  net::ERR_TUNNEL_CONNECTION_FAILED +
+
+
+
Likely PRs
+
+
+
+ ⚪ None +
+ No PR in scope +
No PR in 15.1.0 → 16.0.1 touches WebView TLS validation
+
Customer enterprise networks (Zscaler, Bombardier, Société Générale, Bank Gospodarstwa Krajowego, AXIS Bank, etc.) are doing TLS interception. Their proxy presents a cert WebView's validator rejects. Device-count growth (+153%) reflects the growing 16.0.1 fleet entering enterprise environments.
+
+
+
+
+
+
+
Next step
+
Tag as environmental — track but do not page. Long-term: detect Zscaler-style proxy and surface a clearer user-facing error, OR ship a flight to honor user-installed CA store (security trade-off).
+
+
+
+
+ + +
+
+
+
Code:-10
+
Devices: 62 → 162  (+161%)
+
+
+ Android WebView: ERROR_UNSUPPORTED_SCHEME (-10) + ⚡ PR #3013 openid-vc + Outlook + msapps + 100% AAD · 100% OneAuth +
+
+
+
+ Verdict — Tied to newer brokers + Outlook/msapps. 50% of devices on broker 16.0.1, 32% on 15.1.0. 63% calling app = com.microsoft.office.outlook, 25% = com.microsoft.msapps. 100% MSAL_CPP/OneAuth, 100% AAD. Action: investigate AcquireTokenInteractive failure path on Outlook + OneAuth on the latest brokers. +
+
+
+
Broker version
+
16.0.150%
+
+
15.1.032%
+
+
14.0.210%
+
+
+
+
Calling app
+
com.microsoft.office.outlook63%
+
+
com.microsoft.msapps25%
+
+
+
+
Span
+
AcquireTokenInteractive94%
+
+
+
+
Client SKU
+
MSAL_CPP (OneAuth)84%
+
+
ADAL7%
+
+
+
+
Account type
+
+
+
+
+
+
+
+ AAD 93% + UNKNOWN 7% +
+
+
+ +
+
Code attribution
+
+
Originator
+
Android WebView ERROR_UNSUPPORTED_SCHEME = -10  ·  WebView received a custom-scheme redirect URL it can't handle
+
+
+
Top throw site
+
ExceptionAdapter.exceptionFromAuthorizationResult:146 (250 of 286 cases) wraps WebView's ERROR_UNSUPPORTED_SCHEME as ClientException("Code:-10", ...)
+
+
+
Top error_message
+
100% net::ERR_UNKNOWN_URL_SCHEME (286/286)
+
+
+
Likely PRs
+
+
+
+ 🔴 High +
+ common #3013 · Handle openid-vc urls in webview +
commit 5d30739ca · 2026-03-13 · author @somalaya
+
Introduces handling for new openid-vc:// redirect scheme. If the device's Authenticator/wallet doesn't claim the scheme, WebView throws ERROR_UNSUPPORTED_SCHEME. Author even added a kill-switch flight in #3037 "in case something goes wrong" — exactly this scenario.
+
+
+
+ 🟡 Medium +
+ common #3037 · Set default value for openid-vc flight in webview redirect +
commit 6a4258589 · 2026-03-19 · author @somalaya
+
Companion to #3013 — set the default-on value. If default-on, this is what triggers the spike. Disabling this flight should mitigate.
+
+
+
+
+
+
+
Next step
+
Disable ENABLE_OPENID_VC_HANDLING_IN_WEBVIEW_REDIRECT flight for the affected slice (Outlook + msapps + 16.0.1) and verify spike subsides. Owner: @somalaya / Sowmya Malayanur.
+
+
+
+
+ + +
+
+
+
Top 3 device-share regressions: io_error · no_account_found · invalid_grant
+
Combined attribution view — flat raw counts but +58–66% device-share growth
+
+
+ broker 16.0.1 dominant + all account types + non-shared devices +
+
+
+
+ Verdict — Likely a denominator effect, not a true reliability regression. Raw weekly request counts for all three errors are essentially flat over the last 60 days (see 60-day trend section below — io_error, no_account_found, and invalid_grant all sit on a flat-to-slightly-down trajectory). Yet the per-device % jumped +58–66% week-over-week. The most plausible explanation is a shift in the active-device denominator coinciding with the 16.0.0 → 16.0.1 rollout (cohort change in which devices report, not new spans being emitted). Earlier draft incorrectly stated 16.0.0 emitted "auxiliary spans" — verification of the relevant 16.0.0 commits did not substantiate that; that claim is retracted. Action: before paging, rerun with a stabilized denominator (e.g. devices that made ≥1 ATS/ATI request that week) and compare to the 60-day trend in the next section. +
+
+
+
invalid_grant — broker (by req)
+
16.0.167%
+
+
14.2.015%
+
+
15.1.07%
+
+
+
+
invalid_grant — account type (by req)
+
+
+
+
+
+
+
+
+ AAD 72% + MSA 25% + UNK 3% +
+
+
+
io_error — account type (by req)
+
+
+
+
+
+
+
+
+ AAD 72% + UNK 19% + MSA 9% +
+
+
+
no_account_found — account type
+
+
+
+
+
+
+
+
+ MSA 74% + AAD 24% + UNK 2% +
+
+
+
All three — shared device mode
+
+
+
+
+ Personal 99.9% + SDM 0.1% +
+
Errors are essentially absent from SDM — this is a personal-device pattern.
+
+
+
+
+ +
+ + + +

Error codes — WoW with stable denominator — %dev = devices-hit / auth-active devices

+ +
+ Methodology change: denominator is now SilentAuthStats ∪ InteractiveAuthStats device count (190 M, flat WoW), + not BrokerAdoptionStats (572 M → 353 M, contaminated by PR #88). + Errors below ranked by real change in device share. + Δpp = absolute change (percentage points). + Δrel% = relative change. +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Error codeStatusDevices nowDevices prev%dev now%dev prevΔpp devΔrel% devWhere defined
timed_out_execution▼ Real win24.3 M41.1 M12.89%21.43%−8.54−39.8%broker CommandDispatcher.java:358
unknown_authority▼ Real win948 k7.39 M0.50%3.85%−3.35−87.0%broker Authority.java:420
io_error⚪ Flat73.7 M73.6 M39.03%38.42%+0.61+1.6%broker ConnectionError
no_account_found⚪ Flat59.9 M61.3 M31.74%31.99%−0.25−0.8%broker account cache lookup
invalid_grant⚪ Flat35.0 M35.2 M18.58%18.36%+0.23+1.2%eSTS server-returned
no_tokens_found⚪ Flat4.31 M4.29 M2.28%2.24%+0.04+1.9%broker token cache
null_object⚪ Flat3.78 M4.13 M2.00%2.16%−0.15−7.1%broker nullable utils
illegal_argument_exception▼ Real win140 k428 k0.07%0.22%−0.15−66.7%JDK via OnUpgradeReceiver
timed_out▼ Mild win2.14 M2.44 M1.13%1.27%−0.14−11.0%broker dispatcher
429 (eSTS throttle)▼ Real win2,527142 k~0%0.07%−0.07−98.2%eSTS throttle response
invalid_resource▲ Real regression494 k417 k0.26%0.22%+0.04+20.3%eSTS server-returned
device_network_not_available_doze_mode⚪ Flat1.99 M2.06 M1.05%1.07%−0.02−1.9%Android doze mode
User cancelled⚪ Flat1.74 M1.85 M0.92%0.96%−0.04−4.2%broker UI cancel
interaction_required⚪ Flat1.73 M1.83 M0.92%0.95%−0.03−3.6%eSTS
unauthorized_client⚪ Flat1.16 M1.18 M0.61%0.62%~0−0.5%eSTS
null_pointer_error⚪ Flat57.9 k62.2 k0.03%0.03%~0−5.2%broker NPE wrapper
Failed to parse JWT▲ Spike (small)2,7096330.0014%0.0003%+0.001+367%Nimbus JWT via IDToken
Code:-11▲ Spike (small)2,3299050.0012%0.0005%+0.001+140%Android WebView
+
+ +

Code attribution — real movers only

+ +
+ +
+
+
+
▼ unknown_authority
+
Devices: 7.39 M → 948 k  (−87%)
+
+
+ broker code + all flows +
+
+
+
+ Verdict — Direct fix from Bleu cloud PR. Discovery-failure path now falls back to hardcoded authority list instead of immediately returning unknown_authority. Sovereign clouds (Bleu/Delos/SovSG) also pre-seeded into cache, eliminating most discovery roundtrips. +
+
+
+
Throw site
+
com.microsoft.identity.common.java.authorities.Authority.getKnownAuthorityResult():420
+
+
+
Likely PR
+
+
+
+ 🔴 High +
+ common #3027 · [Common] Bleu cloud support +
commit 69f5e5abf · 2026-03-20 · author Mohit Chandwani
+
PR description literally says: "Authority recognition: getKnownAuthorityResult() now wraps discovery in try-catch — if discovery fails, it still checks hardcoded metadata and developer configuration instead of immediately returning 'unknown authority'." Source code at line 420 confirms this exact behavior. Trend is monotonic decline starting around 16.0.x rollout dates.
+
+
+
+
+
+
+
+
+ +
+
+
+
▼ timed_out_execution
+
Devices: 41.1 M → 24.3 M  (−40%)
+
+
+ broker code + silent auth dispatcher +
+
+
+
+ Verdict — Likely tied to skip-account-aggregation flight. Thrown when CommandDispatcher's silent thread-pool task exceeds the timeout. Per-broker-version slice shows the drop concentrated on 16.0.x — matches the broker's SkipAccountAggregation flight which removes the largest source of slow paths. +
+
+
+
Throw site
+
com.microsoft.identity.common.java.controllers.CommandDispatcher:358
+
+
+
Likely PRs
+
+
+
+ 🔴 High +
+ broker #91 · Skip getCachedRecordToReturn execution when skip_account_aggregation flight is enabled +
commit ddcc073d1 · 2026-03 · removes a major slow path inside ATS
+
Eliminates a redundant cached-record retrieval that often timed out under contention. Together with the broker dispatcher latency wins (p99 −20%), this directly reduces timed_out_execution.
+
+
+
+ 🟡 Medium +
+ common #2910 · Remove Lru cache + few optimizations +
commit 68f001df6
+
Removed lock contention on a shared LRU cache that was a known timeout culprit.
+
+
+
+
+
+
+
+
+ +
+
+
+
▼ illegal_argument_exception / ArgumentException
+
Devices: 428 k → 140 k  (−67%)
+
+
+ side-effect of PR #88 + OnUpgradeReceiver +
+
+
+
+ Verdict — Side-effect of PR #88 (OnUpgradeReceiver async). Per-span breakdown: 97,730 of 140,200 devices (70%) hit this on OnUpgradeReceiver span. That span is no longer firing reliably on 16.0.x devices, so the IAE thrown inside it (likely a Keystore parameter validation in the keystore creation path the PR was trying to defer) is also no longer being captured. Real user impact unchanged. +
+
+
+
Top span affected
+
OnUpgradeReceiver (97 k of 140 k devices = 70%)  ·  AcquireTokenSilent (42 k = 30%)
+
+
+
Likely PR
+
+
+
+ 🔴 High +
+ broker #88 · Make OnUpgradeReceiver operations asynchronous +
commit 14905a3ed · 2026-03-16 · OPPO GPU-overload fix
+
Wraps OnUpgradeReceiver work in goAsync() + CoroutineScope(Dispatchers.IO).launch. The receiver now completes before the async block, and the block itself can be killed mid-execution by the OS — so its IAEs (and the OnUpgradeReceiver span itself) stop being emitted. This is a telemetry side-effect, not a real fix for the IAE.
+
+
+
+
+
+
+
+
+ +
+
+
+
▲ invalid_resource
+
Devices: 417 k → 494 k  (+20%)
+
+
+ eSTS server-side + Outlook + Teams concentrated + broker 16.0.1 dominant +
+
+
+
+ Verdict — Server-side error, not a broker code change. The string invalid_resource is not defined in our broker/common code (no constant, no emit site). It's an eSTS error response passed straight through ServiceException. Concentration: 69% Outlook devices, 19% Teams; 70% on broker 16.0.1, 17% on 15.1.0. Possible explanations: (a) eSTS rejected a resource ID Outlook started sending after a server config change, (b) tdbr claim routing change in common #2679 sending requests to wrong region, (c) Outlook client started requesting a not-yet-deployed resource. +
+
+
+
Originator
+
eSTS server Returned to broker as OAuth error_code in token response, wrapped as ServiceException("invalid_resource", ...)
+
+
+
Top calling apps
+
com.microsoft.office.outlook (340 k devices, 69%) · com.microsoft.teams (95 k, 19%) · com.microsoft.emmx (18 k, 4%)
+
+
+
Top broker version
+
16.0.1 (348 k devices, 70%) · 15.1.0 (85 k, 17%) · 15.0.0 (17 k, 3%)
+
+
+
Likely PRs
+
+
+
+ 🟡 Medium +
+ common #2679 + broker #94 · Use tdbr claim to route telemetry traffic to EU region +
commit cc81b43e2 · 2026-03
+
Despite the title saying "telemetry traffic," this PR set introduces tdbr-based routing logic. If a request is routed to the wrong eSTS regional endpoint, that endpoint may not recognize the resource → invalid_resource. Worth checking the routing decision logs.
+
+
+
+ 🟢 Low +
+ eSTS-side change · Server config or Outlook client API change +
No broker PR — escalate to eSTS / Outlook team
+
If broker routing is correct, this is an eSTS-side issue or an Outlook client started requesting a resource ID eSTS doesn't know about.
+
+
+
+
+
+
+
Next step
+
Pull 5-10 correlation IDs from Outlook devices hitting this and check eSTS logs for the actual rejected resource ID. Owner: Outlook + eSTS teams.
+
+
+
+
+ +
+ + +

Error types — WoW with stable denominator

+ +
+ + + + + + + + + + + + + + + + + + + + + + + +
Error typeStatusDevices nowDevices prev%dev now%dev prevΔpp devΔrel% dev
ClientException▼ Real win83.9 M95.0 M44.47%49.55%−5.08−10.2%
ArgumentException▼ Real win140 k428 k0.07%0.22%−0.15−66.7%
UiRequiredException⚪ Flat93.9 M95.4 M49.73%49.75%~00.0%
ServiceException▼ Mild win1.59 M1.73 M0.84%0.90%−0.06−6.6%
UserCancelException⚪ Flat1.74 M1.85 M0.92%0.96%−0.04−4.2%
IntuneAppProtectionPolicyRequiredException⚪ Flat1.12 M1.14 M0.59%0.59%~0+0.1%
CreateCredentialCancellationException▼ Mild win122 k140 k0.06%0.07%−0.01−11.2%
SSLHandshakeException▲ Spike (small)216303~0%~0%~0request volume +97%
+
+ +
+
+
+
▼ ClientException — root cause of the WoW improvement
+
Devices: 95.0 M → 83.9 M  (−10.2%)
+
+
+
+
+ Verdict — Composite improvement, dominated by two wins. ClientException is the umbrella type for non-server errors thrown by broker. Three sub-codes drive most of the −5.08 pp drop: +
    +
  • timed_out_execution−8.5 pp alone (the dominant component) — tied to broker #91 (skip account aggregation)
  • +
  • unknown_authority−3.4 pp — tied to common #3027 (Bleu cloud)
  • +
  • illegal_argument_exception−0.15 pp — side-effect of broker #88 (OnUpgradeReceiver async)
  • +
+ Other sub-codes are flat. This is real user-visible reliability improvement. +
+
+
+ + +

📊 Traffic analysis

+ +
+ Per-flow request and device counts — what's actually moving in user-visible traffic. +
+ +
+
+
Silent requests
+
10.37 B
+
−0.6% WoW (flat)
+
+
+
+
Silent unique devices
+
190.1 M
+
−0.7% WoW (flat)
+
+
+
+
Interactive requests
+
9.84 M
+
−1.0% WoW (flat)
+
+
+
+
Interactive unique devices
+
6.34 M
+
−1.8% WoW (flat)
+
+
+
+ +

Top calling apps — every app slightly down in requests, devices stable

+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + +
Calling appRequests nowRequests prevΔreqΔreq %Devices nowDevices prevΔdev %Note
com.microsoft.office.outlook2.88 B3.18 B−301 M−9.5%88.9 M90.2 M−1.4%fewer requests/device → cache efficiency
com.microsoft.appmanager2.36 B2.53 B−168 M−6.6%52.9 M53.5 M−1.0%same
com.microsoft.teams1.57 B1.73 B−160 M−9.3%46.4 M47.4 M−2.1%same
com.microsoft.skype.teams.ipphone536 M605 M−69 M−11.4%1.63 M1.72 M−4.9%IP-Phone fleet declining
com.microsoft.skydrive621 M686 M−65 M−9.5%62.8 M65.6 M−4.3%same
com.samsung.android.email.provider142 M181 M−39 M−21.5%738 k746 k−1.0%biggest req drop, devs flat
com.microsoft.office.word375 M397 M−23 M−5.7%15.3 M15.6 M−1.9%same
com.microsoft.emmx142 M159 M−17 M−10.6%5.69 M6.09 M−6.6%same
com.microsoft.office.officehubrow220 M233 M−13 M−5.6%18.4 M19.9 M−7.7%same
com.microsoft.office.excel249 M262 M−13 M−5.0%10.8 M10.9 M−1.4%same
+
+ +

What's moving inside the broker (top spans by absolute drop)

+ +
+ + + + + + + + + + + + + + + + + + + + +
SpanCount nowCount prevΔabsoluteΔrel%Note
OnUpgradeReceiver142 M651 M−509 M−78%broker #88 — goAsync() makes broadcast complete before async work; OS may kill before span flushes
WrappedKeyAlgorithmIdentifier84 M176 M−91 M−52%Downstream of fewer keystore ops in OnUpgradeReceiver path
SecretKeyWrapping201 M260 M−59 M−23%Same downstream cause
DeviceRegistrationApi570 M591 M−22 M−4%Flat (within noise)
AcquireTokenSilent10.13 B10.31 B−176 M−1.7%Flat (real auth)
BrokerOperationRequestDispatcher337 M345 M−9 M−2.5%Flat
AcquireTokenDcfAuthRequest4.7 M7.7 M−3 M−38%Tied to Teams IP-Phone fleet decline
+
+ +
+
📌 Traffic-attribution verdict
+

No real traffic surge or collapse. The headline "38% drop in all-spans devices" is entirely explained by broker PR #88 (~509 M lost OnUpgradeReceiver events/wk). The uniform 5–22% per-app request decline with stable device counts is consistent with caching/efficiency gains rather than traffic loss; recommended next step is to check is_serviced_from_cache rate WoW to confirm.

+
+ + +

Latency — ms, p50/p95/p99 by span

+
+ + + + + + + + + + + + + + + + + + + +
Spanp50 nowp50 prevp95 nowp95 prevp99 nowp99 prevΔrel% p99p99 trend (13 days)
RefreshPrt3453479029422,6805,344−50%
AcquireAtUsingPrt6066151,8852,01312,65424,627−49%
BrokerOperationRequestDispatcher38391,9191,9596,7008,397−20%
AcquireTokenSilent4804824,8305,76830,14930,467−1% (p95: −16%)
DeviceRegistrationApi1881911,4621,4363,5013,442+2%
GetAccounts4594454,6024,41812,35411,838+4%
+
+ + +

Broker version adoption — device share, last 13 days

+
+
+
com.microsoft.appmanager (Link to Windows)
+
16.0.0 → 16.0.1 rollover in progress
+ +
+
16.0.0 (deprecated)
+
16.0.1 (current)
+
15.0.0 (legacy)
+
+
+
+
com.azure.authenticator
+
Authenticator broker version migration
+ +
+
16.0.1
+
15.1.0
+
16.0.0
+
15.0.0
+
+
+
+ + +

Appendix

+ +
+ Long tail — error codes with no material movement +
+ + + + + + + + + + + + + + + + + +
Error codeReqs nowReqs prevΔrel% reqsDevs nowDevs prevΔrel% devs
authorization_pending360 M371 M+3%88.8k92.3k+54%
timed_out28.8 M28.6 M+7%2.29M2.50M+47%
device_network_not_available_doze_mode108 M119 M−4%2.04M2.10M+56%
User cancelled3.65 M3.72 M+4%1.89M1.90M+59%
interaction_required33.4 M34.2 M+3%1.80M1.86M+55%
unauthorized_client24.8 M25.0 M+5%1.20M1.20M+60%
auth_cancelled_by_sdk1.58 M1.67 M+1%906k946k+53%
timed_out_thread_pool_saturated2.60 M3.12 M−12%380k484k+25%
invalid_scope846k937k−4%241k280k+38%
unknown_error2.28 M2.70 M−11%113k118k+53%
device_network_not_available13.1 M20.4 M−32%147k156k+51%
unknown_crypto_error5.01 M4.56 M+17%29.6k31.5k+50%
expired_token2.32 M2.40 M+3%35.1k37.0k+51%
+
+
+ +
+ Methodology & caveats +
+
    +
  • Source: AllAndroidSpans + materialized views ErrorStats, SilentAuthStatsAllRequests, InteractiveAuthStatsAllRequests, BrokerAdoptionStats, PerfStats.
  • +
  • Window: last 7 days vs prior 7 days, ending 2026-05-07. Sparklines show 13-day window.
  • +
  • Attribution data: for each spike, joined ErrorStats (broker/span/active_broker/calling_app/sku) with android_spans (account_type, is_shared_device) over the last 7 days.
  • +
  • account_type unification: applied MergeAccountType() (collapses AAD variants, MSA variants).
  • +
  • error_type unification: applied MergeUiRequiredExceptions().
  • +
  • UNKNOWN account_type: typically appears in pre-authentication flows (DRS, DCF, broker discovery) where no account is yet selected.
  • +
  • Important caveat: the device-share inflation across many errors is most likely a denominator-shift artifact from broker 16.0.0 → 16.0.1 rollout. Per-broker-version slicing is needed to confirm whether errors actually grew or whether the denominator just shrank.
  • +
+
+
+ +
+ + + + + diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js new file mode 100644 index 00000000..5fdce733 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js @@ -0,0 +1,101 @@ +#!/usr/bin/env node +/** + * summarize-attribution.js — Roll up WoW attribution slices for spike-attribution cards. + * + * Reads N Kusto MCP JSON output files, each with a `--label=...` tag describing what + * dimension it slices, and prints a per-(error_code, week, dimension) breakdown. + * + * Each input is the JSON file produced by the Kusto MCP tool. The first row of + * `results.items` is the schema; the remaining rows are positional arrays. + * + * The script auto-detects schema by looking at the column names of row[0]: + * - It expects exactly one column named `error_code`. + * - It expects exactly one column named `wk` or `week` (datetime). + * - It expects exactly one numeric column named `devs` or `countDevices`. + * - The remaining 1–2 string columns are treated as the slicing dimension. + * + * Usage: + * node summarize-attribution.js \ + * --label=span \ + * --label=calling_app \ + * --label=active_broker \ + * --label=broker_version + * + * Output: per error_code, per week, the top-5 values of each dimension by devs and + * their share-of-total. Use this to fill in attr-card dim rows. + * + * IMPORTANT: when you build the source query, ALWAYS use + * dcount_hll(hll_merge(countDevicesHll)) + * for distinct device counts (HLL merging). `sum(countDevices)` double-counts! + */ +const fs = require('fs'); + +const inputs = []; // { label, file } +let pendingLabel = null; +for (const a of process.argv.slice(2)) { + if (a.startsWith('--label=')) { pendingLabel = a.split('=')[1]; continue; } + inputs.push({ label: pendingLabel || 'unknown', file: a }); + pendingLabel = null; +} + +if (inputs.length === 0) { + console.error('Usage: node summarize-attribution.js --label= file1.json --label= file2.json ...'); + process.exit(1); +} + +function loadSlice({ label, file }) { + const d = JSON.parse(fs.readFileSync(file, 'utf8')); + const rows = d.results.items; + const schema = rows[0]; // object: { col: type, ... } + const cols = Object.keys(schema); + const idxCode = cols.indexOf('error_code'); + let idxWeek = cols.indexOf('wk'); if (idxWeek < 0) idxWeek = cols.indexOf('week'); + let idxDevs = cols.indexOf('devs'); if (idxDevs < 0) idxDevs = cols.indexOf('countDevices'); + if (idxCode < 0 || idxWeek < 0 || idxDevs < 0) { + throw new Error(`${file}: schema must include error_code, wk|week, devs|countDevices. Got [${cols.join(', ')}]`); + } + // The "dimension" column is the first string col that isn't error_code/week + const idxDim = cols.findIndex((c, i) => i !== idxCode && i !== idxWeek && i !== idxDevs && schema[c] === 'string'); + if (idxDim < 0) throw new Error(`${file}: no string dimension column found`); + + const map = {}; // code -> wk -> dim -> devs + for (const r of rows.slice(1)) { + const code = r[idxCode], wk = r[idxWeek], dim = r[idxDim] || '(blank)', devs = r[idxDevs] || 0; + ((map[code] ||= {})[wk] ||= {})[dim] = (map[code][wk][dim] || 0) + devs; + } + return { label, dimColumn: cols[idxDim], map }; +} + +const slices = inputs.map(loadSlice); + +// Collect (code, week) universe +const universe = {}; +for (const s of slices) { + for (const [code, wks] of Object.entries(s.map)) { + for (const wk of Object.keys(wks)) { + ((universe[code] ||= {})[wk] = true); + } + } +} + +const codes = Object.keys(universe).sort(); +for (const code of codes) { + console.log(`\n========== ${code} ==========`); + const wks = Object.keys(universe[code]).sort(); + for (const wk of wks) { + console.log(`\n --- week ${wk.slice(0, 10)} ---`); + for (const s of slices) { + const dim = s.map[code]?.[wk] || {}; + const total = Object.values(dim).reduce((x, y) => x + y, 0); + if (total === 0) continue; + console.log(` [${s.label}] total=${total.toLocaleString()}`); + Object.entries(dim) + .sort((a, b) => b[1] - a[1]) + .slice(0, 5) + .forEach(([k, v]) => { + const pct = (v / total * 100).toFixed(1); + console.log(` ${pct.padStart(5)}% ${k} (${v.toLocaleString()})`); + }); + } + } +} From c00e474c5f2d4b06570510ca137a6bfbbda9b381 Mon Sep 17 00:00:00 2001 From: Shahzaib Date: Sun, 10 May 2026 16:16:50 -0700 Subject: [PATCH 2/6] Updates --- .../oncall-weekly-telemetry-report/SKILL.md | 183 +- .../assets/agg.js | 126 ++ .../assets/bucket-trends.js | 49 +- .../assets/find-suspect-prs.ps1 | 102 + .../assets/kusto-cheatsheet.md | 75 +- .../assets/queries/60d-trend-codes.kql | 15 + .../assets/queries/60d-trend-types.kql | 10 + .../assets/queries/README.md | 33 + .../assets/queries/app-share.kql | 11 + .../assets/queries/attr-codes-by-dim.kql | 17 + .../assets/queries/attr-types-by-dim.kql | 15 + .../assets/queries/attr-union-by-dim.kql | 59 + .../assets/queries/broker-version-share.kql | 10 + .../queries/error-message-and-location.kql | 39 + .../assets/queries/latency.kql | 13 + .../assets/queries/os-version-slice.kql | 11 + .../assets/queries/reliability-auth-only.kql | 14 + .../queries/type-subcode-decomposition.kql | 13 + .../assets/queries/wow-movers.kql | 46 + .../assets/report-template.html | 1894 ++++++----------- .../assets/summarize-attribution.js | 189 +- .../assets/template-readme.md | 98 + .../assets/templates/README.md | 19 + .../assets/templates/sparkline-footer.html | 42 + .../assets/templates/spike-card.html | 129 ++ .../assets/templates/traffic-attr-card.html | 46 + .../assets/validate-report.ps1 | 157 ++ 27 files changed, 2049 insertions(+), 1366 deletions(-) create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/agg.js create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/README.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/template-readme.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/templates/README.md create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md index c9110e50..623ecdf9 100644 --- a/.github/skills/oncall-weekly-telemetry-report/SKILL.md +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -15,17 +15,27 @@ Reusable helpers in [`assets/`](assets/): | File | Purpose | |---|---| -| [`report-template.html`](assets/report-template.html) | Canonical layout — copy and replace data only, never restructure CSS | -| [`kusto-cheatsheet.md`](assets/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates | +| [`report-template.html`](assets/report-template.html) | Canonical layout — a real prior-week report kept verbatim. **Edit in place** (replace dates / values / verdicts / PR links); do not restyle. See [`template-readme.md`](assets/template-readme.md) for what to change vs leave alone. | +| [`template-readme.md`](assets/template-readme.md) | Author guide for `report-template.html` — what to change per week, color palette, CSS class quick-reference | +| [`kusto-cheatsheet.md`](assets/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates, AADSTS reference | | [`code-attribution-template.md`](assets/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | -| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs` | -| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards | +| [`queries/`](assets/queries/) | Canonical KQL templates, one file per query — see [`queries/README.md`](assets/queries/README.md). Highlights: [`attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql) (NEW — all 7 dims in one round-trip), [`error-message-and-location.kql`](assets/queries/error-message-and-location.kql) (now accepts BOTH `` and `` in one call) | +| [`templates/`](assets/templates/) | Copy-paste HTML snippets (`spike-card.html`, `traffic-attr-card.html`, `sparkline-footer.html`) | +| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. | +| [`agg.js`](assets/agg.js) | Per-error per-dim top-N rollup with WoW deltas. Workhorse for filling spike-attribution dim blocks. | +| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union ` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label= file.json` per-dim mode. | +| [`find-suspect-prs.ps1`](assets/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run after the Originator pre-check identifies the throw-site class. | +| [`validate-report.ps1`](assets/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), and unbalanced `
` depth in Section 2 (the nested-callout bug). Run as part of Step 7. | --- ## Inputs to confirm with the user -1. **Reporting week** — defaults to the most recent complete week (Sun → Sat ending yesterday or today). **Confirm explicit dates with the user.** Note that Kusto's `startofweek()` is **Sunday-aligned**, so a user-spoken "week of May 3 → May 9" maps to the bucket `startofweek == 2026-05-03`. Off-by-one-week is the #1 silent error — verify by printing the distinct `startofweek` buckets from your first query and confirming the label matches the user's intent. +1. **Reporting week** — defaults to the **most recent complete Sun→Sat week**. If today is itself a Saturday or Sunday, the user often actually wants the **current in-progress week** instead — ASK explicitly. If they pick the in-progress week: + - Add the badge text *"Live data — current bucket may still be filling"* to the report header. + - The `bucket-trends.js` `--end` flag + the `| where week < datetime()` source filter both still apply (use the Sunday AFTER the reporting week as ``); they will drop the partial-end-bucket warning. + + Note that Kusto's `startofweek()` is **Sunday-aligned**, so a user-spoken "week of May 3 → May 9" maps to the bucket `startofweek == 2026-05-03`. Off-by-one-week is the #1 silent error — verify by printing the distinct `startofweek` buckets from your first query and confirming the label matches the user's intent. 2. **Comparison baseline** — defaults to the prior complete week. 3. **60-day window** — last 8 complete weeks (drop the partial start week when computing trend deltas). 4. **Output filename** — `$env:USERPROFILE\android-oce-reports\oncall-wow-report-YYYY-MM-DD.html`, where `YYYY-MM-DD` is the **Sunday `startofweek` bucket** of the reporting week (e.g. the report for the week of May 3 → May 9, 2026 is `oncall-wow-report-2026-05-03.html`). User-scoped, outside the workspace; the date matches the Kusto bucket label used throughout the report. @@ -37,24 +47,24 @@ If any of these are unstated, ask once, then proceed. ## Required sections (in order) 1. **Top-line health KPIs** — total requests, total devices, silent-auth reliability %, interactive reliability %, p95 latency on the hot spans. WoW delta on each. Inline SVG sparklines. -2. **Things that need attention this week** — three callouts: +2. **Things that need attention this week** — callouts: - **Denominator caveat** — explain any large total-spans device-count shift caused by span-emission changes (e.g. `goAsync()` refactors). Always state which denominator the report uses (auth-only: `SilentAuthStats` ∪ `InteractiveAuthStats`). - - **Real WoW regressions** worth investigation, with PR links. - - **Slow-burn 60-day regressions** (rising on 60d even when WoW looks flat). Link to the 60-Day Trend section. + - **🔴 WoW regressions (last 7 days)** — *one* callout listing every code/type that moved sharply WoW, **sorted by current-week device count descending**. Built from the union of (a) the standard WoW table and (b) [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) so small-but-recent spikes appear in the same list as the high-volume ones. Each row uses the `.item` flat-row pattern (see `assets/template-readme.md` § "Section 2 callouts"): name + inline metric chips + tags pushed right + one-line body + optional foot with `Attribution card →` link. **Section 2 rows are at-a-glance only** — do not duplicate the dim slicing / PR analysis / detailed verdict here; that belongs in the Section 4 spike-attribution card. Each row carries tags: `NEW` (first appeared this week or last), `60d↑` (also rising on 60d), and an originator chip (`broker` / `eSTS` / `Android` / `env`). Reader's eye prioritizes naturally by row order and tag combination — broker-tagged rows at the top demand the most attention. + - **Slow-burn 60-day regressions** — codes/types climbing on the 60d window that are flat WoW. Anything that *also* moved WoW belongs in the red callout above (with `60d↑`), not here. Link to the 60-Day Trend section. - **Real wins this week**, with PR links. - **Traffic shape** — flat / surge / collapse summary. -3. **📈 60-Day Trend Analysis** — built from the `ErrorStatsMetrics` materialized view over the last 8 complete weeks. **Run the bucketing pipeline FOUR times — the cross-product of `{error_code, error_type} × {devs, reqs}`** — and union the regression sets. An entry (code OR type) is flagged if it regresses on either metric. +3. **📈 60-Day Trend Analysis** — built from the `ErrorStatsMetrics` materialized view over the last 8 complete weeks. **Run the bucketing pipeline FOUR times — the cross-product of `{error_code, error_type} × {devices, requests}`** — and union the regression sets. An entry (code OR type) is flagged if it regresses on either metric. - - **% of devices** affected (`devsHit / authActiveDevs`) — catches errors hitting more users. - - **% of requests** affected (`errReqs / authTotalReqs`) — catches per-device retry storms (fewer users, more traffic per user). The previous report would have missed `kdfv2_key_derivation_error` (262 → 5,374 reqs on ~57 devices) without this dim. + - **% of devices** affected (`devicesHit / authActiveDevices`) — catches errors hitting more users. + - **% of requests** affected (`errRequests / authTotalRequests`) — catches per-device retry storms (fewer users, more traffic per user). The previous report would have missed `kdfv2_key_derivation_error` (262 → 5,374 requests on ~57 devices) without this dim. Categories: True 60d regression / Ephemeral 60d spike (peak-then-recover) / True 60d improvement / Flat. Every rising entry — whether `error_code` or `error_type` — gets the same Spike Attribution + Code Attribution treatment (Step 4 / Step 5). Always apply `MergeUiRequiredExceptions(error_type)` before bucketing on type; otherwise the 6+ string variants of `UiRequiredException` will each be tracked separately and skew the buckets. 4. **🔎 Spike Attribution** — one card per WoW regression AND per 60-day regression, **for both `error_code` and `error_type` regressions**. Each card slices on **all 7 dimensions** (broker version, span, active broker pkg, calling app, account type AAD/MSA, shared-device mode, client SKU). Each card ends with a **deep Code Attribution block** (see Step 4 for the required fields) and a Traffic Attribution verdict. 5. **🚚 Traffic Attribution** — top-level section listing every error whose spike is fully or partly explained by traffic volume from a specific calling app, rather than a code regression. If none qualify this week, render the section with an explicit "None this week" note. -6. **Error codes — WoW with stable denominator** — full table with `Δ reqs %` and `Δ devs %` columns and the 60d sparkline. -7. **Error types — WoW with stable denominator** — full table, **same columns and rigor as the error-codes table** (`Δ reqs %`, `Δ devs %`, 60d sparkline, status pill). Any regressing type also gets a spike-attribution card in Section 4. For composite types (e.g. `ClientException` is the umbrella for many sub-codes), include a **decomposition card** that breaks the WoW Δ down into the top 3 contributing sub-codes — so a `ClientException` −5 pp drop is explicitly attributed to e.g. `−8.5 pp timed_out_execution` + `−3.4 pp unknown_authority` + `−0.15 pp illegal_argument_exception`. +6. **Error codes — WoW with stable denominator** — full table with `Δ requests %` and `Δ devices %` columns and the 60d sparkline. +7. **Error types — WoW with stable denominator** — full table, **same columns and rigor as the error-codes table** (`Δ requests %`, `Δ devices %`, 60d sparkline, status pill). Any regressing type also gets a spike-attribution card in Section 4. For composite types (e.g. `ClientException` is the umbrella for many sub-codes), include a **decomposition card** that breaks the WoW Δ down into the top 3 contributing sub-codes — so a `ClientException` −5 pp drop is explicitly attributed to e.g. `−8.5 pp timed_out_execution` + `−3.4 pp unknown_authority` + `−0.15 pp illegal_argument_exception`. 8. **📊 Traffic analysis** — total requests/devices (WoW + 60d), top calling apps, top spans, **requests-per-device ratio** per error and overall (a rising ratio = retry storm; a falling ratio = caching gain), sampling-rate change indicator. 9. **Latency** — p50/p95/p99 by hot span. 10. **Broker version adoption** — week-over-week version share. @@ -81,14 +91,19 @@ $reportingSunday = '2026-05-03' # <-- replace with the confirmed reporting-wee $next = Join-Path $reportDir "oncall-wow-report-$reportingSunday.html" if (Test-Path $next) { - Write-Warning "$next already exists — confirm with the user before overwriting." + # Filename collision rule (per Hard Rules): do NOT silently overwrite. Open + # the existing report, identify its top-3 findings, and explicitly state in + # chat what changed in the new data before regenerating. + Write-Warning "$next already exists. Read it first, list its top-3 findings, and confirm a delta exists before regenerating." } Copy-Item c:\Users\shjameel\Repos\android-complete\.github\skills\oncall-weekly-telemetry-report\assets\report-template.html $next -Force Write-Host "Bootstrapped $next from skill template." ``` -Edit `$next` only. The template defines the layout, CSS, sparkline structure, attribution-card markup, and section ordering — **do not redesign these per week**. Replace the data inside each section with the current week's content; keep the structure verbatim. +Edit `$next` in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/template-readme.md`](assets/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, and the CSS class reference. + +Mark any unfinished card or table cell with the literal sentinel `EXAMPLE CONTENT BELOW` inside an HTML comment — the final-pass validator (Step 7) greps for it. If the template ever needs structural improvements (new section, new card style, etc.), update `assets/report-template.html` in the skill folder and commit it so future weeks inherit the change. @@ -103,7 +118,7 @@ Use the Kusto MCP tool against: | Need | View | |------|------| | Per-error-code / per-error-type / per-span counts | `materialized_view('ErrorStatsMetrics')` | -| Total broker reqs / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | +| Total broker requests / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | | Silent auth reliability | `SilentAuthStatsAllRequestsMetrics` + `SilentAuthStatsRequestsWithoutExpectedErrorMetrics` | | Interactive auth reliability | `InteractiveAuthStatsAllRequestsMetrics` + `InteractiveAuthStatsRequestsWithoutExpectedErrorMetrics` | | Latency (p50/p95/p99) | `materialized_view('PerfStatsUpdated')` — use `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` | @@ -126,43 +141,51 @@ Don't pre-filter to a hand-picked top-N list — small-but-rising errors (e.g. ` #### 3a. Per-error-code trend +Use [`assets/queries/60d-trend-codes.kql`](assets/queries/60d-trend-codes.kql) (template; replace `` and `` tokens — `` is **exclusive** and equals the Sunday AFTER the reporting week, e.g. for a 2026-05-03 report use `2026-05-10`): + ```kql materialized_view('ErrorStatsMetrics') -| where EventInfo_Time > ago(70d) +| where EventInfo_Time between (datetime() .. datetime()) | where isnotempty(error_code) and error_code != 'success' | summarize errs = sum(countOverall), devs = dcount_hll(hll_merge(countDevicesHll)) by week = startofweek(EventInfo_Time), error_code +| where week < datetime() // drop partial in-progress week at the source | order by error_code asc, week asc ``` +**The `| where week < datetime()` line is mandatory.** Without it, if Kusto has crossed midnight UTC into the next Sunday, a tiny partial bucket lands as `last` and turns every code into a fake −99% improvement. `bucket-trends.js` will also auto-detect and warn about this, but filtering at the source is preferred. + #### 3b. Per-error-type trend (same rigor) ```kql materialized_view('ErrorStatsMetrics') | extend unified_error_type = MergeUiRequiredExceptions(error_type) -| where EventInfo_Time > ago(70d) +| where EventInfo_Time between (datetime() .. datetime()) | where isnotempty(unified_error_type) | summarize errs = sum(countOverall), devs = dcount_hll(hll_merge(countDevicesHll)) by week = startofweek(EventInfo_Time), unified_error_type +| where week < datetime() | order by unified_error_type asc, week asc ``` `MergeUiRequiredExceptions` is mandatory — without it the 6+ string variants of `UiRequiredException` (raw, fully-qualified, com.microsoft.identity.common.exception.*) each show as separate rows and skew the buckets. -#### 3c. Run the bucketer 4 times (cross-product of `{code, type} × {devs, reqs}`) +#### 3c. Run the bucketer 4 times (cross-product of `{code, type} × {devices, requests}`) ```pwsh # Error codes — by devices, then by requests -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --metric=reqs +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 --metric=reqs # Error types — by devices, then by requests -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --metric=reqs +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 --metric=reqs ``` +`--end` is the Sunday AFTER the reporting week (exclusive). The script also auto-detects partial end-buckets and warns, but passing `--end` explicitly is safer. + Take the **union** of all four regression sets. Both `error_code` and `error_type` regressions get a spike-attribution card in Step 5. It will print regression / spike / improvement / flat buckets, sorted by peak. The thresholds (in case you need to tune): @@ -171,16 +194,38 @@ It will print regression / spike / improvement / flat buckets, sorted by peak. T - **Ephemeral 60d spike:** peak week is ≥3× the mean of the surrounding weeks (peak-then-recover shape). - **True 60d improvement:** `delta < −15%`. - **Flat:** otherwise. -- Codes/types with peak weekly devs `< 10K` (or peak weekly reqs `< 100K` when `--metric=reqs`) are filtered out (`--peak-floor=N` to override). +- Codes/types with peak weekly devices `< 10K` (or peak weekly requests `< 100K` when `--metric=reqs`) are filtered out (`--peak-floor=N` to override). **Why both axes matter:** -- *codes × reqs:* in v5, `kdfv2_key_derivation_error` spiked +1,951% on requests across only ~57 devices — a per-device retry storm device-only bucketing would have missed. +- *codes × requests:* in v5, `kdfv2_key_derivation_error` spiked +1,951% on requests across only ~57 devices — a per-device retry storm device-only bucketing would have missed. - *types × either:* `error_type` is the umbrella (e.g. `ClientException`, `ServiceException`, `UiRequiredException`) — a moving type that doesn't map cleanly to one moving code is a strong signal of a *new* sub-code being introduced or an existing one being reclassified (the v5 `ClientException` −10% drop was driven by `timed_out_execution` reclassification under PR #141, which would have been invisible from the codes table alone). -**Always present side-by-side WoW tables for BOTH error_code AND error_type** with `Δ reqs %` and `Δ devs %` columns; flag any row where either crosses threshold. +**Always present side-by-side WoW tables for BOTH error_code AND error_type** with `Δ requests %` and `Δ devices %` columns; flag any row where either crosses threshold. + +#### 3d. WoW movers query — MANDATORY pass to catch small-base movers + +The 60d bucketer's `--peak-floor=10000` exists for good reason (otherwise the 60d regression list would be 200+ tiny noise codes), but it **silently drops every code whose absolute weekly volume stays under 10K** — even if that code is brand-new or just spiked 5× WoW. Real examples this skill has missed in the past: + +- `Failed to parse JWT` — went `7 → 32 → 54 → 46 → 55 → 892 → 3,461` over 7 weeks (2-week-old NEW spike, real broker code in `IDToken.parseJWT:38`). Never crossed the 10K floor. +- `Code:-11` — sat at ~1,030 devs/wk for 7 weeks then jumped to 2,433 (+165% WoW). Sub-floor. +- `SSLHandshakeException` — devices flat at 260 but requests +186% WoW (per-device retry storm). The bucketer's reqs-axis floor (100K) just barely captures it but the device floor doesn't. + +To catch these, **always** run [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) **as a separate pass after the 60d bucketing**: + +```kql +// inputs: = reporting-week Sunday, = next Sunday (excl), +// = baseline-week Sunday +// floor: cDev>=500 OR cReq>=5000 move: |Δd|>=25% OR |Δr|>=50% OR new-this-week +``` + +Run it **twice — once for `error_code`, once for `error_type`**. **Merge its output rows into the same 🔴 WoW regressions callout as the standard WoW table** (sorted by current-week device count descending). Tag rows that came in via this pass with `NEW` if they were absent or near-zero in the prior week. Do *not* render this as a separate "emerging" callout — the size split is implementation detail; readers prioritize naturally by absolute device count + originator chip. + +For each WoW mover (regardless of size), you still owe the full Code Attribution treatment (Step 4). The dim-slicing pass (Step 5) is allowed to be deferred for sub-1K-device spikes if the throw-site + dominant message already pin the originator unambiguously — but say so explicitly in the card ("dims not yet sliced — file the bug first; pull dims if it persists"). ### Step 4 — Code attribution (deep PR correlation) +> ⚠️ **HARD RULE — Originator pre-check.** Before claiming `Originator: Broker` on any card, you MUST run [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) for that error code (or type) and read **(a) the throw-site stack and (b) the top 3 `error_message` strings**. Most broker error codes flow through `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult, clientExceptionFromException}` — which intentionally bridge eSTS responses into broker exceptions. **If the throw site is in any of those three methods AND the error_message starts with `AADSTS`, the originator is eSTS, not broker.** See the AADSTS reference table in [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). Cards that skip this step must be marked low-confidence, not high. + For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/code-attribution-template.md`](assets/code-attribution-template.md) as the per-card checklist. | Field | What goes in it | How to find it | @@ -196,17 +241,26 @@ For every regression card, the Code Attribution block **must** populate the foll #### PR-grep workflow +**Read the full PR window first, then reason — don't `--grep` blind.** The 4-week window across `broker/` and `common/` typically returns <30 PRs total, small enough to read end-to-end. Targeted `--grep` matches will miss PRs whose titles don't mention the error string (most of them). + ```pwsh cd c:\Users\shjameel\Repos\android-complete\broker -git log --since='' --until='' --oneline ` - --grep='||' -i +git log --since='' --until='' --pretty=format:'%h | %ai | %an | %s' cd ..\common -git log --since='' --until='' --oneline ` - --grep='||' -i +git log --since='' --until='' --pretty=format:'%h | %ai | %an | %s' ``` -When the error name doesn't directly grep (e.g. `timed_out_execution`), grep for related concepts: `timeout`, `coroutine`, `executor`, `cancellation`, `thread pool`, `cache`, `authority`, etc. Then for each candidate PR, **read the diff at the throw site** to confirm it actually touches the failing code path — don't cite a PR just because it grep-matched. +For each candidate PR, **read the diff** to confirm it touches the throw site / wrapper class identified in the Originator pre-check. Don't cite a PR just because the title mentions a related concept. + +For focused follow-up by class/method name, use the helper: + +```pwsh +# Searches both repos in parallel via `git log -S` (pickaxe on diff) AND `--grep` (subject). +# Returns a unified table: repo | date | author | sha | PR# | URL | subject. +.\.github\skills\oncall-weekly-telemetry-report\assets\find-suspect-prs.ps1 ` + -Symbol 'ExceptionAdapter' -Since 2026-04-01 -Until 2026-05-09 +``` #### Repo URL patterns for citations @@ -227,7 +281,9 @@ For errors with no broker code in the stack (Android system errors like `Code:-1 **`ErrorStatsMetrics` already carries `account_type` and `is_shared_device`** (use the `MergeAccountType` / `MergeIsSharedDevice` helpers to normalize) — so you do **not** need a fallback to raw `android_spans` for these dims. Earlier versions of this skill claimed otherwise; that was wrong. The only dim that requires `android_spans` is `DeviceInfo_OsVersion` (OEM/version slicing). -Slice on **all 7 dimensions** for each spike. Run **one query per dimension** (multi-dim cartesians from MCP can return >500 KB of JSON and risk truncation). For `error_type` cards, swap `error_code in (codes)` for `unified_error_type in (types)` and aggregate by the `MergeUiRequiredExceptions(error_type)` extension — otherwise everything else is identical. +Slice on **all 7 dimensions** for each spike. **Preferred for 2-week WoW attribution: one union query that covers all 7 dims for all regressions in a single round-trip** — see [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql). Typical payload for 8 codes × 2 weeks × 7 dims is ~800 KB, well under the MCP limit. Pipe the result into `summarize-attribution.js --union ` (which prints per-dim top-N share + Δ devices + Δ requests for every code). Fall back to the per-dim form ([`attr-codes-by-dim.kql`](assets/queries/attr-codes-by-dim.kql)) only when (a) you need a wider time window, or (b) the union response exceeds payload size. + +For `error_type` cards, swap `error_code in (codes)` for `unified_error_type in (types)` and aggregate by the `MergeUiRequiredExceptions(error_type)` extension — otherwise everything else is identical. | # | Dimension | Source | Cross-check | |---|-----------|--------|-------------| @@ -237,7 +293,7 @@ Slice on **all 7 dimensions** for each spike. Run **one query per dimension** (m | 4 | Calling package | `ErrorStatsMetrics` group by `calling_package_name` | If 1–2 callers dominate, this is likely a traffic-attribution case (see Step 6) | | 5 | Account type (AAD vs MSA) | `ErrorStatsMetrics`, `extend t = MergeAccountType(account_type)` group by `t` | If the split deviates significantly from fleet (~85% AAD / 15% MSA), call it out | | 6 | Shared device mode | `ErrorStatsMetrics`, `extend s = MergeIsSharedDevice(is_shared_device)` group by `s` | Shared-device fleets have very different error profiles | -| 7 | OS version | `android_spans` filtered by `error_code in (codes)` (or `error_type in (types)`) and a tight time window, group by `DeviceInfo_OsVersion` | OEM-specific Android quirks, especially for `io_error`, `unknown_crypto_error`, `null_pointer_error` | +| 7 | OS version | [`assets/queries/os-version-slice.kql`](assets/queries/os-version-slice.kql) — raw `android_spans`, group by `DeviceInfo_OsVersion` | **On-demand only** — slice OS-version when EITHER (a) the wrapper class is in `ExceptionAdapter.clientExceptionFromException` (catch-all wrapping a system exception, where the OEM/version often is the cause), OR (b) the error code is one of `Code:-6`, `Code:-10`, `Code:-11`, `unknown_crypto_error`, `io_error`, `null_pointer_error`. Otherwise mark the dim row as "not sliced this week — no OEM concentration suspected" and move on. Slicing OS-version on every card wastes a raw-spans query without changing the verdict. | #### Type cards have one extra required dimension: sub-code decomposition @@ -258,7 +314,16 @@ materialized_view('ErrorStatsMetrics') Cite the dominant sub-codes inline in the type card's verdict (e.g. *"`ClientException` −10.2% drop is dominated by −8.5 pp `timed_out_execution` + −3.4 pp `unknown_authority`"*) and link to those sub-codes' own attribution cards. The deep Code Attribution block (Step 4) for the type card itself focuses on the **wrapper / catch-and-rethrow** path that defines the type (e.g. `BaseException.java`, `ServiceException.java` constructors), not on each sub-code. -Feed the seven JSON outputs into the helper to roll up dim shares per (error_code, week): +Feed the union JSON output into the summarizer (one round-trip): + +```pwsh +# Union mode (preferred). attr-union.json comes from attr-union-by-dim.kql. +node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution.js ` + --union attr-union.json --top=5 +# For type cards, add --key=unified_error_type +``` + +Legacy per-dim mode (one JSON per dimension) is still supported for the rare wider-time-window case: ```pwsh node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution.js ` @@ -266,12 +331,12 @@ node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution. --label=calling_app app.json ` --label=active_broker ab.json ` --label=broker_version ver.json ` - --label=account_type acct.json ` - --label=shared_device shared.json ` - --label=os_version os.json + --label=acct_type acct.json ` + --label=shared_dev shared.json ` + --label=client_sku sku.json ``` -Ready-to-paste KQL for the per-dimension query is in [`assets/kusto-cheatsheet.md` § 8c](assets/kusto-cheatsheet.md). +Ready-to-paste KQL for both forms: union → [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql); per-dim → [`assets/kusto-cheatsheet.md` § 8c](assets/kusto-cheatsheet.md). **Concentration thresholds** (paint the dim bar red): - > 80% in a single value → strong attribution (one root cause) @@ -356,6 +421,22 @@ Add a top-level **🚚 Traffic Attribution** section that lists every error matc ### Step 7 — Validate & write +Run the bundled validator FIRST — it covers all the silent-failure cases this skill has tripped on in the past: + +```pwsh +.\.github\skills\oncall-weekly-telemetry-report\assets\validate-report.ps1 +# defaults to most-recent oncall-wow-report-*.html under ~/android-oce-reports/ +# pass -Path explicitly to validate a specific file +``` + +The validator hard-fails on: +1. Stale `{{...}}` tokens or `EXAMPLE CONTENT BELOW` / `EXAMPLE_*` sentinels. +2. `devs` / `reqs` in user-facing text (KQL inside `
` is exempted).
+3. `U+FFFD` replacement characters (catches mojibake from emoji edits).
+4. Unbalanced `
` depth in the Section 2 attention block (catches the inception-style nested-callout bug from past runs). +5. A second callout opening before the previous one closes (nested-callout sanity check). + +Then: - Run `get_errors` on the HTML file (no errors expected — pure HTML/CSS). - Verify no stale phrases from prior weeks remain (`Select-String` for retracted hypotheses, prior week's PR numbers). - Verify every PR link in the new file is reachable (the file paths just before the link should match what `git log` returned). @@ -369,10 +450,18 @@ Add a top-level **🚚 Traffic Attribution** section that lists every error matc - **Never sum percentiles.** Latency is a TDigest sketch — `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` only. - **Always apply `MergeAccountType` / `MergeIsSharedDevice` / `MergeUiRequiredExceptions`** so this report agrees with the dashboard. - **Confirm the week bucket label matches the user's intent** before writing the rest of the queries (Sunday-aligned). -- **Never claim "auxiliary spans" or denominator artifacts** without verifying the diff between broker versions in the actual commits. +- **Always filter the partial in-progress week at the source** with `| where week < datetime()` where `` is the Sunday immediately after the reporting week. Otherwise `bucket-trends.js` will show every error as a fake −99% improvement once UTC has crossed midnight Sunday. +- **Originator pre-check is mandatory.** A card cannot claim `Originator: Broker` without first running [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) and reading the throw site + top 3 `error_message` strings. If the throw site is in `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}` AND the message starts with `AADSTS`, the originator is **eSTS, not broker** — see the AADSTS reference in [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). +- **WoW-movers pass is mandatory.** The 60d bucketer's `--peak-floor` silently drops sub-10K-device codes, so [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) MUST be run as a separate pass for both `error_code` and `error_type` (per Step 3d). Its output is **merged into the single 🔴 WoW regressions callout**, sorted by current-week device count descending, with rows tagged `NEW` / `60d↑` / originator chip. Do not render a separate "emerging" callout. Skipping the pass is how the Apr 26 `Failed to parse JWT` spike (7 → 3,461 devs over 7 weeks) hid for two reports running. +- **Section 2 callouts are at-a-glance, Section 4 is the deep dive.** WoW / Slow-burn / Wins items in Section 2 use the `.item` flat-row pattern (no nested cards, no per-item left bars — the parent `.callout` border is the only severity affordance). Each row is a single line of metric chips + a one-line body + an `Attribution card →` link to the corresponding `.attr-card` in Section 4. Do NOT duplicate the dim slicing, PR analysis, or detailed verdict between the two sections — Section 4 is where that lives. See [`assets/template-readme.md`](assets/template-readme.md) for the CSS class reference and the example `.item` markup. +- **Never use bash/PowerShell regex to bulk-edit balanced HTML.** This skill has burned twice on regex strip scripts that ate matched-pair `
` closes, producing inception-style nested-callout bugs that take a depth-tracking script to find. If you need a structural change to the HTML, make a targeted, single-occurrence string replacement (with explicit before/after context) or rewrite the affected block end-to-end. Never run a `-replace` across the whole file expecting it to leave balance intact. +- **Denominator caveat must cite evidence, not hand-wave.** If you flag a large all-spans device-count shift, run [`assets/queries/broker-version-share.kql`](assets/queries/broker-version-share.kql) and name the version cohort the shift moved with. Do not write "recurring telemetry-shape artifact" without backing data; if you don't have it, drop the callout. +- **"Recovery" still merits a PR citation.** When an error pins to a single old broker version and recovers as that version retires, look for the **fix PR in the version that replaced it** before calling it a "natural rolloff." Often the fix is real and just under-credited. - **Never report WoW-only verdicts** for errors that are flat-or-down WoW but rising on 60d — always cross-check both windows. - **Never page** based on a regression that turns out to be a downstream of a denominator shift; always include the auth-only-denominator number alongside the all-spans number. - **Always cite PRs** with full GitHub URLs (the repo URL patterns above), not bare commit SHAs. +- **Filename collision rule.** If a report file already exists for the same Sunday bucket, do not silently overwrite. Open the existing report, list its top-3 findings, and explicitly state in chat what changed in the new data before regenerating. A second run on the same week without a delta is wasted work. +- **No `devs` / `reqs` in user-facing strings.** All UI text — callouts, table headers, KPI labels, verdicts, badges — must say `devices` and `requests`. Internal variable / column / file names in scripts and JSON can stay short. - **Do not create a separate Markdown summary** of the report — the HTML *is* the deliverable. - **Do not commit** the report file. It lives in `$env:USERPROFILE\android-oce-reports\` (outside the workspace) precisely so it can't be staged accidentally. @@ -380,16 +469,20 @@ Add a top-level **🚚 Traffic Attribution** section that lists every error matc ## Output checklist -- [ ] New `oncall-wow-report-YYYY-MM-DD.html` (where `YYYY-MM-DD` is the reporting-week Sunday) exists at `$env:USERPROFILE\android-oce-reports\` (NOT at repo root). +- [ ] New `oncall-wow-report-YYYY-MM-DD.html` (where `YYYY-MM-DD` is the reporting-week Sunday) exists at `$env:USERPROFILE\android-oce-reports\` (NOT at repo root). If a file for this Sunday already existed, the chat session explicitly stated what changed before regenerating. - [ ] All sections present and populated (incl. 🚚 Traffic Attribution — even if “None this week”) -- [ ] **60-day trend bucketing run on the full cross-product** — `{error_code, error_type} × {devs, reqs}` = 4 runs — union of regressions reported. Per-request retry storms (e.g. small device pool, exploding request count) are flagged on both axes. -- [ ] **Both error-codes AND error-types WoW tables have `Δ reqs %` and `Δ devs %` columns**, the 60d sparkline, and a status pill. Any row crossing threshold on either metric is in the regression list. -- [ ] Every WoW regression AND every 60d regression — **for both `error_code` and `error_type`** — has its own spike-attribution card with all 7 dimensions sliced. +- [ ] **60-day trend bucketing run on the full cross-product** — `{error_code, error_type} × {devices, requests}` = 4 runs — union of regressions reported. Per-request retry storms (e.g. small device pool, exploding request count) are flagged on both axes. Source KQL filtered the partial in-progress week with `| where week < datetime()`. +- [ ] **WoW-movers pass run** ([`wow-movers.kql`](assets/queries/wow-movers.kql)) for BOTH `error_code` and `error_type`. Its output rows are **merged into the single 🔴 WoW regressions callout in Section 2** (sorted by curr-week devices descending), each row tagged `NEW` / `60d↑` / originator chip. No separate "emerging" callout. Every row carries throw-site, dominant message, originator, and a next step. If the WoW callout is empty (rare), render "None this week" rather than omit. +- [ ] **Both error-codes AND error-types WoW tables have `Δ requests %` and `Δ devices %` columns**, the 60d sparkline, and a status pill. Any row crossing threshold on either metric is in the regression list. +- [ ] Every WoW regression AND every 60d regression — **for both `error_code` and `error_type`** — has its own spike-attribution card with all 7 dimensions sliced. Cards are built from [`assets/templates/spike-card.html`](assets/templates/spike-card.html). - [ ] **Every `error_type` regression card includes the 8th-dimension sub-code decomposition** showing the top 3–5 contributing `error_code`s with their Δ vs prior week, and links to those sub-codes' own attribution cards. -- [ ] **Every regression card's Code Attribution block populates Originator + Top throw site + Wrapper + Caller hot-spots + Underlying cause + Top error_messages + Likely PRs (with confidence/why-it's-the-suspect) + Next step (with named owner)** — per [`assets/code-attribution-template.md`](assets/code-attribution-template.md). For type cards, the wrapper field focuses on the type's catch-and-rethrow site (e.g. `BaseException`, `ServiceException` constructor). Shallow PR-only attribution is not acceptable. +- [ ] **Originator pre-check has been run for every broker-tagged card** ([`error-message-and-location.kql`](assets/queries/error-message-and-location.kql)). Throw site and top 3 `error_message` strings are populated from real data, not from the code map. AADSTS-prefixed messages are tagged `eSTS`, not `Broker`. +- [ ] **Every regression card's Code Attribution block populates Originator + Top throw site + Wrapper + Caller hot-spots + Underlying cause + Top error_messages + Likely PRs (with confidence/why-it's-the-suspect) + Next step (with named owner)**. For type cards, the wrapper field focuses on the type's catch-and-rethrow site (e.g. `BaseException`, `ServiceException` constructor). Shallow PR-only attribution is not acceptable. - [ ] Non-broker errors are explicitly tagged `environmental` / `non-broker` with confidence `none` — not invented broker PRs. - [ ] Traffic analysis covers totals, per-app, per-span, requests-per-device ratio (per error AND overall), and a sampling-change check. - [ ] **Every material traffic shift (>10% on any segment, up or down) has a reasoning paragraph** that names the dominant span/app/active-broker/broker-version, and either cites a causal PR (with confidence) — span removed/added, `goAsync()` refactor, sampling change, caller-side SDK release, ECS flight ramp — or explicitly says "no PR identified, suspect X" rather than leaving it unexplained. +- [ ] Denominator caveat (if used) is backed by [`broker-version-share.kql`](assets/queries/broker-version-share.kql) evidence naming the responsible version cohort. No hand-waving. - [ ] Auth-only denominator used for all reliability %s, denominator caveat called out at top. -- [ ] No stale text from previous weeks. +- [ ] No `\bdevs\b` or `\breqs\b` in user-facing text. (`Select-String -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false` returns 0.) +- [ ] No stale text from previous weeks. (`Select-String -Pattern 'EXAMPLE CONTENT BELOW'` returns 0 — that's the unfinished-section sentinel. The template no longer ships `{{TOKEN}}` placeholders since v2; if the file still contains any `{{`, that's also a leftover.) - [ ] `get_errors` clean on the HTML file. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/agg.js b/.github/skills/oncall-weekly-telemetry-report/assets/agg.js new file mode 100644 index 00000000..349e8d2d --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/agg.js @@ -0,0 +1,126 @@ +#!/usr/bin/env node +/** + * agg.js — Per-error per-dimension top-N rollup with WoW deltas. + * + * Companion to bucket-trends.js / summarize-attribution.js. Whereas + * summarize-attribution.js is for the cross-dimension cartesian roll-up + * across many dims, this script is the daily workhorse: take one + * "per-week × per-error × per-(one dim)" Kusto JSON file, print a + * human-readable per-error breakdown of the top-N values of that dim + * with previous-week vs current-week counts and a Δ%. + * + * Designed for the Spike Attribution cards. Run once per dim per error + * cluster (span_name, calling_package_name, broker_version, etc.), + * paste the output into the card. + * + * Input shape: a Kusto MCP JSON file produced by: + * + * let codes = dynamic([...]); + * materialized_view('ErrorStatsMetrics') + * | where EventInfo_Time between (datetime() .. datetime()) + * | where error_code in (codes) // or unified_error_type in (types) + * | extend wk = startofweek(EventInfo_Time) + * | where wk < datetime() // drop partial end! + * | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + * errs = sum(countOverall) + * by wk, error_code, + * | order by error_code asc, wk asc, devs desc + * + * Usage: + * node agg.js [ ...] [--top=N] [--metric=devs|reqs] + * + * error_key: "error_code" or "ut" (when extended from MergeUiRequiredExceptions) + * dim_col: the column you grouped by (e.g. span_name, calling_package_name) + * if you pass multiple, they are joined with " | " into a composite key + * --top=5 (default) top-N rows per error + * --metric=devs (default) | reqs + */ +const fs = require('fs'); + +const args = process.argv.slice(2); +const positional = args.filter(a => !a.startsWith('--')); +const file = positional[0]; +const errKey = positional[1] || 'error_code'; +const dimCols = positional.slice(2); +const topN = +((args.find(a => a.startsWith('--top=')) || '').split('=')[1] || 5); +const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); + +if (!file || dimCols.length === 0) { + console.error('Usage: node agg.js [ ...] [--top=N] [--metric=devs|reqs]'); + process.exit(1); +} +if (!['devs', 'reqs'].includes(metric)) { + console.error("--metric must be 'devs' or 'reqs'"); + process.exit(1); +} + +function load(file) { + const j = JSON.parse(fs.readFileSync(file, 'utf8')); + const items = j.results.items.slice(1); + const schema = Object.keys(j.results.items[0]); + return { items, schema }; +} + +function pct(a, b) { + if (!b) return a ? '+inf' : '0'; + return ((a - b) / b * 100).toFixed(1) + '%'; +} + +const { items, schema } = load(file); +const wkIdx = schema.indexOf('wk'); +const errIdx = schema.indexOf(errKey); +const valIdx = schema.indexOf(metric === 'devs' ? 'devs' : 'errs'); +const dimIdxs = dimCols.map(c => { + const i = schema.indexOf(c); + if (i < 0) { + console.error(`Column '${c}' not found in schema: ${schema.join(', ')}`); + process.exit(2); + } + return i; +}); +if (wkIdx < 0 || errIdx < 0 || valIdx < 0) { + console.error(`Required columns missing. schema=${schema.join(', ')} need wk, ${errKey}, ${metric === 'devs' ? 'devs' : 'errs'}`); + process.exit(2); +} + +// group: err -> dimkey -> wk -> value +const m = {}; +const wks = new Set(); +for (const r of items) { + const wk = r[wkIdx], err = r[errIdx], val = r[valIdx]; + const dimKey = dimIdxs.map(i => (r[i] === null || r[i] === undefined || r[i] === '') ? '(blank)' : r[i]).join(' | '); + wks.add(wk); + m[err] = m[err] || {}; + m[err][dimKey] = m[err][dimKey] || {}; + m[err][dimKey][wk] = (m[err][dimKey][wk] || 0) + val; +} +const sortedWks = [...wks].sort(); +if (sortedWks.length < 2) { + console.warn(`[agg] WARN: only ${sortedWks.length} week bucket(s) in input — need >= 2 for WoW deltas.`); +} +const prevWk = sortedWks[0], curWk = sortedWks[sortedWks.length - 1]; + +console.log(`# ${file} (dim: ${dimCols.join(' + ')}, metric: ${metric})`); +console.log(`# WoW: ${prevWk.slice(0, 10)} -> ${curWk.slice(0, 10)}\n`); + +for (const err of Object.keys(m).sort()) { + const rows = Object.entries(m[err]).map(([k, v]) => ({ + key: k, + prev: v[prevWk] || 0, + cur: v[curWk] || 0, + })); + const total = rows.reduce((s, r) => s + r.cur, 0); + rows.sort((a, b) => b.cur - a.cur); + console.log(`## ${err} (cur-week ${metric}=${total.toLocaleString()})`); + for (const r of rows.slice(0, topN)) { + const share = total ? (r.cur / total * 100).toFixed(1) : '0'; + console.log( + ' ' + share.padStart(5) + '%' + + ' Δ ' + pct(r.cur, r.prev).padStart(8) + + ' prev=' + String(r.prev).padStart(11) + + ' cur=' + String(r.cur).padStart(11) + + ' ' + r.key + ); + } + console.log(''); +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js index 36d2a4f5..5e4091f2 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js @@ -5,17 +5,25 @@ * Input: a Kusto MCP JSON result file from a query of the form: * * materialized_view('ErrorStatsMetrics') - * | where EventInfo_Time > ago(70d) + * | where EventInfo_Time between (datetime() .. datetime()) * | where isnotempty(error_code) and error_code != 'success' * | summarize errs=sum(countOverall), * devs=dcount_hll(hll_merge(countDevicesHll)) * by week=startofweek(EventInfo_Time), error_code + * | where week < datetime() // drop partial end-week! * | order by error_code asc, week asc * * (Use dcount_hll on countDevicesHll, NOT sum(countDevices) — see kusto-cheatsheet.md.) * * Usage: - * node bucket-trends.js [--start=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs] + * node bucket-trends.js + * [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] # inclusive start, EXCLUSIVE end (week-bucket) + * [--peak-floor=N] [--metric=devs|reqs] + * + * --start defaults to the second-earliest week in the data (drops partial start week). + * --end defaults to the most recent week, but the script will WARN-AND-DROP any week + * where (latest EventInfo_Time in the bucket - week-start) < 6 days, because that + * is a partial end-week and will turn every error into a fake -99% improvement. * * --metric=devs (default) buckets on weekly device counts (catches errors hitting more users) * --metric=reqs buckets on weekly request counts (catches per-device retry storms) @@ -34,6 +42,7 @@ const fs = require('fs'); const args = process.argv.slice(2); const file = args.find(a => !a.startsWith('--')); const startArg = (args.find(a => a.startsWith('--start=')) || '').split('=')[1]; +const endArg = (args.find(a => a.startsWith('--end=')) || '').split('=')[1]; const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); if (!['devs', 'reqs'].includes(metric)) { console.error(`--metric must be 'devs' or 'reqs', got '${metric}'`); @@ -44,7 +53,7 @@ const peakFloor = +((args.find(a => a.startsWith('--peak-floor=')) || '').split( const metricIdx = metric === 'reqs' ? 0 : 1; // [errs, devs] tuple if (!file) { - console.error('Usage: node bucket-trends.js [--start=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs]'); + console.error('Usage: node bucket-trends.js [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs]'); process.exit(1); } @@ -57,10 +66,42 @@ for (const [w, code, errs, devs] of items) { } const weeks = [...new Set(items.map(r => r[0]))].sort(); const startISO = startArg ? `${startArg}T00:00:00Z` : weeks[1]; // drop partial start week by default -const keep = weeks.filter(w => w >= startISO); +const endISO = endArg ? `${endArg}T00:00:00Z` : null; // exclusive cutoff + +// --- Partial end-week detection --------------------------------------------- +// Compute the total devices/requests per bucket as a proxy for completeness. +// If the most recent bucket is < 30% of the median of the prior 3 buckets, it's +// almost certainly partial — drop it and warn. This catches the common case of +// running the report at 09:00 UTC Sunday and getting 9 hours of data in the +// "last week" bucket. (Caveat: real fleet collapses also look like this; warn, +// don't crash.) +function bucketTotal(w) { + let t = 0; + for (const wd of Object.values(series)) { + const v = wd[w]; + if (v) t += v[metricIdx]; + } + return t; +} +const totals = weeks.map(w => ({ w, t: bucketTotal(w) })); +const medianOf = arr => { const s = [...arr].sort((a,b)=>a-b); return s[Math.floor(s.length/2)] || 0; }; +let droppedPartial = null; +if (!endArg && weeks.length >= 4) { + const last = totals[totals.length - 1]; + const prevMedian = medianOf(totals.slice(-4, -1).map(x => x.t)); + if (prevMedian > 0 && last.t < prevMedian * 0.3) { + droppedPartial = last.w; + console.warn(`[bucket-trends] WARN: dropping likely-partial end bucket ${last.w} (total=${last.t.toLocaleString()} vs median-of-prior-3=${prevMedian.toLocaleString()}). Pass --end=YYYY-MM-DD to override or filter in KQL.`); + } +} + +const keep = weeks.filter(w => w >= startISO && (endISO ? w < endISO : true) && w !== droppedPartial); console.log('All weeks: ', weeks); console.log('Trend weeks: ', keep, `(${keep.length} complete)`); console.log('Metric: ', metric, `(peak floor=${peakFloor.toLocaleString()})`); +if (keep.length < 4) { + console.warn(`[bucket-trends] WARN: only ${keep.length} kept weeks — trend buckets will be unstable. Need >= 4 for meaningful regression/improvement classification.`); +} const buckets = { regression: [], spike: [], improvement: [], flat: [] }; for (const [code, wd] of Object.entries(series)) { diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 new file mode 100644 index 00000000..10c2daef --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 @@ -0,0 +1,102 @@ +<# +.SYNOPSIS + Find candidate PRs touching a class / file / method, across broker/ and common/ in parallel. + +.DESCRIPTION + Speeds up the PR-grep workflow in SKILL.md Step 4. Given a class name (or + arbitrary regex), runs `git log -S` (pickaxe) AND `git log --grep` against + both broker/ and common/ over the supplied window, then prints a unified + table sorted by date. + + Use this AFTER you have identified the throw-site / wrapper class from the + Originator pre-check (assets/queries/error-message-and-location.kql). + +.PARAMETER Symbol + String to search for in commit diffs (passed to `git log -S`). Typically + the class name or method that hosts the throw site, e.g. + 'ExceptionAdapter', 'clientExceptionFromException', 'getKnownAuthorityResult'. + +.PARAMETER GrepRegex + Optional regex for `git log --grep` (commit message). Defaults to $Symbol. + +.PARAMETER Since + Inclusive start date (yyyy-MM-dd). Defaults to 28 days ago. + +.PARAMETER Until + Inclusive end date. Defaults to today. + +.PARAMETER RepoRoot + Defaults to C:\Users\\Repos\android-complete. Overrides via -RepoRoot. + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol ExceptionAdapter -Since 2026-04-01 + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol clientExceptionFromException -Since 2026-04-01 -Until 2026-05-09 + +.NOTES + Cites repos with the URL pattern in SKILL.md (broker -> ad-accounts-for-android, + common -> microsoft-authentication-library-common-for-android). +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Symbol, + [string]$GrepRegex, + [string]$Since = (Get-Date).AddDays(-28).ToString('yyyy-MM-dd'), + [string]$Until = (Get-Date).ToString('yyyy-MM-dd'), + [string]$RepoRoot = (Join-Path $env:USERPROFILE 'Repos\android-complete') +) + +if (-not $GrepRegex) { $GrepRegex = [regex]::Escape($Symbol) } + +$repos = @( + @{ Name='broker'; Path=(Join-Path $RepoRoot 'broker'); UrlBase='https://github.com/identity-authnz-teams/ad-accounts-for-android/pull/' } + @{ Name='common'; Path=(Join-Path $RepoRoot 'common'); UrlBase='https://github.com/AzureAD/microsoft-authentication-library-common-for-android/pull/' } +) + +$results = @() +foreach ($r in $repos) { + if (-not (Test-Path $r.Path)) { Write-Warning "Repo path not found: $($r.Path)"; continue } + Push-Location $r.Path + try { + # Pickaxe: PRs whose diff added or removed the symbol + $pickaxeRaw = git log --since=$Since --until=$Until -S $Symbol --pretty=format:'%h|%ai|%an|%s' 2>$null + # Grep: PRs whose subject mentions the regex (case-insensitive) + $grepRaw = git log --since=$Since --until=$Until --pretty=format:'%h|%ai|%an|%s' --grep=$GrepRegex -i 2>$null + + $seen = @{} + foreach ($line in @($pickaxeRaw, $grepRaw | Where-Object { $_ })) { + foreach ($l in @($line)) { + if (-not $l) { continue } + $parts = $l -split '\|', 4 + if ($parts.Count -lt 4) { continue } + $sha = $parts[0] + if ($seen.ContainsKey($sha)) { continue } + $seen[$sha] = $true + # Try to pull the PR number out of the subject (#NNN at end of MS PR convention) + $prNum = $null + if ($parts[3] -match '#(\d{2,5})\b') { $prNum = $Matches[1] } + $results += [pscustomobject]@{ + Repo = $r.Name + Date = $parts[1].Substring(0, 10) + Author = $parts[2] + Sha = $sha + PR = if ($prNum) { '#' + $prNum } else { '' } + Url = if ($prNum) { $r.UrlBase + $prNum } else { '' } + Subject = $parts[3] + } + } + } + } finally { Pop-Location } +} + +if ($results.Count -eq 0) { + Write-Host "No PRs match in window $Since .. $Until for symbol '$Symbol'." + Write-Host " Tip: try a shorter symbol (just the class name), or widen -Since." + exit 0 +} + +$results | Sort-Object Date -Descending | Format-Table Repo, Date, Author, Sha, PR, @{n='Subject';e={$_.Subject.Substring(0, [Math]::Min(80, $_.Subject.Length))}} -AutoSize +Write-Host "" +Write-Host "PR URLs for citation in attribution cards:" +$results | Where-Object Url | Sort-Object Date -Descending | ForEach-Object { Write-Host " $($_.Repo) #$($_.PR.TrimStart('#')): $($_.Url)" } diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md index 7f6a6013..c81a2907 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md @@ -177,18 +177,73 @@ materialized_view('BrokerAdoptionStatsUpdated') --- -## 9. MCP output handling - -- Most queries with multi-week × per-error-code grain return **>50 KB** and are written to a side file by the tool. Read the side file with the `read_file` tool, or pipe through `bucket-trends.js` / `summarize-attribution.js`. -- The first row of `results.items` is the **schema object**, not data. The helper scripts know this. -- If a query times out or returns `BadRequest`, check **column name typos first** (the error message names the missing column). - ---- - ## 10. Helper scripts | Script | Purpose | |---|---| -| [`bucket-trends.js`](bucket-trends.js) | Bucket every error code into regression / spike / improvement / flat across an N-week window | +| [`bucket-trends.js`](bucket-trends.js) | Bucket every error code into regression / spike / improvement / flat across an N-week window. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop partial in-progress buckets. | +| [`agg.js`](agg.js) | Per-error per-dim top-N rollup with WoW deltas. Feeds spike-attribution dim blocks. | | [`summarize-attribution.js`](summarize-attribution.js) | Roll up 7-dim attribution slices per (error_code, week) — feeds the spike-attribution cards | -| [`report-template.html`](report-template.html) | Canonical layout. Copy to `oncall-wow-report-v{N+1}.html` and replace data only — never restructure CSS | +| [`queries/`](queries/) | Canonical KQL templates, one per query — see [`queries/README.md`](queries/README.md) | +| [`templates/`](templates/) | Copy-paste HTML snippets for cards / footer JS | +| [`report-template.html`](report-template.html) | Canonical layout. Copy to `~/android-oce-reports/oncall-wow-report-.html` and replace `{{TOKENS}}` only — never restructure CSS | + +--- + +## 11. The `error_location` JSON shape (read this before slicing stack-traces) + +`error_location` on `android_spans` is a **serialized JSON string**, not a dynamic object. Naively writing `error_location.MethodName` returns null in KQL. Use `tostring()` to project it raw, then `parse_json()` if you need to drill in: + +```kql +android_spans +| where error_code == 'null_pointer_error' +| extend loc = tostring(error_location) // {"ClassName":"...","MethodName":"...","LineNumber":N} +| extend method = tostring(parse_json(loc).MethodName) +| extend lineNo = toint(parse_json(loc).LineNumber) +| summarize devices = dcount(DeviceInfo_Id) by method, lineNo +| top 20 by devices desc +``` + +For the report's **mandatory Originator pre-check** (Step 4 of SKILL.md), use [`queries/error-message-and-location.kql`](queries/error-message-and-location.kql) — it returns the raw `loc` blob alongside the first 100 chars of `error_message`, which is enough to identify the throw site (file + method + line) and the dominant message string. + +The single most informative attribution query for a regressing code: + +```kql +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where error_code in () +| extend loc = tostring(error_location), + msg = substring(tostring(error_message), 0, 100) +| summarize cnt = count(), + devices = dcount(DeviceInfo_Id) + by error_code, loc, msg +| top 60 by devices desc +``` + +--- + +## 12. AADSTS reference — common eSTS responses bridged into broker errors + +When `error_message` starts with `AADSTS`, the originator is **eSTS, not broker**, regardless of which broker exception class was constructed. Broker (specifically `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}`) translates the AAD response into a broker exception code as a courtesy — it is not the cause. + +| AADSTS code | Meaning | Broker exception code (typical) | Originator | Owner | +|---|---|---|---|---| +| `AADSTS500011` | Resource principal not found in tenant | `invalid_resource` | eSTS / tenant config | Resource owner team | +| `AADSTS500014` | Service principal disabled in tenant | `invalid_resource` | eSTS / tenant config | Resource owner team | +| `AADSTS50158` | External claims challenge / CA enforcement | `interaction_required` | eSTS / Conditional Access | Identity CA team | +| `AADSTS50173` | Fresh token needed (CA / FR) | `interaction_required` / `invalid_grant` | eSTS / CA | Identity CA team | +| `AADSTS65001` | User / admin has not consented | `unauthorized_client` | eSTS / app registration | App owner team | +| `AADSTS70008` | Authorization code expired | `invalid_grant` | eSTS (timing) | Investigate caller latency | +| `AADSTS70011` | Invalid scope | `invalid_scope` | eSTS / app registration | App owner team | +| `AADSTS90072` | User account from external tenant doesn't exist locally | `unauthorized_client` | eSTS / B2B config | Tenant admin | +| `AADSTS900971` | No reply address | `invalid_request` | eSTS / app registration | App owner team | + +**Rule of thumb:** if the throw site is an `ExceptionAdapter.*` method AND the message begins with `AADSTS`, tag the card `eSTS` and route to the resource / app owner team. Do not invent a broker PR to "fix" it. + +--- + +## 13. MCP output handling + +- Most queries with multi-week × per-error-code grain return **>50 KB** and are written to a side file by the tool. Read the side file with the `read_file` tool, or pipe through `bucket-trends.js` / `summarize-attribution.js`. +- The first row of `results.items` is the **schema object**, not data. The helper scripts know this. +- If a query times out or returns `BadRequest`, check **column name typos first** (the error message names the missing column). diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql new file mode 100644 index 00000000..18d3fb20 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql @@ -0,0 +1,15 @@ +// 60-day per-error-code trend. +// Inputs (replace before pasting): +// = first Sunday of the 60d window (e.g. 2026-03-08) +// = end of the reporting week, EXCLUSIVE = next Sunday after the +// reporting week's Sunday (e.g. for a 2026-05-03 report, use 2026-05-10) +// Output: feed to assets/bucket-trends.js with --start= (no --end needed +// because we filter the partial bucket out at the source — preferred). +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime() // drop partial end-week +| order by error_code asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql new file mode 100644 index 00000000..951e840f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql @@ -0,0 +1,10 @@ +// 60-day per-error-type trend (with MergeUiRequiredExceptions to collapse variants). +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(unified_error_type) +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), unified_error_type +| where week < datetime() +| order by unified_error_type asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md new file mode 100644 index 00000000..6d3abf2c --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md @@ -0,0 +1,33 @@ +# `assets/queries/` — canonical KQL templates + +Each `.kql` here is a paste-and-replace template for one of the queries the OCE +weekly report needs. Token convention: + +| Token | Meaning | +|---|---| +| `` | Sunday of the earliest week in the window, ISO date e.g. `2026-03-08` | +| `` | Sunday immediately AFTER the reporting week (EXCLUSIVE upper bound). For a 2026-05-03 report use `2026-05-10`. | +| `` | Sunday of the prior week (the WoW baseline). | +| `` | Comma-separated KQL string list, e.g. `'invalid_resource', 'null_pointer_error'` | +| `` | Same shape but for `unified_error_type`. | +| `` | A single column name, replace per dimension run. | + +**The `` filter is mandatory.** Always include `| where week < datetime()` after the `summarize` so the partial in-progress week is dropped at the source. Otherwise `bucket-trends.js` will see a fake −99% improvement on every code (the partial bucket will look like a fleet-wide collapse). + +## File index + +| File | Purpose | Section it feeds | +|---|---|---| +| [`reliability-auth-only.kql`](reliability-auth-only.kql) | Per-week auth-only requests/devices | Top-line health, denominator caveat | +| [`broker-version-share.kql`](broker-version-share.kql) | Per-week per-version share — **evidence for denominator caveat** | Denominator caveat callout, broker adoption | +| [`60d-trend-codes.kql`](60d-trend-codes.kql) | Feeds `bucket-trends.js` for codes | 60-day trend analysis | +| [`60d-trend-types.kql`](60d-trend-types.kql) | Feeds `bucket-trends.js` for types | 60-day trend analysis | +| [`wow-movers.kql`](wow-movers.kql) | **MANDATORY second pass** — catches small-base codes that spiked sharply this week (below the 60d bucketer's reporting threshold). Run for both `error_code` and `error_type`. **Merge its output rows into the single 🔴 WoW regressions callout** alongside the standard WoW table; tag rows that were absent or near-zero last week with `NEW`. Do not render a separate "emerging" callout. | 🔴 WoW regressions callout (Section 2) | +| [`attr-union-by-dim.kql`](attr-union-by-dim.kql) | **PREFERRED for 2-week WoW.** All 7 dims for N codes (or types) in ONE round-trip; pipe through `summarize-attribution.js --union`. | Spike attribution cards | +| [`attr-codes-by-dim.kql`](attr-codes-by-dim.kql) | Per-dim form (run 7 times). Fall back to this only when the union exceeds payload size or the time window is wider than 2 weeks. | Spike attribution cards | +| [`attr-types-by-dim.kql`](attr-types-by-dim.kql) | Per-dim form for type regressions | Spike attribution cards | +| [`type-subcode-decomposition.kql`](type-subcode-decomposition.kql) | 8th dim for type cards | Type spike-attribution cards | +| [`error-message-and-location.kql`](error-message-and-location.kql) | **MANDATORY** for every broker-tagged regression. Now accepts BOTH `` and `` so codes + types can be sliced in one round-trip. | Code attribution block | +| [`os-version-slice.kql`](os-version-slice.kql) | OS / OEM concentration (raw `android_spans`). **On-demand only** per Step 5 — don't slice every card. | OS-version dim in attribution cards (when applicable) | +| [`latency.kql`](latency.kql) | p50/p95/p99 by hot span | Latency section | +| [`app-share.kql`](app-share.kql) | Top calling apps by week | Traffic analysis | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql new file mode 100644 index 00000000..4e138833 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql @@ -0,0 +1,11 @@ +// Top calling apps share for last N weeks (typically 3). +materialized_view('AppStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), calling_package_name +| where week < datetime() +| order by week asc, req desc +| summarize topApps = make_list(pack('app', calling_package_name, 'req', req, 'dev', dev), 25) + by week +| order by week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql new file mode 100644 index 00000000..fa31ff56 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql @@ -0,0 +1,17 @@ +// Spike attribution: codes by ONE dimension at a time. +// Run 7 times with set to each of: +// span_name | calling_package_name | active_broker_package_name | +// broker_version | unified_account_type | unified_is_shared_device | client_sku +// (Plus android_spans-based for OS version — see os-version-slice.kql.) +let codes = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time between (datetime() .. datetime()) +| where error_code in (codes) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, error_code, +| order by error_code asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql new file mode 100644 index 00000000..d9b06736 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql @@ -0,0 +1,15 @@ +// Spike attribution: types by ONE dimension at a time. +// Same usage as attr-codes-by-dim.kql but for error_type regressions. +let types = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time between (datetime() .. datetime()) +| where unified_error_type in (types) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, +| order by unified_error_type asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql new file mode 100644 index 00000000..0efb0a2d --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql @@ -0,0 +1,59 @@ +// Spike-attribution union — all 7 dims for N codes (or N types) in ONE query. +// +// Recommended for the standard 2-week WoW attribution pass (Step 5 of SKILL.md). +// 1 round-trip vs 7. ~800 KB payload for 8 codes; well under the MCP limit. +// Falls back to per-dim files (assets/queries/attr-codes-by-dim.kql) if you +// need a wider time window or you exceed payload size. +// +// Inputs: +// e.g. dynamic(['no_tokens_found','timed_out_execution', ...]) +// inclusive (e.g. datetime(2026-04-26)) +// EXCLUSIVE Sunday after the reporting week (e.g. datetime(2026-05-10)) +// either `error_code` or `unified_error_type` (the latter for type cards) +// +// Output schema (consumed by `summarize-attribution.js --union`): +// dim string short label per dimension +// wk datetime reporting week +// string error_code or unified_error_type +// val_string string dim value (for string-typed dims) +// val_bool bool dim value (for shared-device only) +// devs long dcount_hll merged device count +// errs long sum of countOverall (request count) +// +// For type cards, swap the first line and key: +// let base = materialized_view('ErrorStatsMetrics') +// | extend unified_error_type = MergeUiRequiredExceptions(error_type) +// | where EventInfo_Time between (datetime() .. datetime()) +// | where unified_error_type in () +// | extend wk = startofweek(EventInfo_Time); + +let codes = dynamic([]); +let base = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where error_code in (codes) + | extend wk = startofweek(EventInfo_Time); +(base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='span', wk, error_code, val_string=span_name, val_bool=bool(null)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='calling_app', wk, error_code, val_string=calling_package_name, val_bool=bool(null)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='active_broker', wk, error_code, val_string=active_broker_package_name, val_bool=bool(null)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='broker_ver', wk, error_code, val_string=broker_version, val_bool=bool(null)) +| union (base | extend t = MergeAccountType(account_type) + | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='acct_type', wk, error_code, val_string=t, val_bool=bool(null)) +| union (base | extend s = MergeIsSharedDevice(is_shared_device) + | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='shared_dev', wk, error_code, val_string=s, val_bool=bool(null)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='client_sku', wk, error_code, val_string=client_sku, val_bool=bool(null)) +| where wk < datetime() +| order by error_code asc, dim asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql new file mode 100644 index 00000000..bdfc5a41 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql @@ -0,0 +1,10 @@ +// Per-broker-version request and device share — the canonical evidence for +// the "denominator caveat" callout. If the all-spans device count moved >20% +// WoW, this query tells you WHICH version cohort drove it. +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), broker_version +| where week < datetime() +| order by week asc, req desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql new file mode 100644 index 00000000..d719b85e --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql @@ -0,0 +1,39 @@ +// Stack-trace + error_message slice for code attribution. MANDATORY for every +// broker-tagged regression card before claiming "Originator: Broker". +// +// Rationale: most broker exception codes flow through +// common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, +// exceptionFromAuthorizationResult, clientExceptionFromException}. Without +// reading the throw site + the dominant error_message string, you cannot tell +// whether the code originated in broker code or was bridged from an eSTS +// AADSTS response. (See kusto-cheatsheet.md "AADSTS reference table".) +// +// THIS TEMPLATE COVERS BOTH error_code AND error_type IN ONE ROUND-TRIP. +// Pass an empty list for the side you don't want to slice. +// +// Inputs: +// e.g. 'invalid_resource', 'null_pointer_error' (or empty) +// e.g. 'IntuneAppProtectionPolicyRequiredException' (or empty) +// datetime of reporting-week PipelineInfo_IngestionTime start +// datetime of next Sunday (exclusive) +// +// Output column 'loc' is a JSON blob {"ClassName":"...","MethodName":"...","LineNumber":N} +// — this is normal. Read it as text. To project the method name only, use +// parse_json(loc).MethodName +// +// HARD RULE (per SKILL.md Step 4): if the throw site is in +// ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult} +// AND the message starts with "AADSTS", the originator is eSTS, not broker. + +let codes = dynamic([]); +let types = dynamic([]); +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where (array_length(codes) > 0 and error_code in (codes)) + or (array_length(types) > 0 and error_type in (types)) +| extend loc = tostring(error_location), + msg = substring(tostring(error_message), 0, 120) +| summarize cnt = count(), + devices = dcount(DeviceInfo_Id) + by error_code, error_type, loc, msg +| top 80 by devices desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql new file mode 100644 index 00000000..471fd2ca --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql @@ -0,0 +1,13 @@ +// p50 / p95 / p99 latency on the hot spans. Always merge TDigest before percentile. +materialized_view('PerfStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| where span_name in ('AcquireTokenSilent','AcquireTokenInteractive', + 'GetAccounts','RemoveAccount','ProcessWebsiteRequest') +| where span_status == 'OK' +| summarize p50 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 50, typeof(long)), + p95 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 95, typeof(long)), + p99 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 99, typeof(long)), + reqs = sum(countRequests) + by week = startofweek(EventInfo_Time), span_name +| where week < datetime() +| order by span_name asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql new file mode 100644 index 00000000..dcb93df6 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql @@ -0,0 +1,11 @@ +// OS version slice (8th attribution dim). Requires raw android_spans because +// ErrorStatsMetrics doesn't carry DeviceInfo_OsVersion. Keep the time window +// tight (<= 7 days) to stay under the MCP 240s timeout. +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where error_code in () +| summarize devs = dcount(DeviceInfo_Id), + cnt = count() + by error_code, DeviceInfo_OsVersion +| where devs >= 100 +| top 30 by devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql new file mode 100644 index 00000000..e9758b21 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql @@ -0,0 +1,14 @@ +// Auth-only denominator and reliability per week. +let s = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize req = sum(countRequests), devHll = hll_merge(countDevicesHll) + by week = startofweek(EventInfo_Time); +let i = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize req = sum(countRequests), devHll = hll_merge(countDevicesHll) + by week = startofweek(EventInfo_Time); +union s, i +| summarize authReq = sum(req), authDev = dcount_hll(hll_merge(devHll)) + by week +| where week < datetime() +| order by week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql new file mode 100644 index 00000000..4454c79f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql @@ -0,0 +1,13 @@ +// Sub-code decomposition for an error_type regression card (the "8th dim"). +// Shows top error_codes that roll up under each unified_error_type, with WoW devices. +let types = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time between (datetime() .. datetime()) +| where unified_error_type in (types) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, error_code +| order by unified_error_type asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql new file mode 100644 index 00000000..1a53a8f9 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql @@ -0,0 +1,46 @@ +// WoW movers — codes (or types) that moved sharply this week regardless of 60d shape. +// +// MANDATORY pass alongside `60d-trend-codes.kql` / `60d-trend-types.kql` (per +// SKILL.md Step 3b). The 60d bucketer's --peak-floor=10000 EXCLUDES errors +// whose absolute weekly volume is small, but those small-volume codes can +// still spike sharply WoW (e.g. `Failed to parse JWT` 7 -> 3,461 devs over 7 +// weeks, or `Code:-11` 937 -> 2,490 devs WoW). Without this pass those spikes +// are silently dropped from the report. +// +// Inputs: +// Sunday of the reporting week (e.g. 2026-05-03) +// Sunday after (exclusive, e.g. 2026-05-10) +// Sunday of the baseline week (e.g. 2026-04-26) +// +// To run for error_type instead of error_code, copy this query and replace: +// - error_code -> MergeUiRequiredExceptions(error_type) (alias as `t`) +// - drop the `error_code != 'success'` filter +// +// Thresholds (tuneable): +// floor: cDev>=500 OR cReq>=5000 (small enough to catch sub-bucketer-floor codes) +// move: |dDev%|>=25 OR |dReq%|>=50 (real spike, not noise) +// new-this-wk: pDev==0 OR pReq==0 (never seen before this week) + +let curr = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where isnotempty(error_code) and error_code != 'success' + | summarize cDev = dcount_hll(hll_merge(countDevicesHll)), + cReq = sum(countOverall) + by error_code; +let prior = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where isnotempty(error_code) and error_code != 'success' + | summarize pDev = dcount_hll(hll_merge(countDevicesHll)), + pReq = sum(countOverall) + by error_code; +curr | join kind=fullouter prior on error_code +| extend ec = coalesce(error_code, error_code1) +| extend cDev = coalesce(cDev, long(0)), cReq = coalesce(cReq, long(0)), + pDev = coalesce(pDev, long(0)), pReq = coalesce(pReq, long(0)) +| extend dDevPct = iff(pDev == 0, real(null), 100.0 * (cDev - pDev) / pDev) +| extend dReqPct = iff(pReq == 0, real(null), 100.0 * (cReq - pReq) / pReq) +| where (cDev >= 500 or cReq >= 5000) +| where (abs(dDevPct) >= 25 or abs(dReqPct) >= 50 or pDev == 0 or pReq == 0) +| project ec, pDev, cDev, dDevPct = round(dDevPct, 1), + pReq, cReq, dReqPct = round(dReqPct, 1) +| order by abs(dDevPct) desc nulls first diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html index 9ff9f624..5375121c 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html +++ b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html @@ -1,8 +1,14 @@ + -Android Broker · On-Call Weekly Report +Android Broker · On-Call Weekly Report — Week of May 3, 2026 @@ -280,21 +346,24 @@

Android Broker · Weekly On-Call Report

- Last 7 days vs prior 7 days  ·  - Source: AllAndroidSpans  ·  - Generated 2026-05-07 + Sun May 3 → Sat May 9, 2026  vs  Apr 26 → May 2  ·  + 60-day window: Mar 8 → May 3 (8 complete weeks)  ·  + Source: android_spans materialized views  ·  + Generated 2026-05-09
- Live data + v6 · Live data
- -

📊 Top-line health — last 13 days

+ +

📊 Top-line health — auth-only denominator

-
Silent auth requests
-
10.37 B
-
−0.6% WoW
-
+
Silent auth requests (week)
+
10.58 B
+
+2.3% WoW
+
+
+
+
Silent auth reliability (req)
+
73.34%
+
+0.68 pp WoW (improving)
+
-
Silent auth devices
-
190.1 M
-
−0.7% WoW
-
+
Silent auth reliability (dev)
+
82.52%
+
+1.34 pp WoW (improving)
Interactive auth requests
-
9.84 M
-
−1.0% WoW
-
+
10.29 M
+
+7.5% WoW
+
+
+
+
Interactive reliability (dev)
+
58.43%
+
+0.74 pp WoW (improving)
-
Interactive auth devices
-
6.34 M
-
−1.8% WoW
-
+
Interactive devices
+
8.17 M
+
+5.6% WoW
+
-
Latest broker
-
16.0.1
-
📈 43.1% req share (was 21.2%)
-
+
Latest broker (16.0.1)
+
70.8%
+
+15.4 pp share WoW (rollout complete)
+
+
+
p95 AcquireTokenSilent
+
5,916 ms
+
−45 ms (−0.8%) WoW
- +

🚨 Things that need attention this week

-
ℹ️ Important caveat about denominators (read this first)
-

The all-spans device count dropped 38% WoW (572 M → 353 M) due to broker PR #88 moving OnUpgradeReceiver work to goAsync() — a fix for an OPPO GPU-overload issue. After 16.0.x rollout, that span no longer fires reliably (broadcast receivers can be killed before async work completes), removing ~509 M span events/week from the denominator. Auth-only device count is flat (190.1 M → 188.6 M) — users are unaffected.

-

All reliability metrics in this report use the auth-only denominator (SilentAuthStatsInteractiveAuthStats) so they reflect real user impact, not telemetry artifacts. The dashboard's default "Device Reliability" tile already does this.

-
- -
-
🔴 Real regressions worth investigation
+
ℹ️ Denominator caveat — read this first
+

The headline BrokerAdoptionStats device count dropped −18.6% WoW (1.52 B → 1.24 B), but this is not a real fleet shrink. The drop is fully explained by three low-value spans deflating as the 16.0.1 rollout completes:

    -
  • invalid_resource — +20% device share, +57% raw devices (eSTS-side). Affects Outlook + Teams. Investigate eSTS error trend.
  • -
  • Failed to parse JWT — +367% device share (small absolute, 2,700 devices). LTW + 16.0.1 + MSA only.
  • -
  • Code:-10 (WebView UNSUPPORTED_SCHEME) — +161% devices. Tied to common #3013 openid-vc URL handling.
  • -
  • DeviceRegistrationException — +20% devices, broker code regression. Tied to broker #87.
  • -
  • kdfv2_key_derivation_error — bursty +1,951% requests. Server-side ECS flight ramp on AAD KDFv2.
  • +
  • OnUpgradeReceiver: 438 M → 151 M events (−65.5%) — fires once per app upgrade; tapers naturally as 16.0.1 finishes deploying. May also be impacted by historical goAsync() refactors that allow the OS to kill the receiver before the span flushes.
  • +
  • SecretKeyWrapping: 329 M → 251 M (−23.7%) — downstream of fewer keystore ops in the OnUpgradeReceiver path.
  • +
  • WrappedKeyAlgorithmIdentifier: 135 M → 87 M (−35.3%) — same downstream cause.
-

See → 🔎 Spike Attribution for code-level root cause on each.

+

The auth-only denominator (Silent ∪ Interactive) is up: silent countRequests +2.3%, interactive +7.5%, silent device count flat (1.55 B → 1.53 B). All reliability % and per-app figures in this report use the auth-only denominator. Real users are unaffected by the device-count drop.

-
📈 Slow-burn regressions only visible on 60-day trend (WoW would miss these)
-
    -
  • timed_out_execution — devices 18M → 80.6M over 8 weeks (+348%, peaked at 143M Apr 19). WoW shows a −40% pullback this week, but the multi-week trajectory is sharply up. Likely tied to broker #141 (HTTP cancellation on ATS timeout).
  • -
  • no_tokens_found — devices 14.1M → 22.9M (+62%); as % of fleet 0.71% → 1.50% (~2.1×). Matches the dashboard chart you flagged. Suspect: common #3074 token-cache remove optimization.
  • -
  • null_pointer_error — devices 48K → 67K (+39%) over 8 weeks. Crash-bucket by error_location next.
  • -
  • unknown_crypto_error — devices +20%; correlate with OEM/Android-version next.
  • -
-

Full breakdown → 📈 60-Day Trend Analysis. The earlier "auxiliary spans in 16.0.0" hypothesis for io_error / no_account_found / invalid_grant has been retracted — see updated verdict in the Spike Attribution section.

+
🔴 WoW regressions (last 7 days vs prior 7) — sorted by current-week devices, descending
+

Tags: NEW first appeared this week or last; 60d↑ also rising on the 60-day window; broker / eSTS / Android / env = originator. Built from the standard WoW table union with wow-movers.kql so small-but-recent spikes appear alongside the high-volume movers.

+
+ + +
+
+ EXAMPLE_error_code + devicesEXAMPLE 65 K + Δ WoWEXAMPLE +6.1% + on 16.0.1EXAMPLE 73% + + broker + 60d↑ +52% + +
+
EXAMPLE one-line narrative: throw site common/SomeClass.someMethod:NN, dominant message, and the verdict. Keep this short — the deep dive is in the attribution card below.
+
Owner: EXAMPLE teamAttribution card →
+
+ +
+
+ +
+
🟡 Slow-burn 60-day regressions — rising on 60d window but flat WoW; codes that also moved WoW are in the red callout above with a 60d↑ tag
+
+ +
+
+ EXAMPLE_slow_burn_code + devicesEXAMPLE 4.5 M + Δ 60dEXAMPLE +56% + Δ requests 60dEXAMPLE +40% + on 16.0.1EXAMPLE 78% + + broker + +
+
EXAMPLE: WoW only +X%. Tracks 16.0.1 rollout share; one-line hypothesis or owner pointer.
+
+ +
+

See the 60-day trend section for the full ranked list.

-
Real wins this week
-
    -
  • unknown_authority−87% devices, −82% requests. Direct fix from common #3027 Bleu cloud support: now falls back to hardcoded authority list when discovery fails.
  • -
  • timed_out_execution−40% devices, −44% requests. Likely tied to broker #91 skip-account-aggregation flight.
  • -
  • ClientException (error_type) — −10% devices. Mostly the timed_out_execution drop above.
  • -
  • illegal_argument_exception / ArgumentException−67% devices. Mostly removed by PR #88 (OnUpgradeReceiver no longer fires synchronously, removing 100k IAE/wk from that path).
  • -
  • 429 (eSTS throttle) — −98% devices. Throttling cleared (Teams IP-Phone fleet).
  • -
  • Latency p99: RefreshPrt −50%, AcquireAtUsingPrt −49%, BrokerOperationRequestDispatcher −20%
  • -
+
🟢 Real wins this week
+
+ +
+
+ EXAMPLE_recovered_code + devicesEXAMPLE 834 K + Δ WoWEXAMPLE −86% + Δ requestsEXAMPLE −78% +
+
EXAMPLE: 100% pinned to broker 16.0.0; recovery is natural rolloff. Likely fix PR: common #EXAMPLE.
+
Watch: EXAMPLE residual cohort.
+
+ +
-
📊 Traffic is flat (no surge, no collapse)
-

Silent auth: 10.37 B requests, 190.1 M devices (−0.6% / −0.7% WoW). Interactive: 9.84 M requests, 6.34 M devices (−1.0% / −1.8% WoW). Every top calling app is down 5–22% in requests with stable device counts — "fewer requests per device," likely a benign cache-efficiency improvement. See → Traffic Analysis.

+
📊 Traffic shape — flat with mild interactive uptick
+

Auth volume is essentially flat in silent (+2.3% requests, −1.2% devices) and slightly up on interactive (+7.5% requests, +5.6% devices). Top calling apps all moved within ±5% on requests. No surge, no collapse, no sampling-rate change suspected. See → 📊 Traffic Analysis.

- -

📈 60-Day Trend Analysis — rising errors that WoW alone misses

+ +

📈 60-Day Trend Analysis — bucketed across 8 complete weeks

- Why this section exists: Some errors don't move much week-over-week but have been climbing steadily for weeks or months. WoW deltas hide these slow-burn regressions. This section tracks weekly device counts (from the ErrorStats materialized view) over the last 8 complete weeks (Mar 8 → Apr 26; the partial weeks Mar 1 and May 3 are excluded). An error is flagged as a trend regression if devices grew >15% across this window even when WoW looks flat. + Methodology: Pulled all error codes from the ErrorStats view over the last 9 weeks. Dropped the partial start week (Mar 1). Kept all codes whose peak weekly device count ≥ 10 K. Bucketed each 8-week series by delta = (last − first) / first: + regression if delta > +15% and trajectory is monotonic-ish; ephemeral spike if peak ≥ 3× mean of surrounding weeks; improvement if delta < −15%; flat otherwise. Every code in the regression list gets a spike-attribution card below.
-
⚠️ True 60-day regressions (rising even though WoW looked flat)
-
    -
  • timed_out_execution — devices 18.0M → 80.6M (+348%) over 8 weeks; peaked at 143M on Apr 19. As % of fleet: 0.91% → 5.29%. Massive slow-burn regression. Likely tied to broker PR #141 (5c64e1ebd — "Add flight-gated HTTP cancellation on ATS command-level timeout to eliminate zombie worker threads", AB#3542516) which actively converts long-running ATS calls into timed_out_execution errors instead of silent thread leaks. This may be a deliberate visibility increase, but the magnitude warrants confirming the flight rollout schedule and whether downstream callers retry cleanly.
  • -
  • no_tokens_found — devices 14.1M → 22.9M (+62%); as % of fleet 0.71% → 1.50% (~2.1×). Matches the dashboard chart you shared (no_tokens_found % requests climbing 2.5% → 3.7%, % devices 1.25% → 1.8%). Candidate PRs: common #3074 (4f869773a — "Optimize token cache remove path and add filter-first-clone flight for filtered retrieval", AB#3570409) and common #3081 (85f1948e8 — "Fix WPJ's BrokerDiscovery cache crash due to shared predefined encryption key with MSAL", AB#3577391). The cache-remove optimization in #3074 is the prime suspect — an over-aggressive remove or a flight enabling for more apps would directly elevate no_tokens_found.
  • -
  • null_pointer_error — devices 48.4K → 67.3K (+39%) over 8 weeks (peak 71.6K on Apr 19). As % of fleet it's flat (~0.0044%), but absolute device count is steadily climbing. Worth a focused crash-bucketing query on error_location / stack-trace fields to identify the specific call site before it grows further.
  • -
  • unknown_crypto_error — devices 63.8K → 76.3K (+20%); mild but consistent. Likely keystore / TEE-related; correlate with device OEM and Android OS version next pass.
  • -
  • unauthorized_client — devices 2.74M → 3.17M (+15%); mild and may reflect new app onboarding rather than a regression. Bucket by calling_package_name to confirm.
  • -
+
⚠️ True 60-day regressions — 5 codes
+ + + + + + + + + +
Error codeWk 1 devicesWk 8 devicesΔ over 8w60d sparklineTrajectory
no_tokens_found13.9 M23.7 M+70.6%monotonic up
unauthorized_client2.72 M3.37 M+23.6%monotonic up
Code:-631.8 K86.4 K+171.5%step-up at wk 6
unknown_crypto_error59.3 K78.4 K+32.4%U-shaped, climbing
null_pointer_error48.5 K70.7 K+45.9%monotonic up
-
Ephemeral 60-day spike (already self-resolving)
-

unknown_authority — baseline ~1K devices/week through end-March, then exploded: 944K (Apr 5) → 20.5M (Apr 12) → 34.1M (Apr 19, peak) → 9.0M (Apr 26) → 1.3M (May 3, recovering). Strong candidate root cause: common PR #3082 (b53d87e34 — "Fix ABBA deadlock between AzureActiveDirectory and AzureActiveDirectoryAuthority class monitors", AB#3578299) which lands in the authority-validation code path. The mitigation appears to have already taken effect — but a 5-order-of-magnitude excursion deserves a post-mortem and a guardrail alert at >1M devs/week for this code.

+
Ephemeral 60-day spikes (peaked then recovered)
+ + + + + + + +
Error codeBaselinePeakNow60d sparkline
timed_out_execution17.9 M142.9 M (wk Apr 12)53.4 M
unknown_authority~1 K34.1 M (wk Apr 12)1.45 M
429 (eSTS rate-limit)~10218 K (wk Mar 22)2.5 K
+

Both unknown_authority (common #3082 ABBA deadlock fix) and timed_out_execution (broker #141 flight gating) are recovering. Recommendation: add Aria guardrail at >1M devices/week for unknown_authority to detect any future excursion early.

-
True 60-day improvements (sustained, not just WoW noise)
-
    -
  • timed_out — devices 37.5M → 5.5M (−85%) over 8 weeks. Likely a downstream effect of the same ATS timeout refactor (PR #141) — generic timed_out is being reclassified into timed_out_execution. Net traffic is roughly conserved between these two codes.
  • -
  • invalid_scope2.00M → 0.38M (−81%). Genuine improvement.
  • -
  • timed_out_thread_pool_saturated1.71M → 0.68M (−60%). Consistent with the zombie-worker-thread fix in PR #141.
  • -
  • null_object8.17M → 5.39M (−34%). Steady improvement.
  • -
+
True 60-day improvements
+ + + + + + + + + +
Error codeWk 1Wk 8ΔSparkline
timed_out36.1 M5.1 M−85.9%
invalid_scope1.92 M0.36 M−81.3%
timed_out_thread_pool_saturated1.64 M0.62 M−62.1%
illegal_argument_exception0.21 M0.19 M−7.5% (peak −62%)
null_object, device_network_not_available, access_denied, ONLY_SUPPORTS_ACCOUNT_MANAGER_ERROR_CODE, invalid_keyall −17% to −78% over 8 wks (see appendix)
+

Note: the timed_out drop and timed_out_execution climb are partly the same event — broker #141 reclassifies legacy timed_out into the more specific timed_out_execution. The reclassification is net-neutral but the new code is louder; treat the timed_out "win" with caution.

-
Flat on 60d (no trend regression, no improvement)
-

io_error, no_account_found, invalid_grant, interaction_required, device_network_not_available_doze_mode, authorization_pending, expired_token, illegal_argument_exception, User cancelled, auth_cancelled_by_sdk — all within ±10% across the 8-week window. This directly contradicts the WoW finding that io_error/no_account_found/invalid_grant regressed +58–66% on a per-device basis — reinforcing the denominator-effect hypothesis in the Spike Attribution card below.

+
Flat on 60d (within ±10%)
+

io_error, no_account_found, invalid_grant, interaction_required, device_network_not_available_doze_mode, authorization_pending, expired_token, User cancelled, auth_cancelled_by_sdk, invalid_resource, invalid_request, device_registration_needed, Code:-1, Code:-2, Code:-8, operation_interrupted, ipc_return_null_cursor, device_needs_to_be_managed, Redirect url scheme not SSL protected, ipc_operation_not_supported_on_server_side, invalid_client, ipc_connection_error, unknown_error.

- -

🔎 Spike Attribution — root-cause breakdown for each spike

+ +

🔎 Spike Attribution — one card per regression

- What this section answers for each spike: - Is it tied to a broker version rollout? a specific span? active broker? calling app? account type (AAD vs MSA)? shared device mode? -  Each pill in the header summarizes the dominant attribution. Bars show device-share within the dimension. Red bars indicate >80% concentration in a single value (a strong signal). + Each card slices on broker version, span, active broker package, calling app, and sub-dimensions where data is available. Concentration thresholds: > 80% in a single value = strong attribution (red bar); 60–80% = medium; < 60% = broad/cross-cutting. Account-type and shared-device-mode dimensions are sourced from raw android_spans and shown when material.
-
- -
+
-
Failed to parse JWT
-
Devices: 343 → 2,868  (+1,183%)
-
-
- 3rd-party: Nimbus JWT - ⚡ broker 16.0.1 - ⚡ Link to Windows - ⚡ MSA only +
no_tokens_found
+
Devices: 14.1 M → 23.7 M over 8wks  (+68.2%); WoW 22.9 M → 23.7 M (+3.6%)
+
60d regressionsilent pathbroad calling-app spread
-
- Verdict — Strong attribution. 91% of devices are on broker 16.0.1, 100% are com.microsoft.appmanager (Link to Windows) using OneAuth/MSAL_CPP, on AcquireTokenInteractive with MSA accounts. The spike began climbing on Apr 30, matching the 16.0.1 LTW rollout window. Action: file bug against LTW + OneAuth team for JWT parsing path on MSA interactive flows. -
-
-
-
Broker version
-
16.0.191%
-
-
16.0.03%
-
-
other (8 versions)6%
-
-
-
-
Active broker
-
com.microsoft.appmanager90%
-
-
com.azure.authenticator10%
-
-
-
-
Calling app
-
com.microsoft.appmanager100%
-
-
-
-
Span
-
AcquireTokenInteractive100%
-
-
-
-
Client SKU
-
MSAL_CPP (OneAuth)100%
-
-
-
-
Account type
-
-
-
-
-
-
-
- MSA 99.97% - AAD 0.03% -
-
-
-
Shared device mode
-
-
-
-
Personal 100%
-
-
- +
Verdict — Slow-burn 60-day regression, no single dominant dimension. Spans AcquireTokenSilent (98%) but spread across all top callers (Outlook 36%, Teams 20%, SkyDrive 11%, AppManager 7%). Active broker is split ~46% Authenticator / 44% AppManager / 10% Intune CP, mirroring fleet-share — so this is not a broker-app-specific issue. Strongest code-attribution candidate: common #3074 (token-cache remove path optimization, AB#3570409). The +9.6 M devices added since wk of Mar 8 closely tracks the rollout window of that PR. Action: bisect by enabling/disabling the filter-first-clone flight on a small ring to confirm causation.
+
Span
AcquireTokenSilent98.6%
+
+
ATISilently1.3%
+
+
MSAL_PerformIpcStrategy0.1%
+
Calling app
com.microsoft.office.outlook35.5%
+
+
com.microsoft.teams20.1%
+
+
com.microsoft.skydrive10.6%
+
+
com.microsoft.office.word7.3%
+
+
com.microsoft.appmanager7.0%
+
Active broker
com.azure.authenticator46.0%
+
+
com.microsoft.appmanager44.5%
+
+
com.microsoft.windowsintune.companyportal9.5%
+
Broker version
16.0.171.1%
+
+
15.1.010.0%
+
+
14.2.09.0%
+
+
other9.9%
+
Code attribution
-
-
Originator
-
3rd-party lib Nimbus JOSE+JWT — wrapped by broker code
-
-
-
Top throw site
-
com.nimbusds.jwt.SignedJWT.getJWTClaimsSet:28  97% of cases  ·  thrown as ParseException
-
-
-
Wrapper
-
com.microsoft.identity.common.java.providers.oauth2.IDToken.parseJWT:38 wraps it as ServiceException("Failed to parse JWT", INVALID_JWT, e)
-
-
-
Likely PRs
-
-
-
- 🟡 Medium -
- broker #71 · Add Android integration layer for Browser SSO -
commit 92d660dd7 · 2026-03 · authors @melissaahn / Browser SSO team
-
New token-build path through Browser SSO → broker → OneAuth response. MSA-specific paths likely under-tested. Matches Apr 30 climb date.
-
-
-
- 🟢 Low -
- common #3006 + broker #76 · Edge TB: PoP support for WebApps -
commit d774c923b · 2026-03-17
-
Touches MicrosoftStsAccountCredentialAdapter near IDToken handling but only adds a new auth scheme branch — doesn't change the parse path itself.
-
-
-
-
+
+
medium
+
common#3074 Token cache filter-first-clone optimization +
Touches the cache-remove path; an over-eager remove or filter mismatch would directly raise no_tokens_found on AcquireTokenSilent.
-
-
Next step
-
Capture 5-10 correlation IDs from this spike, fetch the broker → OneAuth response payload, inspect actual idToken bytes to confirm whether it's empty, truncated, or base64-malformed.
+
+
low
+
common#3081 BrokerDiscovery cache crash fix (shared encryption key with MSAL) +
Same WPJ/encryption surface; less likely root cause but worth ruling out.
+
+
+
+
🚚 Traffic attribution
+
Spread across all top callers in proportion to their request volume — no single calling-app traffic surge is responsible. Per-Outlook-request rate of no_tokens_found has risen consistently with the trend, ruling out traffic attribution.
- - -
+
-
kdfv2_key_derivation_error
-
Requests: 262 → 5,374  (+1,951%) · 57 devices
-
-
- Android system: Keystore - ⚡ ECS flight ramp - ⚡ AAD only - bursty (May 1, May 2) +
timed_out_execution
+
Devices: 18.0 M → 53.4 M over 8wks (peaked 143 M wk of Apr 12); WoW 80.6 M → 53.4 M (−33.7%) — recovering
+
60d regressionpeak-then-recoverAppManager-heavy
-
- Verdict — Per-device retry storm. 99% of requests come from AAD accounts on a tiny pool (~57 devices). Two big bursts: 1,019 requests on May 1 and 3,026 requests on May 2, then dropped back to baseline. Looks like a small set of devices retrying KDFv2 derivation in a loop. Likely related to broker 16.0.1 crypto path. Action: check broker logs for those device IDs, may need a server-side flight to disable KDFv2 for these devices. -
-
-
-
Account type
-
-
-
-
-
-
-
- AAD 99% - UNKNOWN 1% -
-
-
-
Shared device
-
-
-
-
Personal 100%
-
-
-
Daily request count (last 13 days)
-
-
Spikes on Apr 30, May 1, May 4 — bursty, not sustained.
-
-
- +
Verdict — 60d regression with WoW recovery underway. Almost entirely on AcquireTokenSilent (99.9%). The peak at 143 M devices (wk of Apr 12) and subsequent drop to 53.4 M is consistent with broker #141 (HTTP cancellation on ATS command-level timeout) being flight-rolled out and then partially gated back. AppManager (Link to Windows) is the dominant active broker (53% this week, was 69% prior), and most-affected calling app: Outlook 32% / AppManager 25% / Teams 24%. Action: confirm the flight rollout schedule for #141 and check whether the timeout threshold needs tuning before re-enabling broadly. Watch for downstream client retry storms (it converts silent thread-leak into an explicit error → callers must retry cleanly).
+
Span
AcquireTokenSilent99.9%
+
+
ATISilently0.1%
+
+
AcquireTokenInteractive0.1%
+
Calling app
com.microsoft.office.outlook32.5%
+
+
com.microsoft.appmanager25.0%
+
+
com.microsoft.teams24.0%
+
+
com.microsoft.skydrive5.4%
+
Active broker
com.microsoft.appmanager52.9%
+
+
com.azure.authenticator35.4%
+
+
com.microsoft.windowsintune.companyportal11.7%
+
Broker version
16.0.170.8%
+
+
15.1.010.0%
+
+
14.2.08.8%
+
+
other10.4%
+
Code attribution
-
-
Originator
-
Android system Keystore / SHA-256 provider on certain devices — wrapped by broker
+
+
high
+
broker#141 Add flight-gated HTTP cancellation on ATS command-level timeout to eliminate zombie worker threads (AB#3542516) +
This PR explicitly converts long-running ATS calls into timed_out_execution. The 60d trajectory matches the flight rollout perfectly. The reciprocal drop in legacy timed_out (-86%) confirms the reclassification.
-
-
Top throw site
-
com.microsoft.identity.broker4j.broker.prt.SessionKeyJwtRequestSigner.getSignedJwt:118
-
-
-
Underlying cause
-
84% no_such_algorithm from ProviderFactory.getMessageDigest:123  ·  16% invalid_key from SP800108KeyGen$1.perform:112
-
-
-
Likely PRs
-
-
-
- 🔴 High -
- Server-side ECS rollout · UseKdfVersion2 flight ramp -
Not a code PR — telemetry pattern matches flight ramp (bursts on May 1: 1,019 reqs, May 2: 3,026 reqs)
-
Broker code shipped July 2025 (PR #3144). What changed this week is the flight ramp, not the code.
-
-
-
- 🟢 Low -
- broker #152 · Enable KDFv2 by default -
commit 0fe27f7ab · 2026-04-17 · ships in v16.1.0 (NOT yet rolled out)
-
Code change exists but isn't in production yet. Server-side flight is the active driver.
-
-
-
-
-
-
-
Next step
-
Check ECS dashboard for UseKdfVersion2 ramp on Apr 30 / May 1. Add try/catch fallback in SessionKeyJwtRequestSigner.getSignedJwt():117-122 to retry with KDFv1 on no_such_algorithm. Block-list affected device models from the flight.
-
-
-
-
- - -
-
-
-
SSLHandshakeException
-
Requests: 298k → 555k  (+97%) · only 233 devices
+
-
- Android system: Conscrypt - NOT new broker - legacy broker 13.3.2 - Teams IP-Phone DCF -
-
-
-
- Verdict — Same legacy device pool retrying more. 99% of requests come from broker 13.3.2 (legacy), all from com.microsoft.skype.teams.ipphone calling app, on the AcquireTokenDcfAuthRequest span (Device Code Flow). Same ~150 device pool — they're just retrying more. NOT caused by 16.0.1 rollout. Action: escalate to Teams IP-Phone team — they're on a 2+ year old broker that needs upgrading; their TLS path is failing. -
-
-
-
Broker version
-
13.3.2 (legacy)99%
-
-
13.9.11%
-
-
-
-
Active broker
-
com.azure.authenticator100%
-
-
-
-
Calling app
-
com.microsoft.skype.teams.ipphone99%
-
-
-
-
Span
-
AcquireTokenDcfAuthRequest99%
-
-
-
-
Client SKU
-
MSAL (Android)99%
-
-
-
-
Account type
-
-
-
-
UNKNOWN 100% (DCF pre-auth)
-
-
- -
-
Code attribution
-
-
Originator
-
Android system Conscrypt TLS implementation — broker is a passive consumer
-
-
-
Top throw site
-
com.android.org.conscrypt.SSLUtils.toSSLHandshakeException:363 (125k requests)  ·  ConscryptFileDescriptorSocket.startHandshake:231 (45k)
-
-
-
Underlying cause
-
99%+ CertificateException from TrustManagerImpl.verifyChain  ·  cert-chain rejection at TLS layer
-
-
-
Likely PRs
-
-
-
- ⚪ None -
- No PR in scope -
Broker code is not in the call stack at all
-
Broker version 13.3.2 (legacy, from 2024) is dominantly affected — far outside the 15.1.0 → 16.0.1 window. The growth reflects an existing fleet's environmental TLS issues, not a code regression.
-
-
-
-
-
-
-
Next step
-
Tag as environmental — track but do not page. Already known: escalate to Teams IP-Phone team to upgrade their fleet off broker 13.3.2.
-
+
+
🚚 Traffic attribution
+
AppManager (LTW) dropped from 55.5 M to 28.2 M devices (-49%) WoW while AppManager total request volume rose 2.9% — so this is NOT traffic-driven; the per-AppManager-request rate is what fell, consistent with a flight pull-back.
- - -
-
+
+
-
SSLPeerUnverifiedException
-
Requests: 104 → 3,346  (+3,117%) · only 24 devices
-
-
- Android system: okhttp - same root cause as SSLHandshake - legacy 13.3.2 + 13.9.1 +
unauthorized_client
+
Devices: 2.74 M → 3.37 M over 8wks (+22.8%); WoW 3.17 M → 3.37 M (+6.3%)
+
60d regressionOutlook+Teams concentratedsilent path
-
- Verdict — Same root cause as SSLHandshakeException. 95% of requests on broker 13.3.2 + 13.9.1 (legacy), 95% from com.microsoft.skype.teams.ipphone, all on AcquireTokenDcfAuthRequest. Probably the same TLS chain validation issue as above on the Teams IP-Phone fleet. Treat together with SSLHandshakeException. -
-
-
-
Broker version
-
13.3.262%
-
-
13.9.133%
-
-
-
-
Calling app
-
com.microsoft.skype.teams.ipphone95%
-
-
-
-
Span
-
AcquireTokenDcfAuthRequest95%
-
-
-
-
Account type
-
-
-
-
UNKNOWN 100%
-
-
- +
Verdict — Mild but consistent 60d climb, very likely traffic-attributed (not a broker bug). Calling-app concentration: 67% in Outlook+Teams alone, with the next 5 callers all being Office apps (Excel 8%, Word 7%, SCMx 4%). Span: AcquireTokenSilent 90% / AcquireTokenInteractive 6%. Active broker shares mirror fleet-share. The growth tracks request-volume growth in Outlook/Teams (+2.1%/+3.9% WoW each, +12% over 60d) closely. The most likely explanation is that some Outlook/Teams app registrations are gradually being marked unauthorized for specific resources/scopes by their first-party app owners — not a broker code issue. Action: sample 10 unauthorized_client correlation IDs from this week's Outlook traffic and check the eSTS error sub-code; route to Outlook + first-party app team if confirmed.
+
Span
AcquireTokenSilent90.3%
+
+
AcquireTokenInteractive5.8%
+
+
ATISilently3.6%
+
Calling app
com.microsoft.office.outlook34.7%
+
+
com.microsoft.teams32.6%
+
+
com.microsoft.office.excel8.0%
+
+
com.microsoft.office.word7.4%
+
Active broker
com.azure.authenticator43.3%
+
+
com.microsoft.windowsintune.companyportal36.9%
+
+
com.microsoft.appmanager19.8%
+
Broker version
16.0.170.5%
+
+
15.1.010.2%
+
+
14.2.08.8%
+
+
other10.5%
+
Code attribution
-
-
Originator
-
Android system Bundled okhttp legacy stack — broker is a passive consumer
-
-
-
Top throw site
-
com.android.okhttp.internal.io.RealConnection.connectTls:205  88% of cases  ·  TLS hostname verification failure
-
-
-
Likely PRs
-
-
-
- ⚪ None -
- No PR in scope -
Same root cause class as SSLHandshakeException — Android system TLS
-
Treat together with SSLHandshakeException. Same legacy fleet (13.3.2 + 13.9.1).
-
-
-
-
+
+
none
+
(no PR) No broker code regression identified +
Mirrors fleet broker-version share; no version concentration. Most likely an app-registration / first-party-config drift on the eSTS side.
+
+
+
+
🚚 Traffic attribution
+
Strong traffic-attribution signal: 67% concentration in Outlook+Teams, both growing in request volume. See → 🚚 Traffic Attribution section for full analysis.
- - -
+
-
DeviceRegistrationException
-
Devices: 204 → 245  (+20%) · DRS-adjacent
-
-
- broker code: PR #87 - ⚡ broker 16.0.1 - ⚡ Authenticator - DeviceRegistrationIpc +
Code:-6
+
Devices: 33 K → 86 K over 8wks (+162%, peak 92 K wk of Apr 26); WoW 92.6 K → 86.4 K (−6.7%) — first WoW pullback
+
60d regressioninteractive onlyIntune-CP active broker
-
- Verdict — Likely tied to Authenticator 16.0.1 device registration path. 78% on broker 16.0.1 in com.azure.authenticator, 78% on the new DeviceRegistrationIpc span. Action: investigate Authenticator 16.0.1 device-registration IPC failures; may indicate regression in the new DRS protocol. -
-
-
-
Broker version
-
16.0.178%
-
-
15.1.013%
-
-
others9%
-
-
-
-
Active broker
-
com.azure.authenticator78%
-
-
com.microsoft.intune12%
-
-
-
-
Calling app
-
com.azure.authenticator78%
-
-
com.microsoft.intune15%
-
-
-
-
Span
-
DeviceRegistrationIpc78%
-
-
DeviceRegistrationApi22%
-
-
-
-
Account type
-
-
-
-
UNKNOWN 100% (pre-auth DRS)
-
-
- +
Verdict — Code:-6 (interactive auth canceled by user via system UI) jumped 2.7× starting wk of Apr 19. Span: AcquireTokenInteractive 76% / ATIInteractively 24%. Active broker concentration: 57% Intune Company Portal (vs ~38% fleet share). Calling-app spread: Outlook 30% / Teams 21% / Axis Bank Siddhi 21% (notable: 3rd-party banking app appearing as #3 caller for an interactive-cancellation error suggests an MAM/Intune pop-up issue). Broker-version split: 36% on 15.1.0 (over-represented vs 10% fleet share) and 36% on 16.0.1. Action: investigate whether 15.1.0 introduced an interactive-cancellation path bug, or whether Intune CP is showing a new system dialog that users dismiss. Check broker broker/AADAuthenticator for changes to WebViewClient / interactive consent flows since wk of Apr 12.
+
Span
AcquireTokenInteractive75.8%
+
+
ATIInteractively24.0%
+
+
CertBasedAuth0.1%
+
Calling app
com.microsoft.office.outlook29.6%
+
+
com.microsoft.teams21.4%
+
+
com.axisbank.siddhi.v321.3%
+
+
com.microsoft.windowsintune.companyportal4.7%
+
Active broker
com.microsoft.windowsintune.companyportal56.7%
+
+
com.azure.authenticator24.3%
+
+
com.microsoft.appmanager19.1%
+
Broker version
15.1.037.8%
+
+
16.0.135.5%
+
+
15.0.013.0%
+
+
14.2.04.7%
+
Code attribution
-
-
Originator
-
Broker code New IPC path added in 16.0.x release window
-
-
-
Top throw site
-
com.microsoft.workaccount.workplacejoin.protocol.AndroidDeviceRegistrationProtocolPacker.throwIfBundleContainsDeviceRegistrationException:226 (207 of 215 cases)
-
-
-
Caller hot-spots
-
GetRegistrationStateV0LegacyExecutor.execute:90 (84 dev)  ·  AndroidDeviceRegistrationClientController.execute:234 (47 dev)
-
-
-
Likely PRs
-
-
-
- 🔴 High -
- broker #87 · Update OpenTelemetry integration for Device Registration IPC in client -
commit 9db76c2fe · 2026-03-16 · author @pedroro
-
107-line diff to AndroidDeviceRegistrationClientController.execute(); line 234 (where 47 device errors land) was touched in this range. Also added the DeviceRegistrationIpc span where 78% of telemetry now lands.
-
-
-
- 🟡 Medium -
- broker #81 + common #2926 · Add BoundServiceStrategy as a DR API IPC fallback -
commit 74b33b4b9 · 2026-03-16 · author @pedroro
-
Added a new IPC strategy that may fail and bubble up. Slice by bound_service_status/content_provider_status attributes (added in #87) to isolate which strategy is failing.
-
-
-
-
-
-
-
Next step
-
Owner: @pedroro. Slice by bound_service_status vs content_provider_status attributes to identify which IPC strategy is failing.
+
+
low
+
(no PR) Investigate broker/AADAuthenticator WebViewClient changes between 15.1.0 and 16.0.1 +
Both versions are over-represented vs fleet share. Could not pinpoint a single PR via grep — needs targeted diff between 15.1.0 and 16.0.1 release branches focused on interactive consent flows.
+
+
+
+
🚚 Traffic attribution
+
Axis Bank Siddhi alone is 21% — a single 3rd-party app being a top contributor to a cancellation-style error is unusual. Check whether their app introduced an interactive AcquireToken call recently and whether their UX is causing user dismissal.
- - -
-
+
+
-
Code:-11
-
Devices: 952 → 2,404  (+153%)
-
-
- Android WebView: ERROR_FAILED_SSL_HANDSHAKE (-11) - environmental — no PR - Outlook + Teams IP-Phone - AAD-dominant +
null_pointer_error
+
Devices: 48 K → 71 K over 8wks (+46%); WoW 67 K → 71 K (+5.1%)
+
60d regressionsilent pathLTW + Authenticator
-
- Verdict — Mixed legacy + new fleet, no single root cause. Top broker is legacy 13.3.2 (17 devices, 3.5k requests = retry storm) but the device count growth comes from 15.1.0 (914 dev) and 16.0.1 (372 dev). 99% AAD. Spread across many calling apps (Outlook leads with 60% of devices). Not a clean version regression — looks like two unrelated populations: legacy IP-phone retry storm + a slowly-growing baseline across newer brokers. Action: low priority; track WoW for stability. -
-
-
-
Broker version (by devices)
-
15.1.038%
-
-
16.0.115%
-
-
15.0.011%
-
-
other (40+ versions)36%
-
-
-
-
Calling app (by devices)
-
com.microsoft.office.outlook60%
-
-
com.microsoft.teams16%
-
-
com.axisbank.siddhi.v35%
-
-
-
-
Span
-
AcquireTokenInteractive96%
-
-
-
-
Account type
-
-
-
-
-
-
-
- AAD 96% - MSA 4% -
-
-
-
Shared device
-
-
-
-
-
-
-
- Personal 99% - SDM 1% -
-
-
- +
Verdict — Steady 60d climb in NPE crashes; small absolute volume but trajectory worth flagging. Span: AcquireTokenSilent 98%. Active broker: AppManager (LTW) 53% / Authenticator 26% / Intune CP 20%. Calling app: AppManager 35% / Teams 28% / Outlook 21%. The over-representation of LTW (53% vs ~5-10% fleet share for that active broker) is a strong signal — there's likely a null path specific to the LTW broker process. Action: bucket by error_location / stack-trace prefix and route to LTW team. Cite broker #141 as a possible secondary contributor (timeout cancellation can interact with deferred work that holds a null reference).
+
Span
AcquireTokenSilent98.5%
+
+
DeviceRegistrationApi1.0%
+
+
AcquireTokenInteractive0.3%
+
Calling app
com.microsoft.appmanager35.4%
+
+
com.microsoft.teams28.5%
+
+
com.microsoft.office.outlook21.2%
+
+
com.microsoft.office.word2.8%
+
Active broker
com.microsoft.appmanager53.3%
+
+
com.azure.authenticator26.4%
+
+
com.microsoft.windowsintune.companyportal20.3%
+
Broker version
16.0.170.0%
+
+
15.1.010.5%
+
+
14.2.09.0%
+
+
other10.5%
+
Code attribution
-
-
Originator
-
Android WebView ERROR_FAILED_SSL_HANDSHAKE = -11  ·  environmental enterprise TLS interception
-
-
-
Top throw site
-
com.microsoft.identity.common.internal.ui.webview.OAuth2WebViewClient.sendErrorToCallback wraps as new ClientException("Code:" + errorCode, ...)
-
-
-
Top error_messages
-
- 5,298× net::ERR_SSL_PROTOCOL_ERROR  ·  - 2,689× Zscaler-issued cert for login.live.com  ·  - many Zscaler/proxy certs for aadcdn.msftauth.net, aadgatewaymsit.msidentity.com, etc.  ·  - net::ERR_BAD_SSL_CLIENT_AUTH_CERT  ·  net::ERR_TUNNEL_CONNECTION_FAILED -
+
+
low
+
broker#141 HTTP cancellation on ATS timeout (LTW broker process) +
Active broker is heavily LTW (53% vs ~7% fleet share). The same ATS timeout cancellation path may be racing with a null-checked reference in a deferred callback. Needs stack-trace bucketing to confirm.
-
-
Likely PRs
-
-
-
- ⚪ None -
- No PR in scope -
No PR in 15.1.0 → 16.0.1 touches WebView TLS validation
-
Customer enterprise networks (Zscaler, Bombardier, Société Générale, Bank Gospodarstwa Krajowego, AXIS Bank, etc.) are doing TLS interception. Their proxy presents a cert WebView's validator rejects. Device-count growth (+153%) reflects the growing 16.0.1 fleet entering enterprise environments.
-
-
-
-
-
-
-
Next step
-
Tag as environmental — track but do not page. Long-term: detect Zscaler-style proxy and surface a clearer user-facing error, OR ship a flight to honor user-installed CA store (security trade-off).
+
+
none
+
(no PR) Awaiting crash bucket by error_location +
Cannot pinpoint specific PR without stacktrace breakdown.
+
+
+
+
🚚 Traffic attribution
+
AppManager (LTW) request volume rose only +2.9% WoW while NPE devices from LTW grew +12% — per-LTW-request NPE rate is rising. NOT traffic-driven.
- - -
-
+
+
-
Code:-10
-
Devices: 62 → 162  (+161%)
-
-
- Android WebView: ERROR_UNSUPPORTED_SCHEME (-10) - ⚡ PR #3013 openid-vc - Outlook + msapps - 100% AAD · 100% OneAuth +
unknown_crypto_error
+
Devices: 64 K → 78 K over 8wks (+23%); WoW 76.3 K → 78.4 K (+2.7%)
+
60d regressionkeystore / TEEpre-auth flow
-
- Verdict — Tied to newer brokers + Outlook/msapps. 50% of devices on broker 16.0.1, 32% on 15.1.0. 63% calling app = com.microsoft.office.outlook, 25% = com.microsoft.msapps. 100% MSAL_CPP/OneAuth, 100% AAD. Action: investigate AcquireTokenInteractive failure path on Outlook + OneAuth on the latest brokers. -
-
-
-
Broker version
-
16.0.150%
-
-
15.1.032%
-
-
14.0.210%
-
-
-
-
Calling app
-
com.microsoft.office.outlook63%
+
Verdict — Slow-burn keystore failure, dominated by device-registration / WPJ paths. Span: KeyPairGeneration 55% / SecretKeyGeneration 45% — both keystore-bound, indicating TEE / hardware keystore issues at first-key-generation time. Active broker: 63% Authenticator / 31% AppManager / 6% Intune CP. Calling app is blank for ~100% of these (consistent with pre-authentication flows like DRS/WPJ where no caller is yet attached). Action: slice by DeviceInfo_OsVersion and OEM (Samsung/Pixel/Xiaomi/Huawei) on raw android_spans; this kind of growth typically maps to a specific OEM/Android-version combo (StrongBox-backed keystore quirks).
+
Span
KeyPairGeneration54.5%
+
+
SecretKeyGeneration45.3%
+
+
SecretKeyWrapping0.1%
+
+
SecretKeyRetrieval0.0%
+
Calling app
(blank — pre-auth)100.0%
+
Active broker
com.azure.authenticator63.0%
-
com.microsoft.msapps25%
-
-
-
-
Span
-
AcquireTokenInteractive94%
-
-
-
-
Client SKU
-
MSAL_CPP (OneAuth)84%
-
-
ADAL7%
-
-
-
-
Account type
-
-
-
-
-
-
-
- AAD 93% - UNKNOWN 7% -
-
-
- +
com.microsoft.appmanager31.0%
+
+
com.microsoft.windowsintune.companyportal5.9%
+
Broker version
16.0.171.0%
+
+
15.1.010.0%
+
+
14.2.09.0%
+
+
other10.0%
+
Code attribution
-
-
Originator
-
Android WebView ERROR_UNSUPPORTED_SCHEME = -10  ·  WebView received a custom-scheme redirect URL it can't handle
-
-
-
Top throw site
-
ExceptionAdapter.exceptionFromAuthorizationResult:146 (250 of 286 cases) wraps WebView's ERROR_UNSUPPORTED_SCHEME as ClientException("Code:-10", ...)
-
-
-
Top error_message
-
100% net::ERR_UNKNOWN_URL_SCHEME (286/286)
-
-
-
Likely PRs
-
-
-
- 🔴 High -
- common #3013 · Handle openid-vc urls in webview -
commit 5d30739ca · 2026-03-13 · author @somalaya
-
Introduces handling for new openid-vc:// redirect scheme. If the device's Authenticator/wallet doesn't claim the scheme, WebView throws ERROR_UNSUPPORTED_SCHEME. Author even added a kill-switch flight in #3037 "in case something goes wrong" — exactly this scenario.
-
-
-
- 🟡 Medium -
- common #3037 · Set default value for openid-vc flight in webview redirect -
commit 6a4258589 · 2026-03-19 · author @somalaya
-
Companion to #3013 — set the default-on value. If default-on, this is what triggers the spike. Disabling this flight should mitigate.
-
-
-
-
-
-
-
Next step
-
Disable ENABLE_OPENID_VC_HANDLING_IN_WEBVIEW_REDIRECT flight for the affected slice (Outlook + msapps + 16.0.1) and verify spike subsides. Owner: @somalaya / Sowmya Malayanur.
-
-
-
-
- - -
-
-
-
Top 3 device-share regressions: io_error · no_account_found · invalid_grant
-
Combined attribution view — flat raw counts but +58–66% device-share growth
-
-
- broker 16.0.1 dominant - all account types - non-shared devices -
-
-
-
- Verdict — Likely a denominator effect, not a true reliability regression. Raw weekly request counts for all three errors are essentially flat over the last 60 days (see 60-day trend section below — io_error, no_account_found, and invalid_grant all sit on a flat-to-slightly-down trajectory). Yet the per-device % jumped +58–66% week-over-week. The most plausible explanation is a shift in the active-device denominator coinciding with the 16.0.0 → 16.0.1 rollout (cohort change in which devices report, not new spans being emitted). Earlier draft incorrectly stated 16.0.0 emitted "auxiliary spans" — verification of the relevant 16.0.0 commits did not substantiate that; that claim is retracted. Action: before paging, rerun with a stabilized denominator (e.g. devices that made ≥1 ATS/ATI request that week) and compare to the 60-day trend in the next section. -
-
-
-
invalid_grant — broker (by req)
-
16.0.167%
-
-
14.2.015%
-
-
15.1.07%
-
-
-
-
invalid_grant — account type (by req)
-
-
-
-
-
-
-
-
- AAD 72% - MSA 25% - UNK 3% -
+
+
none
+
(no PR) No broker PR identified — likely OEM/Android-version-specific keystore behavior +
KeyPairGeneration + SecretKeyGeneration concentration points to TEE/StrongBox keystore. Common Android quirks: Samsung Knox vault provisioning, Xiaomi/Huawei custom keystore HALs.
-
-
io_error — account type (by req)
-
-
-
-
-
-
-
-
- AAD 72% - UNK 19% - MSA 9% -
-
-
-
no_account_found — account type
-
-
-
-
-
-
-
-
- MSA 74% - AAD 24% - UNK 2% -
-
-
-
All three — shared device mode
-
-
-
-
- Personal 99.9% - SDM 0.1% -
-
Errors are essentially absent from SDM — this is a personal-device pattern.
-
-
-
-
- -
- - - -

Error codes — WoW with stable denominator — %dev = devices-hit / auth-active devices

- -
- Methodology change: denominator is now SilentAuthStats ∪ InteractiveAuthStats device count (190 M, flat WoW), - not BrokerAdoptionStats (572 M → 353 M, contaminated by PR #88). - Errors below ranked by real change in device share. - Δpp = absolute change (percentage points). - Δrel% = relative change. -
- -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Error codeStatusDevices nowDevices prev%dev now%dev prevΔpp devΔrel% devWhere defined
timed_out_execution▼ Real win24.3 M41.1 M12.89%21.43%−8.54−39.8%broker CommandDispatcher.java:358
unknown_authority▼ Real win948 k7.39 M0.50%3.85%−3.35−87.0%broker Authority.java:420
io_error⚪ Flat73.7 M73.6 M39.03%38.42%+0.61+1.6%broker ConnectionError
no_account_found⚪ Flat59.9 M61.3 M31.74%31.99%−0.25−0.8%broker account cache lookup
invalid_grant⚪ Flat35.0 M35.2 M18.58%18.36%+0.23+1.2%eSTS server-returned
no_tokens_found⚪ Flat4.31 M4.29 M2.28%2.24%+0.04+1.9%broker token cache
null_object⚪ Flat3.78 M4.13 M2.00%2.16%−0.15−7.1%broker nullable utils
illegal_argument_exception▼ Real win140 k428 k0.07%0.22%−0.15−66.7%JDK via OnUpgradeReceiver
timed_out▼ Mild win2.14 M2.44 M1.13%1.27%−0.14−11.0%broker dispatcher
429 (eSTS throttle)▼ Real win2,527142 k~0%0.07%−0.07−98.2%eSTS throttle response
invalid_resource▲ Real regression494 k417 k0.26%0.22%+0.04+20.3%eSTS server-returned
device_network_not_available_doze_mode⚪ Flat1.99 M2.06 M1.05%1.07%−0.02−1.9%Android doze mode
User cancelled⚪ Flat1.74 M1.85 M0.92%0.96%−0.04−4.2%broker UI cancel
interaction_required⚪ Flat1.73 M1.83 M0.92%0.95%−0.03−3.6%eSTS
unauthorized_client⚪ Flat1.16 M1.18 M0.61%0.62%~0−0.5%eSTS
null_pointer_error⚪ Flat57.9 k62.2 k0.03%0.03%~0−5.2%broker NPE wrapper
Failed to parse JWT▲ Spike (small)2,7096330.0014%0.0003%+0.001+367%Nimbus JWT via IDToken
Code:-11▲ Spike (small)2,3299050.0012%0.0005%+0.001+140%Android WebView
-
- -

Code attribution — real movers only

- -
- -
-
-
-
▼ unknown_authority
-
Devices: 7.39 M → 948 k  (−87%)
-
-
- broker code - all flows -
-
-
-
- Verdict — Direct fix from Bleu cloud PR. Discovery-failure path now falls back to hardcoded authority list instead of immediately returning unknown_authority. Sovereign clouds (Bleu/Delos/SovSG) also pre-seeded into cache, eliminating most discovery roundtrips. +
-
-
-
Throw site
-
com.microsoft.identity.common.java.authorities.Authority.getKnownAuthorityResult():420
-
-
-
Likely PR
-
-
-
- 🔴 High -
- common #3027 · [Common] Bleu cloud support -
commit 69f5e5abf · 2026-03-20 · author Mohit Chandwani
-
PR description literally says: "Authority recognition: getKnownAuthorityResult() now wraps discovery in try-catch — if discovery fails, it still checks hardcoded metadata and developer configuration instead of immediately returning 'unknown authority'." Source code at line 420 confirms this exact behavior. Trend is monotonic decline starting around 16.0.x rollout dates.
-
-
-
-
-
+
+
🚚 Traffic attribution
+
Pre-authentication flow with no calling app — traffic attribution does not apply.
- -
-
-
-
▼ timed_out_execution
-
Devices: 41.1 M → 24.3 M  (−40%)
-
-
- broker code - silent auth dispatcher -
-
-
-
- Verdict — Likely tied to skip-account-aggregation flight. Thrown when CommandDispatcher's silent thread-pool task exceeds the timeout. Per-broker-version slice shows the drop concentrated on 16.0.x — matches the broker's SkipAccountAggregation flight which removes the largest source of slow paths. -
-
-
-
Throw site
-
com.microsoft.identity.common.java.controllers.CommandDispatcher:358
-
-
-
Likely PRs
-
-
-
- 🔴 High -
- broker #91 · Skip getCachedRecordToReturn execution when skip_account_aggregation flight is enabled -
commit ddcc073d1 · 2026-03 · removes a major slow path inside ATS
-
Eliminates a redundant cached-record retrieval that often timed out under contention. Together with the broker dispatcher latency wins (p99 −20%), this directly reduces timed_out_execution.
-
-
-
- 🟡 Medium -
- common #2910 · Remove Lru cache + few optimizations -
commit 68f001df6
-
Removed lock contention on a shared LRU cache that was a known timeout culprit.
-
-
-
-
-
-
-
-
-
-
-
▼ illegal_argument_exception / ArgumentException
-
Devices: 428 k → 140 k  (−67%)
-
-
- side-effect of PR #88 - OnUpgradeReceiver -
-
-
-
- Verdict — Side-effect of PR #88 (OnUpgradeReceiver async). Per-span breakdown: 97,730 of 140,200 devices (70%) hit this on OnUpgradeReceiver span. That span is no longer firing reliably on 16.0.x devices, so the IAE thrown inside it (likely a Keystore parameter validation in the keystore creation path the PR was trying to defer) is also no longer being captured. Real user impact unchanged. -
-
-
-
Top span affected
-
OnUpgradeReceiver (97 k of 140 k devices = 70%)  ·  AcquireTokenSilent (42 k = 30%)
-
-
-
Likely PR
-
-
-
- 🔴 High -
- broker #88 · Make OnUpgradeReceiver operations asynchronous -
commit 14905a3ed · 2026-03-16 · OPPO GPU-overload fix
-
Wraps OnUpgradeReceiver work in goAsync() + CoroutineScope(Dispatchers.IO).launch. The receiver now completes before the async block, and the block itself can be killed mid-execution by the OS — so its IAEs (and the OnUpgradeReceiver span itself) stop being emitted. This is a telemetry side-effect, not a real fix for the IAE.
-
-
-
-
-
-
-
-
+ +

🚚 Traffic Attribution — spikes explained by calling-app traffic, not code

-
-
-
▲ invalid_resource
-
Devices: 417 k → 494 k  (+20%)
-
-
- eSTS server-side - Outlook + Teams concentrated - broker 16.0.1 dominant -
-
-
-
- Verdict — Server-side error, not a broker code change. The string invalid_resource is not defined in our broker/common code (no constant, no emit site). It's an eSTS error response passed straight through ServiceException. Concentration: 69% Outlook devices, 19% Teams; 70% on broker 16.0.1, 17% on 15.1.0. Possible explanations: (a) eSTS rejected a resource ID Outlook started sending after a server config change, (b) tdbr claim routing change in common #2679 sending requests to wrong region, (c) Outlook client started requesting a not-yet-deployed resource. -
-
-
-
Originator
-
eSTS server Returned to broker as OAuth error_code in token response, wrapped as ServiceException("invalid_resource", ...)
-
-
-
Top calling apps
-
com.microsoft.office.outlook (340 k devices, 69%) · com.microsoft.teams (95 k, 19%) · com.microsoft.emmx (18 k, 4%)
-
-
-
Top broker version
-
16.0.1 (348 k devices, 70%) · 15.1.0 (85 k, 17%) · 15.0.0 (17 k, 3%)
-
-
-
Likely PRs
-
-
-
- 🟡 Medium -
- common #2679 + broker #94 · Use tdbr claim to route telemetry traffic to EU region -
commit cc81b43e2 · 2026-03
-
Despite the title saying "telemetry traffic," this PR set introduces tdbr-based routing logic. If a request is routed to the wrong eSTS regional endpoint, that endpoint may not recognize the resource → invalid_resource. Worth checking the routing decision logs.
-
-
-
- 🟢 Low -
- eSTS-side change · Server config or Outlook client API change -
No broker PR — escalate to eSTS / Outlook team
-
If broker routing is correct, this is an eSTS-side issue or an Outlook client started requesting a resource ID eSTS doesn't know about.
-
-
-
-
-
-
-
Next step
-
Pull 5-10 correlation IDs from Outlook devices hitting this and check eSTS logs for the actual rejected resource ID. Owner: Outlook + eSTS teams.
-
-
-
-
- -
- - -

Error types — WoW with stable denominator

- -
- - - - - - - - - - - - - - - - - - - - - - - -
Error typeStatusDevices nowDevices prev%dev now%dev prevΔpp devΔrel% dev
ClientException▼ Real win83.9 M95.0 M44.47%49.55%−5.08−10.2%
ArgumentException▼ Real win140 k428 k0.07%0.22%−0.15−66.7%
UiRequiredException⚪ Flat93.9 M95.4 M49.73%49.75%~00.0%
ServiceException▼ Mild win1.59 M1.73 M0.84%0.90%−0.06−6.6%
UserCancelException⚪ Flat1.74 M1.85 M0.92%0.96%−0.04−4.2%
IntuneAppProtectionPolicyRequiredException⚪ Flat1.12 M1.14 M0.59%0.59%~0+0.1%
CreateCredentialCancellationException▼ Mild win122 k140 k0.06%0.07%−0.01−11.2%
SSLHandshakeException▲ Spike (small)216303~0%~0%~0request volume +97%
-
- -
-
+
-
▼ ClientException — root cause of the WoW improvement
-
Devices: 95.0 M → 83.9 M  (−10.2%)
+
unauthorized_client
+
Classification: traffic-attributed — not a broker code regression
+
🚚 traffic-driven
-
- Verdict — Composite improvement, dominated by two wins. ClientException is the umbrella type for non-server errors thrown by broker. Three sub-codes drive most of the −5.08 pp drop: -
    -
  • timed_out_execution−8.5 pp alone (the dominant component) — tied to broker #91 (skip account aggregation)
  • -
  • unknown_authority−3.4 pp — tied to common #3027 (Bleu cloud)
  • -
  • illegal_argument_exception−0.15 pp — side-effect of broker #88 (OnUpgradeReceiver async)
  • -
- Other sub-codes are flat. This is real user-visible reliability improvement. +
+ unauthorized_client +6.3% devices WoW (and +22.8% over 60d) is concentrated in Outlook (35%) + Teams (33%) — combined 67% — both of which grew in request volume +2.1% and +3.9% WoW respectively, and ~12% over 60d. Per-Outlook-request and per-Teams-request unauthorized_client rates are essentially flat. This means the spike is being driven by these apps issuing more requests (some of which were always going to fail with this error code), not by broker code regressing. No broker code change is implicated. Route to Outlook + Teams app-registration owners on the eSTS side.
- + +

Error codes — WoW with stable (auth-only) denominator

+
+ + + + + + + + + + + + + + + + + + + + +
Error codeStatusDevices nowDevices priorΔ devices60d sparkline
no_tokens_found▲ 60d regression23.73 M22.91 M+3.6%
unauthorized_client▲ 60d regression3.37 M3.17 M+6.3%
unknown_crypto_error▲ 60d regression78.4 k76.3 k+2.8%
null_pointer_error▲ 60d regression70.7 k67.3 k+5.1%
Code:-6⚪ Flat86.4 k92.6 k-6.6%
timed_out_execution▼ Win53.40 M80.63 M-33.8%
unknown_authority▼ Win1.45 M9.00 M-83.9%
429▼ Win2.5 k143.9 k-98.3%
io_error⚪ Flat458.49 M439.68 M+4.3%
no_account_found⚪ Flat305.05 M311.77 M-2.2%
invalid_grant⚪ Flat144.07 M140.76 M+2.4%
device_network_not_available_doze_mode⚪ Flat6.31 M6.28 M+0.5%
interaction_required⚪ Flat5.23 M5.12 M+2.1%
User cancelled⚪ Flat3.19 M3.08 M+3.5%
auth_cancelled_by_sdk⚪ Flat1.46 M1.42 M+2.9%
invalid_resource⚠ Watch1.31 M1.05 M+24.4%
invalid_request⚪ Flat1.28 M1.25 M+2.3%
authorization_pending⚪ Flat175.2 k170.0 k+3.0%
expired_token⚪ Flat111.0 k111.9 k-0.7%
Failed to parse JWT⚠ Watch3.5 k895+288.0%
+ + +

Error types — WoW with stable denominator

+
+ + + + + + + + + + + + + + + + + + +
Error typeDevices nowDevices priorΔ devices %
ClientException536.81 M552.79 M-2.9%
UiRequiredException474.82 M477.28 M-0.5%
ServiceException3.88 M3.73 M+4.0%
IntuneAppProtectionPolicyRequiredException3.22 M3.03 M+6.2%
UserCancelException3.20 M3.09 M+3.5%
ArgumentException190.4 k464.2 k-59.0%
CreateCredentialCancellationException136.9 k141.1 k-2.9%
GetCredentialCancellationException116.1 k115.5 k+0.5%
BrokerCommunicationException73.0 k70.4 k+3.7%
DeviceRegistrationRequiredException64.5 k59.8 k+8.0%
CreatePublicKeyCredentialDomException38.1 k42.4 k-10.3%
JobCancellationException32.8 k31.9 k+2.8%
UnknownHostException19.6 k20.4 k-4.0%
NullPointerException12.2 k11.9 k+2.3%
JsonSyntaxException6.1 k7.0 k-12.5%
SocketException1.5 k1.6 k-7.8%
CreateCredentialUnknownException9691.2 k-19.0%
TimeoutCancellationException9161.5 k-37.2%
+ +

📊 Traffic analysis

- Per-flow request and device counts — what's actually moving in user-visible traffic. + Total auth requests/devices, top calling apps, top spans, requests-per-device, sampling-change check.
-
-
Silent requests
-
10.37 B
-
−0.6% WoW (flat)
-
-
-
-
Silent unique devices
-
190.1 M
-
−0.7% WoW (flat)
-
-
-
-
Interactive requests
-
9.84 M
-
−1.0% WoW (flat)
-
-
-
-
Interactive unique devices
-
6.34 M
-
−1.8% WoW (flat)
-
-
-
- -

Top calling apps — every app slightly down in requests, devices stable

- -
- - - - - - - - - - - - - - - - - - - - - - - - - - -
Calling appRequests nowRequests prevΔreqΔreq %Devices nowDevices prevΔdev %Note
com.microsoft.office.outlook2.88 B3.18 B−301 M−9.5%88.9 M90.2 M−1.4%fewer requests/device → cache efficiency
com.microsoft.appmanager2.36 B2.53 B−168 M−6.6%52.9 M53.5 M−1.0%same
com.microsoft.teams1.57 B1.73 B−160 M−9.3%46.4 M47.4 M−2.1%same
com.microsoft.skype.teams.ipphone536 M605 M−69 M−11.4%1.63 M1.72 M−4.9%IP-Phone fleet declining
com.microsoft.skydrive621 M686 M−65 M−9.5%62.8 M65.6 M−4.3%same
com.samsung.android.email.provider142 M181 M−39 M−21.5%738 k746 k−1.0%biggest req drop, devs flat
com.microsoft.office.word375 M397 M−23 M−5.7%15.3 M15.6 M−1.9%same
com.microsoft.emmx142 M159 M−17 M−10.6%5.69 M6.09 M−6.6%same
com.microsoft.office.officehubrow220 M233 M−13 M−5.6%18.4 M19.9 M−7.7%same
com.microsoft.office.excel249 M262 M−13 M−5.0%10.8 M10.9 M−1.4%same
-
- -

What's moving inside the broker (top spans by absolute drop)

- -
- - - - - - - - - - - - - - - - - - - - -
SpanCount nowCount prevΔabsoluteΔrel%Note
OnUpgradeReceiver142 M651 M−509 M−78%broker #88 — goAsync() makes broadcast complete before async work; OS may kill before span flushes
WrappedKeyAlgorithmIdentifier84 M176 M−91 M−52%Downstream of fewer keystore ops in OnUpgradeReceiver path
SecretKeyWrapping201 M260 M−59 M−23%Same downstream cause
DeviceRegistrationApi570 M591 M−22 M−4%Flat (within noise)
AcquireTokenSilent10.13 B10.31 B−176 M−1.7%Flat (real auth)
BrokerOperationRequestDispatcher337 M345 M−9 M−2.5%Flat
AcquireTokenDcfAuthRequest4.7 M7.7 M−3 M−38%Tied to Teams IP-Phone fleet decline
-
- -
-
📌 Traffic-attribution verdict
-

No real traffic surge or collapse. The headline "38% drop in all-spans devices" is entirely explained by broker PR #88 (~509 M lost OnUpgradeReceiver events/wk). The uniform 5–22% per-app request decline with stable device counts is consistent with caching/efficiency gains rather than traffic loss; recommended next step is to check is_serviced_from_cache rate WoW to confirm.

+
Total broker requests (BrokerAdoptionStats)
12.79 B
−1.1% WoW · 60d −24%
+
Total broker devices (BrokerAdoptionStats)
1.24 B
−18.6% WoW · ⚠ denominator artifact (see top callout)
+
Auth-only requests (Silent + Interactive)
10.59 B
+2.4% WoW · 60d +1.0%
+
Auth-only devices
1.54 B
−1.2% WoW (real fleet flat)
+
Requests / device (silent)
6.92
+3.6% WoW (more requests per dev)
+
Sampling change indicator
⚪ Stable
All-spans dropped >20%, but auth-only <5% — confirms OnUpgradeReceiver taper, not sampling change
- -

Latency — ms, p50/p95/p99 by span

-
- - - - - - - - - - - - - - - - - - - -
Spanp50 nowp50 prevp95 nowp95 prevp99 nowp99 prevΔrel% p99p99 trend (13 days)
RefreshPrt3453479029422,6805,344−50%
AcquireAtUsingPrt6066151,8852,01312,65424,627−49%
BrokerOperationRequestDispatcher38391,9191,9596,7008,397−20%
AcquireTokenSilent4804824,8305,76830,14930,467−1% (p95: −16%)
DeviceRegistrationApi1881911,4621,4363,5013,442+2%
GetAccounts4594454,6024,41812,35411,838+4%
-
- - -

Broker version adoption — device share, last 13 days

-
-
-
com.microsoft.appmanager (Link to Windows)
-
16.0.0 → 16.0.1 rollover in progress
- -
-
16.0.0 (deprecated)
-
16.0.1 (current)
-
15.0.0 (legacy)
-
-
-
-
com.azure.authenticator
-
Authenticator broker version migration
- -
-
16.0.1
-
15.1.0
-
16.0.0
-
15.0.0
-
-
-
- - +

Top calling apps

+
+ + + + + + + + + + + + + + +
Calling appRequests nowRequests priorΔ requests %Devices nowDevices priorΔ devices %
com.microsoft.office.outlook3.24 B3.18 B+2.1%458.26 M454.87 M+0.7%
com.microsoft.appmanager2.64 B2.56 B+2.9%282.92 M284.00 M-0.4%
com.microsoft.teams1.79 B1.72 B+3.9%236.82 M234.55 M+1.0%
com.microsoft.skydrive691.12 M691.85 M-0.1%200.50 M199.29 M+0.6%
com.microsoft.skype.teams.ipphone599.97 M597.66 M+0.4%9.03 M9.28 M-2.7%
com.microsoft.office.word419.64 M400.69 M+4.7%63.26 M61.29 M+3.2%
com.microsoft.office.excel279.21 M264.73 M+5.5%44.64 M42.86 M+4.2%
com.microsoft.office.officehubrow247.83 M233.35 M+6.2%38.71 M37.31 M+3.8%
com.microsoft.emmx158.68 M158.54 M+0.1%16.83 M17.10 M-1.6%
com.samsung.android.email.provider151.42 M178.53 M-15.2%4.20 M4.16 M+0.9%
com.microsoft.scmx93.59 M94.15 M-0.6%9.68 M9.74 M-0.7%
com.microsoft.office.powerpoint84.03 M79.37 M+5.9%14.91 M14.17 M+5.2%
com.microsoft.windowsintune.companyportal65.18 M63.85 M+2.1%35.32 M34.80 M+1.5%
com.microsoft.sharepoint12.86 M12.75 M+0.9%5.40 M5.39 M+0.1%
+ +

Top spans by request volume

+
+ + + + + + + + + + + + + + + +
SpanCount nowCount priorΔ %Note
AcquireTokenSilent10.59 B10.35 B+2.3%
DeviceRegistrationApi598.70 M578.30 M+3.5%
AcquireTokenDcfFetchToken365.00 M364.00 M+0.3%
BrokerOperationRequestDispatcher349.10 M344.30 M+1.4%
SecretKeyWrapping251.00 M329.10 M-23.7%Downstream of OnUpgradeReceiver drop
OnUpgradeReceiver151.20 M438.20 M-65.5%Denominator culprit — natural taper as 16.0.1 rollout completes; may also be amplified by goAsync() effects
SecretKeyRetrieval113.40 M111.50 M+1.7%
WrappedKeyAlgorithmIdentifier87.10 M134.70 M-35.3%Downstream of OnUpgradeReceiver drop
RefreshTransferToken78.80 M74.80 M+5.3%
EcsFlightsFetchConfigs47.30 M47.70 M-0.8%
AcquireAtUsingPrt38.80 M37.10 M+4.6%
Passthrough23.90 M24.20 M-1.2%
RefreshPrt20.70 M19.40 M+6.7%
AccountStorageWithBackup11.40 M11.60 M-1.7%
AcquireTokenInteractive10.30 M9.60 M+7.3%Up — matches +7.5% interactive auth growth
+ + +

Latency — ms, p50/p95/p99 by hot span

+
+ + + + +
Spanp50 nowp50 priorp95 nowp95 priorp99 nowp99 priorΔ p99 %
AcquireTokenSilent11851204591659611375813712+0.3%
GetAccounts452448449044011180111784+0.1%
ProcessWebsiteRequest16167677199201-1.0%
RemoveAccount171925928516741945-13.9%
+

Source: PerfStats (TDigest-merged). All hot spans flat or slightly improving except GetAccounts p95 (+2%, within noise).

+ + +

Broker version adoption — request share by version

+
+ + + + + + + + + + +
Broker versionReq share nowReq share priorΔ share ppΔ rel %
16.0.170.77%55.34%+15.43+26.5%
15.1.010.12%10.78%-0.66-7.1%
14.2.08.80%9.42%-0.62-7.6%
15.0.02.27%4.03%-1.75-44.1%
16.0.01.39%13.35%-11.96-89.7%
14.1.11.06%1.24%-0.18-15.6%
14.0.20.99%1.09%-0.10-9.9%
13.3.20.63%0.67%-0.04-6.9%
13.9.10.42%0.45%-0.03-6.9%
13.20.00.40%0.41%-0.01-3.8%
+

16.0.1 rollout effectively complete. Reached 70.8% req share (from 55.4%); 16.0.0 down to 1.4% from 13.4%. Older 15.x and 14.x versions all decline 5-15% as natural attrition. No version regressed in error rate during this rollout — see Spike Attribution cards for per-error broker_version concentration.

+ +

Appendix

- Long tail — error codes with no material movement + Queries used (Kusto KQL)
- - - - - - - - - - - - - - - - - -
Error codeReqs nowReqs prevΔrel% reqsDevs nowDevs prevΔrel% devs
authorization_pending360 M371 M+3%88.8k92.3k+54%
timed_out28.8 M28.6 M+7%2.29M2.50M+47%
device_network_not_available_doze_mode108 M119 M−4%2.04M2.10M+56%
User cancelled3.65 M3.72 M+4%1.89M1.90M+59%
interaction_required33.4 M34.2 M+3%1.80M1.86M+55%
unauthorized_client24.8 M25.0 M+5%1.20M1.20M+60%
auth_cancelled_by_sdk1.58 M1.67 M+1%906k946k+53%
timed_out_thread_pool_saturated2.60 M3.12 M−12%380k484k+25%
invalid_scope846k937k−4%241k280k+38%
unknown_error2.28 M2.70 M−11%113k118k+53%
device_network_not_available13.1 M20.4 M−32%147k156k+51%
unknown_crypto_error5.01 M4.56 M+17%29.6k31.5k+50%
expired_token2.32 M2.40 M+3%35.1k37.0k+51%
+

Cluster: https://idsharedeus2.kusto.windows.net · Database: ad-accounts-android-otel

+

1. Reliability:

+
let all = SilentAuthStatsAllRequests | where EventInfo_Time > ago(70d)
+  | summarize allReq=sum(countRequests), allDev=sum(countDevices) by week=startofweek(EventInfo_Time);
+let ok = SilentAuthStatsRequestsWithoutExpectedError | where EventInfo_Time > ago(70d)
+  | summarize okReq=sum(countRequests), okDev=sum(countDevices) by week=startofweek(EventInfo_Time);
+all | join kind=inner ok on week
+  | project week, reqRel=round(100.0*okReq/allReq,3), devRel=round(100.0*okDev/allDev,3)
+  | order by week asc
+

2. 60-day error trend (bucketed in post-processing):

+
ErrorStats | where EventInfo_Time > ago(70d)
+  | where isnotempty(error_code) and error_code != 'success'
+  | summarize errs=sum(countOverall), devs=sum(countDevices)
+       by week=startofweek(EventInfo_Time), error_code
+  | order by error_code asc, week asc
+

3. Spike attribution (per error, per dimension):

+
let codes = dynamic(['no_tokens_found','unauthorized_client','Code:-6',
+                     'unknown_crypto_error','null_pointer_error','timed_out_execution']);
+ErrorStats | where EventInfo_Time > ago(14d) | where error_code in (codes)
+  | extend wk=startofweek(EventInfo_Time)
+  | summarize devs=sum(countDevices) by wk, error_code,
+       calling_package_name, active_broker_package_name, broker_version, span_name
+  | order by error_code asc, wk asc, devices desc
+

4. Latency (TDigest-merged):

+
PerfStats | where EventInfo_Time > ago(21d)
+  | where span_name in ('AcquireTokenSilent','GetAccounts','ProcessWebsiteRequest','RemoveAccount')
+  | where span_status == 'OK'
+  | summarize merged=tdigest_merge(responseTimeTDigest), reqs=sum(countRequests)
+       by week=startofweek(EventInfo_Time), span_name
+  | extend p50=percentile_tdigest(merged,50),
+           p95=percentile_tdigest(merged,95),
+           p99=percentile_tdigest(merged,99)
@@ -1691,13 +1069,12 @@

Appendix

Methodology & caveats
    -
  • Source: AllAndroidSpans + materialized views ErrorStats, SilentAuthStatsAllRequests, InteractiveAuthStatsAllRequests, BrokerAdoptionStats, PerfStats.
  • -
  • Window: last 7 days vs prior 7 days, ending 2026-05-07. Sparklines show 13-day window.
  • -
  • Attribution data: for each spike, joined ErrorStats (broker/span/active_broker/calling_app/sku) with android_spans (account_type, is_shared_device) over the last 7 days.
  • -
  • account_type unification: applied MergeAccountType() (collapses AAD variants, MSA variants).
  • -
  • error_type unification: applied MergeUiRequiredExceptions().
  • -
  • UNKNOWN account_type: typically appears in pre-authentication flows (DRS, DCF, broker discovery) where no account is yet selected.
  • -
  • Important caveat: the device-share inflation across many errors is most likely a denominator-shift artifact from broker 16.0.0 → 16.0.1 rollout. Per-broker-version slicing is needed to confirm whether errors actually grew or whether the denominator just shrank.
  • +
  • Reporting window: Sun May 3 → Sat May 9, 2026 (Kusto startofweek('2026-05-03')). Baseline: prior week of Apr 26 → May 2.
  • +
  • 60-day window: 8 complete weeks Mar 8 → May 3 (the partial Mar 1 start week is excluded for trend deltas).
  • +
  • Auth-only denominator: all reliability % use countRequests from SilentAuthStatsAllRequestsInteractiveAuthStatsAllRequests. The all-spans denominator from BrokerAdoptionStats is sensitive to receiver/goAsync() taper effects.
  • +
  • Concentration thresholds for attribution cards: >80% = strong (red bar); 60-80% = medium; <60% = broad/cross-cutting.
  • +
  • PR confidence rating: high = trajectory + flight rollout date both line up; medium = code path matches but no flight gate evidence; low = candidate from grep, needs verification; none = no broker PR identified.
  • +
  • Account type / shared-device-mode dimensions are not yet sliced this week — ErrorStats doesn't carry them, requires a targeted android_spans query that we'll add next pass.
@@ -1705,6 +1082,13 @@

Appendix

- + \ No newline at end of file diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js index 5fdce733..617d0a16 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js @@ -2,27 +2,42 @@ /** * summarize-attribution.js — Roll up WoW attribution slices for spike-attribution cards. * - * Reads N Kusto MCP JSON output files, each with a `--label=...` tag describing what - * dimension it slices, and prints a per-(error_code, week, dimension) breakdown. + * TWO INPUT MODES: * - * Each input is the JSON file produced by the Kusto MCP tool. The first row of - * `results.items` is the schema; the remaining rows are positional arrays. + * 1) Per-dim files (legacy mode): one Kusto JSON per dimension, each tagged with + * --label=. Use this when you ran 7 separate per-dim queries. * - * The script auto-detects schema by looking at the column names of row[0]: - * - It expects exactly one column named `error_code`. - * - It expects exactly one column named `wk` or `week` (datetime). - * - It expects exactly one numeric column named `devs` or `countDevices`. - * - The remaining 1–2 string columns are treated as the slicing dimension. + * node summarize-attribution.js \ + * --label=span \ + * --label=calling_app \ + * --label=active_broker \ + * --label=broker_version \ + * --label=acct_type \ + * --label=shared_dev \ + * --label=client_sku * - * Usage: - * node summarize-attribution.js \ - * --label=span \ - * --label=calling_app \ - * --label=active_broker \ - * --label=broker_version + * Per-file schema: row[0] must include `error_code`, `wk`/`week`, `devs`/`countDevices`, + * and exactly one trailing string column (the dimension value). * - * Output: per error_code, per week, the top-5 values of each dimension by devs and - * their share-of-total. Use this to fill in attr-card dim rows. + * 2) Union mode (NEW, recommended for 2-week WoW attribution — one query covers all dims): + * + * node summarize-attribution.js --union + * + * Expected schema (any column order): + * dim string -- short label e.g. 'span', 'calling_app', 'broker_ver' + * wk | week datetime + * error_code string (or `error_type` — use --key=error_type to switch) + * val_string string } EITHER `val_string`+`val_bool` (Kusto union of + * val_bool bool } mixed-type slice columns) ... + * val string } ... OR a single `val` column + * devs long (use `dcount_hll(hll_merge(countDevicesHll))` upstream) + * errs long (optional — request count, used for retry-storm detection) + * + * The union form is what Step 5 of SKILL.md now recommends — 1 round-trip vs 7. + * See assets/queries/attr-union-by-dim.kql. + * + * Output: per error_code, per dimension, the top-5 values for each week (prior + curr), + * concentration % of curr-week total, and Δd / Δr vs prior week. * * IMPORTANT: when you build the source query, ALWAYS use * dcount_hll(hll_merge(countDevicesHll)) @@ -30,72 +45,142 @@ */ const fs = require('fs'); -const inputs = []; // { label, file } +// --- arg parsing --------------------------------------------------------- +const argv = process.argv.slice(2); +const inputs = []; // per-dim mode: { label, file } let pendingLabel = null; -for (const a of process.argv.slice(2)) { +let unionFile = null; +let keyCol = 'error_code'; // override with --key=error_type for type cards +let topN = 5; +for (const a of argv) { + if (a === '--union') { /* next non-flag arg is the file */ pendingLabel = '__UNION__'; continue; } + if (a.startsWith('--union=')) { unionFile = a.split('=')[1]; pendingLabel = null; continue; } + if (a.startsWith('--key=')) { keyCol = a.split('=')[1]; continue; } + if (a.startsWith('--top=')) { topN = parseInt(a.split('=')[1], 10) || 5; continue; } if (a.startsWith('--label=')) { pendingLabel = a.split('=')[1]; continue; } + if (pendingLabel === '__UNION__') { unionFile = a; pendingLabel = null; continue; } inputs.push({ label: pendingLabel || 'unknown', file: a }); pendingLabel = null; } -if (inputs.length === 0) { - console.error('Usage: node summarize-attribution.js --label= file1.json --label= file2.json ...'); +if (!unionFile && inputs.length === 0) { + console.error('Usage:\n node summarize-attribution.js --union [--key=error_code|error_type] [--top=N]\n node summarize-attribution.js --label= file1.json --label= file2.json ...'); process.exit(1); } -function loadSlice({ label, file }) { +// --- helpers -------------------------------------------------------------- +function fmt(n) { + if (n == null) return '–'; + if (Math.abs(n) >= 1e9) return (n / 1e9).toFixed(2) + 'B'; + if (Math.abs(n) >= 1e6) return (n / 1e6).toFixed(2) + 'M'; + if (Math.abs(n) >= 1e3) return (n / 1e3).toFixed(1) + 'k'; + return String(n); +} +function pct(num, den) { return den ? (100 * num / den).toFixed(1) : '0.0'; } +function delta(curr, prior) { + if (prior == null || prior === 0) return curr ? 'NEW' : '–'; + return ((curr - prior) / prior * 100).toFixed(1) + '%'; +} + +// --- per-dim file loader (legacy mode) ------------------------------------ +function loadSliceFile({ label, file }) { const d = JSON.parse(fs.readFileSync(file, 'utf8')); const rows = d.results.items; - const schema = rows[0]; // object: { col: type, ... } + const schema = rows[0]; const cols = Object.keys(schema); - const idxCode = cols.indexOf('error_code'); + const idxCode = cols.indexOf(keyCol); let idxWeek = cols.indexOf('wk'); if (idxWeek < 0) idxWeek = cols.indexOf('week'); let idxDevs = cols.indexOf('devs'); if (idxDevs < 0) idxDevs = cols.indexOf('countDevices'); + let idxErrs = cols.indexOf('errs'); if (idxErrs < 0) idxErrs = cols.indexOf('countOverall'); if (idxCode < 0 || idxWeek < 0 || idxDevs < 0) { - throw new Error(`${file}: schema must include error_code, wk|week, devs|countDevices. Got [${cols.join(', ')}]`); + throw new Error(`${file}: schema must include ${keyCol}, wk|week, devs|countDevices. Got [${cols.join(', ')}]`); } - // The "dimension" column is the first string col that isn't error_code/week - const idxDim = cols.findIndex((c, i) => i !== idxCode && i !== idxWeek && i !== idxDevs && schema[c] === 'string'); + const idxDim = cols.findIndex((c, i) => + i !== idxCode && i !== idxWeek && i !== idxDevs && i !== idxErrs && schema[c] === 'string'); if (idxDim < 0) throw new Error(`${file}: no string dimension column found`); - const map = {}; // code -> wk -> dim -> devs + const map = {}; + for (const r of rows.slice(1)) { + const code = r[idxCode], wk = r[idxWeek]; + const dim = (r[idxDim] === null || r[idxDim] === '') ? '(blank)' : r[idxDim]; + const devs = r[idxDevs] || 0; + const errs = idxErrs >= 0 ? (r[idxErrs] || 0) : 0; + const slot = ((map[code] ||= {})[wk] ||= {})[dim] ||= { devs: 0, errs: 0 }; + slot.devs += devs; slot.errs += errs; + } + return { label, map }; +} + +// --- union-mode loader (NEW) --------------------------------------------- +function loadUnion(file) { + const d = JSON.parse(fs.readFileSync(file, 'utf8')); + const rows = d.results.items; + const schema = rows[0]; + const cols = Object.keys(schema); + const idx = name => cols.indexOf(name); + const idxDim = idx('dim'); + const idxCode = idx(keyCol); + let idxWeek = idx('wk'); if (idxWeek < 0) idxWeek = idx('week'); + let idxDevs = idx('devs'); if (idxDevs < 0) idxDevs = idx('countDevices'); + let idxErrs = idx('errs'); if (idxErrs < 0) idxErrs = idx('countOverall'); + const idxValS = idx('val_string') >= 0 ? idx('val_string') : idx('val'); + const idxValB = idx('val_bool'); + if (idxDim < 0 || idxCode < 0 || idxWeek < 0 || idxDevs < 0 || idxValS < 0) { + throw new Error(`Union file ${file}: schema must include dim, ${keyCol}, wk|week, devs|countDevices, val_string|val (and optionally val_bool). Got [${cols.join(', ')}]`); + } + // perDim[label].map[code][wk][dimVal] = { devs, errs } + const byDim = {}; for (const r of rows.slice(1)) { - const code = r[idxCode], wk = r[idxWeek], dim = r[idxDim] || '(blank)', devs = r[idxDevs] || 0; - ((map[code] ||= {})[wk] ||= {})[dim] = (map[code][wk][dim] || 0) + devs; + const label = r[idxDim]; + const code = r[idxCode]; + const wk = r[idxWeek]; + const valS = r[idxValS]; + const valB = idxValB >= 0 ? r[idxValB] : null; + let v; + if (valS !== null && valS !== undefined && valS !== '') v = valS; + else if (valB !== null && valB !== undefined) v = String(valB); + else v = '(blank)'; + const devs = r[idxDevs] || 0; + const errs = idxErrs >= 0 ? (r[idxErrs] || 0) : 0; + const target = byDim[label] ||= { label, map: {} }; + const slot = ((target.map[code] ||= {})[wk] ||= {})[v] ||= { devs: 0, errs: 0 }; + slot.devs += devs; slot.errs += errs; } - return { label, dimColumn: cols[idxDim], map }; + return Object.values(byDim); } -const slices = inputs.map(loadSlice); +const slices = unionFile ? loadUnion(unionFile) : inputs.map(loadSliceFile); -// Collect (code, week) universe +// --- output -------------------------------------------------------------- const universe = {}; for (const s of slices) { for (const [code, wks] of Object.entries(s.map)) { - for (const wk of Object.keys(wks)) { - ((universe[code] ||= {})[wk] = true); - } + for (const wk of Object.keys(wks)) ((universe[code] ||= {})[wk] = true); } } - const codes = Object.keys(universe).sort(); + for (const code of codes) { - console.log(`\n========== ${code} ==========`); const wks = Object.keys(universe[code]).sort(); - for (const wk of wks) { - console.log(`\n --- week ${wk.slice(0, 10)} ---`); - for (const s of slices) { - const dim = s.map[code]?.[wk] || {}; - const total = Object.values(dim).reduce((x, y) => x + y, 0); - if (total === 0) continue; - console.log(` [${s.label}] total=${total.toLocaleString()}`); - Object.entries(dim) - .sort((a, b) => b[1] - a[1]) - .slice(0, 5) - .forEach(([k, v]) => { - const pct = (v / total * 100).toFixed(1); - console.log(` ${pct.padStart(5)}% ${k} (${v.toLocaleString()})`); - }); + const prior = wks[0], curr = wks[wks.length - 1]; + console.log(`\n========== ${code} (prior=${prior?.slice(0,10)} curr=${curr?.slice(0,10)}) ==========`); + for (const s of slices) { + const priorMap = s.map[code]?.[prior] || {}; + const currMap = s.map[code]?.[curr] || {}; + const allVals = new Set([...Object.keys(priorMap), ...Object.keys(currMap)]); + if (allVals.size === 0) continue; + const totC = Object.values(currMap).reduce((a, b) => a + b.devs, 0); + const rows = [...allVals].map(v => ({ + v, + pDev: priorMap[v]?.devs || 0, + cDev: currMap[v]?.devs || 0, + pErr: priorMap[v]?.errs || 0, + cErr: currMap[v]?.errs || 0, + })).sort((a, b) => b.cDev - a.cDev).slice(0, topN); + console.log(`\n -- ${s.label} (curr-total devices=${fmt(totC)})`); + for (const r of rows) { + const share = pct(r.cDev, totC); + console.log(` ${share.padStart(5)}% ${fmt(r.cDev).padStart(8)}d d_dev ${delta(r.cDev, r.pDev).padStart(8)} d_req ${delta(r.cErr, r.pErr).padStart(8)} ${r.v}`); } } } diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md new file mode 100644 index 00000000..54af933a --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md @@ -0,0 +1,98 @@ +# report-template.html — author guide + +`assets/report-template.html` is the canonical layout for the OCE weekly report. +It is **a real prior week's report kept verbatim as a structural reference** — +not a tokenized skeleton. The right mental model is: + +> *"Open the template, save it under a new filename, then walk top-to-bottom and +> replace every prior-week date / number / verdict / PR citation with current-week +> data. Don't redesign the layout. Don't restyle the CSS."* + +## What you change per week + +| Region | What to update | +|---|---| +| `` and `<h1>` block | Reporting window dates + "Generated …" date | +| KPI tiles (`.kpi-grid`) | Value, delta, `data-spark` array (8–9 numbers) per tile | +| 🚨 Needs-attention callouts (`.callout.urgent` / `.watch` / `.win`) | Replace bullet list with current-week findings; keep the 4 callout categories | +| 📈 60-day trend tables | Rows + `.trend` sparkline arrays, generated by `bucket-trends.js` (4 runs, union of regressions) | +| 🔎 Spike-attribution cards (`.attr-card`) | One card per regression. **Use [`templates/spike-card.html`](templates/spike-card.html) as the per-card skeleton.** Replace dim percentages, throw-site, PR list, etc. | +| 🚚 Traffic-attribution cards | Same as spike cards; render an explicit "None this week" if no errors qualify | +| Error-codes / error-types tables | One row per non-trivial code/type with Δ devices % + Δ requests % + 60d sparkline | +| Traffic / latency / adoption tables | Update numbers; structure stays | +| Appendix PR window list | Run `find-suspect-prs.ps1` (or `git log`) for broker/ + common/ over the 4-week window | + +## What you NEVER change + +- The `<style>` block at the top — the CSS is canonical +- The `<script>` block at the bottom — the sparkline JS is canonical (uses string concatenation, not template literals, on purpose — see comment in the script) +- Section ordering and `id="..."` anchors — the table-of-contents links rely on these + +If the layout itself ever needs to change (new section, new card style), edit +`assets/report-template.html` here in the skill folder and commit so future +weeks inherit the change. + +## Validator pass before saving + +Two literal-string greps must return zero results: + +```pwsh +Select-String -Path <output.html> -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false # user-facing terminology +Select-String -Path <output.html> -Pattern 'EXAMPLE CONTENT BELOW' # unfinished-section sentinel +``` + +Authors mark unfinished sections with the literal text `EXAMPLE CONTENT BELOW` +inside an HTML comment. The grep catches anything still in flight. + +`devs` / `reqs` are allowed inside `<pre><code>…</code></pre>` KQL blocks +(legitimate Kusto column / variable names). All other occurrences are +forbidden — use `devices` / `requests` in user-facing prose, headers, badges, +and verdicts. + +## Sparkline color palette + +Used by both `.spark` (KPI tiles) and `.trend` (table cells): + +| Color hex | Semantic | When to use | +|---|---|---| +| `#cf222e` red | bad / regression | data-trend on a row in the regressions table | +| `#1a7f37` green | good / improvement / win | data-spark on a reliability KPI; data-trend on a recovery | +| `#0969da` blue | neutral / informational | data-spark on traffic-volume KPIs | +| `#0550ae` darker blue | latency | data-spark on p95 KPIs | +| `#9a6700` amber | watch / spike | data-trend on ephemeral spikes (peak-then-recover) | +| `#656d76` grey | flat / no-movement | data-trend on flat rows in long-tail tables | + +## CSS class quick reference + +(Defined in `<style>`; do not redefine inline.) + +### Section 2 callouts (at-a-glance, flat rows — the "Things that need attention" block) + +| Class | Use | +|---|---| +| `.callout` (`.urgent` / `.watch` / `.win`) | The outer card with the colored left rail and pastel background. The rail color IS the severity affordance — do not add per-item left bars inside the callout (they will visually clash). | +| `.item-list` | Container for the flat row list inside a callout body. | +| `.item` | Single divider-separated row. NO chrome — no border, no background, no left bar. The `.item:first-child` selector removes the top divider. | +| `.item-head` | Flex row: name + inline metric chips + tags pushed right. Use `flex-wrap` so it works on narrow viewports. | +| `.item-name` | Monospace bold name (the `error_code` or `error_type`). Append a `<span class="kind">type</span>` pill if it's an error_type, not an error_code. | +| `.metric` (`.up` / `.down`) | Inline metric chip: `<label> <value>`. `.up` = red (regression), `.down` = green (improvement). Use multiple per row for `devices`, `Δ WoW`, `Δ requests`, `on 16.0.1`, etc. | +| `.item-tags` | Right-pushed tag rail. Put the originator chip (`origin-broker` / `origin-thirdparty` / `origin-android` / `origin-env`) here, plus optional `NEW` / `60d↑` tags. | +| `.item-body` | One short narrative line (throw site + dominant message + verdict). Keep it short — the deep dive belongs in the spike-attribution card. | +| `.item-foot` | Optional footer with owner / next step + right-aligned `Attribution card →` link via `.arrow-link`. | + +**HARD RULE:** Section 2 items are at-a-glance — they MUST link to a deep-dive `.attr-card` in Section 4 via `<a class="arrow-link" href="#card-XXX">Attribution card →</a>` rather than duplicating the dim slicing or PR analysis inline. The split between Section 2 (skim) and Section 4 (deep dive) is the whole point of the report layout. + +### Section 4 attribution cards (deep-dive — the "🔎 Spike Attribution" block) + +| Class | Use | +|---|---| +| `.attr-card` | Per-error attribution container. Each WoW regression AND each 60d regression gets one. | +| `.attr-header` (`.urgent` / `.watch`) | Header strip with name + tag chips. | +| `.attr-name`, `.attr-tags`, `.attr-verdict` (`.bad`) | Header content + top verdict paragraph. | +| `.attr-dims` | 7-tile grid for the 7 mandatory dim slices. | +| `.dim` / `.dim-label` / `.dim-row` / `.dim-bar-track` / `.dim-bar-fill` (`.dominant` / `.split`) | Single dim tile with concentration bars. | +| `.code-attr` / `.code-attr-title` | Labeled-grid block under the dims. | +| `.origin-row` | One row in the code-attr grid (label + value). | +| `.stack` | Chip for a `file:line` throw-site reference. | +| `.pr-card` / `.pr-conf` (`-high` / `-medium` / `-low` / `-none`) / `.pr-body` | PR citation with confidence pill. | +| `.origin-tag` (`.origin-broker` / `.origin-android` / `.origin-thirdparty` / `.origin-env`) | Colored chips for the Originator field. | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md b/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md new file mode 100644 index 00000000..93a60229 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md @@ -0,0 +1,19 @@ +# `assets/templates/` — copy-paste HTML snippets + +These are raw HTML fragments designed to be copied verbatim into the working +report file and then have `{{TOKENS}}` replaced. The CSS classes they reference +are defined in [`../report-template.html`](../report-template.html) — do not +restyle per week. + +| File | When to use | +|---|---| +| [`spike-card.html`](spike-card.html) | One per regressing `error_code` or `error_type`. The 7 dim blocks + 8th-for-types and the Code Attribution block are MANDATORY (per SKILL.md). | +| [`traffic-attr-card.html`](traffic-attr-card.html) | One per error whose spike is traffic-driven (per-app volume up, per-request failure rate flat). | +| [`sparkline-footer.html`](sparkline-footer.html) | Paste once, immediately before `</body>`. Uses string concatenation (no JS template literals) so it survives PowerShell here-string composition. | + +Final-pass sanity check before saving the report: + +```pwsh +Select-String -Path <report-file> -Pattern '\{\{|EXAMPLE CONTENT BELOW' +# expected: 0 matches +``` diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html new file mode 100644 index 00000000..b2c4b512 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html @@ -0,0 +1,42 @@ +<!-- Sparkline footer JS — paste verbatim immediately before </body>. + + This file uses string concatenation (no JS template literals) so it + survives PowerShell here-string composition. PowerShell treats backticks + as escape characters and would silently mangle ${...} expressions. + + COLOR PALETTE for `data-color` attributes (matches the rest of the report + CSS — keep these in sync with assets/template-readme.md): + + #cf222e red — bad / regression (data-trend in regression rows) + #1a7f37 green — good / improvement / win (reliability KPIs, recovery sparklines) + #0969da blue — neutral / informational (traffic-volume KPIs) + #0550ae darker blue — latency (p95 KPI sparklines) + #9a6700 amber — watch / spike (ephemeral peak-then-recover trends) + #656d76 grey — flat / no-movement (long-tail rows) +--> +<script> +function makeSparkline(values, opts) { + opts = opts || {}; + var w = opts.w || 100, h = opts.h || 24, color = opts.color || "#0969da", pad = 2; + if (!values || values.length === 0) return ""; + var min = Math.min.apply(null, values), max = Math.max.apply(null, values); + var range = max - min || 1; + var step = (w - pad * 2) / (values.length - 1); + var points = values.map(function (v, i) { return [pad + i * step, h - pad - ((v - min) / range) * (h - pad * 2)]; }); + var linePath = "M" + points.map(function (p) { return p.join(","); }).join(" L"); + var lastPt = points[points.length - 1]; + var fillPath = linePath + " L" + lastPt[0] + "," + h + " L" + points[0][0] + "," + h + " Z"; + return '<svg class="sparkline" width="' + w + '" height="' + h + '" viewBox="0 0 ' + w + ' ' + h + '" xmlns="http://www.w3.org/2000/svg">' + + '<path d="' + fillPath + '" fill="' + color + '" opacity="0.12"/>' + + '<path d="' + linePath + '" fill="none" stroke="' + color + '" stroke-width="1.4" stroke-linejoin="round" stroke-linecap="round"/>' + + '<circle cx="' + lastPt[0] + '" cy="' + lastPt[1] + '" r="2" fill="' + color + '"/></svg>'; +} +document.querySelectorAll(".spark[data-spark]").forEach(function (el) { + el.innerHTML = makeSparkline(JSON.parse(el.dataset.spark), { w: 180, h: 28, color: el.dataset.color }); +}); +document.querySelectorAll(".trend[data-trend]").forEach(function (el) { + var w = parseInt(el.dataset.w) || 120; + var h = parseInt(el.dataset.h) || 24; + el.innerHTML = makeSparkline(JSON.parse(el.dataset.trend), { w: w, h: h, color: el.dataset.color }); +}); +</script> diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html new file mode 100644 index 00000000..04579cd9 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html @@ -0,0 +1,129 @@ +<!-- + Spike Attribution card template — copy verbatim per regressing error_code (or + error_type) and replace every {{TOKEN}}. The 7 dim blocks are MANDATORY (per + SKILL.md). For type cards, also include the 8th-dim sub-code decomposition. + + CSS classes used here are defined in assets/report-template.html. Do not + restyle per week. + + Severity: + .attr-header.urgent -> red, use for confirmed broker WoW regressions + .attr-header.watch -> yellow, use for slow-burn / 60d-only / environmental + (no class) -> neutral, use for recovering or wins + + Origin tag (apply on Originator row + in attr-tags): + .origin-broker red — code lives in broker/ or common/ + .origin-android orange — Android system / Credential Manager / OEM + .origin-thirdparty blue — eSTS / Nimbus / okhttp / 3p library + .origin-env gray — tenant policy, network, environmental + + PR confidence (REQUIRED — pick one honestly): + .pr-conf-high red evidence directly ties this PR to the regression + .pr-conf-medium yellow PR touches the affected code path; timing fits + .pr-conf-low blue possible but no direct evidence + .pr-conf-none gray no broker PR can fix this (eSTS/Android/env) +--> + +<div class="attr-card" id="card-{{ERROR_ID}}"> + <div class="attr-header {{HEADER_CLASS}}"> + <div class="attr-name">{{ERROR_NAME}}   + <span class="tag tag-bad">{{WOW_BADGE}}</span> + <span class="tag tag-bad">{{D60_BADGE}}</span> + </div> + <div class="attr-tags"> + <span class="tag tag-warn">{{TAG_1}}</span> + <span class="tag tag-info">{{TAG_2}}</span> + </div> + </div> + <div class="attr-body"> + <div class="attr-verdict bad"> + <strong>Verdict:</strong> {{VERDICT_PARAGRAPH}} + </div> + + <!-- 7 mandatory dim blocks — fill from agg.js output for each dim query --> + <div class="attr-dims"> + <div class="dim"><div class="dim-label">Span</div> + {{SPAN_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Calling app</div> + {{APP_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Active broker pkg</div> + {{ACTIVE_BROKER_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Broker version</div> + {{BROKER_VERSION_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Account type</div> + {{ACCOUNT_TYPE_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Shared device</div> + {{SHARED_DEVICE_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">Client SKU</div> + {{CLIENT_SKU_DIM_ROWS}} + </div> + <div class="dim"><div class="dim-label">OS version</div> + {{OS_VERSION_DIM_ROWS}} + </div> + </div> + + <!-- Code attribution — every field below is REQUIRED. --> + <div class="code-attr"> + <div class="code-attr-title">Code attribution</div> + <div class="origin-row"><div class="origin-label">Originator</div> + <div class="origin-value"><span class="origin-tag {{ORIGIN_CLASS}}">{{ORIGIN_LABEL}}</span> {{ORIGIN_DESCRIPTION}}</div> + </div> + <div class="origin-row"><div class="origin-label">Top throw site</div> + <div class="origin-value"><span class="stack">{{THROW_SITE_FILE_LINE}}</span> {{THROW_SITE_NOTES}}</div> + </div> + <div class="origin-row"><div class="origin-label">Wrapper</div> + <div class="origin-value">{{WRAPPER_CLASS_AND_METHOD}}</div> + </div> + <div class="origin-row"><div class="origin-label">Caller hot-spots</div> + <div class="origin-value">{{CALLER_BREAKDOWN}}</div> + </div> + <div class="origin-row"><div class="origin-label">Underlying cause</div> + <div class="origin-value">{{ROOT_CAUSE}}</div> + </div> + <div class="origin-row"><div class="origin-label">Top error_messages</div> + <div class="origin-value"> + <ol style="margin:0;padding-left:18px;font-size:11.5px;"> + <li><code>{{MSG_1}}</code> — {{MSG_1_DEVICES}}</li> + <li><code>{{MSG_2}}</code> — {{MSG_2_DEVICES}}</li> + <li><code>{{MSG_3}}</code> — {{MSG_3_DEVICES}}</li> + </ol> + </div> + </div> + <div class="origin-row"><div class="origin-label">Likely PRs</div> + <div class="origin-value"><div class="pr-list"> + <div class="pr-card"> + <div class="pr-conf {{PR_1_CONF_CLASS}}">{{PR_1_CONF_LABEL}}</div> + <div class="pr-body"> + <a class="pr-id" href="{{PR_1_URL}}" target="_blank" rel="noopener">{{PR_1_ID}}</a>   + <span class="pr-title">{{PR_1_TITLE}}</span> + <div class="pr-meta">{{PR_1_DATE}} · {{PR_1_AUTHOR}} · sha {{PR_1_SHA}}</div> + <div class="pr-why">{{PR_1_WHY}}</div> + </div> + </div> + <!-- Repeat .pr-card for PR 2 / PR 3. For environmental errors use: + <div class="pr-card"> + <div class="pr-conf pr-conf-none">⚪ None — not in scope</div> + <div class="pr-body"> + <span class="pr-title">No broker/common PR matches. Tagged <strong>{{ENV_TAG}}</strong>.</span> + </div> + </div> + --> + </div></div> + </div> + <div class="origin-row"><div class="origin-label">Next step</div> + <div class="origin-value">📝 <strong>{{OWNER_TEAM}}:</strong> {{NEXT_ACTION}}</div> + </div> + </div> + + <!-- Traffic Attribution check — REQUIRED on every card. --> + <div class="attr-verdict" style="border-left-color:#1a7f37; background:#dafbe1;"> + <strong>Traffic Attribution check:</strong> {{TRAFFIC_VERDICT}} + </div> + </div> +</div> diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html new file mode 100644 index 00000000..e16b4071 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html @@ -0,0 +1,46 @@ +<!-- + Traffic Attribution card — for errors whose spike is fully or partly explained + by per-app request volume rising rather than per-request failure rate. Routed + to the calling-app team, not to broker. + + Use under the "🚚 Traffic Attribution" section. If no errors qualify in a + given week, emit the "None this week" callout instead (see SKILL.md Step 6c). +--> +<div class="attr-card" id="traffic-card-{{ERROR_ID}}"> + <div class="attr-header" style="background:linear-gradient(180deg,#fff8c5 0%,#fffbe8 100%);border-bottom-color:#d4a72c;"> + <div class="attr-name">{{ERROR_NAME}}   + <span class="tag tag-warn">{{WOW_BADGE}}</span> + <span class="tag tag-info">traffic-driven, not failure-rate</span> + </div> + <div class="attr-tags"> + <span class="tag tag-info">dominant caller: {{DOMINANT_APP}}</span> + </div> + </div> + <div class="attr-body"> + <div class="attr-verdict"> + <strong>Verdict:</strong> {{VERDICT_PARAGRAPH}} + </div> + + <table style="width:100%; font-size:12px;"> + <thead><tr> + <th>Calling app</th> + <th class="num">Δ overall requests WoW</th> + <th class="num">Per-request failure rate (prev → cur)</th> + <th class="num">Δ failure rate</th> + </tr></thead> + <tbody> + <tr> + <td><code>{{APP_1}}</code></td> + <td class="num">{{APP_1_DELTA_REQ}}</td> + <td class="num">{{APP_1_PREV_RATE}} → {{APP_1_CUR_RATE}}</td> + <td class="num {{APP_1_RATE_CLASS}}">{{APP_1_DELTA_RATE}}</td> + </tr> + <!-- repeat per affected app --> + </tbody> + </table> + + <div class="attr-verdict" style="border-left-color:#9a6700; background:#fff8c5; margin-top:12px;"> + <strong>Routing:</strong> 📝 <strong>{{CALLER_OWNER_TEAM}}</strong> (not broker). Per-request failure rate is essentially flat, so a code regression in the broker is not implicated. The error spike is a function of {{DOMINANT_APP}} sending {{APP_1_DELTA_REQ}} more requests this week. + </div> + </div> +</div> diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 new file mode 100644 index 00000000..be1a0a07 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 @@ -0,0 +1,157 @@ +<# +.SYNOPSIS + Validate a generated OCE weekly report HTML before publishing. + +.DESCRIPTION + Runs all required pre-publish checks per SKILL.md "Output checklist": + 1. No stale-template tokens ({{...}} placeholders or "EXAMPLE CONTENT BELOW" sentinel). + 2. No `devs` / `reqs` in user-facing text (only allowed inside <pre><code> KQL blocks). + 3. No U+FFFD (Unicode replacement character) — catches mojibake from emoji edits. + 4. Section 2 callouts are siblings, NOT nested. Tracks <div> open/close depth + from #attention to #trend60d; the depth must return to 0 between callouts. + 5. (Informational) Reports HTML size and number of <div class="callout"> openings. + + Exits with non-zero status if any HARD check fails (stale tokens, devs/reqs leak, + U+FFFD, or unbalanced div depth in the attention block). + +.PARAMETER Path + Absolute path to the report file. Defaults to the current week's report under + $env:USERPROFILE\android-oce-reports\. + +.EXAMPLE + .\validate-report.ps1 + .\validate-report.ps1 -Path C:\path\to\oncall-wow-report-2026-05-03.html +#> +[CmdletBinding()] +param( + [string]$Path +) + +# Default: pick the most-recent oncall-wow-report-*.html in the user's reports folder +if (-not $Path) { + $reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' + $latest = Get-ChildItem $reportDir -Filter 'oncall-wow-report-*.html' -ErrorAction SilentlyContinue | + Sort-Object LastWriteTime -Descending | Select-Object -First 1 + if (-not $latest) { + Write-Error "No oncall-wow-report-*.html found in $reportDir. Pass -Path explicitly." + exit 2 + } + $Path = $latest.FullName +} + +if (-not (Test-Path $Path)) { + Write-Error "Report file not found: $Path" + exit 2 +} + +$failures = @() +$warnings = @() + +function Add-Fail($msg) { $script:failures += $msg; Write-Host " [FAIL] $msg" -ForegroundColor Red } +function Add-Warn($msg) { $script:warnings += $msg; Write-Host " [WARN] $msg" -ForegroundColor Yellow } +function Pass($msg) { Write-Host " [OK] $msg" -ForegroundColor Green } + +Write-Host "" +Write-Host "Validating: $Path" +Write-Host ("Size: {0:N0} bytes" -f (Get-Item $Path).Length) +Write-Host "" + +# ---- 1. Stale tokens / EXAMPLE sentinel ---- +$stale = Select-String -Path $Path -Pattern '\{\{|EXAMPLE CONTENT BELOW|EXAMPLE_' +if ($stale.Count -gt 0) { + Add-Fail "Stale template tokens found ($($stale.Count)). First few:" + $stale | Select-Object -First 5 | ForEach-Object { Write-Host " L$($_.LineNumber): $($_.Line.Trim().Substring(0, [Math]::Min(110, $_.Line.Trim().Length)))" } +} else { + Pass "No stale {{...}} tokens or EXAMPLE sentinel" +} + +# ---- 2. devs / reqs in user-facing text ---- +# Allowed: occurrences inside <pre><code>...</code></pre> KQL blocks. +$content = [System.IO.File]::ReadAllText($Path) +$contentNoCode = [regex]::Replace($content, '(?s)<pre[^>]*>.*?</pre>', '') +$contentNoCode = [regex]::Replace($contentNoCode, '(?s)<code[^>]*>.*?</code>', '') +$drMatches = [regex]::Matches($contentNoCode, '\b(devs|reqs)\b', 'IgnoreCase') +if ($drMatches.Count -gt 0) { + Add-Fail "Found $($drMatches.Count) devs/reqs occurrence(s) in user-facing text (use 'devices' / 'requests'). First few contexts:" + $drMatches | Select-Object -First 5 | ForEach-Object { + $ctxStart = [Math]::Max(0, $_.Index - 40) + $ctxLen = [Math]::Min(100, $contentNoCode.Length - $ctxStart) + $ctx = $contentNoCode.Substring($ctxStart, $ctxLen) -replace '\s+', ' ' + Write-Host " ...$ctx..." + } +} else { + Pass "No devs/reqs in user-facing text" +} + +# ---- 3. U+FFFD (mojibake from emoji edits) ---- +$bytes = [System.IO.File]::ReadAllBytes($Path) +$text = [System.Text.Encoding]::UTF8.GetString($bytes) +$ufffd = ($text.ToCharArray() | Where-Object { $_ -eq [char]0xFFFD }).Count +if ($ufffd -gt 0) { + Add-Fail "$ufffd U+FFFD replacement character(s) found (mojibake). First context:" + $i = $text.IndexOf([char]0xFFFD) + $start = [Math]::Max(0, $i - 30); $end = [Math]::Min($text.Length, $i + 30) + Write-Host " ...$($text.Substring($start, $end - $start) -replace "`r?`n", ' ')..." +} else { + Pass "No U+FFFD (no mojibake)" +} + +# ---- 4. Section 2 div balance ---- +$lines = Get-Content $Path +$startIdx = -1; $endIdx = -1 +for ($i = 0; $i -lt $lines.Count; $i++) { + if ($lines[$i] -match 'id="attention"') { $startIdx = $i } + if ($lines[$i] -match 'id="trend60d"') { $endIdx = $i; break } +} +if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { + $depth = 0 + for ($i = $startIdx; $i -le $endIdx; $i++) { + if ($null -eq $lines[$i]) { continue } + $depth += ([regex]::Matches($lines[$i], '<div\b')).Count + $depth -= ([regex]::Matches($lines[$i], '</div>')).Count + } + if ($depth -ne 0) { + Add-Fail "Section 2 (attention block) has unbalanced <div>s; net depth at end = $depth (expected 0). Likely cause: a callout is missing its closing </div>, which makes the next callout nest inside it." + } else { + Pass "Section 2 div balance OK (depth returns to 0)" + } +} else { + Add-Warn "Could not locate the attention block (#attention / #trend60d anchors). Skipping div-balance check." +} + +# ---- 5. Informational: callout count + nested-callout sanity ---- +$calloutOpens = ([regex]::Matches($content, '<div class="callout(?:\s|")')).Count +Write-Host "" +Write-Host "Info: $calloutOpens callout container(s) in the document." + +# Cheap nested-callout heuristic: scan the attention block for any callout that +# opens before the previous callout closes. We approximate by tracking depth. +if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { + $depthOuter = 0; $nestedAt = @() + for ($i = $startIdx; $i -le $endIdx; $i++) { + if ($null -eq $lines[$i]) { continue } + # Match the callout container itself, not callout-title. The class can be + # `callout`, `callout urgent`, `callout watch`, `callout win`, etc. — but + # never `callout-title`. Require a space or end-of-class-attr after. + if ($lines[$i] -match '<div class="callout(?:\s|")' -and $depthOuter -gt 0) { + $nestedAt += $i + 1 + } + $depthOuter += ([regex]::Matches($lines[$i], '<div\b')).Count + $depthOuter -= ([regex]::Matches($lines[$i], '</div>')).Count + } + if ($nestedAt.Count -gt 0) { + Add-Fail "Nested callout detected at line(s): $($nestedAt -join ', '). Each callout in Section 2 must be a SIBLING, not nested inside another callout." + } else { + Pass "No nested callouts in Section 2" + } +} + +Write-Host "" +if ($failures.Count -eq 0) { + Write-Host "All hard checks passed." -ForegroundColor Green + if ($warnings.Count -gt 0) { Write-Host "$($warnings.Count) warning(s) — review above." -ForegroundColor Yellow } + exit 0 +} else { + Write-Host "$($failures.Count) hard check(s) failed. Fix before publishing." -ForegroundColor Red + exit 1 +} From 047e3cb710d739839f75bf3907ed2b59a35002d9 Mon Sep 17 00:00:00 2001 From: Shahzaib <shahzaib.jameel@microsoft.com> Date: Mon, 11 May 2026 11:46:34 -0700 Subject: [PATCH 3/6] Updates to telemetry OCE report skill --- .../oncall-weekly-telemetry-report/SKILL.md | 24 ++++-- .../assets/kusto-cheatsheet.md | 16 ++++ .../assets/queries/README.md | 1 + .../assets/queries/attr-union-by-dim.kql | 23 +++-- .../queries/broker-version-share-wow.kql | 34 ++++++++ .../queries/error-message-and-location.kql | 8 +- .../assets/summarize-attribution.js | 17 +++- .../assets/template-readme.md | 86 +++++++++++++++++++ .../assets/validate-report.ps1 | 63 ++++++++++++++ 9 files changed, 252 insertions(+), 20 deletions(-) create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md index 623ecdf9..32e4dea5 100644 --- a/.github/skills/oncall-weekly-telemetry-report/SKILL.md +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -31,7 +31,11 @@ Reusable helpers in [`assets/`](assets/): ## Inputs to confirm with the user -1. **Reporting week** — defaults to the **most recent complete Sun→Sat week**. If today is itself a Saturday or Sunday, the user often actually wants the **current in-progress week** instead — ASK explicitly. If they pick the in-progress week: +1. **Reporting week** — **first compute the most recent complete Sun→Sat week** (Sunday bucket = the most recent Sunday strictly before today, or today itself if today is a Sunday and the week's data is at least 6h old). Default to that and proceed without asking *unless*: + - today is itself a Sat or Sun **and** the user phrasing suggests they want "this week" (e.g. "current report", "latest data"). Then ASK explicitly between the in-progress and most-recent-complete options. + - today is a Mon–Fri — just default to the most recent complete week and proceed; do not ask. + + If the user picks the in-progress week: - Add the badge text *"Live data — current bucket may still be filling"* to the report header. - The `bucket-trends.js` `--end` flag + the `| where week < datetime(<END>)` source filter both still apply (use the Sunday AFTER the reporting week as `<END>`); they will drop the partial-end-bucket warning. @@ -174,14 +178,16 @@ materialized_view('ErrorStatsMetrics') #### 3c. Run the bucketer 4 times (cross-product of `{code, type} × {devices, requests}`) +`bucket-trends.js` defaults to grouping by `error_code`. For the type runs you MUST pass `--key=unified_error_type` so it picks up the right column from the type-trend JSON. + ```pwsh # Error codes — by devices, then by requests node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <codes.json> --start=2026-03-08 --end=2026-05-10 node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <codes.json> --start=2026-03-08 --end=2026-05-10 --metric=reqs -# Error types — by devices, then by requests -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <types.json> --start=2026-03-08 --end=2026-05-10 -node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <types.json> --start=2026-03-08 --end=2026-05-10 --metric=reqs +# Error types — by devices, then by requests (note --key) +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <types.json> --start=2026-03-08 --end=2026-05-10 --key=unified_error_type +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js <types.json> --start=2026-03-08 --end=2026-05-10 --key=unified_error_type --metric=reqs ``` `--end` is the Sunday AFTER the reporting week (exclusive). The script also auto-detects partial end-buckets and warns, but passing `--end` explicitly is safer. @@ -225,6 +231,8 @@ For each WoW mover (regardless of size), you still owe the full Code Attribution ### Step 4 — Code attribution (deep PR correlation) > ⚠️ **HARD RULE — Originator pre-check.** Before claiming `Originator: Broker` on any card, you MUST run [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) for that error code (or type) and read **(a) the throw-site stack and (b) the top 3 `error_message` strings**. Most broker error codes flow through `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult, clientExceptionFromException}` — which intentionally bridge eSTS responses into broker exceptions. **If the throw site is in any of those three methods AND the error_message starts with `AADSTS`, the originator is eSTS, not broker.** See the AADSTS reference table in [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). Cards that skip this step must be marked low-confidence, not high. +> +> **Window:** use the FULL 7-day reporting window (`<CURR_START>` → `<CURR_END>`) on `PipelineInfo_IngestionTime`, NOT a narrower 3–5 day slice — low-volume types (e.g. `SSLHandshakeException`, `IntuneAppProtectionPolicyRequiredException`) routinely return zero rows in a sub-week window. If a code/type still returns nothing, fall back to the prior 14 days before declaring "no data". For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/code-attribution-template.md`](assets/code-attribution-template.md) as the per-card checklist. @@ -435,6 +443,8 @@ The validator hard-fails on: 3. `U+FFFD` replacement characters (catches mojibake from emoji edits). 4. Unbalanced `<div>` depth in the Section 2 attention block (catches the inception-style nested-callout bug from past runs). 5. A second callout opening before the previous one closes (nested-callout sanity check). +6. **Chartless KPI grid** — if more than half the `.kpi` tiles lack a `data-spark` element (catches the v7 regression where the body was rebuilt without sparklines). Also warns when total chart count (sparks + trends + inline svgs) is < 15. +7. **Code-attribution depth** — each `.attr-card`'s "Code attribution" block must contain an `Originator` row (proxy for the full 8-field structure: Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). Catches the v7-third-pass regression where cards shipped with a `pr-list`-only stub. Then: - Run `get_errors` on the HTML file (no errors expected — pure HTML/CSS). @@ -455,7 +465,7 @@ Then: - **WoW-movers pass is mandatory.** The 60d bucketer's `--peak-floor` silently drops sub-10K-device codes, so [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) MUST be run as a separate pass for both `error_code` and `error_type` (per Step 3d). Its output is **merged into the single 🔴 WoW regressions callout**, sorted by current-week device count descending, with rows tagged `NEW` / `60d↑` / originator chip. Do not render a separate "emerging" callout. Skipping the pass is how the Apr 26 `Failed to parse JWT` spike (7 → 3,461 devs over 7 weeks) hid for two reports running. - **Section 2 callouts are at-a-glance, Section 4 is the deep dive.** WoW / Slow-burn / Wins items in Section 2 use the `.item` flat-row pattern (no nested cards, no per-item left bars — the parent `.callout` border is the only severity affordance). Each row is a single line of metric chips + a one-line body + an `Attribution card →` link to the corresponding `.attr-card` in Section 4. Do NOT duplicate the dim slicing, PR analysis, or detailed verdict between the two sections — Section 4 is where that lives. See [`assets/template-readme.md`](assets/template-readme.md) for the CSS class reference and the example `.item` markup. - **Never use bash/PowerShell regex to bulk-edit balanced HTML.** This skill has burned twice on regex strip scripts that ate matched-pair `</div>` closes, producing inception-style nested-callout bugs that take a depth-tracking script to find. If you need a structural change to the HTML, make a targeted, single-occurrence string replacement (with explicit before/after context) or rewrite the affected block end-to-end. Never run a `-replace` across the whole file expecting it to leave balance intact. -- **Denominator caveat must cite evidence, not hand-wave.** If you flag a large all-spans device-count shift, run [`assets/queries/broker-version-share.kql`](assets/queries/broker-version-share.kql) and name the version cohort the shift moved with. Do not write "recurring telemetry-shape artifact" without backing data; if you don't have it, drop the callout. +- **Denominator caveat must cite evidence, not hand-wave.** If you flag a large all-spans device-count shift, run [`assets/queries/broker-version-share-wow.kql`](assets/queries/broker-version-share-wow.kql) (single WoW snapshot) or [`assets/queries/broker-version-share.kql`](assets/queries/broker-version-share.kql) (time-series) and name the version cohort the shift moved with. Do not write "recurring telemetry-shape artifact" without backing data; if you don't have it, drop the callout. - **"Recovery" still merits a PR citation.** When an error pins to a single old broker version and recovers as that version retires, look for the **fix PR in the version that replaced it** before calling it a "natural rolloff." Often the fix is real and just under-credited. - **Never report WoW-only verdicts** for errors that are flat-or-down WoW but rising on 60d — always cross-check both windows. - **Never page** based on a regression that turns out to be a downstream of a denominator shift; always include the auth-only-denominator number alongside the all-spans number. @@ -481,8 +491,10 @@ Then: - [ ] Non-broker errors are explicitly tagged `environmental` / `non-broker` with confidence `none` — not invented broker PRs. - [ ] Traffic analysis covers totals, per-app, per-span, requests-per-device ratio (per error AND overall), and a sampling-change check. - [ ] **Every material traffic shift (>10% on any segment, up or down) has a reasoning paragraph** that names the dominant span/app/active-broker/broker-version, and either cites a causal PR (with confidence) — span removed/added, `goAsync()` refactor, sampling change, caller-side SDK release, ECS flight ramp — or explicitly says "no PR identified, suspect X" rather than leaving it unexplained. -- [ ] Denominator caveat (if used) is backed by [`broker-version-share.kql`](assets/queries/broker-version-share.kql) evidence naming the responsible version cohort. No hand-waving. +- [ ] Denominator caveat (if used) is backed by [`broker-version-share-wow.kql`](assets/queries/broker-version-share-wow.kql) or [`broker-version-share.kql`](assets/queries/broker-version-share.kql) evidence naming the responsible version cohort. No hand-waving. - [ ] Auth-only denominator used for all reliability %s, denominator caveat called out at top. - [ ] No `\bdevs\b` or `\breqs\b` in user-facing text. (`Select-String -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false` returns 0.) +- [ ] **Sparklines rendered.** Every `.kpi` tile in the Top-line health section has a `data-spark` array with 8–9 weekly values. Every row in the 60-day trend tables and both WoW tables (codes + types) has a `data-trend` mini-spark. The validator's chart-coverage check passes (KPI coverage ≥1/2 of tiles, total elements ≥15). Past failure mode: the v7 body rebuild dropped all sparklines silently — see `template-readme.md` § "Sparklines are MANDATORY". +- [ ] **Code-attribution depth.** Every `.attr-card`'s Code attribution block uses the full 8-field `<div class="origin-row">` structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) per [`assets/code-attribution-template.md`](assets/code-attribution-template.md). A `pr-list`-only stub is **not acceptable** — the validator hard-fails this. Past failure mode (v7 third pass): all 10 cards shipped with PR-only stubs and lost the throw-site / wrapper / underlying-cause analysis. - [ ] No stale text from previous weeks. (`Select-String -Pattern 'EXAMPLE CONTENT BELOW'` returns 0 — that's the unfinished-section sentinel. The template no longer ships `{{TOKEN}}` placeholders since v2; if the file still contains any `{{`, that's also a leftover.) - [ ] `get_errors` clean on the HTML file. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md index c81a2907..b5302510 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md @@ -130,6 +130,22 @@ all | join kind=inner ok on week | order by week asc ``` +**Auth-only device union** (Silent ∪ Interactive — what the report uses for the "real fleet" KPI). The natural reach for `hll_merge_array` to combine two pre-merged HLL sketches **does not exist in Kusto** (`SEM0260: Unknown function`). Instead, project the raw `countDevicesHll` rows from both views, `union` them, and `hll_merge` once at the end: + +```kql +let s = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | project EventInfo_Time, countDevicesHll; +let i = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) + | project EventInfo_Time, countDevicesHll; +union s, i +| summarize authDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time) +| where week < datetime(<END>) +| order by week asc +``` + ### 8b. 60-day error trend (feeds `bucket-trends.js`) ```kql diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md index 6d3abf2c..12351dae 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md @@ -20,6 +20,7 @@ weekly report needs. Token convention: |---|---|---| | [`reliability-auth-only.kql`](reliability-auth-only.kql) | Per-week auth-only requests/devices | Top-line health, denominator caveat | | [`broker-version-share.kql`](broker-version-share.kql) | Per-week per-version share — **evidence for denominator caveat** | Denominator caveat callout, broker adoption | +| [`broker-version-share-wow.kql`](broker-version-share-wow.kql) | Single WoW snapshot of version share — fastest evidence for cohort transitions | Denominator caveat callout | | [`60d-trend-codes.kql`](60d-trend-codes.kql) | Feeds `bucket-trends.js` for codes | 60-day trend analysis | | [`60d-trend-types.kql`](60d-trend-types.kql) | Feeds `bucket-trends.js` for types | 60-day trend analysis | | [`wow-movers.kql`](wow-movers.kql) | **MANDATORY second pass** — catches small-base codes that spiked sharply this week (below the 60d bucketer's reporting threshold). Run for both `error_code` and `error_type`. **Merge its output rows into the single 🔴 WoW regressions callout** alongside the standard WoW table; tag rows that were absent or near-zero last week with `NEW`. Do not render a separate "emerging" callout. | 🔴 WoW regressions callout (Section 2) | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql index 0efb0a2d..00afd33f 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql @@ -15,8 +15,7 @@ // dim string short label per dimension // wk datetime reporting week // <KEY> string error_code or unified_error_type -// val_string string dim value (for string-typed dims) -// val_bool bool dim value (for shared-device only) +// val_string string dim value (cast via tostring() in every union leg) // devs long dcount_hll merged device count // errs long sum of countOverall (request count) // @@ -27,6 +26,12 @@ // | where unified_error_type in (<TYPES>) // | extend wk = startofweek(EventInfo_Time); +// IMPORTANT — column-aliasing gotcha: every union branch MUST emit `val_string` +// as a real `string` (never `bool(null)`), or Kusto will rename the columns +// `val_string_string` and `val_string_bool` in the result schema, which then +// breaks `summarize-attribution.js` (it now accepts both names as a fallback, +// but emitting one consistent `string` column is cleaner). Use `tostring()` on +// non-string dims (e.g. shared_dev) so every leg has a string-typed column. let codes = dynamic([<CODES>]); let base = materialized_view('ErrorStatsMetrics') | where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) @@ -34,26 +39,26 @@ let base = materialized_view('ErrorStatsMetrics') | extend wk = startofweek(EventInfo_Time); (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='span', wk, error_code, val_string=span_name, val_bool=bool(null)) + by dim='span', wk, error_code, val_string=tostring(span_name)) | union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='calling_app', wk, error_code, val_string=calling_package_name, val_bool=bool(null)) + by dim='calling_app', wk, error_code, val_string=tostring(calling_package_name)) | union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='active_broker', wk, error_code, val_string=active_broker_package_name, val_bool=bool(null)) + by dim='active_broker', wk, error_code, val_string=tostring(active_broker_package_name)) | union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='broker_ver', wk, error_code, val_string=broker_version, val_bool=bool(null)) + by dim='broker_ver', wk, error_code, val_string=tostring(broker_version)) | union (base | extend t = MergeAccountType(account_type) | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='acct_type', wk, error_code, val_string=t, val_bool=bool(null)) + by dim='acct_type', wk, error_code, val_string=tostring(t)) | union (base | extend s = MergeIsSharedDevice(is_shared_device) | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='shared_dev', wk, error_code, val_string=s, val_bool=bool(null)) + by dim='shared_dev', wk, error_code, val_string=tostring(s)) | union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), errs = sum(countOverall) - by dim='client_sku', wk, error_code, val_string=client_sku, val_bool=bool(null)) + by dim='client_sku', wk, error_code, val_string=tostring(client_sku)) | where wk < datetime(<END>) | order by error_code asc, dim asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql new file mode 100644 index 00000000..0fe05c58 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql @@ -0,0 +1,34 @@ +// WoW broker-version share comparison \u2014 the canonical evidence for the +// "denominator caveat" callout when an entire version cohort retires. Use this +// instead of the time-series `broker-version-share.kql` when you need a single +// WoW snapshot showing which versions gained/lost share. Modeled on +// `wow-movers.kql`. +// +// Inputs: +// <CURR_START> Sunday of the reporting week (e.g. 2026-05-03) +// <CURR_END> Sunday after (exclusive, e.g. 2026-05-10) +// <PRIOR_START> Sunday of the baseline week (e.g. 2026-04-26) +// +// Floor: only versions with >100M reqs in either week (filters long-tail). +// Output sorted by current-week req count descending. + +let curr = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (datetime(<CURR_START>) .. datetime(<CURR_END>)) + | summarize cReq = sum(countRequests), + cDev = dcount_hll(hll_merge(countDevicesHll)) + by broker_version; +let prior = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (datetime(<PRIOR_START>) .. datetime(<CURR_START>)) + | summarize pReq = sum(countRequests), + pDev = dcount_hll(hll_merge(countDevicesHll)) + by broker_version; +curr | join kind=fullouter prior on broker_version +| extend bv = coalesce(broker_version, broker_version1) +| extend cReq = coalesce(cReq, long(0)), cDev = coalesce(cDev, long(0)), + pReq = coalesce(pReq, long(0)), pDev = coalesce(pDev, long(0)) +| project bv, pReq, cReq, + dReqPct = iff(pReq == 0, real(null), round(100.0 * (cReq - pReq) / pReq, 1)), + pDev, cDev, + dDevPct = iff(pDev == 0, real(null), round(100.0 * (cDev - pDev) / pDev, 1)) +| where cReq > 100000000 or pReq > 100000000 +| order by cReq desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql index d719b85e..34c03364 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql @@ -14,9 +14,15 @@ // Inputs: // <CODES_LIST> e.g. 'invalid_resource', 'null_pointer_error' (or empty) // <TYPES_LIST> e.g. 'IntuneAppProtectionPolicyRequiredException' (or empty) -// <START> datetime of reporting-week PipelineInfo_IngestionTime start +// <START> datetime — should be the reporting-week Sunday (e.g. 2026-05-03). +// Use the FULL 7-day reporting window, NOT a narrower 3-5 day slice +// (low-volume types like SSLHandshakeException / Intune* may return +// zero rows in a sub-week window). // <END> datetime of next Sunday (exclusive) // +// Tip: if the reporting window returns no rows for a low-volume code/type, fall +// back to the prior 14-day window (`<START> - 7d .. <END>`) before giving up. +// // Output column 'loc' is a JSON blob {"ClassName":"...","MethodName":"...","LineNumber":N} // — this is normal. Read it as text. To project the method name only, use // parse_json(loc).MethodName diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js index 617d0a16..69972740 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js @@ -78,7 +78,7 @@ function fmt(n) { } function pct(num, den) { return den ? (100 * num / den).toFixed(1) : '0.0'; } function delta(curr, prior) { - if (prior == null || prior === 0) return curr ? 'NEW' : '–'; + if (prior == null || prior === 0) return curr ? `NEW(+${fmt(curr)})` : '–'; return ((curr - prior) / prior * 100).toFixed(1) + '%'; } @@ -123,10 +123,19 @@ function loadUnion(file) { let idxWeek = idx('wk'); if (idxWeek < 0) idxWeek = idx('week'); let idxDevs = idx('devs'); if (idxDevs < 0) idxDevs = idx('countDevices'); let idxErrs = idx('errs'); if (idxErrs < 0) idxErrs = idx('countOverall'); - const idxValS = idx('val_string') >= 0 ? idx('val_string') : idx('val'); - const idxValB = idx('val_bool'); + // Kusto auto-renames duplicate column names from union branches: a column + // declared `val_string` in two `union` legs (one typed string, one typed + // bool(null)) becomes `val_string_string` and `val_string_bool`. Accept + // those as synonyms so the union KQL doesn't need a per-leg cast. + const idxValS = + idx('val_string') >= 0 ? idx('val_string') : + idx('val_string_string') >= 0 ? idx('val_string_string') : + idx('val'); + const idxValB = + idx('val_bool') >= 0 ? idx('val_bool') : + idx('val_string_bool'); if (idxDim < 0 || idxCode < 0 || idxWeek < 0 || idxDevs < 0 || idxValS < 0) { - throw new Error(`Union file ${file}: schema must include dim, ${keyCol}, wk|week, devs|countDevices, val_string|val (and optionally val_bool). Got [${cols.join(', ')}]`); + throw new Error(`Union file ${file}: schema must include dim, ${keyCol}, wk|week, devs|countDevices, val_string|val|val_string_string (and optionally val_bool|val_string_bool). Got [${cols.join(', ')}]`); } // perDim[label].map[code][wk][dimVal] = { devs, errs } const byDim = {}; diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md index 54af933a..6568ad80 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md @@ -32,6 +32,50 @@ If the layout itself ever needs to change (new section, new card style), edit `assets/report-template.html` here in the skill folder and commit so future weeks inherit the change. +## Editing strategy: in-place vs head+body+footer rebuild + +Pick by overlap with the prior week: + +- **In-place edit (default)** — when ≤3 attribution cards change AND the section structure is unchanged. Use `replace_string_in_file` with surrounding context per card / table row. Fast and low-risk. +- **Head+body+footer rebuild (fallback)** — when ≥4 attribution cards change, or several callouts get re-categorized, or the regression set has near-zero overlap with the template. Trying to in-place edit at that scale invites the inception-style nested-`</div>` bugs the validator was written to catch. + + Boundary lines in the canonical template (verify with `grep` before splitting — they drift as the template evolves): + + | Region | Lines (approx) | Last/first line content | + |---|---|---| + | **head** | 1 → ~342 | ends `<body>` then `<div class="container">` (open) | + | **body** (replace) | ~343 → ~1081 | starts `<div class="header">`, ends `</div>` that closes `.container` | + | **footer** | ~1082 → end | starts `<script>`, ends `</body></html>` | + + Rebuild recipe (PowerShell, single line — multi-line here-strings can mangle JS template literals in the footer; see user-memory `oce-report-lessons.md`): + + ```pwsh + $work="$env:USERPROFILE\android-oce-reports\_data"; $f='<output>.html'; $head=[IO.File]::ReadAllText("$work\head.html"); $body=[IO.File]::ReadAllText("$work\body.html"); $footerRaw=[IO.File]::ReadAllText("$work\footer.html"); $footer=$footerRaw -replace '^</div>\s*',''; [IO.File]::WriteAllText($f, $head + "`n" + $body + "`n" + $footer) + ``` + + The `-replace '^</div>\s*',''` strips the original body's closing `</div>` from the footer so the new body's own closing `</div>` doesn't double up. Always run `validate-report.ps1` after. + + **Critical for the rebuild path:** the rebuilt body must include `data-spark` on every KPI tile and `data-trend` on every relevant table row — the in-place template has these, but a fresh-authored body won't unless you add them explicitly. Reference markup: + + ```html + <!-- KPI tile with sparkline --> + <div class="kpi"> + <div class="label">Silent auth requests (week)</div> + <div class="value">10.59 B</div> + <div class="delta delta-up">+2.4% WoW</div> + <div class="spark" data-spark='[9.97e9,9.61e9,...,1.06e10]' data-color="#0969da"></div> + </div> + + <!-- 60-day trend table row with mini sparkline in the trajectory cell --> + <tr> + <td><code>no_tokens_found</code></td> + <td class="num">2.90 M</td><td class="num">4.52 M</td><td class="num bad">+55.7%</td> + <td><span class="trend" data-trend='[2902878,...,4519309]' data-color="#cf222e" data-w="160"></span></td> + </tr> + ``` + + The footer JS auto-renders both — no per-tile JS calls needed. The validator (Step 7) hard-fails if > half the KPI tiles lack `data-spark`. + ## Validator pass before saving Two literal-string greps must return zero results: @@ -49,6 +93,48 @@ inside an HTML comment. The grep catches anything still in flight. forbidden — use `devices` / `requests` in user-facing prose, headers, badges, and verdicts. +## Sparklines are MANDATORY (don't drop them) + +The footer JS auto-renders any element with `data-spark` or `data-trend` attributes — but only if you actually emit those attributes. **Past mistake (v7 run):** body was rebuilt without `data-spark` on KPI tiles and without `.trend` cells in tables → the report shipped with zero charts. The validator does not catch this, so it is your responsibility. + +Required spark/trend coverage in every report: + +| Where | Attribute | Length | Color (see palette below) | +|---|---|---|---| +| Every KPI tile in `.kpi-grid` (Top-line health) | `<div class="spark" data-spark='[...]' data-color="..."></div>` inside the tile | 8–9 weekly values | blue/green/dark-blue per metric semantic | +| **Every** row in the 60-day trend tables — true regressions, **ephemeral spikes**, and **true improvements** (all three callout tables) | `<span class="trend" data-trend='[...]' data-color="..." data-w="160"></span>` in the trajectory cell | 8–9 weekly values | red regression / amber spike / green improvement / grey flat | +| Every row in the error-codes WoW table and error-types WoW table | `<span class="trend" data-trend='[...]' data-color="..."></span>` in the 60d-trend column | 8 weekly values | same palette | + +**Past failure modes:** +- v7 first pass: the body rebuild emitted *zero* `data-spark` / `data-trend` (validator now hard-fails this). +- v7 second pass: only the *true regressions* table got sparklines; the **ephemeral spikes** and **true improvements** tables were left text-only. All three tables in the 60-day trend section need the trajectory column with a sparkline — the validator's overall-coverage warn (≥15) catches this approximately, but the rule of thumb is: **if a row reports an 8-week delta, it gets a sparkline.** + +## Traffic-shape callout styling + +The Section 2 "Traffic shape" callout uses the **neutral grey-bordered `<div class="callout">`** (no `urgent` / `watch` / `win` modifier) and a **🚦** icon — it's an informational summary, not an alert. Don't promote it to `watch` (yellow) just because there's been some movement; reserve `watch` for things that need follow-up. + +## Traffic-attribution sub-block on each attribution card (tri-state) + +Each `.attr-card` in Section 4 ends with a small "Traffic attribution" sub-block. **Pick one of three colors based on the verdict — don't paint everything yellow.** Yellow loses meaning when it's the default. + +| Verdict | Color | Title prefix | Inline `style` on the wrapper | +|---|---|---|---| +| Per-request rate clearly moved; traffic ruled out | 🟢 green | `✓ Traffic attribution — ruled out` | `background:#dafbe1;border-color:#1a7f37;` + title `color:#1a7f37;` | +| Mixed signal — traffic + rate both contributing | 🟡 yellow | `⚠ Traffic attribution — partly contributing` | `background:linear-gradient(180deg,#fff8c5 0%,#fff1a8 100%);border-color:#d4a72c;` + title `color:#9a6700;` | +| Traffic IS the dominant driver | 🔴 red | `🚚 Traffic attribution — primary driver (see § 5)` | `background:#ffeef0;border-color:#cf222e;` + title `color:#cf222e;` | + +A red sub-block here means the error **also** belongs in the top-level § 5 "🚚 Traffic Attribution" section. Don't surface a red sub-block without a matching § 5 entry, and don't render § 5 as "None this week" if any attribution card has a red sub-block. + +Past failure mode (v7 second pass): all 10 cards painted yellow regardless of verdict, making the color meaningless. The actual breakdown that week was 6 green + 4 yellow + 0 red. + +**Minimum verification step before publishing** (add to your final-pass checklist): + +```pwsh +Select-String -Path <output.html> -Pattern 'data-spark|data-trend' | Measure-Object | Select-Object Count +``` + +Should return **at least ~30** matches (8 KPI tiles + ~10 60d-trend rows + ~12 WoW-table rows). If the count is zero or near-zero, the report is missing all charts — go back and add them. + ## Sparkline color palette Used by both `.spark` (KPI tiles) and `.trend` (table cells): diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 index be1a0a07..866e8390 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 +++ b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 @@ -124,6 +124,69 @@ $calloutOpens = ([regex]::Matches($content, '<div class="callout(?:\s|")')).Coun Write-Host "" Write-Host "Info: $calloutOpens callout container(s) in the document." +# ---- 6. Sparkline / trend chart coverage ---- +# The footer JS auto-renders any element with data-spark or data-trend. If the +# count is near-zero, the body was likely rebuilt without sparklines (v7 +# regression — chartless report). +# +# Two checks: +# 6a. STRUCTURAL (HARD FAIL): if the report has KPI tiles but >half lack +# data-spark, the rebuild dropped them — fail the build. +# 6b. OVERALL (WARN): total chart elements should be ~30+ (8 KPI sparks + +# ~10 trend rows + ~12 WoW-table rows). Warn if under 15. +$sparkCount = ([regex]::Matches($content, 'data-spark=')).Count +$trendCount = ([regex]::Matches($content, 'data-trend=')).Count +$inlineSvg = ([regex]::Matches($content, '<svg[^>]*class="?sparkline')).Count +$kpiTiles = ([regex]::Matches($content, '<div class="kpi"')).Count +$totalCharts = $sparkCount + $trendCount + $inlineSvg +Write-Host "" +Write-Host "Info: $sparkCount data-spark, $trendCount data-trend, $inlineSvg inline sparkline svg(s), $kpiTiles KPI tile(s)." + +if ($kpiTiles -ge 4 -and $sparkCount -lt [Math]::Ceiling($kpiTiles / 2)) { + Add-Fail "Only $sparkCount data-spark element(s) for $kpiTiles KPI tile(s) — over half the KPI tiles are chartless. The body was likely rebuilt without sparklines. See template-readme.md \"Sparklines are MANDATORY\"." +} else { + Pass "KPI tiles have data-spark coverage ($sparkCount/$kpiTiles)" +} +if ($totalCharts -lt 15) { + Add-Warn "Only $totalCharts chart elements found. Expected ~30+ (KPI sparks + 60d-trend rows + WoW-table rows). Did you forget to add data-trend attributes to the WoW / trend tables?" +} else { + Pass "Overall chart coverage looks reasonable ($totalCharts elements)" +} + +# ---- 7. Traffic-attribution sub-block color diversity (tri-state convention) ---- +# Per template-readme.md: each .attr-card's traffic sub-block should be green +# (ruled out), yellow (partly contributing), or red (primary driver). If every +# sub-block is the same color, the author defaulted to one and didn't actually +# classify per card (v7 second-pass regression: 10/10 yellow). +$taGreen = ([regex]::Matches($content, '\u2713 Traffic attribution \u2014 ruled out')).Count +$taYellow = ([regex]::Matches($content, '\u26a0 Traffic attribution \u2014 partly contributing')).Count +$taRed = ([regex]::Matches($content, '\ud83d\ude9a Traffic attribution \u2014 primary driver')).Count +$taTotal = $taGreen + $taYellow + $taRed +if ($taTotal -ge 4) { + $distinctColors = @($taGreen, $taYellow, $taRed | Where-Object { $_ -gt 0 }).Count + if ($distinctColors -le 1) { + Add-Warn "All $taTotal traffic-attribution sub-blocks share one color (g=$taGreen y=$taYellow r=$taRed). The tri-state convention exists so color carries meaning \u2014 verify each card's verdict and recolor accordingly. See template-readme.md \"Traffic-attribution sub-block on each attribution card (tri-state)\"." + } else { + Pass "Traffic-attribution color mix: $taGreen green / $taYellow yellow / $taRed red" + } +} + +# ---- 8. Code-attribution depth (8-field structure) ---- +# SKILL.md \u00a74 mandates that each .attr-card's "Code attribution" block populates +# Originator + Top throw site + Wrapper + Caller hot-spots + Underlying cause + +# Top error_messages + Likely PRs + Next step. A pr-list-only block is the v7-third- +# pass regression. Heuristic: each `<div class="code-attr-title">Code attribution</div>` +# must be followed (within the same card) by an `origin-label` row. +$codeAttrBlocks = ([regex]::Matches($content, '<div class="code-attr-title">Code attribution</div>')).Count +$originLabels = ([regex]::Matches($content, 'class="origin-label">Originator')).Count +if ($codeAttrBlocks -ge 1) { + if ($originLabels -lt $codeAttrBlocks) { + Add-Fail "$codeAttrBlocks Code-attribution block(s) but only $originLabels have an Originator row. Each card needs the full 8-field structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). See assets/code-attribution-template.md." + } else { + Pass "All $codeAttrBlocks code-attribution block(s) have full 8-field structure" + } +} + # Cheap nested-callout heuristic: scan the attention block for any callout that # opens before the previous callout closes. We approximate by tracking depth. if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { From 427d8e4102817c38c8803ceb120fe44fdfbf5719 Mon Sep 17 00:00:00 2001 From: Shahzaib <shahzaib.jameel@microsoft.com> Date: Tue, 9 Jun 2026 17:48:10 -0700 Subject: [PATCH 4/6] Template fixes --- .../assets/report-template.html | 65 +++++++++++++++---- .../assets/template-readme.md | 54 +++++++++++++++ .../assets/templates/spike-card.html | 27 +++++++- .../assets/validate-report.ps1 | 54 ++++++++++++++- 4 files changed, 184 insertions(+), 16 deletions(-) diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html index 5375121c..49953f30 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html +++ b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html @@ -132,7 +132,12 @@ .attr-card { background: #fff; border: 1px solid #d0d7de; border-radius: 12px; box-shadow: 0 1px 3px rgba(31,35,40,0.06); overflow: hidden; + /* Spacing when cards are rendered as siblings (no .attr-grid wrapper) — + e.g. a head+body+footer rebuild that emits .attr-card directly under <h2>. + Without this, consecutive cards visually touch. Discovered in v8 rebuild. */ + margin-bottom: 16px; } + .attr-card + .attr-card { margin-top: 16px; } .attr-header { padding: 14px 18px; background: linear-gradient(180deg, #fafbfc 0%, #f6f8fa 100%); border-bottom: 1px solid #d0d7de; display: flex; align-items: center; @@ -162,37 +167,69 @@ .attr-verdict.bad { border-left-color: #cf222e; background: #fff8f8; } .attr-verdict strong { color: #0d1117; } - .attr-dims { - display: grid; grid-template-columns: repeat(auto-fit, minmax(180px, 1fr)); + /* .attr-dims grid + .dim cards — long calling-app / version names MUST truncate, + never wrap or bleed out. Two flexbox traps to be aware of (v8 regression): + 1. text-overflow:ellipsis is silently IGNORED on display:inline elements. + The name <span> must be display:block (or inline-block) for ellipsis. + 2. Flex children won't shrink below their content size unless min-width:0 + is set explicitly on the child AND every flex ancestor. + Both fixes are baked into the rules below. Authors who hand-write .dim-row + markup may use either classed (.dim-name + .dim-pct) or unclassed (<span> + + <span>) — both are covered. */ + .attr-dims { min-width: 0; display: grid; + grid-template-columns: repeat(auto-fit, minmax(180px, 1fr)); gap: 12px; } .dim { background: #fafbfc; border: 1px solid #eaeef2; border-radius: 8px; padding: 10px 12px; + min-width: 0; overflow: hidden; } .dim-label { font-size: 10px; font-weight: 700; text-transform: uppercase; color: #656d76; letter-spacing: 0.4px; margin-bottom: 6px; } .dim-row { - display: flex; align-items: center; justify-content: space-between; + display: flex; align-items: center; + flex-wrap: nowrap; gap: 8px; padding: 3px 0; font-size: 12px; + min-width: 0; width: 100%; } - .dim-row .dim-name { - font-family: "SF Mono", Consolas, monospace; font-size: 11.5px; - color: #0d1117; flex: 1; overflow: hidden; text-overflow: ellipsis; - white-space: nowrap; + /* Bar-track fixed width — must not flex */ + .dim-row > .dim-bar-track { + flex: 0 0 56px; width: 56px; min-width: 56px; + height: 4px; background: #eaeef2; border-radius: 2px; + margin-top: 0; overflow: hidden; } - .dim-row .dim-pct { + /* Name column — classed (.dim-name) OR first unclassed <span>. + KEY: display:block + flex:1 1 0 + min-width:0 to make ellipsis engage. */ + .dim-row .dim-name, + .dim-row > span:first-of-type { + display: block; + flex: 1 1 0; min-width: 0; max-width: 100%; + overflow: hidden; text-overflow: ellipsis; white-space: nowrap; + font-family: "SF Mono", Consolas, monospace; font-size: 11.5px; color: #0d1117; + } + /* Percent column — classed (.dim-pct) OR last unclassed <span> */ + .dim-row .dim-pct, + .dim-row > span:last-of-type { + flex: 0 0 auto; font-variant-numeric: tabular-nums; color: #656d76; font-size: 11px; - min-width: 38px; text-align: right; + min-width: 38px; text-align: right; white-space: nowrap; } - .dim-row.dominant .dim-name { font-weight: 700; } - .dim-row.dominant .dim-pct { color: #cf222e; font-weight: 700; } - .dim-bar-track { - height: 4px; background: #eaeef2; border-radius: 2px; - margin-top: 2px; overflow: hidden; + /* Single-span placeholder rows (e.g. "Not sliced — …") */ + .dim-row > span:only-child { + display: block; flex: 1 1 0; min-width: 0; max-width: 100%; + overflow: hidden; text-overflow: ellipsis; white-space: nowrap; } + /* Dominant row styling — works whether .dominant is on the row OR derived + from the bar-fill class (the body-generated markup only marks the fill). */ + .dim-row.dominant .dim-name, + .dim-row.dominant > span:first-of-type, + .dim-row:has(> .dim-bar-track > .dim-bar-fill.dominant) > span:first-of-type { font-weight: 700; } + .dim-row.dominant .dim-pct, + .dim-row.dominant > span:last-of-type, + .dim-row:has(> .dim-bar-track > .dim-bar-fill.dominant) > span:last-of-type { color: #cf222e; font-weight: 700; } .dim-bar-fill { height: 100%; background: #0969da; border-radius: 2px; } .dim-bar-fill.dominant { background: #cf222e; } .dim-bar-fill.split { background: #9a6700; } diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md index 6568ad80..3c430bcf 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md @@ -135,6 +135,60 @@ Select-String -Path <output.html> -Pattern 'data-spark|data-trend' | Measure-Obj Should return **at least ~30** matches (8 KPI tiles + ~10 60d-trend rows + ~12 WoW-table rows). If the count is zero or near-zero, the report is missing all charts — go back and add them. +## Attribution-card layout — the two v8 traps + +The CSS in `report-template.html` now guards both, and `validate-report.ps1` § 9 +hard-fails when the rules are missing. Two failure modes to know about: + +### 1. Cards touching (no spacing between consecutive `.attr-card`s) + +The template originally relied on an outer `<div class="attr-grid">` wrapper to +provide `gap: 16px` between cards. A head+body+footer rebuild that emits +`.attr-card` elements directly under `<h2>` produces visually touching cards. + +**Fix in template CSS:** `.attr-card { margin-bottom: 16px }` + `.attr-card + +.attr-card { margin-top: 16px }`. If you ever rewrite the head, make sure both +rules survive. + +### 2. Text bleeding out of `.dim` boxes (long calling-app / version names) + +Two flexbox traps stack here: + +- **`text-overflow: ellipsis` is silently ignored on `display: inline` elements.** + A `<span>` defaults to inline. The name span must be `display: block` (or + `inline-block`) for ellipsis to render. +- **Flex children don't shrink below their content size by default.** Both the + flex child AND every flex ancestor need `min-width: 0` explicitly. + +**Two valid `.dim-row` markup variants — pick one per card:** + +```html +<!-- Variant A: classed spans (original template, recommended) --> +<div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill dominant" style="width:99.0%"></div></div> + <span class="dim-name">AcquireTokenSilent</span> + <span class="dim-pct">99.0%</span> +</div> + +<!-- Variant B: unclassed spans (terser; CSS covers both forms via :first-of-type / :last-of-type) --> +<div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill" style="width:36.6%"></div></div> + <span>com.microsoft.windowsintune.companyportal</span> + <span>36.6%</span> +</div> + +<!-- Placeholder rows ("Not sliced — …") — one span only, still truncate --> +<div class="dim-row"> + <span style="color:#656d76;font-size:11.5px;">Not sliced — OEM not suspected.</span> +</div> +``` + +The CSS rules `text-overflow: ellipsis` + `display: block` + `flex: 1 1 0` + +`min-width: 0` + `max-width: 100%` are baked into the template name-column +selector for both classed and unclassed variants. Do not bypass them by setting +inline `white-space: normal` or removing `min-width: 0` from `.dim` / +`.attr-dims` — that's how the bug regresses. + ## Sparkline color palette Used by both `.spark` (KPI tiles) and `.trend` (table cells): diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html index 04579cd9..f879c612 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html @@ -40,7 +40,32 @@ <strong>Verdict:</strong> {{VERDICT_PARAGRAPH}} </div> - <!-- 7 mandatory dim blocks — fill from agg.js output for each dim query --> + <!-- 7 mandatory dim blocks — fill from agg.js output for each dim query. + Canonical .dim-row markup (use EXACTLY this shape so CSS targets it). + Two variants are supported; pick one and stick with it within a card: + + Variant A — classed spans (matches the original template, recommended): + <div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill dominant" style="width:99.0%"></div></div> + <span class="dim-name">AcquireTokenSilent</span> + <span class="dim-pct">99.0%</span> + </div> + + Variant B — unclassed spans (terser; CSS selectors cover both forms): + <div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill" style="width:36.6%"></div></div> + <span>com.microsoft.windowsintune.companyportal</span> + <span>36.6%</span> + </div> + + "Not sliced / N/A" placeholder rows — one span only, will truncate too: + <div class="dim-row"> + <span style="color:#656d76;font-size:11.5px;">Not sliced — OEM not suspected.</span> + </div> + + Long names (calling apps, broker versions with annotations) MUST truncate + to an ellipsis — never wrap, never bleed out of the .dim card. The CSS + in assets/report-template.html handles this for both variants. --> <div class="attr-dims"> <div class="dim"><div class="dim-label">Span</div> {{SPAN_DIM_ROWS}} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 index 866e8390..19abe1d2 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 +++ b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 @@ -10,9 +10,20 @@ 4. Section 2 callouts are siblings, NOT nested. Tracks <div> open/close depth from #attention to #trend60d; the depth must return to 0 between callouts. 5. (Informational) Reports HTML size and number of <div class="callout"> openings. + 6. KPI tiles have data-spark coverage (>= half) + overall chart coverage (>=15). + 7. Traffic-attribution sub-block color diversity (tri-state convention). + 8. Code-attribution depth — each .attr-card has the full 8-field Originator block. + 9. Attribution-card layout sanity (v8 regression): + 9a. .attr-card cards-touching guard — CSS must define explicit margin + on .attr-card so successive cards don't visually run together when + the body emits them without an .attr-grid wrapper. + 9b. .dim-row name-overflow guard — CSS must define text-overflow:ellipsis + on .dim-name / .dim-row > span:first-of-type AND min-width:0 on + .dim / .dim-row so long calling-app / version names truncate inside + their dim card rather than bleeding out. Exits with non-zero status if any HARD check fails (stale tokens, devs/reqs leak, - U+FFFD, or unbalanced div depth in the attention block). + U+FFFD, unbalanced div depth, missing layout-guard CSS). .PARAMETER Path Absolute path to the report file. Defaults to the current week's report under @@ -209,6 +220,47 @@ if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { } } +# ---- 9. Attribution-card layout sanity (v8 regression — cards touching + dim-row bleed) ---- +# Two layout bugs hit the v8 rebuild and forced manual CSS patches mid-publish. +# Both have CSS fixes baked into report-template.html now, but the validator +# catches the markup-side preconditions so a future hand-rolled body that +# diverges from the template is flagged before publish. +# +# 9a. Cards-touching guard: if the report has .attr-card outside any .attr-grid +# wrapper AND the CSS in <style> is missing the explicit margin rule, warn. +# (Belt + suspenders — the canonical CSS now ships the margin, but a stale +# copy/paste of an older head could regress.) +$hasAttrCard = ([regex]::Matches($content, '<div class="attr-card')).Count -gt 0 +if ($hasAttrCard) { + $cssHasCardMargin = $content -match '\.attr-card\s*\{[^}]*margin-bottom\s*:\s*16px' ` + -or $content -match '\.attr-card\s*\+\s*\.attr-card\s*\{[^}]*margin-top' + if (-not $cssHasCardMargin) { + Add-Fail "Report has .attr-card elements but the CSS is missing the cards-touching guard (.attr-card { margin-bottom:16px } and/or .attr-card + .attr-card { margin-top:16px }). The v8 head rebuild dropped this — re-extract <head> from the current assets/report-template.html." + } else { + Pass "Attribution cards have spacing CSS" + } +} + +# 9b. Dim-row overflow guard: every .dim-row that wraps a name + percent must +# have the CSS rules that make text-overflow:ellipsis engage. The trap: +# text-overflow:ellipsis is silently ignored on inline <span> elements; +# the spans must be display:block (or inline-block) AND flex children +# with min-width:0. We can't measure actual rendering, but we CAN assert +# the CSS rules exist verbatim. +if ($hasAttrCard) { + $cssHasEllipsis = $content -match '\.dim-row\s*>\s*span:first-of-type[^}]*text-overflow\s*:\s*ellipsis' ` + -or $content -match '\.dim-row\s+\.dim-name[^}]*text-overflow\s*:\s*ellipsis' + $cssHasMinWidth = $content -match '\.dim\s*\{[^}]*min-width\s*:\s*0' ` + -or $content -match '\.dim-row\s*\{[^}]*min-width\s*:\s*0' + if (-not $cssHasEllipsis) { + Add-Fail "CSS is missing the .dim-row name-overflow guard (text-overflow:ellipsis on .dim-name and/or .dim-row > span:first-of-type). Long calling-app / version names will bleed out of the dim cards. Re-extract <head> from the current assets/report-template.html." + } elseif (-not $cssHasMinWidth) { + Add-Warn "CSS has text-overflow rules but is missing min-width:0 on .dim / .dim-row. Without it, flex children won't shrink below content size and ellipsis won't trigger inside narrow dim cards." + } else { + Pass "Dim-row name-overflow guard CSS present (ellipsis + min-width:0)" + } +} + Write-Host "" if ($failures.Count -eq 0) { Write-Host "All hard checks passed." -ForegroundColor Green From a216b2caa71af528e431644fd79781a0fea9cbe7 Mon Sep 17 00:00:00 2001 From: Shahzaib <shahzaib.jameel@microsoft.com> Date: Tue, 9 Jun 2026 22:38:25 -0700 Subject: [PATCH 5/6] More updates --- .../oncall-weekly-telemetry-report/SKILL.md | 93 ++++++--- .../assets/bucket-trends.js | 82 +++++++- .../queries/wow-table-sparkline-series.kql | 34 ++++ .../assets/report-template.html | 10 +- .../assets/scripts/bootstrap-report.ps1 | 171 +++++++++++++++++ .../assets/scripts/run-kql.ps1 | 103 ++++++++++ .../assets/scripts/visual-smoke.ps1 | 177 ++++++++++++++++++ .../assets/summarize-attribution.js | 38 +++- .../assets/template-readme.md | 32 ++++ .../assets/validate-report.ps1 | 46 ++++- 10 files changed, 735 insertions(+), 51 deletions(-) create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 create mode 100644 .github/skills/oncall-weekly-telemetry-report/assets/scripts/visual-smoke.ps1 diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md index 32e4dea5..33cda751 100644 --- a/.github/skills/oncall-weekly-telemetry-report/SKILL.md +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -21,11 +21,14 @@ Reusable helpers in [`assets/`](assets/): | [`code-attribution-template.md`](assets/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | | [`queries/`](assets/queries/) | Canonical KQL templates, one file per query — see [`queries/README.md`](assets/queries/README.md). Highlights: [`attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql) (NEW — all 7 dims in one round-trip), [`error-message-and-location.kql`](assets/queries/error-message-and-location.kql) (now accepts BOTH `<CODES_LIST>` and `<TYPES_LIST>` in one call) | | [`templates/`](assets/templates/) | Copy-paste HTML snippets (`spike-card.html`, `traffic-attr-card.html`, `sparkline-footer.html`) | -| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. | +| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. **`--summary` suppresses the verbose header; `--json=<path>` emits a structured sidecar for programmatic consumption.** | | [`agg.js`](assets/agg.js) | Per-error per-dim top-N rollup with WoW deltas. Workhorse for filling spike-attribution dim blocks. | -| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union <file.json>` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label=<dim> file.json` per-dim mode. | -| [`find-suspect-prs.ps1`](assets/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run after the Originator pre-check identifies the throw-site class. | -| [`validate-report.ps1`](assets/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), and unbalanced `<div>` depth in Section 2 (the nested-callout bug). Run as part of Step 7. | +| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union <file.json>` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label=<dim> file.json` per-dim mode. **Auto-detects the array-form schema produced by `assets/scripts/run-kql.ps1` — no schema-transformer step needed.** | +| [`find-suspect-prs.ps1`](assets/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run *only after* the Originator pre-check has identified a specific throw-site class — the unscoped 4-week PR window is small enough (<30 PRs) to scan with plain `git log` first. | +| [`validate-report.ps1`](assets/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), unbalanced `<div>` depth in Section 2 (the nested-callout bug), KPI/trend sparkline coverage, code-attribution depth, layout-guard CSS presence, and suspicious low-peak fabricated `data-trend` arrays. Run as part of Step 7. | +| [`scripts/run-kql.ps1`](assets/scripts/run-kql.ps1) | **Direct-REST Kusto helper — drop-in fallback for the Azure Kusto MCP server when the MCP times out** (frequent on per-error-code queries). Acquires a token via `az`, POSTs to `/v2/rest/query`, writes a JSON file the JS helpers can consume directly. | +| [`scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1) | Bootstrap a new week's report from the canonical template. Auto-computes the reporting Sunday, creates `_data/<sunday>/`, prunes `_data` folders older than 60 days, and detects "unfilled template stub" vs "real prior report" collisions using a multi-marker fingerprint (title + meta date + first KPI value + size ratio). | +| [`scripts/visual-smoke.ps1`](assets/scripts/visual-smoke.ps1) | Optional Playwright-based layout smoke test. Renders the report at 1400 px viewport, captures a full-page screenshot under `~/android-oce-reports/_visual/`, and runs DOM-based overflow + adjacent-card-gap detection. Catches the rendered-layout bugs (text bleed, cards touching) that pure HTML/CSS validation can't see. | --- @@ -80,32 +83,24 @@ If any of these are unstated, ask once, then proceed. ### Step 1 — Bootstrap the new report file from the template -This skill ships with a canonical template at [`assets/report-template.html`](assets/report-template.html) (a real prior week's report kept as the reference layout). **Always start from this template** — never assume a prior week's report exists on the file system. +This skill ships with a canonical template at [`assets/report-template.html`](assets/report-template.html) (a real prior week's report kept as the reference layout). **Use [`assets/scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1)** to handle all the boilerplate (Sunday-date computation, `_data/<sunday>/` directory, retention-pruning, collision detection): ```pwsh -# Reports live OUTSIDE the workspace, in the user's home folder, so they never -# accidentally get committed and don't pollute the repo root. -$reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' -New-Item -ItemType Directory -Force $reportDir | Out-Null - -# Filename uses the Sunday startofweek bucket of the reporting week (matches the -# Kusto bucket label used throughout the report). For "week of May 3 -> May 9, 2026" -# this evaluates to 2026-05-03. -$reportingSunday = '2026-05-03' # <-- replace with the confirmed reporting-week Sunday -$next = Join-Path $reportDir "oncall-wow-report-$reportingSunday.html" - -if (Test-Path $next) { - # Filename collision rule (per Hard Rules): do NOT silently overwrite. Open - # the existing report, identify its top-3 findings, and explicitly state in - # chat what changed in the new data before regenerating. - Write-Warning "$next already exists. Read it first, list its top-3 findings, and confirm a delta exists before regenerating." -} - -Copy-Item c:\Users\shjameel\Repos\android-complete\.github\skills\oncall-weekly-telemetry-report\assets\report-template.html $next -Force -Write-Host "Bootstrapped $next from skill template." +.\.github\skills\oncall-weekly-telemetry-report\assets\scripts\bootstrap-report.ps1 +# Optional: explicit reporting Sunday + force overwrite +# .\bootstrap-report.ps1 -ReportingSunday 2026-05-31 -Force ``` -Edit `$next` in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/template-readme.md`](assets/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, and the CSS class reference. +What it does: +* Computes the reporting-Sunday from the system clock (most recent complete Sun-Sat week). +* Creates `~/android-oce-reports/oncall-wow-report-<sunday>.html` from the canonical template. +* Creates `~/android-oce-reports/_data/<sunday>/` for raw KQL JSON payloads. +* Prunes `_data/<old-sunday>/` folders older than 60 days so the cache doesn't accumulate. +* **Collision detection (the v8-hardened version):** uses a multi-marker fingerprint (title + meta-line dates + first-KPI value + size ratio) to distinguish an "unfilled template stub" (silently re-bootstrap) from a "real populated report" (HARD HALT, exit 2, require `-Force` to overwrite). The earlier single-marker (title only) version misclassified populated reports as stubs and overwrote real work. + +Edit the bootstrapped file in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/template-readme.md`](assets/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, the CSS class reference, and the two v8 layout traps. + +> **⚠️ UTF-8 trap — DO NOT use PowerShell `@'...'@` heredocs to compose HTML content containing emojis, em-dashes, arrows, or middle dots.** PowerShell silently strips multi-byte UTF-8 characters when piping heredocs to `Set-Content` / `Out-File`. Use Node.js (`fs.writeFileSync`), `[IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))`, or explicit Unicode-pair literals (`[char]0xD83D + [char]0xDCCA` for 📊) instead. This trap cost ~30 min in v8 and required a full emoji-restoration pass — every callout icon, every section header emoji, every arrow link had to be re-injected. The validator's `U+FFFD` check catches the worst case (mojibake replacement char) but cannot detect characters that were silently stripped to nothing. Mark any unfinished card or table cell with the literal sentinel `EXAMPLE CONTENT BELOW` inside an HTML comment — the final-pass validator (Step 7) greps for it. @@ -119,6 +114,20 @@ Use the Kusto MCP tool against: **Always prefer the canonical `materialized_view('XxxMetrics' or 'XxxUpdated')` variants** — these are what the production dashboard uses, are pre-aggregated and HLL-bucketed, and avoid the 240 s MCP timeout that plain `android_spans` queries hit. Full schema, gotchas, and query templates: [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). +> **Fallback when the Kusto MCP times out:** use [`assets/scripts/run-kql.ps1`](assets/scripts/run-kql.ps1). It acquires a token via `az account get-access-token`, POSTs directly to `/v2/rest/query`, and writes the result as a JSON file the JS helpers (`bucket-trends.js`, `summarize-attribution.js`) can consume directly. The skill's MCP-vs-REST switch is roughly: try the MCP once; if it returns `McpError -32001 (timeout)`, switch to the REST helper for the rest of the run. Run multiple queries in parallel via PowerShell `Start-Job`: +> +> ```pwsh +> $queries = @{ 'reliability.json' = $reliabilityKql; '60d-codes.json' = $codesKql; ... } +> $jobs = @() +> foreach ($f in $queries.Keys) { +> $q = $queries[$f] +> $jobs += Start-Job -ScriptBlock { +> param($Q, $O) & "$using:skillRoot\assets\scripts\run-kql.ps1" -Query $Q -Out $O +> } -ArgumentList $q, $f +> } +> $jobs | Wait-Job | Receive-Job; $jobs | Remove-Job +> ``` + | Need | View | |------|------| | Per-error-code / per-error-type / per-span counts | `materialized_view('ErrorStatsMetrics')` | @@ -249,21 +258,28 @@ For every regression card, the Code Attribution block **must** populate the foll #### PR-grep workflow -**Read the full PR window first, then reason — don't `--grep` blind.** The 4-week window across `broker/` and `common/` typically returns <30 PRs total, small enough to read end-to-end. Targeted `--grep` matches will miss PRs whose titles don't mention the error string (most of them). +**Read the full PR window first, then reason — don't `--grep` blind.** The 4-week window across `broker/` and `common/` typically returns <30 PRs total, small enough to read end-to-end. Targeted `--grep` matches will miss PRs whose titles don't mention the error string (most of them). **The recommended order is:** + +1. **Run plain `git log` on both repos** for the 4-week window. Read the resulting list end-to-end before any greps. +2. **Cross-reference titles + dates** against the Originator pre-check throw-site class. +3. **Only when you have a specific symbol** to chase (e.g. the throw-site class identified in step 2), reach for `find-suspect-prs.ps1` to do the symbol-targeted parallel pickaxe + grep. + +The historical mistake (pre-v8) was to jump straight to `find-suspect-prs.ps1` without reading the window first, which silently dropped PRs whose titles didn't mention the symbol. ```pwsh +# Step 1: read the full 4-week window cd c:\Users\shjameel\Repos\android-complete\broker -git log --since='<windowStart>' --until='<windowEnd>' --pretty=format:'%h | %ai | %an | %s' +git --no-pager log --since='<windowStart>' --until='<windowEnd>' --pretty=format:'%h | %ai | %an | %s' --no-merges cd ..\common -git log --since='<windowStart>' --until='<windowEnd>' --pretty=format:'%h | %ai | %an | %s' +git --no-pager log --since='<windowStart>' --until='<windowEnd>' --pretty=format:'%h | %ai | %an | %s' --no-merges ``` For each candidate PR, **read the diff** to confirm it touches the throw site / wrapper class identified in the Originator pre-check. Don't cite a PR just because the title mentions a related concept. -For focused follow-up by class/method name, use the helper: - ```pwsh +# Step 3 (optional): symbol-targeted focused follow-up. Use ONLY after step 1 gave +# you a specific class/method name to chase from the Originator pre-check. # Searches both repos in parallel via `git log -S` (pickaxe on diff) AND `--grep` (subject). # Returns a unified table: repo | date | author | sha | PR# | URL | subject. .\.github\skills\oncall-weekly-telemetry-report\assets\find-suspect-prs.ps1 ` @@ -293,6 +309,8 @@ Slice on **all 7 dimensions** for each spike. **Preferred for 2-week WoW attribu For `error_type` cards, swap `error_code in (codes)` for `unified_error_type in (types)` and aggregate by the `MergeUiRequiredExceptions(error_type)` extension — otherwise everything else is identical. +> **Low-volume fallback (extends Step 4's pre-check fallback to the 7-dim union):** when a code/type returns sparse dim rows in the 7-day reporting window — typical for sub-1k-device entries like `TimeoutCancellationException`, `JsonSyntaxException`, `kdfv2_key_derivation_error` — widen the union query to **14 days** (`<START>` = baseline-week Sunday − 7d) before declaring "broad — needs targeted slice". The added week of context usually surfaces enough rows to compute concentration percentages. If a code STILL has no concentration after 14 days, mark every dim cell as "not sliced — sub-week volume; file the bug first, slice on persistence" — do NOT fabricate "Broad" verdicts. + | # | Dimension | Source | Cross-check | |---|-----------|--------|-------------| | 1 | Broker version | `ErrorStatsMetrics` group by `broker_version` | Cross-reference `BrokerAdoptionStatsUpdated` to see if the version's request share *also* moved that week — if yes, the spike is rollout-driven, not code-driven | @@ -445,8 +463,21 @@ The validator hard-fails on: 5. A second callout opening before the previous one closes (nested-callout sanity check). 6. **Chartless KPI grid** — if more than half the `.kpi` tiles lack a `data-spark` element (catches the v7 regression where the body was rebuilt without sparklines). Also warns when total chart count (sparks + trends + inline svgs) is < 15. 7. **Code-attribution depth** — each `.attr-card`'s "Code attribution" block must contain an `Originator` row (proxy for the full 8-field structure: Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). Catches the v7-third-pass regression where cards shipped with a `pr-list`-only stub. +8. **Attribution-card layout guards (v8)** — the CSS must define `.attr-card { margin-bottom: 16px }` AND `.dim-row` overflow rules (`text-overflow: ellipsis` + `min-width: 0`). Catches the "cards touching" and "text bleeding out of dim boxes" regressions from a stale `<head>` block. +9. **Fabricated-sparkline heuristic (v8)** — warns when a `data-trend` array's peak value is < 100 (almost certainly hand-rolled rather than sourced from real data). See [`assets/queries/wow-table-sparkline-series.kql`](assets/queries/wow-table-sparkline-series.kql) for the canonical KQL that pulls real 8-week series for every code in the WoW tables. Then: +- **Run the visual smoke test (recommended)** — catches rendered-layout bugs that pure HTML/CSS validation can't see: + + ```pwsh + .\.github\skills\oncall-weekly-telemetry-report\assets\scripts\visual-smoke.ps1 + # Opens the report at 1400px in headless Chromium via Playwright, captures a + # full-page screenshot to ~/android-oce-reports/_visual/, and runs DOM-based + # checks for: + # - element overflow inside .dim / .attr-card (catches "text bleeding out") + # - adjacent .attr-card pairs with gap < 8px (catches "cards touching") + # First run auto-installs Playwright + Chromium into %LOCALAPPDATA%\oce-skill-playwright + ``` - Run `get_errors` on the HTML file (no errors expected — pure HTML/CSS). - Verify no stale phrases from prior weeks remain (`Select-String` for retracted hypotheses, prior week's PR numbers). - Verify every PR link in the new file is reachable (the file paths just before the link should match what `git log` returned). diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js index 5e4091f2..892f0784 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js @@ -36,6 +36,24 @@ * spike: peak >= 3 x mean(other weeks) and peak > 1.5 x max(first,last) * improvement: delta < -15% * flat: otherwise + * + * Output flags (NEW v8): + * --summary Suppress the verbose header (week list, partial-bucket + * detection). Print only the bucket counts + the per-bucket + * rows. Recommended for the standard skill workflow. + * --json=<path> Also write a structured JSON sidecar with the bucketed + * result for programmatic consumption (e.g. by a future + * sparkline-data-generator script). The sidecar shape is: + * { + * "metric": "devs" | "reqs", + * "weeks": [iso, iso, ...], + * "buckets": { + * "regression": [ { code, first, last, peak, delta, series: [N,N,...] }, ... ], + * "spike": [...], + * "improvement": [...], + * "flat": [...] + * } + * } */ const fs = require('fs'); @@ -44,6 +62,8 @@ const file = args.find(a => !a.startsWith('--')); const startArg = (args.find(a => a.startsWith('--start=')) || '').split('=')[1]; const endArg = (args.find(a => a.startsWith('--end=')) || '').split('=')[1]; const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); +const summary = args.includes('--summary'); +const jsonArg = (args.find(a => a.startsWith('--json=')) || '').split('=')[1]; if (!['devs', 'reqs'].includes(metric)) { console.error(`--metric must be 'devs' or 'reqs', got '${metric}'`); process.exit(1); @@ -51,20 +71,42 @@ if (!['devs', 'reqs'].includes(metric)) { const defaultFloor = metric === 'reqs' ? 100000 : 10000; const peakFloor = +((args.find(a => a.startsWith('--peak-floor=')) || '').split('=')[1] || defaultFloor); const metricIdx = metric === 'reqs' ? 0 : 1; // [errs, devs] tuple +const keyCol = ((args.find(a => a.startsWith('--key=')) || '').split('=')[1] || 'error_code'); if (!file) { - console.error('Usage: node bucket-trends.js <mcp-output.json> [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs]'); + console.error('Usage: node bucket-trends.js <mcp-output.json> [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs] [--key=error_code|unified_error_type] [--summary] [--json=path]'); process.exit(1); } const d = JSON.parse(fs.readFileSync(file, 'utf8')); -const items = d.results.items.slice(1); // first row is the schema +// Schema row can be either an object {col: type} (MCP) or a string array [col, col, ...] +// (from assets/scripts/run-kql.ps1). Detect and locate the key column index so we +// don't assume positional order. +const schemaRow = d.results.items[0]; +let colNames; +if (Array.isArray(schemaRow)) { + colNames = schemaRow.map(String); +} else if (schemaRow && typeof schemaRow === 'object') { + colNames = Object.keys(schemaRow); +} else { + throw new Error('First row of results.items must be the schema row'); +} +const iWeek = colNames.indexOf('week') >= 0 ? colNames.indexOf('week') : colNames.indexOf('wk'); +const iCode = colNames.indexOf(keyCol); +const iErrs = colNames.indexOf('errs'); +const iDevs = colNames.indexOf('devs'); +if (iWeek < 0 || iCode < 0 || iErrs < 0 || iDevs < 0) { + throw new Error(`Schema must include week|wk, ${keyCol}, errs, devs. Got [${colNames.join(', ')}]`); +} + +const items = d.results.items.slice(1); const series = {}; -for (const [w, code, errs, devs] of items) { +for (const r of items) { + const w = r[iWeek], code = r[iCode], errs = r[iErrs], devs = r[iDevs]; if (!series[code]) series[code] = {}; series[code][w] = [errs, devs]; } -const weeks = [...new Set(items.map(r => r[0]))].sort(); +const weeks = [...new Set(items.map(r => r[iWeek]))].sort(); const startISO = startArg ? `${startArg}T00:00:00Z` : weeks[1]; // drop partial start week by default const endISO = endArg ? `${endArg}T00:00:00Z` : null; // exclusive cutoff @@ -96,9 +138,11 @@ if (!endArg && weeks.length >= 4) { } const keep = weeks.filter(w => w >= startISO && (endISO ? w < endISO : true) && w !== droppedPartial); -console.log('All weeks: ', weeks); -console.log('Trend weeks: ', keep, `(${keep.length} complete)`); -console.log('Metric: ', metric, `(peak floor=${peakFloor.toLocaleString()})`); +if (!summary) { + console.log('All weeks: ', weeks); + console.log('Trend weeks: ', keep, `(${keep.length} complete)`); + console.log('Metric: ', metric, `(peak floor=${peakFloor.toLocaleString()})`); +} if (keep.length < 4) { console.warn(`[bucket-trends] WARN: only ${keep.length} kept weeks — trend buckets will be unstable. Need >= 4 for meaningful regression/improvement classification.`); } @@ -122,6 +166,11 @@ for (const [code, wd] of Object.entries(series)) { buckets[cat].push({ code, first, last, peak, delta: +(delta * 100).toFixed(1), series: vals }); } +// Compact bucket-count line (always emitted, summary or verbose) +const countLine = ['regression','spike','improvement','flat'] + .map(k => `${k}=${buckets[k].length}`).join(' '); +console.log(`\nBucket counts (metric=${metric}, key=${keyCol}, peak-floor=${peakFloor.toLocaleString()}): ${countLine}`); + for (const k of ['regression', 'improvement', 'spike', 'flat']) { console.log(`\n=== ${k.toUpperCase()} (${buckets[k].length}) ===`); buckets[k] @@ -132,3 +181,22 @@ for (const k of ['regression', 'improvement', 'spike', 'flat']) { ); }); } + +// Optional structured JSON sidecar +if (jsonArg) { + const sidecar = { + metric, + key: keyCol, + peakFloor, + weeks: keep, + droppedPartial, + buckets: Object.fromEntries( + Object.entries(buckets).map(([k, arr]) => [ + k, + arr.sort((a, b) => b.peak - a.peak) + ]) + ) + }; + fs.writeFileSync(jsonArg, JSON.stringify(sidecar, null, 2)); + console.log(`\nWrote JSON sidecar -> ${jsonArg}`); +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql new file mode 100644 index 00000000..7ef5ce46 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql @@ -0,0 +1,34 @@ +// 8-week per-error sparkline series for the WoW tables (data-trend arrays). +// +// MANDATORY (per SKILL.md Output checklist, v8): the `data-trend` arrays in +// the Section 6 (error_code) and Section 7 (error_type) WoW tables must come +// from real data — not be fabricated from a "roughly increasing" pattern. +// Past failure mode: small-volume codes (Broker request cancelled, +// kdfv2_key_derivation_error, TimeoutCancellationException) were filtered out +// by the 60d bucketer's peak-floor, then their sparklines were invented inline +// in the WoW table HTML. That's data dishonesty even when the array looks plausible. +// +// This query returns 8 weekly buckets for every code/type that appears in +// either the WoW movers list OR the 60d trend output. Run it twice — once with +// the codes filter, once with the types filter — and feed the result into the +// WoW-table generator so every row has a real-data trend. +// +// Inputs: +// <START> Sunday of week-0 (e.g. 2026-04-12 for an 8-week window ending 2026-06-06) +// <END> Sunday after the reporting week, EXCLUSIVE (e.g. 2026-06-07) +// <CODES> Dynamic list of error_code values whose sparklines we need. +// Build this from the union of: +// * wow-movers-codes.json results +// * 60d-codes regression/spike/improvement bucket members +// For the type variant, swap to `unified_error_type in (<TYPES>)` +// and the MergeUiRequiredExceptions extension. + +let codes = dynamic([<CODES>]); +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime(<START>) .. datetime(<END>)) +| where error_code in (codes) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime(<END>) +| order by error_code asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html index 49953f30..45ef575f 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html +++ b/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html @@ -217,10 +217,14 @@ font-variant-numeric: tabular-nums; color: #656d76; font-size: 11px; min-width: 38px; text-align: right; white-space: nowrap; } - /* Single-span placeholder rows (e.g. "Not sliced — …") */ - .dim-row > span:only-child { + /* Single-span placeholder rows (e.g. "Not sliced — …") — allow wrap, not truncate. + Long inline <code> blocks should wrap inside the cell rather than expanding + it horizontally. Truncation/ellipsis is for the name+bar+pct rows only. */ + .dim-row > span:only-child, + .dim-row:not(:has(> .dim-bar-track)) > span { display: block; flex: 1 1 0; min-width: 0; max-width: 100%; - overflow: hidden; text-overflow: ellipsis; white-space: nowrap; + overflow: hidden; text-overflow: clip; white-space: normal; + overflow-wrap: anywhere; word-break: break-word; } /* Dominant row styling — works whether .dominant is on the row OR derived from the bar-fill class (the body-generated markup only marks the fill). */ diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 new file mode 100644 index 00000000..ccca2638 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 @@ -0,0 +1,171 @@ +<# +.SYNOPSIS + Bootstrap a new OCE weekly report file from the canonical template. + +.DESCRIPTION + Implements SKILL.md Step 1 as a script so the workflow doesn't drift across + runs: + 1. Computes the reporting-week Sunday from the current date (most recent + complete Sun-Sat week unless -ReportingSunday is passed explicitly). + 2. Creates ~/android-oce-reports/_data/<sunday>/ for raw query payloads. + 3. Decides what to do if the target report file already exists: + - If the existing file is an UNFILLED template stub (header dates + still match the canonical template's reference week), silently + re-bootstrap from the template — there's nothing to preserve. + - If the existing file contains real per-week content (the dates + inside differ from the template's reference week), HALT and + require the caller to explicitly delete or rename the file first. + This is the "filename collision rule" from SKILL.md. + 4. Prunes _data/<sunday>/ folders older than -DataRetentionDays (default 60) + so the directory doesn't accumulate stale payloads indefinitely. + +.PARAMETER ReportingSunday + Sunday of the reporting week (yyyy-MM-dd). If omitted, defaults to the most + recent complete Sun-Sat week relative to the system clock. + +.PARAMETER Force + Skip the collision check and overwrite any existing file. + +.PARAMETER DataRetentionDays + How many days of _data/<sunday>/ folders to keep before pruning. Default 60. + +.PARAMETER SkillRoot + Path to the skill folder. Defaults to the location of this script's parent. + +.EXAMPLE + .\bootstrap-report.ps1 + # Default: latest complete week, halt on collision + +.EXAMPLE + .\bootstrap-report.ps1 -ReportingSunday 2026-05-31 -Force + +.OUTPUTS + Prints the absolute path of the newly created report file. +#> +[CmdletBinding()] +param( + [string]$ReportingSunday, + [switch]$Force, + [int]$DataRetentionDays = 60, + [string]$SkillRoot +) +$ErrorActionPreference = 'Stop' + +# Locate the skill folder + canonical template +if (-not $SkillRoot) { + $SkillRoot = Split-Path -Parent (Split-Path -Parent $PSCommandPath) +} +$template = Join-Path $SkillRoot 'report-template.html' +if (-not (Test-Path $template)) { + throw "Canonical template not found at $template. Pass -SkillRoot if running outside the skill folder." +} + +# Compute the reporting Sunday +if (-not $ReportingSunday) { + $today = [datetime]::Today + # Most recent Sunday strictly before today, OR today if today is Sunday + $offset = ($today.DayOfWeek.value__ + 7) % 7 # 0..6 days back to the previous Sunday + $sunday = $today.AddDays(-$offset) + # If today is Sunday but it's still early in the day, prefer the prior complete week + if ($today.DayOfWeek -eq [DayOfWeek]::Sunday -and (Get-Date).Hour -lt 6) { + $sunday = $sunday.AddDays(-7) + } + $ReportingSunday = $sunday.ToString('yyyy-MM-dd') +} +[void][datetime]::Parse($ReportingSunday) # validate format + +# Paths +$reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' +$dataDir = Join-Path $reportDir "_data\$ReportingSunday" +$out = Join-Path $reportDir "oncall-wow-report-$ReportingSunday.html" +New-Item -ItemType Directory -Force $reportDir | Out-Null +New-Item -ItemType Directory -Force $dataDir | Out-Null + +# Read the template's reference dates so we can detect "unfilled stub" collisions. +# A reliable signal of "this file is the template stub": MULTIPLE markers all +# still match the template. We check title, the meta-line dates, AND the first +# KPI value — any divergence means real content has been written. +$templateText = [IO.File]::ReadAllText($template) + +function Get-FingerprintMarkers([string]$text) { + $m = @{} + if ($text -match '<title>([^<]+?)') { $m['title'] = $Matches[1].Trim() } + if ($text -match '
\s*([^<]+)') { $m['metaDate'] = $Matches[1].Trim() } + if ($text -match 'Generated\s+([^<]+?)') { $m['generated'] = $Matches[1].Trim() } + # First KPI tile's value (e.g. "10.58 B"). Differs week-to-week. + if ($text -match '
\s*
[^<]+
\s*
([^<]+?)
') { $m['firstKpi'] = $Matches[1].Trim() } + return $m +} + +$templateMarkers = Get-FingerprintMarkers $templateText + +# Collision check +if ((Test-Path $out) -and -not $Force) { + $existingText = [IO.File]::ReadAllText($out) + $existingMarkers = Get-FingerprintMarkers $existingText + + # "Unfilled stub" requires ALL markers to match the template AND the file size + # to be within 5% of the template's. ANY divergence (a single value updated, + # a single KPI populated, sections added) means real content exists. + $allMatch = $true + foreach ($k in $templateMarkers.Keys) { + if ($existingMarkers[$k] -ne $templateMarkers[$k]) { $allMatch = $false; break } + } + $sizeRatio = (Get-Item $out).Length / [Math]::Max(1, (Get-Item $template).Length) + $sizeClose = ($sizeRatio -ge 0.95) -and ($sizeRatio -le 1.05) + + $isUnfilledStub = $allMatch -and $sizeClose + + if ($isUnfilledStub) { + Write-Warning "Existing $out is an unfilled template stub (all template fingerprints match, size within 5%). Re-bootstrapping silently." + } else { + $divergence = @() + foreach ($k in $templateMarkers.Keys) { + if ($existingMarkers[$k] -ne $templateMarkers[$k]) { + $divergence += " $k`: template='$($templateMarkers[$k])' existing='$($existingMarkers[$k])'" + } + } + if (-not $sizeClose) { + $divergence += " size: template=$((Get-Item $template).Length) bytes existing=$((Get-Item $out).Length) bytes ratio=$([Math]::Round($sizeRatio,2))x" + } + Write-Error @" +A populated report already exists for the same Sunday bucket: + $out + +Divergence from the template (which is why this is NOT classified as an unfilled stub): +$($divergence -join "`n") + +Per the SKILL.md filename-collision rule, do NOT silently overwrite. Either: + 1. Open the existing report, list its top-3 findings, and confirm what changed + in the new data before regenerating. Then re-run with -Force. + 2. Rename / delete the existing file and re-run. +"@ + exit 2 + } +} + +# Bootstrap +Copy-Item $template $out -Force +Write-Host "Bootstrapped $out from $template" +Write-Host "Data folder: $dataDir" + +# Prune old _data folders +$dataRoot = Join-Path $reportDir '_data' +if (Test-Path $dataRoot) { + $cutoff = (Get-Date).AddDays(-$DataRetentionDays) + $oldFolders = Get-ChildItem $dataRoot -Directory | Where-Object { + # Folder name should look like a date; skip the current run's folder + $_.FullName -ne $dataDir -and + $_.LastWriteTime -lt $cutoff + } + if ($oldFolders) { + Write-Host "Pruning $($oldFolders.Count) _data folder(s) older than $DataRetentionDays days:" + $oldFolders | ForEach-Object { + Write-Host " removing $($_.FullName) (last write $($_.LastWriteTime.ToString('yyyy-MM-dd')))" + Remove-Item -Recurse -Force $_.FullName + } + } +} + +# Print the path so callers can capture it +Write-Output $out diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 new file mode 100644 index 00000000..3686259f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 @@ -0,0 +1,103 @@ +<# +.SYNOPSIS + Direct-REST Kusto query helper. Drop-in fallback for the Azure Kusto MCP server + when the MCP times out (the MCP has a 240 s budget and frequently exceeds it on + the per-error-code queries this skill needs). + +.DESCRIPTION + Acquires an Entra token via the local `az` CLI for the Kusto cluster, POSTs the + query to /v2/rest/query, and writes a JSON file whose schema matches what the + other helpers in this skill (bucket-trends.js, summarize-attribution.js) expect: + + { "results": { "items": [ + [colName0, colName1, ...], // first row = column-name list + [row0col0, row0col1, ...], + [row1col0, row1col1, ...], + ... + ] } } + + The `summarize-attribution.js --union` loader will auto-detect this array-form + schema (since the v8 update) — no transformer step needed. + +.PARAMETER Query + KQL query text. Pass via single-quoted PowerShell here-string for safety. + +.PARAMETER Out + Output JSON file path. + +.PARAMETER Cluster + Kusto cluster URI (default: idsharedeus2 — the production Android Broker cluster). + +.PARAMETER Database + Database name (default: ad-accounts-android-otel). + +.PARAMETER TimeoutSec + HTTP timeout (default 300 s — Kusto itself has a 5-minute server-side query budget). + +.EXAMPLE + # Sanity check + .\run-kql.ps1 -Query 'print x=1' -Out test.json + +.EXAMPLE + # Pull the 60-day per-error-code trend + $q = @" +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime(2026-04-12) .. datetime(2026-06-07)) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime(2026-06-07) +| order by error_code asc, week asc +"@ + .\run-kql.ps1 -Query $q -Out 60d-codes.json + +.NOTES + * Requires `az login` to have been run beforehand and the caller to have read + access to the cluster (Android Auth Client SDK security group). + * Runs queries in parallel from PowerShell jobs — see SKILL.md Step 2 for the + "5-queries-in-parallel" pattern. + * If your query payload is large (>50 KB returned), the JSON file may itself + be large — pipe to bucket-trends.js / summarize-attribution.js directly + rather than viewing in-band. +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Query, + [Parameter(Mandatory=$true)][string]$Out, + [string]$Cluster = 'https://idsharedeus2.kusto.windows.net', + [string]$Database = 'ad-accounts-android-otel', + [int]$TimeoutSec = 300 +) +$ErrorActionPreference = 'Stop' + +# Acquire token via az CLI (works for users + managed identity) +$tok = az account get-access-token --resource $Cluster --query accessToken -o tsv 2>$null +if (-not $tok) { + throw "Failed to acquire token for $Cluster. Run 'az login' first and verify membership in the Android Auth Client SDK security group." +} + +$body = @{ csl = $Query; db = $Database } | ConvertTo-Json -Compress +$resp = Invoke-RestMethod -Uri "$Cluster/v2/rest/query" -Method Post ` + -Headers @{ Authorization = "Bearer $tok"; 'Content-Type' = 'application/json' } ` + -Body $body -TimeoutSec $TimeoutSec + +# Find the PrimaryResult table (Kusto returns multiple frame types; we want the data) +$primary = $resp | Where-Object { $_.FrameType -eq 'DataTable' -and $_.TableKind -eq 'PrimaryResult' } | Select-Object -First 1 +if (-not $primary) { + # Surface any error frames so the caller can see what went wrong + $err = $resp | Where-Object { $_.FrameType -eq 'DataSetCompletion' -and $_.HasErrors } | Select-Object -First 1 + if ($err) { throw "Kusto query failed with errors. Full response:`n$($resp | ConvertTo-Json -Depth 6)" } + throw 'No PrimaryResult table in response' +} + +# Convert to the canonical schema the JS helpers expect +$colNames = @($primary.Columns | ForEach-Object { $_.ColumnName }) +$items = New-Object System.Collections.ArrayList +[void]$items.Add($colNames) +foreach ($r in $primary.Rows) { [void]$items.Add($r) } + +$obj = @{ results = @{ items = $items } } +# UTF-8 without BOM — keeps emoji/diacritic data clean for downstream consumption +[IO.File]::WriteAllText($Out, ($obj | ConvertTo-Json -Depth 12 -Compress), [System.Text.UTF8Encoding]::new($false)) +Write-Host ("Saved {0} rows -> {1}" -f ($primary.Rows.Count), $Out) diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/visual-smoke.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/visual-smoke.ps1 new file mode 100644 index 00000000..c4ca9272 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/visual-smoke.ps1 @@ -0,0 +1,177 @@ +<# +.SYNOPSIS + Visual / layout smoke test for the OCE weekly report. Optional sibling of + validate-report.ps1 — catches rendered-layout bugs that pure HTML/CSS + validation can't see. + +.DESCRIPTION + Uses Playwright (headless Chromium) to: + 1. Open the report at 1400 px viewport width (the target media size). + 2. Wait for the footer JS to render all sparklines. + 3. Run two DOM-based layout checks: + a. Element overflow — for every .dim and .attr-card, check that no + descendant element's bounding box extends beyond the container's + client width. Catches the "long calling-app name bleeds out of + the dim card" regression. + b. Card adjacency — check that consecutive .attr-card siblings have + at least 8 px of vertical gap. Catches the "cards touching" regression. + 4. Capture a full-page screenshot to ~/android-oce-reports/_visual/ + for manual review. + + Installation note: requires Node.js + Playwright. The script auto-installs + Playwright + Chromium on first run via `npm install --no-save`. + +.PARAMETER Path + Absolute path to the report HTML. Defaults to the most recent + oncall-wow-report-*.html under ~/android-oce-reports/. + +.PARAMETER ScreenshotOnly + Skip the layout checks; just capture the screenshot. + +.EXAMPLE + .\visual-smoke.ps1 + # checks the latest report + writes ~/android-oce-reports/_visual/oncall-wow-report-.png + +.EXAMPLE + .\visual-smoke.ps1 -Path C:\path\to\report.html + +.NOTES + Treat warnings as advisory. The script returns 0 on success, 1 on hard + layout violations (overflow > 4 px, adjacent cards with gap < 8 px). +#> +[CmdletBinding()] +param( + [string]$Path, + [switch]$ScreenshotOnly +) +$ErrorActionPreference = 'Stop' + +if (-not $Path) { + $reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' + $latest = Get-ChildItem $reportDir -Filter 'oncall-wow-report-*.html' -ErrorAction SilentlyContinue | + Sort-Object LastWriteTime -Descending | Select-Object -First 1 + if (-not $latest) { throw "No report found in $reportDir. Pass -Path explicitly." } + $Path = $latest.FullName +} +if (-not (Test-Path $Path)) { throw "Report not found: $Path" } + +$screenshotDir = Join-Path $env:USERPROFILE 'android-oce-reports\_visual' +New-Item -ItemType Directory -Force $screenshotDir | Out-Null +$reportBase = [IO.Path]::GetFileNameWithoutExtension($Path) +$screenshot = Join-Path $screenshotDir "$reportBase.png" + +# Locate or install Playwright in a per-skill node_modules cache +$cacheDir = Join-Path $env:LOCALAPPDATA 'oce-skill-playwright' +New-Item -ItemType Directory -Force $cacheDir | Out-Null +if (-not (Test-Path (Join-Path $cacheDir 'node_modules\playwright'))) { + Write-Host "Installing Playwright + Chromium (one-time, into $cacheDir)..." + Push-Location $cacheDir + try { + if (-not (Test-Path 'package.json')) { '{"name":"oce-visual-smoke","version":"0.0.0","private":true}' | Set-Content 'package.json' } + npm install --no-save playwright | Out-Null + npx playwright install chromium | Out-Null + } finally { Pop-Location } +} + +# Build the JS test inline so the .ps1 is self-contained +$jsScript = @' +const { chromium } = require(require('path').join(process.env.OCE_PWCACHE, 'node_modules', 'playwright')); +const fs = require('fs'); +(async () => { + const file = process.argv[2]; + const screenshotPath = process.argv[3]; + const screenshotOnly = process.argv[4] === 'true'; + const browser = await chromium.launch(); + const page = await browser.newPage({ viewport: { width: 1400, height: 900 } }); + await page.goto('file://' + file); + await page.waitForLoadState('networkidle'); + await page.waitForTimeout(500); // give sparkline JS a beat + await page.screenshot({ path: screenshotPath, fullPage: true }); + console.log('SCREENSHOT ' + screenshotPath); + + if (screenshotOnly) { await browser.close(); return; } + + const issues = await page.evaluate(() => { + const out = { overflow: [], adjacent: [] }; + // 1. Overflow check: every .dim / .attr-card must contain its descendants + for (const sel of ['.dim', '.attr-card']) { + document.querySelectorAll(sel).forEach((el, idx) => { + const elRect = el.getBoundingClientRect(); + el.querySelectorAll('*').forEach(child => { + const r = child.getBoundingClientRect(); + if (r.width === 0 || r.height === 0) return; + // Allow 4 px tolerance for sub-pixel rendering + const overflowRight = r.right - elRect.right; + if (overflowRight > 4) { + // Identify offending element + const ident = (child.tagName + (child.className ? '.' + String(child.className).split(' ').join('.') : '')).slice(0, 80); + const txt = (child.textContent || '').trim().slice(0, 60); + out.overflow.push({ sel, idx, overflowRight: Math.round(overflowRight), tag: ident, text: txt }); + } + }); + }); + } + // 2. Adjacent .attr-card check + const cards = Array.from(document.querySelectorAll('.attr-card')); + for (let i = 1; i < cards.length; i++) { + const prevR = cards[i - 1].getBoundingClientRect(); + const currR = cards[i].getBoundingClientRect(); + const gap = currR.top - prevR.bottom; + if (gap < 8) { + out.adjacent.push({ prevIdx: i - 1, currIdx: i, gap: Math.round(gap) }); + } + } + return out; + }); + + console.log('ISSUES ' + JSON.stringify(issues)); + await browser.close(); +})(); +'@ + +$jsFile = Join-Path $env:TEMP 'oce-visual-smoke.js' +$jsScript | Set-Content $jsFile -Encoding utf8 + +$env:OCE_PWCACHE = $cacheDir +$absPath = (Resolve-Path $Path).Path.Replace('\', '/') +$absShot = (Resolve-Path $screenshotDir).Path.Replace('\', '/') + '/' + [IO.Path]::GetFileName($screenshot) + +$result = node $jsFile $absPath $absShot $ScreenshotOnly.IsPresent.ToString().ToLower() 2>&1 +$result | ForEach-Object { Write-Host $_ } +Remove-Item $jsFile -Force -ErrorAction SilentlyContinue + +if ($ScreenshotOnly) { Write-Host "Screenshot saved: $screenshot"; exit 0 } + +$issuesLine = $result | Where-Object { $_ -match '^ISSUES ' } +$issues = ($issuesLine -replace '^ISSUES ', '') | ConvertFrom-Json +$overflowCount = if ($issues.overflow) { @($issues.overflow).Count } else { 0 } +$adjCount = if ($issues.adjacent) { @($issues.adjacent).Count } else { 0 } + +Write-Host "" +Write-Host "Visual smoke summary:" +Write-Host " Screenshot: $screenshot" +Write-Host " Overflow issues: $overflowCount" +Write-Host " Adjacent gaps <8px: $adjCount" + +if ($overflowCount -gt 0) { + Write-Host "" + Write-Host "Overflow details (showing first 10):" -ForegroundColor Yellow + $issues.overflow | Select-Object -First 10 | ForEach-Object { + Write-Host (" [{0} #{1}] +{2}px overflow: <{3}> text='{4}'" -f $_.sel, $_.idx, $_.overflowRight, $_.tag, $_.text) -ForegroundColor Yellow + } +} +if ($adjCount -gt 0) { + Write-Host "" + Write-Host "Adjacent cards with insufficient gap (showing first 5):" -ForegroundColor Yellow + $issues.adjacent | Select-Object -First 5 | ForEach-Object { + Write-Host (" cards #{0} -> #{1}: gap={2}px (need >=8)" -f $_.prevIdx, $_.currIdx, $_.gap) -ForegroundColor Yellow + } +} + +if ($overflowCount -gt 0 -or $adjCount -gt 0) { + Write-Host "" + Write-Host "Hard layout issues detected. Open $screenshot to inspect." -ForegroundColor Red + exit 1 +} +Write-Host "No hard layout issues." -ForegroundColor Green +exit 0 diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js index 69972740..9d86a75c 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js @@ -86,7 +86,17 @@ function delta(curr, prior) { function loadSliceFile({ label, file }) { const d = JSON.parse(fs.readFileSync(file, 'utf8')); const rows = d.results.items; - const schema = rows[0]; + const schemaRaw = rows[0]; + // Support both schema forms: object (MCP) and array (assets/scripts/run-kql.ps1). + let schema; + if (Array.isArray(schemaRaw)) { + schema = {}; + for (let i = 0; i < schemaRaw.length; i++) schema[String(schemaRaw[i])] = 'string'; + } else if (schemaRaw && typeof schemaRaw === 'object') { + schema = schemaRaw; + } else { + throw new Error(`${file}: first row of results.items must be the schema (column-name array or {col: type} object). Got: ${JSON.stringify(schemaRaw)}`); + } const cols = Object.keys(schema); const idxCode = cols.indexOf(keyCol); let idxWeek = cols.indexOf('wk'); if (idxWeek < 0) idxWeek = cols.indexOf('week'); @@ -95,9 +105,16 @@ function loadSliceFile({ label, file }) { if (idxCode < 0 || idxWeek < 0 || idxDevs < 0) { throw new Error(`${file}: schema must include ${keyCol}, wk|week, devs|countDevices. Got [${cols.join(', ')}]`); } - const idxDim = cols.findIndex((c, i) => + // Find the dim column. When schema was provided as an array (run-kql.ps1) we + // don't have type info, so fall back to "any remaining column" (typically the + // last one in the SELECT). + let idxDim = cols.findIndex((c, i) => i !== idxCode && i !== idxWeek && i !== idxDevs && i !== idxErrs && schema[c] === 'string'); - if (idxDim < 0) throw new Error(`${file}: no string dimension column found`); + if (idxDim < 0) { + idxDim = cols.findIndex((c, i) => + i !== idxCode && i !== idxWeek && i !== idxDevs && i !== idxErrs); + } + if (idxDim < 0) throw new Error(`${file}: no dimension column found`); const map = {}; for (const r of rows.slice(1)) { @@ -115,7 +132,20 @@ function loadSliceFile({ label, file }) { function loadUnion(file) { const d = JSON.parse(fs.readFileSync(file, 'utf8')); const rows = d.results.items; - const schema = rows[0]; + const schemaRaw = rows[0]; + // Two schema shapes are supported: + // (a) Object form (MCP tool): { dim: 0, wk: 1, ... } — keys are column names + // (b) Array form (REST helper assets/scripts/run-kql.ps1): ['dim', 'wk', ...] + // Detect and normalize to an object map { colName -> index }. + let schema; + if (Array.isArray(schemaRaw)) { + schema = {}; + for (let i = 0; i < schemaRaw.length; i++) schema[String(schemaRaw[i])] = i; + } else if (schemaRaw && typeof schemaRaw === 'object') { + schema = schemaRaw; + } else { + throw new Error(`Union file ${file}: first row of results.items must be the schema (column-name array or {col: index} object). Got: ${JSON.stringify(schemaRaw)}`); + } const cols = Object.keys(schema); const idx = name => cols.indexOf(name); const idxDim = idx('dim'); diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md index 3c430bcf..9eb14992 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/template-readme.md @@ -39,6 +39,13 @@ Pick by overlap with the prior week: - **In-place edit (default)** — when ≤3 attribution cards change AND the section structure is unchanged. Use `replace_string_in_file` with surrounding context per card / table row. Fast and low-risk. - **Head+body+footer rebuild (fallback)** — when ≥4 attribution cards change, or several callouts get re-categorized, or the regression set has near-zero overlap with the template. Trying to in-place edit at that scale invites the inception-style nested-`
` bugs the validator was written to catch. +> **⚠️ UTF-8 trap in PowerShell composition.** When composing HTML body sections via `@'...'@` heredocs piped to `Set-Content` / `Out-File` (or even `Add-Content`), PowerShell silently strips multi-byte UTF-8 characters — emojis (📊 🚨 🔴 🟡), em-dashes (—), arrows (→), middle-dots (·). The file remains valid UTF-8; the characters just become empty strings. The validator's `U+FFFD` check catches mojibake but NOT silent strips. Two safe approaches: +> +> 1. **Use `[IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))`** for the final write — this preserves Unicode literals from the script source. +> 2. **Write a Node.js generator** (`gen-body.js`) that takes a JSON spec and emits the HTML body. Node handles UTF-8 natively. If creating the script becomes painful (the `create` tool occasionally fails on `file_text` in this codebase), fall back to approach 1 with explicit `[char]0xD83D + [char]0xDCCA` for 📊, `[char]0x2192` for →, etc. +> +> The cost when this trap fires: a full restoration pass against every emoji + em-dash + arrow in the report (~30 minutes in v8). + Boundary lines in the canonical template (verify with `grep` before splitting — they drift as the template evolves): | Region | Lines (approx) | Last/first line content | @@ -236,3 +243,28 @@ Used by both `.spark` (KPI tiles) and `.trend` (table cells): | `.stack` | Chip for a `file:line` throw-site reference. | | `.pr-card` / `.pr-conf` (`-high` / `-medium` / `-low` / `-none`) / `.pr-body` | PR citation with confidence pill. | | `.origin-tag` (`.origin-broker` / `.origin-android` / `.origin-thirdparty` / `.origin-env`) | Colored chips for the Originator field. | + +### Section 6/7 WoW table row pills + +Status pills in the `error_codes` and `error_types` WoW tables. The 5-color +palette is meaningful — pick the one that matches the row's state: + +| Class | Color | Emoji | When to use | +|---|---|---|---| +| `.pill-bad` | red (#ffeef0 bg / #cf222e text) | 🔴 | Row crossed regression threshold this week — `WoW`, `NEW`, `spike`, or `retry storm` modifier. | +| `.pill-watch` | amber (#fff8c5 bg / #9a6700 text) | 🟡 | Row is flat WoW but rising on the 60d window (use the `60d↑` modifier). | +| `.pill-good` | green (#dafbe1 bg / #1a7f37 text) | 🟢 | Row is improving — recovery, `improving`, `60d↓`, or `requests↓` modifier. | +| `.pill-flat` | grey (#f0f3f6 bg / #656d76 text) | ⚪ | Row is within ±10% on both 60d and WoW; explicitly stable. | +| `.pill-info` | blue (#ddf4ff bg / #0550ae text) | ℹ️ | Informational rows (e.g. policy-driven, fleet-growth-driven). | + +Render pattern: +```html +🔴 WoW +🟡 60d↑ +🟢 improving +``` + +If your table has zero `.pill-bad` rows the week was unusually quiet — +double-check the WoW-movers and 60d bucketing passes ran. If every row is +`.pill-bad` you've mis-categorized. + diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 index 19abe1d2..63e0a794 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 +++ b/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 @@ -232,8 +232,9 @@ if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { # copy/paste of an older head could regress.) $hasAttrCard = ([regex]::Matches($content, '
from the current assets/report-template.html." } else { @@ -248,10 +249,10 @@ if ($hasAttrCard) { # with min-width:0. We can't measure actual rendering, but we CAN assert # the CSS rules exist verbatim. if ($hasAttrCard) { - $cssHasEllipsis = $content -match '\.dim-row\s*>\s*span:first-of-type[^}]*text-overflow\s*:\s*ellipsis' ` - -or $content -match '\.dim-row\s+\.dim-name[^}]*text-overflow\s*:\s*ellipsis' - $cssHasMinWidth = $content -match '\.dim\s*\{[^}]*min-width\s*:\s*0' ` - -or $content -match '\.dim-row\s*\{[^}]*min-width\s*:\s*0' + $cssHasEllipsis = $content -match '(?s)\.dim-row\s*>\s*span:first-of-type[^}]*text-overflow\s*:\s*ellipsis' ` + -or $content -match '(?s)\.dim-row\s+\.dim-name[^}]*text-overflow\s*:\s*ellipsis' + $cssHasMinWidth = $content -match '(?s)\.dim\s*\{[^}]*min-width\s*:\s*0' ` + -or $content -match '(?s)\.dim-row\s*\{[^}]*min-width\s*:\s*0' if (-not $cssHasEllipsis) { Add-Fail "CSS is missing the .dim-row name-overflow guard (text-overflow:ellipsis on .dim-name and/or .dim-row > span:first-of-type). Long calling-app / version names will bleed out of the dim cards. Re-extract from the current assets/report-template.html." } elseif (-not $cssHasMinWidth) { @@ -261,6 +262,39 @@ if ($hasAttrCard) { } } +# ---- 10. Fabricated-sparkline heuristic (v8 regression — hand-rolled data-trend arrays) ---- +# Past failure mode: when 60d bucketer dropped a sub-floor code, the report author +# fabricated a "roughly monotonic" 8-week series inline in the WoW table HTML. +# Cannot 100% detect fabricated data, but we can flag the telltale fingerprints: +# - All values < 1000 (the bucketer's peak-floor is 10000; real data above floor) +# - Suspiciously round / arithmetic-progression numbers (e.g. [388,401,394,425,415,432,414,455] +# where consecutive deltas are all ~10-30) +# Authors should source these from assets/queries/wow-table-sparkline-series.kql +# instead and validate against the pulled JSON. +$trendMatches = [regex]::Matches($content, "data-trend=['""]?\[([0-9.,e\s+\-]+)\]") +$suspectCount = 0 +$suspectFirst = $null +foreach ($m in $trendMatches) { + $arrStr = $m.Groups[1].Value + $vals = $arrStr.Split(',') | ForEach-Object { try { [double]$_.Trim() } catch { 0 } } + if ($vals.Count -lt 6) { continue } + # Filter 1: trend with all values < 100 is suspicious (real codes don't sit at 30-50 devices/wk for 8 weeks) + $maxVal = ($vals | Measure-Object -Maximum).Maximum + if ($maxVal -lt 100) { + $suspectCount++ + if (-not $suspectFirst) { $suspectFirst = $arrStr } + continue + } + # Filter 2: zero-padded series like [0,0,0,0,0,0,0,N] is fine (legitimate NEW); skip + # Filter 3: implausibly regular - if every consecutive delta has the same sign AND is < 5% of the value, that's a fake. + # Skip this; too easy to false-positive on genuinely monotonic real series like no_tokens_found. +} +if ($suspectCount -gt 0) { + Add-Warn "$suspectCount data-trend array(s) have peak value < 100 (suspicious — real WoW-table series usually peak >= 100 devices/wk). Likely fabricated. First: [$suspectFirst]. Source from assets/queries/wow-table-sparkline-series.kql instead." +} else { + Pass "No suspicious low-peak data-trend arrays detected" +} + Write-Host "" if ($failures.Count -eq 0) { Write-Host "All hard checks passed." -ForegroundColor Green From 7c9c378359a26194d1ce26874d90694ed38fbd26 Mon Sep 17 00:00:00 2001 From: Shahzaib Date: Tue, 9 Jun 2026 22:58:40 -0700 Subject: [PATCH 6/6] Reorganize skill assets: docs/, scripts/, templates/ subfolders Move 5 scripts (agg.js, bucket-trends.js, summarize-attribution.js, find-suspect-prs.ps1, validate-report.ps1) from assets/ root into assets/scripts/ where bootstrap-report.ps1, run-kql.ps1, and visual-smoke.ps1 already live. Move report-template.html and template-readme.md from assets/ root into assets/templates/ alongside the spike-card / traffic-attr-card / sparkline-footer snippets they're conceptually grouped with. Move kusto-cheatsheet.md and code-attribution-template.md from assets/ root into a new assets/docs/ folder. Update all cross-references in SKILL.md, the query templates, the script self-documentation, and the bootstrap script's template-lookup path. Validator + bootstrap still pass end-to-end after the move. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../oncall-weekly-telemetry-report/SKILL.md | 46 +++++++++---------- .../{ => docs}/code-attribution-template.md | 2 +- .../assets/{ => docs}/kusto-cheatsheet.md | 2 +- .../assets/queries/60d-trend-codes.kql | 2 +- .../queries/error-message-and-location.kql | 2 +- .../assets/{ => scripts}/agg.js | 0 .../assets/scripts/bootstrap-report.ps1 | 4 +- .../assets/{ => scripts}/bucket-trends.js | 2 +- .../assets/{ => scripts}/find-suspect-prs.ps1 | 0 .../{ => scripts}/summarize-attribution.js | 0 .../assets/{ => scripts}/validate-report.ps1 | 14 +++--- .../assets/templates/README.md | 2 +- .../{ => templates}/report-template.html | 6 +-- .../assets/templates/sparkline-footer.html | 2 +- .../assets/templates/spike-card.html | 4 +- .../assets/{ => templates}/template-readme.md | 4 +- 16 files changed, 47 insertions(+), 45 deletions(-) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => docs}/code-attribution-template.md (98%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => docs}/kusto-cheatsheet.md (98%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => scripts}/agg.js (100%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => scripts}/bucket-trends.js (98%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => scripts}/find-suspect-prs.ps1 (100%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => scripts}/summarize-attribution.js (100%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => scripts}/validate-report.ps1 (96%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => templates}/report-template.html (99%) rename .github/skills/oncall-weekly-telemetry-report/assets/{ => templates}/template-readme.md (99%) diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md index 33cda751..4638ee01 100644 --- a/.github/skills/oncall-weekly-telemetry-report/SKILL.md +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -7,25 +7,25 @@ description: Generate the weekly Android Broker on-call (OCE) WoW + 60-day trend Produce the weekly Android Broker on-call (OCE) telemetry report as a self-contained HTML file at `$env:USERPROFILE\android-oce-reports\oncall-wow-report-v{N+1}.html` (i.e. `~/android-oce-reports/`, outside the workspace so reports never accidentally get committed). -The output mirrors the structure of the canonical template at [`assets/report-template.html`](assets/report-template.html) — copy it to `oncall-wow-report-v{N+1}.html` at repo root and edit in place. Do **not** redesign the layout each week. +The output mirrors the structure of the canonical template at [`assets/templates/report-template.html`](assets/templates/report-template.html) — copy it to `oncall-wow-report-v{N+1}.html` at repo root and edit in place. Do **not** redesign the layout each week. -**Before writing any KQL, read [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md).** It captures the canonical view names, helper functions, the HLL device-count gotcha, week-alignment rules, and ready-to-paste query templates — distilled from the production Android Broker Dashboard. +**Before writing any KQL, read [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md).** It captures the canonical view names, helper functions, the HLL device-count gotcha, week-alignment rules, and ready-to-paste query templates — distilled from the production Android Broker Dashboard. Reusable helpers in [`assets/`](assets/): | File | Purpose | |---|---| -| [`report-template.html`](assets/report-template.html) | Canonical layout — a real prior-week report kept verbatim. **Edit in place** (replace dates / values / verdicts / PR links); do not restyle. See [`template-readme.md`](assets/template-readme.md) for what to change vs leave alone. | -| [`template-readme.md`](assets/template-readme.md) | Author guide for `report-template.html` — what to change per week, color palette, CSS class quick-reference | -| [`kusto-cheatsheet.md`](assets/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates, AADSTS reference | -| [`code-attribution-template.md`](assets/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | +| [`report-template.html`](assets/templates/report-template.html) | Canonical layout — a real prior-week report kept verbatim. **Edit in place** (replace dates / values / verdicts / PR links); do not restyle. See [`template-readme.md`](assets/templates/template-readme.md) for what to change vs leave alone. | +| [`template-readme.md`](assets/templates/template-readme.md) | Author guide for `report-template.html` — what to change per week, color palette, CSS class quick-reference | +| [`kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates, AADSTS reference | +| [`code-attribution-template.md`](assets/docs/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | | [`queries/`](assets/queries/) | Canonical KQL templates, one file per query — see [`queries/README.md`](assets/queries/README.md). Highlights: [`attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql) (NEW — all 7 dims in one round-trip), [`error-message-and-location.kql`](assets/queries/error-message-and-location.kql) (now accepts BOTH `` and `` in one call) | | [`templates/`](assets/templates/) | Copy-paste HTML snippets (`spike-card.html`, `traffic-attr-card.html`, `sparkline-footer.html`) | -| [`bucket-trends.js`](assets/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. **`--summary` suppresses the verbose header; `--json=` emits a structured sidecar for programmatic consumption.** | -| [`agg.js`](assets/agg.js) | Per-error per-dim top-N rollup with WoW deltas. Workhorse for filling spike-attribution dim blocks. | -| [`summarize-attribution.js`](assets/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union ` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label= file.json` per-dim mode. **Auto-detects the array-form schema produced by `assets/scripts/run-kql.ps1` — no schema-transformer step needed.** | -| [`find-suspect-prs.ps1`](assets/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run *only after* the Originator pre-check has identified a specific throw-site class — the unscoped 4-week PR window is small enough (<30 PRs) to scan with plain `git log` first. | -| [`validate-report.ps1`](assets/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), unbalanced `
` depth in Section 2 (the nested-callout bug), KPI/trend sparkline coverage, code-attribution depth, layout-guard CSS presence, and suspicious low-peak fabricated `data-trend` arrays. Run as part of Step 7. | +| [`bucket-trends.js`](assets/scripts/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. **`--summary` suppresses the verbose header; `--json=` emits a structured sidecar for programmatic consumption.** | +| [`agg.js`](assets/scripts/agg.js) | Per-error per-dim top-N rollup with WoW deltas. Workhorse for filling spike-attribution dim blocks. | +| [`summarize-attribution.js`](assets/scripts/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union ` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label= file.json` per-dim mode. **Auto-detects the array-form schema produced by `assets/scripts/run-kql.ps1` — no schema-transformer step needed.** | +| [`find-suspect-prs.ps1`](assets/scripts/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run *only after* the Originator pre-check has identified a specific throw-site class — the unscoped 4-week PR window is small enough (<30 PRs) to scan with plain `git log` first. | +| [`validate-report.ps1`](assets/scripts/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), unbalanced `
` depth in Section 2 (the nested-callout bug), KPI/trend sparkline coverage, code-attribution depth, layout-guard CSS presence, and suspicious low-peak fabricated `data-trend` arrays. Run as part of Step 7. | | [`scripts/run-kql.ps1`](assets/scripts/run-kql.ps1) | **Direct-REST Kusto helper — drop-in fallback for the Azure Kusto MCP server when the MCP times out** (frequent on per-error-code queries). Acquires a token via `az`, POSTs to `/v2/rest/query`, writes a JSON file the JS helpers can consume directly. | | [`scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1) | Bootstrap a new week's report from the canonical template. Auto-computes the reporting Sunday, creates `_data//`, prunes `_data` folders older than 60 days, and detects "unfilled template stub" vs "real prior report" collisions using a multi-marker fingerprint (title + meta date + first KPI value + size ratio). | | [`scripts/visual-smoke.ps1`](assets/scripts/visual-smoke.ps1) | Optional Playwright-based layout smoke test. Renders the report at 1400 px viewport, captures a full-page screenshot under `~/android-oce-reports/_visual/`, and runs DOM-based overflow + adjacent-card-gap detection. Catches the rendered-layout bugs (text bleed, cards touching) that pure HTML/CSS validation can't see. | @@ -56,7 +56,7 @@ If any of these are unstated, ask once, then proceed. 1. **Top-line health KPIs** — total requests, total devices, silent-auth reliability %, interactive reliability %, p95 latency on the hot spans. WoW delta on each. Inline SVG sparklines. 2. **Things that need attention this week** — callouts: - **Denominator caveat** — explain any large total-spans device-count shift caused by span-emission changes (e.g. `goAsync()` refactors). Always state which denominator the report uses (auth-only: `SilentAuthStats` ∪ `InteractiveAuthStats`). - - **🔴 WoW regressions (last 7 days)** — *one* callout listing every code/type that moved sharply WoW, **sorted by current-week device count descending**. Built from the union of (a) the standard WoW table and (b) [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) so small-but-recent spikes appear in the same list as the high-volume ones. Each row uses the `.item` flat-row pattern (see `assets/template-readme.md` § "Section 2 callouts"): name + inline metric chips + tags pushed right + one-line body + optional foot with `Attribution card →` link. **Section 2 rows are at-a-glance only** — do not duplicate the dim slicing / PR analysis / detailed verdict here; that belongs in the Section 4 spike-attribution card. Each row carries tags: `NEW` (first appeared this week or last), `60d↑` (also rising on 60d), and an originator chip (`broker` / `eSTS` / `Android` / `env`). Reader's eye prioritizes naturally by row order and tag combination — broker-tagged rows at the top demand the most attention. + - **🔴 WoW regressions (last 7 days)** — *one* callout listing every code/type that moved sharply WoW, **sorted by current-week device count descending**. Built from the union of (a) the standard WoW table and (b) [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) so small-but-recent spikes appear in the same list as the high-volume ones. Each row uses the `.item` flat-row pattern (see `assets/templates/template-readme.md` § "Section 2 callouts"): name + inline metric chips + tags pushed right + one-line body + optional foot with `Attribution card →` link. **Section 2 rows are at-a-glance only** — do not duplicate the dim slicing / PR analysis / detailed verdict here; that belongs in the Section 4 spike-attribution card. Each row carries tags: `NEW` (first appeared this week or last), `60d↑` (also rising on 60d), and an originator chip (`broker` / `eSTS` / `Android` / `env`). Reader's eye prioritizes naturally by row order and tag combination — broker-tagged rows at the top demand the most attention. - **Slow-burn 60-day regressions** — codes/types climbing on the 60d window that are flat WoW. Anything that *also* moved WoW belongs in the red callout above (with `60d↑`), not here. Link to the 60-Day Trend section. - **Real wins this week**, with PR links. - **Traffic shape** — flat / surge / collapse summary. @@ -83,7 +83,7 @@ If any of these are unstated, ask once, then proceed. ### Step 1 — Bootstrap the new report file from the template -This skill ships with a canonical template at [`assets/report-template.html`](assets/report-template.html) (a real prior week's report kept as the reference layout). **Use [`assets/scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1)** to handle all the boilerplate (Sunday-date computation, `_data//` directory, retention-pruning, collision detection): +This skill ships with a canonical template at [`assets/templates/report-template.html`](assets/templates/report-template.html) (a real prior week's report kept as the reference layout). **Use [`assets/scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1)** to handle all the boilerplate (Sunday-date computation, `_data//` directory, retention-pruning, collision detection): ```pwsh .\.github\skills\oncall-weekly-telemetry-report\assets\scripts\bootstrap-report.ps1 @@ -98,13 +98,13 @@ What it does: * Prunes `_data//` folders older than 60 days so the cache doesn't accumulate. * **Collision detection (the v8-hardened version):** uses a multi-marker fingerprint (title + meta-line dates + first-KPI value + size ratio) to distinguish an "unfilled template stub" (silently re-bootstrap) from a "real populated report" (HARD HALT, exit 2, require `-Force` to overwrite). The earlier single-marker (title only) version misclassified populated reports as stubs and overwrote real work. -Edit the bootstrapped file in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/template-readme.md`](assets/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, the CSS class reference, and the two v8 layout traps. +Edit the bootstrapped file in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/templates/template-readme.md`](assets/templates/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, the CSS class reference, and the two v8 layout traps. > **⚠️ UTF-8 trap — DO NOT use PowerShell `@'...'@` heredocs to compose HTML content containing emojis, em-dashes, arrows, or middle dots.** PowerShell silently strips multi-byte UTF-8 characters when piping heredocs to `Set-Content` / `Out-File`. Use Node.js (`fs.writeFileSync`), `[IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))`, or explicit Unicode-pair literals (`[char]0xD83D + [char]0xDCCA` for 📊) instead. This trap cost ~30 min in v8 and required a full emoji-restoration pass — every callout icon, every section header emoji, every arrow link had to be re-injected. The validator's `U+FFFD` check catches the worst case (mojibake replacement char) but cannot detect characters that were silently stripped to nothing. Mark any unfinished card or table cell with the literal sentinel `EXAMPLE CONTENT BELOW` inside an HTML comment — the final-pass validator (Step 7) greps for it. -If the template ever needs structural improvements (new section, new card style, etc.), update `assets/report-template.html` in the skill folder and commit it so future weeks inherit the change. +If the template ever needs structural improvements (new section, new card style, etc.), update `assets/templates/report-template.html` in the skill folder and commit it so future weeks inherit the change. ### Step 2 — Pull WoW reliability data @@ -112,7 +112,7 @@ Use the Kusto MCP tool against: - **Cluster:** `https://idsharedeus2.kusto.windows.net` - **Database:** `ad-accounts-android-otel` -**Always prefer the canonical `materialized_view('XxxMetrics' or 'XxxUpdated')` variants** — these are what the production dashboard uses, are pre-aggregated and HLL-bucketed, and avoid the 240 s MCP timeout that plain `android_spans` queries hit. Full schema, gotchas, and query templates: [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). +**Always prefer the canonical `materialized_view('XxxMetrics' or 'XxxUpdated')` variants** — these are what the production dashboard uses, are pre-aggregated and HLL-bucketed, and avoid the 240 s MCP timeout that plain `android_spans` queries hit. Full schema, gotchas, and query templates: [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). > **Fallback when the Kusto MCP times out:** use [`assets/scripts/run-kql.ps1`](assets/scripts/run-kql.ps1). It acquires a token via `az account get-access-token`, POSTs directly to `/v2/rest/query`, and writes the result as a JSON file the JS helpers (`bucket-trends.js`, `summarize-attribution.js`) can consume directly. The skill's MCP-vs-REST switch is roughly: try the MCP once; if it returns `McpError -32001 (timeout)`, switch to the REST helper for the rest of the run. Run multiple queries in parallel via PowerShell `Start-Job`: > @@ -239,11 +239,11 @@ For each WoW mover (regardless of size), you still owe the full Code Attribution ### Step 4 — Code attribution (deep PR correlation) -> ⚠️ **HARD RULE — Originator pre-check.** Before claiming `Originator: Broker` on any card, you MUST run [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) for that error code (or type) and read **(a) the throw-site stack and (b) the top 3 `error_message` strings**. Most broker error codes flow through `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult, clientExceptionFromException}` — which intentionally bridge eSTS responses into broker exceptions. **If the throw site is in any of those three methods AND the error_message starts with `AADSTS`, the originator is eSTS, not broker.** See the AADSTS reference table in [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). Cards that skip this step must be marked low-confidence, not high. +> ⚠️ **HARD RULE — Originator pre-check.** Before claiming `Originator: Broker` on any card, you MUST run [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) for that error code (or type) and read **(a) the throw-site stack and (b) the top 3 `error_message` strings**. Most broker error codes flow through `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult, clientExceptionFromException}` — which intentionally bridge eSTS responses into broker exceptions. **If the throw site is in any of those three methods AND the error_message starts with `AADSTS`, the originator is eSTS, not broker.** See the AADSTS reference table in [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). Cards that skip this step must be marked low-confidence, not high. > > **Window:** use the FULL 7-day reporting window (`` → ``) on `PipelineInfo_IngestionTime`, NOT a narrower 3–5 day slice — low-volume types (e.g. `SSLHandshakeException`, `IntuneAppProtectionPolicyRequiredException`) routinely return zero rows in a sub-week window. If a code/type still returns nothing, fall back to the prior 14 days before declaring "no data". -For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/code-attribution-template.md`](assets/code-attribution-template.md) as the per-card checklist. +For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/docs/code-attribution-template.md`](assets/docs/code-attribution-template.md) as the per-card checklist. | Field | What goes in it | How to find it | |---|---|---| @@ -362,7 +362,7 @@ node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution. --label=client_sku sku.json ``` -Ready-to-paste KQL for both forms: union → [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql); per-dim → [`assets/kusto-cheatsheet.md` § 8c](assets/kusto-cheatsheet.md). +Ready-to-paste KQL for both forms: union → [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql); per-dim → [`assets/docs/kusto-cheatsheet.md` § 8c](assets/docs/kusto-cheatsheet.md). **Concentration thresholds** (paint the dim bar red): - > 80% in a single value → strong attribution (one root cause) @@ -450,7 +450,7 @@ Add a top-level **🚚 Traffic Attribution** section that lists every error matc Run the bundled validator FIRST — it covers all the silent-failure cases this skill has tripped on in the past: ```pwsh -.\.github\skills\oncall-weekly-telemetry-report\assets\validate-report.ps1 +.\.github\skills\oncall-weekly-telemetry-report\assets\scripts\validate-report.ps1 # defaults to most-recent oncall-wow-report-*.html under ~/android-oce-reports/ # pass -Path explicitly to validate a specific file ``` @@ -492,9 +492,9 @@ Then: - **Always apply `MergeAccountType` / `MergeIsSharedDevice` / `MergeUiRequiredExceptions`** so this report agrees with the dashboard. - **Confirm the week bucket label matches the user's intent** before writing the rest of the queries (Sunday-aligned). - **Always filter the partial in-progress week at the source** with `| where week < datetime()` where `` is the Sunday immediately after the reporting week. Otherwise `bucket-trends.js` will show every error as a fake −99% improvement once UTC has crossed midnight Sunday. -- **Originator pre-check is mandatory.** A card cannot claim `Originator: Broker` without first running [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) and reading the throw site + top 3 `error_message` strings. If the throw site is in `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}` AND the message starts with `AADSTS`, the originator is **eSTS, not broker** — see the AADSTS reference in [`assets/kusto-cheatsheet.md`](assets/kusto-cheatsheet.md). +- **Originator pre-check is mandatory.** A card cannot claim `Originator: Broker` without first running [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) and reading the throw site + top 3 `error_message` strings. If the throw site is in `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}` AND the message starts with `AADSTS`, the originator is **eSTS, not broker** — see the AADSTS reference in [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). - **WoW-movers pass is mandatory.** The 60d bucketer's `--peak-floor` silently drops sub-10K-device codes, so [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) MUST be run as a separate pass for both `error_code` and `error_type` (per Step 3d). Its output is **merged into the single 🔴 WoW regressions callout**, sorted by current-week device count descending, with rows tagged `NEW` / `60d↑` / originator chip. Do not render a separate "emerging" callout. Skipping the pass is how the Apr 26 `Failed to parse JWT` spike (7 → 3,461 devs over 7 weeks) hid for two reports running. -- **Section 2 callouts are at-a-glance, Section 4 is the deep dive.** WoW / Slow-burn / Wins items in Section 2 use the `.item` flat-row pattern (no nested cards, no per-item left bars — the parent `.callout` border is the only severity affordance). Each row is a single line of metric chips + a one-line body + an `Attribution card →` link to the corresponding `.attr-card` in Section 4. Do NOT duplicate the dim slicing, PR analysis, or detailed verdict between the two sections — Section 4 is where that lives. See [`assets/template-readme.md`](assets/template-readme.md) for the CSS class reference and the example `.item` markup. +- **Section 2 callouts are at-a-glance, Section 4 is the deep dive.** WoW / Slow-burn / Wins items in Section 2 use the `.item` flat-row pattern (no nested cards, no per-item left bars — the parent `.callout` border is the only severity affordance). Each row is a single line of metric chips + a one-line body + an `Attribution card →` link to the corresponding `.attr-card` in Section 4. Do NOT duplicate the dim slicing, PR analysis, or detailed verdict between the two sections — Section 4 is where that lives. See [`assets/templates/template-readme.md`](assets/templates/template-readme.md) for the CSS class reference and the example `.item` markup. - **Never use bash/PowerShell regex to bulk-edit balanced HTML.** This skill has burned twice on regex strip scripts that ate matched-pair `
` closes, producing inception-style nested-callout bugs that take a depth-tracking script to find. If you need a structural change to the HTML, make a targeted, single-occurrence string replacement (with explicit before/after context) or rewrite the affected block end-to-end. Never run a `-replace` across the whole file expecting it to leave balance intact. - **Denominator caveat must cite evidence, not hand-wave.** If you flag a large all-spans device-count shift, run [`assets/queries/broker-version-share-wow.kql`](assets/queries/broker-version-share-wow.kql) (single WoW snapshot) or [`assets/queries/broker-version-share.kql`](assets/queries/broker-version-share.kql) (time-series) and name the version cohort the shift moved with. Do not write "recurring telemetry-shape artifact" without backing data; if you don't have it, drop the callout. - **"Recovery" still merits a PR citation.** When an error pins to a single old broker version and recovers as that version retires, look for the **fix PR in the version that replaced it** before calling it a "natural rolloff." Often the fix is real and just under-credited. @@ -526,6 +526,6 @@ Then: - [ ] Auth-only denominator used for all reliability %s, denominator caveat called out at top. - [ ] No `\bdevs\b` or `\breqs\b` in user-facing text. (`Select-String -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false` returns 0.) - [ ] **Sparklines rendered.** Every `.kpi` tile in the Top-line health section has a `data-spark` array with 8–9 weekly values. Every row in the 60-day trend tables and both WoW tables (codes + types) has a `data-trend` mini-spark. The validator's chart-coverage check passes (KPI coverage ≥1/2 of tiles, total elements ≥15). Past failure mode: the v7 body rebuild dropped all sparklines silently — see `template-readme.md` § "Sparklines are MANDATORY". -- [ ] **Code-attribution depth.** Every `.attr-card`'s Code attribution block uses the full 8-field `
` structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) per [`assets/code-attribution-template.md`](assets/code-attribution-template.md). A `pr-list`-only stub is **not acceptable** — the validator hard-fails this. Past failure mode (v7 third pass): all 10 cards shipped with PR-only stubs and lost the throw-site / wrapper / underlying-cause analysis. +- [ ] **Code-attribution depth.** Every `.attr-card`'s Code attribution block uses the full 8-field `
` structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) per [`assets/docs/code-attribution-template.md`](assets/docs/code-attribution-template.md). A `pr-list`-only stub is **not acceptable** — the validator hard-fails this. Past failure mode (v7 third pass): all 10 cards shipped with PR-only stubs and lost the throw-site / wrapper / underlying-cause analysis. - [ ] No stale text from previous weeks. (`Select-String -Pattern 'EXAMPLE CONTENT BELOW'` returns 0 — that's the unfinished-section sentinel. The template no longer ships `{{TOKEN}}` placeholders since v2; if the file still contains any `{{`, that's also a leftover.) - [ ] `get_errors` clean on the HTML file. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md b/.github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md similarity index 98% rename from .github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md rename to .github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md index 90aed84f..cd5dfe57 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/code-attribution-template.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md @@ -1,6 +1,6 @@ # Code Attribution Card — Per-Spike Checklist -Use this template for **every** spike-attribution card in the report. The HTML markup matches the `code-attr` / `pr-card` / `origin-tag` styles already in [`report-template.html`](report-template.html). +Use this template for **every** spike-attribution card in the report. The HTML markup matches the `code-attr` / `pr-card` / `origin-tag` styles already in [`report-template.html`](../templates/report-template.html). A card without a populated **Originator + Top throw site + Likely PRs + Next step** is not acceptable. "Caller hot-spots", "Underlying cause", and "Top error_messages" are required for any error where the originator is *not* obvious from the error name alone (Android system errors, 3rd-party library wrappers, environmental). diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md b/.github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md similarity index 98% rename from .github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md rename to .github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md index b5302510..5d52ed31 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/kusto-cheatsheet.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md @@ -202,7 +202,7 @@ materialized_view('BrokerAdoptionStatsUpdated') | [`summarize-attribution.js`](summarize-attribution.js) | Roll up 7-dim attribution slices per (error_code, week) — feeds the spike-attribution cards | | [`queries/`](queries/) | Canonical KQL templates, one per query — see [`queries/README.md`](queries/README.md) | | [`templates/`](templates/) | Copy-paste HTML snippets for cards / footer JS | -| [`report-template.html`](report-template.html) | Canonical layout. Copy to `~/android-oce-reports/oncall-wow-report-.html` and replace `{{TOKENS}}` only — never restructure CSS | +| [`report-template.html`](../templates/report-template.html) | Canonical layout. Copy to `~/android-oce-reports/oncall-wow-report-.html` and replace `{{TOKENS}}` only — never restructure CSS | --- diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql index 18d3fb20..a9e03506 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql @@ -3,7 +3,7 @@ // = first Sunday of the 60d window (e.g. 2026-03-08) // = end of the reporting week, EXCLUSIVE = next Sunday after the // reporting week's Sunday (e.g. for a 2026-05-03 report, use 2026-05-10) -// Output: feed to assets/bucket-trends.js with --start= (no --end needed +// Output: feed to assets/scripts/bucket-trends.js with --start= (no --end needed // because we filter the partial bucket out at the source — preferred). materialized_view('ErrorStatsMetrics') | where EventInfo_Time between (datetime() .. datetime()) diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql index 34c03364..0272daaa 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql @@ -6,7 +6,7 @@ // exceptionFromAuthorizationResult, clientExceptionFromException}. Without // reading the throw site + the dominant error_message string, you cannot tell // whether the code originated in broker code or was bridged from an eSTS -// AADSTS response. (See kusto-cheatsheet.md "AADSTS reference table".) +// AADSTS response. (See ../docs/kusto-cheatsheet.md "AADSTS reference table".) // // THIS TEMPLATE COVERS BOTH error_code AND error_type IN ONE ROUND-TRIP. // Pass an empty list for the side you don't want to slice. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/agg.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/agg.js similarity index 100% rename from .github/skills/oncall-weekly-telemetry-report/assets/agg.js rename to .github/skills/oncall-weekly-telemetry-report/assets/scripts/agg.js diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 index ccca2638..3b2c45b1 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 @@ -53,9 +53,11 @@ $ErrorActionPreference = 'Stop' # Locate the skill folder + canonical template if (-not $SkillRoot) { + # This script lives at /assets/scripts/bootstrap-report.ps1, so go up 2 levels + # to reach /assets/. Templates live at /assets/templates/. $SkillRoot = Split-Path -Parent (Split-Path -Parent $PSCommandPath) } -$template = Join-Path $SkillRoot 'report-template.html' +$template = Join-Path $SkillRoot 'templates\report-template.html' if (-not (Test-Path $template)) { throw "Canonical template not found at $template. Pass -SkillRoot if running outside the skill folder." } diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js similarity index 98% rename from .github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js rename to .github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js index 892f0784..c602fba3 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/bucket-trends.js +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js @@ -13,7 +13,7 @@ * | where week < datetime() // drop partial end-week! * | order by error_code asc, week asc * - * (Use dcount_hll on countDevicesHll, NOT sum(countDevices) — see kusto-cheatsheet.md.) +// (Use dcount_hll on countDevicesHll, NOT sum(countDevices) — see ../docs/kusto-cheatsheet.md.) * * Usage: * node bucket-trends.js diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/find-suspect-prs.ps1 similarity index 100% rename from .github/skills/oncall-weekly-telemetry-report/assets/find-suspect-prs.ps1 rename to .github/skills/oncall-weekly-telemetry-report/assets/scripts/find-suspect-prs.ps1 diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/summarize-attribution.js similarity index 100% rename from .github/skills/oncall-weekly-telemetry-report/assets/summarize-attribution.js rename to .github/skills/oncall-weekly-telemetry-report/assets/scripts/summarize-attribution.js diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 similarity index 96% rename from .github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 rename to .github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 index 63e0a794..148871f5 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/validate-report.ps1 +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 @@ -154,7 +154,7 @@ Write-Host "" Write-Host "Info: $sparkCount data-spark, $trendCount data-trend, $inlineSvg inline sparkline svg(s), $kpiTiles KPI tile(s)." if ($kpiTiles -ge 4 -and $sparkCount -lt [Math]::Ceiling($kpiTiles / 2)) { - Add-Fail "Only $sparkCount data-spark element(s) for $kpiTiles KPI tile(s) — over half the KPI tiles are chartless. The body was likely rebuilt without sparklines. See template-readme.md \"Sparklines are MANDATORY\"." + Add-Fail "Only $sparkCount data-spark element(s) for $kpiTiles KPI tile(s) — over half the KPI tiles are chartless. The body was likely rebuilt without sparklines. See assets/templates/template-readme.md \"Sparklines are MANDATORY\"." } else { Pass "KPI tiles have data-spark coverage ($sparkCount/$kpiTiles)" } @@ -165,7 +165,7 @@ if ($totalCharts -lt 15) { } # ---- 7. Traffic-attribution sub-block color diversity (tri-state convention) ---- -# Per template-readme.md: each .attr-card's traffic sub-block should be green +# Per assets/templates/template-readme.md: each .attr-card's traffic sub-block should be green # (ruled out), yellow (partly contributing), or red (primary driver). If every # sub-block is the same color, the author defaulted to one and didn't actually # classify per card (v7 second-pass regression: 10/10 yellow). @@ -176,7 +176,7 @@ $taTotal = $taGreen + $taYellow + $taRed if ($taTotal -ge 4) { $distinctColors = @($taGreen, $taYellow, $taRed | Where-Object { $_ -gt 0 }).Count if ($distinctColors -le 1) { - Add-Warn "All $taTotal traffic-attribution sub-blocks share one color (g=$taGreen y=$taYellow r=$taRed). The tri-state convention exists so color carries meaning \u2014 verify each card's verdict and recolor accordingly. See template-readme.md \"Traffic-attribution sub-block on each attribution card (tri-state)\"." + Add-Warn "All $taTotal traffic-attribution sub-blocks share one color (g=$taGreen y=$taYellow r=$taRed). The tri-state convention exists so color carries meaning \u2014 verify each card's verdict and recolor accordingly. See assets/templates/template-readme.md \"Traffic-attribution sub-block on each attribution card (tri-state)\"." } else { Pass "Traffic-attribution color mix: $taGreen green / $taYellow yellow / $taRed red" } @@ -192,7 +192,7 @@ $codeAttrBlocks = ([regex]::Matches($content, '
Code $originLabels = ([regex]::Matches($content, 'class="origin-label">Originator')).Count if ($codeAttrBlocks -ge 1) { if ($originLabels -lt $codeAttrBlocks) { - Add-Fail "$codeAttrBlocks Code-attribution block(s) but only $originLabels have an Originator row. Each card needs the full 8-field structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). See assets/code-attribution-template.md." + Add-Fail "$codeAttrBlocks Code-attribution block(s) but only $originLabels have an Originator row. Each card needs the full 8-field structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). See assets/docs/code-attribution-template.md." } else { Pass "All $codeAttrBlocks code-attribution block(s) have full 8-field structure" } @@ -222,7 +222,7 @@ if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { # ---- 9. Attribution-card layout sanity (v8 regression — cards touching + dim-row bleed) ---- # Two layout bugs hit the v8 rebuild and forced manual CSS patches mid-publish. -# Both have CSS fixes baked into report-template.html now, but the validator +# Both have CSS fixes baked into assets/templates/report-template.html now, but the validator # catches the markup-side preconditions so a future hand-rolled body that # diverges from the template is flagged before publish. # @@ -236,7 +236,7 @@ if ($hasAttrCard) { $cssHasCardMargin = $content -match '(?s)\.attr-card\s*\{[^}]*margin-bottom\s*:\s*16px' ` -or $content -match '(?s)\.attr-card\s*\+\s*\.attr-card\s*\{[^}]*margin-top' if (-not $cssHasCardMargin) { - Add-Fail "Report has .attr-card elements but the CSS is missing the cards-touching guard (.attr-card { margin-bottom:16px } and/or .attr-card + .attr-card { margin-top:16px }). The v8 head rebuild dropped this — re-extract from the current assets/report-template.html." + Add-Fail "Report has .attr-card elements but the CSS is missing the cards-touching guard (.attr-card { margin-bottom:16px } and/or .attr-card + .attr-card { margin-top:16px }). The v8 head rebuild dropped this — re-extract from the current assets/templates/report-template.html." } else { Pass "Attribution cards have spacing CSS" } @@ -254,7 +254,7 @@ if ($hasAttrCard) { $cssHasMinWidth = $content -match '(?s)\.dim\s*\{[^}]*min-width\s*:\s*0' ` -or $content -match '(?s)\.dim-row\s*\{[^}]*min-width\s*:\s*0' if (-not $cssHasEllipsis) { - Add-Fail "CSS is missing the .dim-row name-overflow guard (text-overflow:ellipsis on .dim-name and/or .dim-row > span:first-of-type). Long calling-app / version names will bleed out of the dim cards. Re-extract from the current assets/report-template.html." + Add-Fail "CSS is missing the .dim-row name-overflow guard (text-overflow:ellipsis on .dim-name and/or .dim-row > span:first-of-type). Long calling-app / version names will bleed out of the dim cards. Re-extract from the current assets/templates/report-template.html." } elseif (-not $cssHasMinWidth) { Add-Warn "CSS has text-overflow rules but is missing min-width:0 on .dim / .dim-row. Without it, flex children won't shrink below content size and ellipsis won't trigger inside narrow dim cards." } else { diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md b/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md index 93a60229..a91ee82a 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/README.md @@ -2,7 +2,7 @@ These are raw HTML fragments designed to be copied verbatim into the working report file and then have `{{TOKENS}}` replaced. The CSS classes they reference -are defined in [`../report-template.html`](../report-template.html) — do not +are defined in [`report-template.html`](report-template.html) — do not restyle per week. | File | When to use | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/report-template.html similarity index 99% rename from .github/skills/oncall-weekly-telemetry-report/assets/report-template.html rename to .github/skills/oncall-weekly-telemetry-report/assets/templates/report-template.html index 45ef575f..513d12ea 100644 --- a/.github/skills/oncall-weekly-telemetry-report/assets/report-template.html +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/report-template.html @@ -1,10 +1,10 @@ - + not restyle per week. Authors mark unfinished sections with the literal sentinel string the validator greps for (see assets/templates/template-readme.md). --> @@ -1123,7 +1123,7 @@

Appendix