diff --git a/.github/skills/oncall-weekly-telemetry-report/SKILL.md b/.github/skills/oncall-weekly-telemetry-report/SKILL.md new file mode 100644 index 00000000..4638ee01 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/SKILL.md @@ -0,0 +1,531 @@ +--- +name: oncall-weekly-telemetry-report +description: Generate the weekly Android Broker on-call (OCE) WoW + 60-day trend telemetry report as a polished self-contained HTML file. Use this skill for the weekly OCE rotation when asked to "produce the OCE report", "weekly on-call report", "WoW telemetry report", "weekly broker health report", or "generate this week's on-call summary". Pulls from `android_spans` materialized views, attributes regressions/improvements to PRs in `broker/` and `common/`, and writes to `oncall-wow-report-vN.html` at repo root. +--- + +# OCE Weekly Report + +Produce the weekly Android Broker on-call (OCE) telemetry report as a self-contained HTML file at `$env:USERPROFILE\android-oce-reports\oncall-wow-report-v{N+1}.html` (i.e. `~/android-oce-reports/`, outside the workspace so reports never accidentally get committed). + +The output mirrors the structure of the canonical template at [`assets/templates/report-template.html`](assets/templates/report-template.html) — copy it to `oncall-wow-report-v{N+1}.html` at repo root and edit in place. Do **not** redesign the layout each week. + +**Before writing any KQL, read [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md).** It captures the canonical view names, helper functions, the HLL device-count gotcha, week-alignment rules, and ready-to-paste query templates — distilled from the production Android Broker Dashboard. + +Reusable helpers in [`assets/`](assets/): + +| File | Purpose | +|---|---| +| [`report-template.html`](assets/templates/report-template.html) | Canonical layout — a real prior-week report kept verbatim. **Edit in place** (replace dates / values / verdicts / PR links); do not restyle. See [`template-readme.md`](assets/templates/template-readme.md) for what to change vs leave alone. | +| [`template-readme.md`](assets/templates/template-readme.md) | Author guide for `report-template.html` — what to change per week, color palette, CSS class quick-reference | +| [`kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md) | Schemas, helper funcs, gotchas, ready-to-paste KQL templates, AADSTS reference | +| [`code-attribution-template.md`](assets/docs/code-attribution-template.md) | Per-card checklist for the deep code-attribution block (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) | +| [`queries/`](assets/queries/) | Canonical KQL templates, one file per query — see [`queries/README.md`](assets/queries/README.md). Highlights: [`attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql) (NEW — all 7 dims in one round-trip), [`error-message-and-location.kql`](assets/queries/error-message-and-location.kql) (now accepts BOTH `` and `` in one call) | +| [`templates/`](assets/templates/) | Copy-paste HTML snippets (`spike-card.html`, `traffic-attr-card.html`, `sparkline-footer.html`) | +| [`bucket-trends.js`](assets/scripts/bucket-trends.js) | Bucket all error codes into 60-day regression / spike / improvement / flat. Run with `--metric=devs` AND `--metric=reqs`. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop the partial in-progress bucket. **`--summary` suppresses the verbose header; `--json=` emits a structured sidecar for programmatic consumption.** | +| [`agg.js`](assets/scripts/agg.js) | Per-error per-dim top-N rollup with WoW deltas. Workhorse for filling spike-attribution dim blocks. | +| [`summarize-attribution.js`](assets/scripts/summarize-attribution.js) | Roll up 7-dim attribution slices for spike-attribution cards. Supports BOTH `--union ` (preferred for 2-week WoW; pairs with `attr-union-by-dim.kql`) AND legacy `--label= file.json` per-dim mode. **Auto-detects the array-form schema produced by `assets/scripts/run-kql.ps1` — no schema-transformer step needed.** | +| [`find-suspect-prs.ps1`](assets/scripts/find-suspect-prs.ps1) | Parallel `git log -S` + `--grep` across broker/ + common/ for a class/method symbol, with PR numbers + URLs. Run *only after* the Originator pre-check has identified a specific throw-site class — the unscoped 4-week PR window is small enough (<30 PRs) to scan with plain `git log` first. | +| [`validate-report.ps1`](assets/scripts/validate-report.ps1) | Pre-publish validator. Catches stale tokens, devs/reqs leaks, mojibake (U+FFFD), unbalanced `
` depth in Section 2 (the nested-callout bug), KPI/trend sparkline coverage, code-attribution depth, layout-guard CSS presence, and suspicious low-peak fabricated `data-trend` arrays. Run as part of Step 7. | +| [`scripts/run-kql.ps1`](assets/scripts/run-kql.ps1) | **Direct-REST Kusto helper — drop-in fallback for the Azure Kusto MCP server when the MCP times out** (frequent on per-error-code queries). Acquires a token via `az`, POSTs to `/v2/rest/query`, writes a JSON file the JS helpers can consume directly. | +| [`scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1) | Bootstrap a new week's report from the canonical template. Auto-computes the reporting Sunday, creates `_data//`, prunes `_data` folders older than 60 days, and detects "unfilled template stub" vs "real prior report" collisions using a multi-marker fingerprint (title + meta date + first KPI value + size ratio). | +| [`scripts/visual-smoke.ps1`](assets/scripts/visual-smoke.ps1) | Optional Playwright-based layout smoke test. Renders the report at 1400 px viewport, captures a full-page screenshot under `~/android-oce-reports/_visual/`, and runs DOM-based overflow + adjacent-card-gap detection. Catches the rendered-layout bugs (text bleed, cards touching) that pure HTML/CSS validation can't see. | + +--- + +## Inputs to confirm with the user + +1. **Reporting week** — **first compute the most recent complete Sun→Sat week** (Sunday bucket = the most recent Sunday strictly before today, or today itself if today is a Sunday and the week's data is at least 6h old). Default to that and proceed without asking *unless*: + - today is itself a Sat or Sun **and** the user phrasing suggests they want "this week" (e.g. "current report", "latest data"). Then ASK explicitly between the in-progress and most-recent-complete options. + - today is a Mon–Fri — just default to the most recent complete week and proceed; do not ask. + + If the user picks the in-progress week: + - Add the badge text *"Live data — current bucket may still be filling"* to the report header. + - The `bucket-trends.js` `--end` flag + the `| where week < datetime()` source filter both still apply (use the Sunday AFTER the reporting week as ``); they will drop the partial-end-bucket warning. + + Note that Kusto's `startofweek()` is **Sunday-aligned**, so a user-spoken "week of May 3 → May 9" maps to the bucket `startofweek == 2026-05-03`. Off-by-one-week is the #1 silent error — verify by printing the distinct `startofweek` buckets from your first query and confirming the label matches the user's intent. +2. **Comparison baseline** — defaults to the prior complete week. +3. **60-day window** — last 8 complete weeks (drop the partial start week when computing trend deltas). +4. **Output filename** — `$env:USERPROFILE\android-oce-reports\oncall-wow-report-YYYY-MM-DD.html`, where `YYYY-MM-DD` is the **Sunday `startofweek` bucket** of the reporting week (e.g. the report for the week of May 3 → May 9, 2026 is `oncall-wow-report-2026-05-03.html`). User-scoped, outside the workspace; the date matches the Kusto bucket label used throughout the report. + +If any of these are unstated, ask once, then proceed. + +--- + +## Required sections (in order) + +1. **Top-line health KPIs** — total requests, total devices, silent-auth reliability %, interactive reliability %, p95 latency on the hot spans. WoW delta on each. Inline SVG sparklines. +2. **Things that need attention this week** — callouts: + - **Denominator caveat** — explain any large total-spans device-count shift caused by span-emission changes (e.g. `goAsync()` refactors). Always state which denominator the report uses (auth-only: `SilentAuthStats` ∪ `InteractiveAuthStats`). + - **🔴 WoW regressions (last 7 days)** — *one* callout listing every code/type that moved sharply WoW, **sorted by current-week device count descending**. Built from the union of (a) the standard WoW table and (b) [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) so small-but-recent spikes appear in the same list as the high-volume ones. Each row uses the `.item` flat-row pattern (see `assets/templates/template-readme.md` § "Section 2 callouts"): name + inline metric chips + tags pushed right + one-line body + optional foot with `Attribution card →` link. **Section 2 rows are at-a-glance only** — do not duplicate the dim slicing / PR analysis / detailed verdict here; that belongs in the Section 4 spike-attribution card. Each row carries tags: `NEW` (first appeared this week or last), `60d↑` (also rising on 60d), and an originator chip (`broker` / `eSTS` / `Android` / `env`). Reader's eye prioritizes naturally by row order and tag combination — broker-tagged rows at the top demand the most attention. + - **Slow-burn 60-day regressions** — codes/types climbing on the 60d window that are flat WoW. Anything that *also* moved WoW belongs in the red callout above (with `60d↑`), not here. Link to the 60-Day Trend section. + - **Real wins this week**, with PR links. + - **Traffic shape** — flat / surge / collapse summary. +3. **📈 60-Day Trend Analysis** — built from the `ErrorStatsMetrics` materialized view over the last 8 complete weeks. **Run the bucketing pipeline FOUR times — the cross-product of `{error_code, error_type} × {devices, requests}`** — and union the regression sets. An entry (code OR type) is flagged if it regresses on either metric. + + - **% of devices** affected (`devicesHit / authActiveDevices`) — catches errors hitting more users. + - **% of requests** affected (`errRequests / authTotalRequests`) — catches per-device retry storms (fewer users, more traffic per user). The previous report would have missed `kdfv2_key_derivation_error` (262 → 5,374 requests on ~57 devices) without this dim. + + Categories: True 60d regression / Ephemeral 60d spike (peak-then-recover) / True 60d improvement / Flat. Every rising entry — whether `error_code` or `error_type` — gets the same Spike Attribution + Code Attribution treatment (Step 4 / Step 5). + + Always apply `MergeUiRequiredExceptions(error_type)` before bucketing on type; otherwise the 6+ string variants of `UiRequiredException` will each be tracked separately and skew the buckets. +4. **🔎 Spike Attribution** — one card per WoW regression AND per 60-day regression, **for both `error_code` and `error_type` regressions**. Each card slices on **all 7 dimensions** (broker version, span, active broker pkg, calling app, account type AAD/MSA, shared-device mode, client SKU). Each card ends with a **deep Code Attribution block** (see Step 4 for the required fields) and a Traffic Attribution verdict. +5. **🚚 Traffic Attribution** — top-level section listing every error whose spike is fully or partly explained by traffic volume from a specific calling app, rather than a code regression. If none qualify this week, render the section with an explicit "None this week" note. +6. **Error codes — WoW with stable denominator** — full table with `Δ requests %` and `Δ devices %` columns and the 60d sparkline. +7. **Error types — WoW with stable denominator** — full table, **same columns and rigor as the error-codes table** (`Δ requests %`, `Δ devices %`, 60d sparkline, status pill). Any regressing type also gets a spike-attribution card in Section 4. For composite types (e.g. `ClientException` is the umbrella for many sub-codes), include a **decomposition card** that breaks the WoW Δ down into the top 3 contributing sub-codes — so a `ClientException` −5 pp drop is explicitly attributed to e.g. `−8.5 pp timed_out_execution` + `−3.4 pp unknown_authority` + `−0.15 pp illegal_argument_exception`. +8. **📊 Traffic analysis** — total requests/devices (WoW + 60d), top calling apps, top spans, **requests-per-device ratio** per error and overall (a rising ratio = retry storm; a falling ratio = caching gain), sampling-rate change indicator. +9. **Latency** — p50/p95/p99 by hot span. +10. **Broker version adoption** — week-over-week version share. +11. **Appendix** — query list and methodology. + +--- + +## Step-by-step workflow + +### Step 1 — Bootstrap the new report file from the template + +This skill ships with a canonical template at [`assets/templates/report-template.html`](assets/templates/report-template.html) (a real prior week's report kept as the reference layout). **Use [`assets/scripts/bootstrap-report.ps1`](assets/scripts/bootstrap-report.ps1)** to handle all the boilerplate (Sunday-date computation, `_data//` directory, retention-pruning, collision detection): + +```pwsh +.\.github\skills\oncall-weekly-telemetry-report\assets\scripts\bootstrap-report.ps1 +# Optional: explicit reporting Sunday + force overwrite +# .\bootstrap-report.ps1 -ReportingSunday 2026-05-31 -Force +``` + +What it does: +* Computes the reporting-Sunday from the system clock (most recent complete Sun-Sat week). +* Creates `~/android-oce-reports/oncall-wow-report-.html` from the canonical template. +* Creates `~/android-oce-reports/_data//` for raw KQL JSON payloads. +* Prunes `_data//` folders older than 60 days so the cache doesn't accumulate. +* **Collision detection (the v8-hardened version):** uses a multi-marker fingerprint (title + meta-line dates + first-KPI value + size ratio) to distinguish an "unfilled template stub" (silently re-bootstrap) from a "real populated report" (HARD HALT, exit 2, require `-Force` to overwrite). The earlier single-marker (title only) version misclassified populated reports as stubs and overwrote real work. + +Edit the bootstrapped file in place — the template ships as a real prior-week report (not a tokenized skeleton). **Walk top-to-bottom and replace every prior-week date / KPI value / table row / verdict / PR citation with current-week data.** The CSS, sparkline JS, section ordering, and attribution-card markup are canonical — do not redesign them. See [`assets/templates/template-readme.md`](assets/templates/template-readme.md) for the full guide on what to change vs leave alone, the sparkline color palette, the CSS class reference, and the two v8 layout traps. + +> **⚠️ UTF-8 trap — DO NOT use PowerShell `@'...'@` heredocs to compose HTML content containing emojis, em-dashes, arrows, or middle dots.** PowerShell silently strips multi-byte UTF-8 characters when piping heredocs to `Set-Content` / `Out-File`. Use Node.js (`fs.writeFileSync`), `[IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))`, or explicit Unicode-pair literals (`[char]0xD83D + [char]0xDCCA` for 📊) instead. This trap cost ~30 min in v8 and required a full emoji-restoration pass — every callout icon, every section header emoji, every arrow link had to be re-injected. The validator's `U+FFFD` check catches the worst case (mojibake replacement char) but cannot detect characters that were silently stripped to nothing. + +Mark any unfinished card or table cell with the literal sentinel `EXAMPLE CONTENT BELOW` inside an HTML comment — the final-pass validator (Step 7) greps for it. + +If the template ever needs structural improvements (new section, new card style, etc.), update `assets/templates/report-template.html` in the skill folder and commit it so future weeks inherit the change. + +### Step 2 — Pull WoW reliability data + +Use the Kusto MCP tool against: +- **Cluster:** `https://idsharedeus2.kusto.windows.net` +- **Database:** `ad-accounts-android-otel` + +**Always prefer the canonical `materialized_view('XxxMetrics' or 'XxxUpdated')` variants** — these are what the production dashboard uses, are pre-aggregated and HLL-bucketed, and avoid the 240 s MCP timeout that plain `android_spans` queries hit. Full schema, gotchas, and query templates: [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). + +> **Fallback when the Kusto MCP times out:** use [`assets/scripts/run-kql.ps1`](assets/scripts/run-kql.ps1). It acquires a token via `az account get-access-token`, POSTs directly to `/v2/rest/query`, and writes the result as a JSON file the JS helpers (`bucket-trends.js`, `summarize-attribution.js`) can consume directly. The skill's MCP-vs-REST switch is roughly: try the MCP once; if it returns `McpError -32001 (timeout)`, switch to the REST helper for the rest of the run. Run multiple queries in parallel via PowerShell `Start-Job`: +> +> ```pwsh +> $queries = @{ 'reliability.json' = $reliabilityKql; '60d-codes.json' = $codesKql; ... } +> $jobs = @() +> foreach ($f in $queries.Keys) { +> $q = $queries[$f] +> $jobs += Start-Job -ScriptBlock { +> param($Q, $O) & "$using:skillRoot\assets\scripts\run-kql.ps1" -Query $Q -Out $O +> } -ArgumentList $q, $f +> } +> $jobs | Wait-Job | Receive-Job; $jobs | Remove-Job +> ``` + +| Need | View | +|------|------| +| Per-error-code / per-error-type / per-span counts | `materialized_view('ErrorStatsMetrics')` | +| Total broker requests / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | +| Silent auth reliability | `SilentAuthStatsAllRequestsMetrics` + `SilentAuthStatsRequestsWithoutExpectedErrorMetrics` | +| Interactive auth reliability | `InteractiveAuthStatsAllRequestsMetrics` + `InteractiveAuthStatsRequestsWithoutExpectedErrorMetrics` | +| Latency (p50/p95/p99) | `materialized_view('PerfStatsUpdated')` — use `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` | +| Broker version share | `BrokerAdoptionStatsUpdated` | +| Calling app share | `AppStatsUpdated` | +| SKU share | `SkuStatsUpdated` | +| Spike-by-flight slicing | `Operations_ByFlight`, `ErrorCodeBySpan_ByFlight`, `ErrorType_ByFlight` | + +Time filter: always use `EventInfo_Time` on materialized views. Use `PipelineInfo_IngestionTime` only on raw `android_spans`. + +**Three rules that will silently corrupt your data if violated** (full detail in the cheatsheet): + +1. **Distinct devices are HLL-encoded.** Use `dcount_hll(hll_merge(countDevicesHll))`, never `sum(countDevices)`. Summing double-counts every device that appears in more than one row. +2. **Apply the dashboard helper functions** so this report agrees with the dashboard: `MergeAccountType(account_type)`, `MergeIsSharedDevice(is_shared_device)`, `MergeUiRequiredExceptions(error_type)`. +3. **Auth-only denominator for reliability %s:** sum `countRequests` from `SilentAuthStatsAllRequestsMetrics` ∪ `InteractiveAuthStatsAllRequestsMetrics` — not total broker spans. Total span counts are sensitive to `goAsync()` / receiver refactors and will give false WoW reliability swings. + +### Step 3 — Pull 60-day trend + +Don't pre-filter to a hand-picked top-N list — small-but-rising errors (e.g. `null_pointer_error` at ~67K devices) will fall off and never show up in the trend section. Instead pull every error code **and every error type** with a meaningful baseline across the window, then bucket each. + +#### 3a. Per-error-code trend + +Use [`assets/queries/60d-trend-codes.kql`](assets/queries/60d-trend-codes.kql) (template; replace `` and `` tokens — `` is **exclusive** and equals the Sunday AFTER the reporting week, e.g. for a 2026-05-03 report use `2026-05-10`): + +```kql +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime() // drop partial in-progress week at the source +| order by error_code asc, week asc +``` + +**The `| where week < datetime()` line is mandatory.** Without it, if Kusto has crossed midnight UTC into the next Sunday, a tiny partial bucket lands as `last` and turns every code into a fake −99% improvement. `bucket-trends.js` will also auto-detect and warn about this, but filtering at the source is preferred. + +#### 3b. Per-error-type trend (same rigor) + +```kql +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(unified_error_type) +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), unified_error_type +| where week < datetime() +| order by unified_error_type asc, week asc +``` + +`MergeUiRequiredExceptions` is mandatory — without it the 6+ string variants of `UiRequiredException` (raw, fully-qualified, com.microsoft.identity.common.exception.*) each show as separate rows and skew the buckets. + +#### 3c. Run the bucketer 4 times (cross-product of `{code, type} × {devices, requests}`) + +`bucket-trends.js` defaults to grouping by `error_code`. For the type runs you MUST pass `--key=unified_error_type` so it picks up the right column from the type-trend JSON. + +```pwsh +# Error codes — by devices, then by requests +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 --metric=reqs + +# Error types — by devices, then by requests (note --key) +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 --key=unified_error_type +node .github\skills\oncall-weekly-telemetry-report\assets\bucket-trends.js --start=2026-03-08 --end=2026-05-10 --key=unified_error_type --metric=reqs +``` + +`--end` is the Sunday AFTER the reporting week (exclusive). The script also auto-detects partial end-buckets and warns, but passing `--end` explicitly is safer. + +Take the **union** of all four regression sets. Both `error_code` and `error_type` regressions get a spike-attribution card in Step 5. + +It will print regression / spike / improvement / flat buckets, sorted by peak. The thresholds (in case you need to tune): + +- **True 60d regression:** `delta > +15%` and trajectory is monotonic-ish (no single-week spike dominating). +- **Ephemeral 60d spike:** peak week is ≥3× the mean of the surrounding weeks (peak-then-recover shape). +- **True 60d improvement:** `delta < −15%`. +- **Flat:** otherwise. +- Codes/types with peak weekly devices `< 10K` (or peak weekly requests `< 100K` when `--metric=reqs`) are filtered out (`--peak-floor=N` to override). + +**Why both axes matter:** +- *codes × requests:* in v5, `kdfv2_key_derivation_error` spiked +1,951% on requests across only ~57 devices — a per-device retry storm device-only bucketing would have missed. +- *types × either:* `error_type` is the umbrella (e.g. `ClientException`, `ServiceException`, `UiRequiredException`) — a moving type that doesn't map cleanly to one moving code is a strong signal of a *new* sub-code being introduced or an existing one being reclassified (the v5 `ClientException` −10% drop was driven by `timed_out_execution` reclassification under PR #141, which would have been invisible from the codes table alone). + +**Always present side-by-side WoW tables for BOTH error_code AND error_type** with `Δ requests %` and `Δ devices %` columns; flag any row where either crosses threshold. + +#### 3d. WoW movers query — MANDATORY pass to catch small-base movers + +The 60d bucketer's `--peak-floor=10000` exists for good reason (otherwise the 60d regression list would be 200+ tiny noise codes), but it **silently drops every code whose absolute weekly volume stays under 10K** — even if that code is brand-new or just spiked 5× WoW. Real examples this skill has missed in the past: + +- `Failed to parse JWT` — went `7 → 32 → 54 → 46 → 55 → 892 → 3,461` over 7 weeks (2-week-old NEW spike, real broker code in `IDToken.parseJWT:38`). Never crossed the 10K floor. +- `Code:-11` — sat at ~1,030 devs/wk for 7 weeks then jumped to 2,433 (+165% WoW). Sub-floor. +- `SSLHandshakeException` — devices flat at 260 but requests +186% WoW (per-device retry storm). The bucketer's reqs-axis floor (100K) just barely captures it but the device floor doesn't. + +To catch these, **always** run [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) **as a separate pass after the 60d bucketing**: + +```kql +// inputs: = reporting-week Sunday, = next Sunday (excl), +// = baseline-week Sunday +// floor: cDev>=500 OR cReq>=5000 move: |Δd|>=25% OR |Δr|>=50% OR new-this-week +``` + +Run it **twice — once for `error_code`, once for `error_type`**. **Merge its output rows into the same 🔴 WoW regressions callout as the standard WoW table** (sorted by current-week device count descending). Tag rows that came in via this pass with `NEW` if they were absent or near-zero in the prior week. Do *not* render this as a separate "emerging" callout — the size split is implementation detail; readers prioritize naturally by absolute device count + originator chip. + +For each WoW mover (regardless of size), you still owe the full Code Attribution treatment (Step 4). The dim-slicing pass (Step 5) is allowed to be deferred for sub-1K-device spikes if the throw-site + dominant message already pin the originator unambiguously — but say so explicitly in the card ("dims not yet sliced — file the bug first; pull dims if it persists"). + +### Step 4 — Code attribution (deep PR correlation) + +> ⚠️ **HARD RULE — Originator pre-check.** Before claiming `Originator: Broker` on any card, you MUST run [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) for that error code (or type) and read **(a) the throw-site stack and (b) the top 3 `error_message` strings**. Most broker error codes flow through `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult, clientExceptionFromException}` — which intentionally bridge eSTS responses into broker exceptions. **If the throw site is in any of those three methods AND the error_message starts with `AADSTS`, the originator is eSTS, not broker.** See the AADSTS reference table in [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). Cards that skip this step must be marked low-confidence, not high. +> +> **Window:** use the FULL 7-day reporting window (`` → ``) on `PipelineInfo_IngestionTime`, NOT a narrower 3–5 day slice — low-volume types (e.g. `SSLHandshakeException`, `IntuneAppProtectionPolicyRequiredException`) routinely return zero rows in a sub-week window. If a code/type still returns nothing, fall back to the prior 14 days before declaring "no data". + +For every regression card, the Code Attribution block **must** populate the following fields. Shallow PR-citation only is not acceptable. Use [`assets/docs/code-attribution-template.md`](assets/docs/code-attribution-template.md) as the per-card checklist. + +| Field | What goes in it | How to find it | +|---|---|---| +| **Originator** | Where the error physically originates: broker code / common / Android system (WebView / Conscrypt / Keystore) / 3rd-party lib (Nimbus JWT, okhttp) / eSTS server / environmental (enterprise TLS interception). Use the colour-coded `origin-tag` spans (`origin-broker`, `origin-android`, `origin-thirdparty`, `origin-env`). | Grep the error string across `broker/`, `common/`, `msal/`. If no match, it's not our code — search the Android SDK or call out as eSTS-returned. | +| **Top throw site** | Fully-qualified file:line where the exception is constructed, plus the % of cases that throw from this single site. | Pull `error_location` / stack-prefix from `android_spans` for the spiking error code (one targeted query, narrow time window). Cite the dominant site. | +| **Wrapper** | Broker/common code that catches the originator's exception and re-throws it as the user-visible error code. Often `IDToken.parseJWT()`, `ServiceException(...)`, `ExceptionAdapter.exceptionFromAuthorizationResult()`. | Walk up the stack from the throw site — check for `try { ... } catch (X e) { throw new Y(...); }` patterns in broker/common. | +| **Caller hot-spots** | Top 1–3 callers of the wrapper, with device counts. Helps identify the specific code path the regression flows through. | `android_spans` slice by `error_location` (or `error.stack_trace` first frame inside our code). | +| **Underlying cause** | The proximate cause one level deeper (e.g. "99% `CertificateException` from `TrustManagerImpl.verifyChain`", "84% `no_such_algorithm` from `ProviderFactory.getMessageDigest`"). | `android_spans` slice by `error.cause` or `error_message` first 80 chars. | +| **Top error_messages** | Top 3–5 distinct `error_message` strings with counts. Often reveals the 3rd-party library or environmental signal (e.g. `net::ERR_SSL_PROTOCOL_ERROR`, Zscaler-issued cert names). | `summarize count() by tostring(error_message)` on raw `android_spans` filtered to the spike. | +| **Likely PRs** | 1–3 PRs with confidence rating (high / medium / low / none), full GitHub URL, commit SHA, author, AB#, and a 1-sentence **why-it's-the-suspect** justification (not just the title). Use the `pr-card` markup. | See PR-grep below. **Cite confidence honestly** — "none" is a valid verdict for environmental errors. | +| **Next step** | Concrete action with a named owner: who runs the next slice, who files the bug, what flight to flip, what correlation IDs to pull. | Pulled from PR authors / CODEOWNERS for the affected file. | + +#### PR-grep workflow + +**Read the full PR window first, then reason — don't `--grep` blind.** The 4-week window across `broker/` and `common/` typically returns <30 PRs total, small enough to read end-to-end. Targeted `--grep` matches will miss PRs whose titles don't mention the error string (most of them). **The recommended order is:** + +1. **Run plain `git log` on both repos** for the 4-week window. Read the resulting list end-to-end before any greps. +2. **Cross-reference titles + dates** against the Originator pre-check throw-site class. +3. **Only when you have a specific symbol** to chase (e.g. the throw-site class identified in step 2), reach for `find-suspect-prs.ps1` to do the symbol-targeted parallel pickaxe + grep. + +The historical mistake (pre-v8) was to jump straight to `find-suspect-prs.ps1` without reading the window first, which silently dropped PRs whose titles didn't mention the symbol. + +```pwsh +# Step 1: read the full 4-week window +cd c:\Users\shjameel\Repos\android-complete\broker +git --no-pager log --since='' --until='' --pretty=format:'%h | %ai | %an | %s' --no-merges + +cd ..\common +git --no-pager log --since='' --until='' --pretty=format:'%h | %ai | %an | %s' --no-merges +``` + +For each candidate PR, **read the diff** to confirm it touches the throw site / wrapper class identified in the Originator pre-check. Don't cite a PR just because the title mentions a related concept. + +```pwsh +# Step 3 (optional): symbol-targeted focused follow-up. Use ONLY after step 1 gave +# you a specific class/method name to chase from the Originator pre-check. +# Searches both repos in parallel via `git log -S` (pickaxe on diff) AND `--grep` (subject). +# Returns a unified table: repo | date | author | sha | PR# | URL | subject. +.\.github\skills\oncall-weekly-telemetry-report\assets\find-suspect-prs.ps1 ` + -Symbol 'ExceptionAdapter' -Since 2026-04-01 -Until 2026-05-09 +``` + +#### Repo URL patterns for citations + +| Repo | URL pattern | +|------|-------------| +| `common/` | `https://github.com/AzureAD/microsoft-authentication-library-common-for-android/pull/` | +| `broker/` | `https://github.com/identity-authnz-teams/ad-accounts-for-android/pull/` | +| `msal/` | `https://github.com/AzureAD/microsoft-authentication-library-for-android/pull/` | +| `adal/` | `https://github.com/AzureAD/azure-activedirectory-library-for-android/pull/` | + +#### Non-broker errors + +For errors with no broker code in the stack (Android system errors like `Code:-10`/`Code:-11`, OEM-specific keystore failures, eSTS-returned codes, environmental TLS interception), explicitly cite **"⚪ None — not in scope"** with confidence `none`, and explain *why* in the why-it's-the-suspect line. Do not invent broker PRs to fill the slot. Tag these errors as `environmental` or `non-broker` so they're tracked but don't page. + +### Step 5 — Spike attribution dimensions + +**Coverage rule: every `error_code` AND every `error_type` that lands in either the WoW regression list OR the 60-day regression list MUST get a spike-attribution card.** No silent skips. + +**`ErrorStatsMetrics` already carries `account_type` and `is_shared_device`** (use the `MergeAccountType` / `MergeIsSharedDevice` helpers to normalize) — so you do **not** need a fallback to raw `android_spans` for these dims. Earlier versions of this skill claimed otherwise; that was wrong. The only dim that requires `android_spans` is `DeviceInfo_OsVersion` (OEM/version slicing). + +Slice on **all 7 dimensions** for each spike. **Preferred for 2-week WoW attribution: one union query that covers all 7 dims for all regressions in a single round-trip** — see [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql). Typical payload for 8 codes × 2 weeks × 7 dims is ~800 KB, well under the MCP limit. Pipe the result into `summarize-attribution.js --union ` (which prints per-dim top-N share + Δ devices + Δ requests for every code). Fall back to the per-dim form ([`attr-codes-by-dim.kql`](assets/queries/attr-codes-by-dim.kql)) only when (a) you need a wider time window, or (b) the union response exceeds payload size. + +For `error_type` cards, swap `error_code in (codes)` for `unified_error_type in (types)` and aggregate by the `MergeUiRequiredExceptions(error_type)` extension — otherwise everything else is identical. + +> **Low-volume fallback (extends Step 4's pre-check fallback to the 7-dim union):** when a code/type returns sparse dim rows in the 7-day reporting window — typical for sub-1k-device entries like `TimeoutCancellationException`, `JsonSyntaxException`, `kdfv2_key_derivation_error` — widen the union query to **14 days** (`` = baseline-week Sunday − 7d) before declaring "broad — needs targeted slice". The added week of context usually surfaces enough rows to compute concentration percentages. If a code STILL has no concentration after 14 days, mark every dim cell as "not sliced — sub-week volume; file the bug first, slice on persistence" — do NOT fabricate "Broad" verdicts. + +| # | Dimension | Source | Cross-check | +|---|-----------|--------|-------------| +| 1 | Broker version | `ErrorStatsMetrics` group by `broker_version` | Cross-reference `BrokerAdoptionStatsUpdated` to see if the version's request share *also* moved that week — if yes, the spike is rollout-driven, not code-driven | +| 2 | Span name | `ErrorStatsMetrics` group by `span_name` | A single span hosting >60% of the error → strong code-path signal | +| 3 | Active broker package | `ErrorStatsMetrics` group by `active_broker_package_name` | E.g. CompanyPortal vs Authenticator vs LTW | +| 4 | Calling package | `ErrorStatsMetrics` group by `calling_package_name` | If 1–2 callers dominate, this is likely a traffic-attribution case (see Step 6) | +| 5 | Account type (AAD vs MSA) | `ErrorStatsMetrics`, `extend t = MergeAccountType(account_type)` group by `t` | If the split deviates significantly from fleet (~85% AAD / 15% MSA), call it out | +| 6 | Shared device mode | `ErrorStatsMetrics`, `extend s = MergeIsSharedDevice(is_shared_device)` group by `s` | Shared-device fleets have very different error profiles | +| 7 | OS version | [`assets/queries/os-version-slice.kql`](assets/queries/os-version-slice.kql) — raw `android_spans`, group by `DeviceInfo_OsVersion` | **On-demand only** — slice OS-version when EITHER (a) the wrapper class is in `ExceptionAdapter.clientExceptionFromException` (catch-all wrapping a system exception, where the OEM/version often is the cause), OR (b) the error code is one of `Code:-6`, `Code:-10`, `Code:-11`, `unknown_crypto_error`, `io_error`, `null_pointer_error`. Otherwise mark the dim row as "not sliced this week — no OEM concentration suspected" and move on. Slicing OS-version on every card wastes a raw-spans query without changing the verdict. | + +#### Type cards have one extra required dimension: sub-code decomposition + +Because `error_type` is an umbrella over many `error_code` values, every `error_type` regression card MUST also include an **8th dimension: sub-code breakdown** showing the top 3–5 `error_code`s rolled up under that type, with their device counts and Δ vs prior week. This lets the reader see whether the type-level move is driven by one sub-code or many — and routes the deep Code Attribution work to the right sub-code. + +```kql +let target_types = dynamic(['ClientException', 'ServiceException']); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time > ago(14d) +| where unified_error_type in (target_types) +| extend wk = startofweek(EventInfo_Time) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, error_code +| order by unified_error_type asc, wk asc, devs desc +``` + +Cite the dominant sub-codes inline in the type card's verdict (e.g. *"`ClientException` −10.2% drop is dominated by −8.5 pp `timed_out_execution` + −3.4 pp `unknown_authority`"*) and link to those sub-codes' own attribution cards. The deep Code Attribution block (Step 4) for the type card itself focuses on the **wrapper / catch-and-rethrow** path that defines the type (e.g. `BaseException.java`, `ServiceException.java` constructors), not on each sub-code. + +Feed the union JSON output into the summarizer (one round-trip): + +```pwsh +# Union mode (preferred). attr-union.json comes from attr-union-by-dim.kql. +node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution.js ` + --union attr-union.json --top=5 +# For type cards, add --key=unified_error_type +``` + +Legacy per-dim mode (one JSON per dimension) is still supported for the rare wider-time-window case: + +```pwsh +node .github\skills\oncall-weekly-telemetry-report\assets\summarize-attribution.js ` + --label=span span.json ` + --label=calling_app app.json ` + --label=active_broker ab.json ` + --label=broker_version ver.json ` + --label=acct_type acct.json ` + --label=shared_dev shared.json ` + --label=client_sku sku.json +``` + +Ready-to-paste KQL for both forms: union → [`assets/queries/attr-union-by-dim.kql`](assets/queries/attr-union-by-dim.kql); per-dim → [`assets/docs/kusto-cheatsheet.md` § 8c](assets/docs/kusto-cheatsheet.md). + +**Concentration thresholds** (paint the dim bar red): +- > 80% in a single value → strong attribution (one root cause) +- 60–80% → medium attribution +- < 60% → broad / cross-cutting → say so explicitly, don't fabricate a single cause + +### Step 6 — Traffic analysis + traffic attribution + +Do this section in three parts. Traffic changes (up *or* down) need the same level of root-cause reasoning as error spikes — a uniform "−9% requests across all top apps with flat devices" is **not** a satisfactory verdict on its own; explain *why*. + +**6a. Top-line traffic shape.** Compare WoW *and* 60d for both totals and per-segment: + +```kql +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time > ago(70d) +| summarize totalReq = sum(countRequests), + totalDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time) +| order by week asc +``` + +For each of the following, report direction + magnitude: +- Total requests (WoW %, 60d %) +- Total devices (WoW %, 60d %) +- Requests-per-device ratio (a drop often means a benign caching improvement; a spike often means a retry storm) +- Top 10 calling apps (`AppStatsUpdated`) — which apps drove the change? +- Top spans by request volume — did one span explode or collapse? +- Sampling-rate change indicator: if total spans moved >20% but auth-only device count moved <5%, suspect a sampling/instrumentation change. + +**6b. Reasoning for material traffic shifts (>10% on any segment).** For every span/app/active-broker that moved meaningfully WoW *or* 60d, run this slicing-and-correlation pass: + +| # | Question | How to check | +|---|---|---| +| 1 | **Is the move concentrated in one span?** | Slice top-10 spans by `Δreq` absolute and `Δreq %`. A >50% move on a single span almost always points to a code change (span added / removed / sampled / `goAsync()`-ed). | +| 2 | **Is the move concentrated in one calling app?** | Slice `AppStatsUpdated` WoW. A single app moving >20% in requests with flat devices = client-side caching/retry change in that app — escalate to that app's owners, not broker. | +| 3 | **Is the move concentrated in one active broker pkg?** | Slice `BrokerAdoptionStatsUpdated` by `active_broker_package_name`. AppManager (LTW) vs Authenticator vs Intune CP often diverge during a rollout. | +| 4 | **Is the move concentrated in one broker version?** | Cross-check against rollout share. If a span dropped −80% on `16.0.1` but is flat on `15.1.0`, the cause is in the 16.0.1 diff. | +| 5 | **Did anything else co-move?** | A span dropping while `OnUpgradeReceiver`-style downstream spans also drop (`SecretKeyWrapping`, `WrappedKeyAlgorithmIdentifier` in v5) confirms a single upstream change. | + +For every meaningful shift, **search for a causal PR** in the repos likely to affect telemetry shape: + +```pwsh +# Broker (span add/remove, goAsync, scope changes, sampling/exporter config) +cd c:\Users\shjameel\Repos\android-complete\broker +git log --since='' --oneline -i ` + --grep='span|goAsync|receiver|telemetr|otel|trace|metric|sampl|exporter' + +# Common (instrumentation surfaces) +cd ..\common +git log --since='' --oneline -i ` + --grep='span|telemetr|otel|trace|sampl|instrument' +``` + +**Causal PR categories that meaningfully shift traffic counts** (flag any of these): + +- **Span removed / renamed / scope-narrowed** → drops the span's count to zero or partial +- **`goAsync()` / `BroadcastReceiver` refactor** → broadcast may complete before async work flushes the span (this is the v5 PR #88 / `OnUpgradeReceiver` story — call it out as a precedent) +- **Sampling-rate change** in broker `Otel*` / `Telemetry*` exporter config or `common/` instrumentation → uniformly scales counts up or down across many spans +- **New span added** in a hot path → request counts for that span jump from ~0 to material +- **Caller-side SDK change** (MSAL/MSAL_CPP/OneAuth release) that batches or caches requests → uniform per-app request drop with flat devices +- **Flight rollout** (ECS) that gates a code path on/off → bursty changes in a specific span on specific dates + +Cite the suspect PR(s) with the same confidence ratings used in Code Attribution (high / medium / low / none) and the same `pr-card` markup. If you can't pin one down, say so explicitly — *"uniform 5–22% per-app request drop with flat devices, no telemetry-platform PR identified, suspect caller-side SDK change in MSAL release X.Y"* is acceptable; "traffic is flat" without checking is not. + +**6c. Per-error traffic attribution (is the *error* spike traffic-driven?).** For every error code flagged in Step 5 as a regression, additionally check whether the spike is *traffic-driven* rather than *failure-rate-driven*: + +```kql +let target_code = ""; +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time > ago(14d) and error_code == target_code +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), calling_package_name +| order by week asc, devs desc +``` + +If the spike is concentrated in a single calling app whose **overall** request volume also rose that week (cross-check `AppStatsUpdated`), and the **per-request failure rate is essentially flat**, classify the spike as a **traffic-attribution case** rather than a code regression: + +> Example: "`no_account_found` +60% devices this week is fully explained by Outlook's request volume rising 65% — the per-Outlook-request failure rate is unchanged. No broker code change is implicated." + +Add a top-level **🚚 Traffic Attribution** section that lists every error matched to a traffic-driven origin, mirroring the Code Attribution section. **Each card must include**: the dominant calling app(s) with their WoW request-volume delta, the per-app per-request failure rate (now vs prior — show it's flat), and the recommended owner to route to (typically the calling app's team, not broker). If no errors qualify in a given week, render the section with an explicit "None this week" note rather than omitting it. + +### Step 7 — Validate & write + +Run the bundled validator FIRST — it covers all the silent-failure cases this skill has tripped on in the past: + +```pwsh +.\.github\skills\oncall-weekly-telemetry-report\assets\scripts\validate-report.ps1 +# defaults to most-recent oncall-wow-report-*.html under ~/android-oce-reports/ +# pass -Path explicitly to validate a specific file +``` + +The validator hard-fails on: +1. Stale `{{...}}` tokens or `EXAMPLE CONTENT BELOW` / `EXAMPLE_*` sentinels. +2. `devs` / `reqs` in user-facing text (KQL inside `
` is exempted).
+3. `U+FFFD` replacement characters (catches mojibake from emoji edits).
+4. Unbalanced `
` depth in the Section 2 attention block (catches the inception-style nested-callout bug from past runs). +5. A second callout opening before the previous one closes (nested-callout sanity check). +6. **Chartless KPI grid** — if more than half the `.kpi` tiles lack a `data-spark` element (catches the v7 regression where the body was rebuilt without sparklines). Also warns when total chart count (sparks + trends + inline svgs) is < 15. +7. **Code-attribution depth** — each `.attr-card`'s "Code attribution" block must contain an `Originator` row (proxy for the full 8-field structure: Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). Catches the v7-third-pass regression where cards shipped with a `pr-list`-only stub. +8. **Attribution-card layout guards (v8)** — the CSS must define `.attr-card { margin-bottom: 16px }` AND `.dim-row` overflow rules (`text-overflow: ellipsis` + `min-width: 0`). Catches the "cards touching" and "text bleeding out of dim boxes" regressions from a stale `` block. +9. **Fabricated-sparkline heuristic (v8)** — warns when a `data-trend` array's peak value is < 100 (almost certainly hand-rolled rather than sourced from real data). See [`assets/queries/wow-table-sparkline-series.kql`](assets/queries/wow-table-sparkline-series.kql) for the canonical KQL that pulls real 8-week series for every code in the WoW tables. + +Then: +- **Run the visual smoke test (recommended)** — catches rendered-layout bugs that pure HTML/CSS validation can't see: + + ```pwsh + .\.github\skills\oncall-weekly-telemetry-report\assets\scripts\visual-smoke.ps1 + # Opens the report at 1400px in headless Chromium via Playwright, captures a + # full-page screenshot to ~/android-oce-reports/_visual/, and runs DOM-based + # checks for: + # - element overflow inside .dim / .attr-card (catches "text bleeding out") + # - adjacent .attr-card pairs with gap < 8px (catches "cards touching") + # First run auto-installs Playwright + Chromium into %LOCALAPPDATA%\oce-skill-playwright + ``` +- Run `get_errors` on the HTML file (no errors expected — pure HTML/CSS). +- Verify no stale phrases from prior weeks remain (`Select-String` for retracted hypotheses, prior week's PR numbers). +- Verify every PR link in the new file is reachable (the file paths just before the link should match what `git log` returned). + +--- + +## Hard rules + +- **Never `sum(countDevices)`.** Always `dcount_hll(hll_merge(countDevicesHll))`. Summing the per-row distinct count double-counts. +- **Always wrap view names in `materialized_view('Xxx')`** and use the canonical `Metrics`/`Updated` variants (see cheatsheet § 2). +- **Never sum percentiles.** Latency is a TDigest sketch — `percentile_tdigest(tdigest_merge(responseTimeTDigest), N, typeof(long))` only. +- **Always apply `MergeAccountType` / `MergeIsSharedDevice` / `MergeUiRequiredExceptions`** so this report agrees with the dashboard. +- **Confirm the week bucket label matches the user's intent** before writing the rest of the queries (Sunday-aligned). +- **Always filter the partial in-progress week at the source** with `| where week < datetime()` where `` is the Sunday immediately after the reporting week. Otherwise `bucket-trends.js` will show every error as a fake −99% improvement once UTC has crossed midnight Sunday. +- **Originator pre-check is mandatory.** A card cannot claim `Originator: Broker` without first running [`assets/queries/error-message-and-location.kql`](assets/queries/error-message-and-location.kql) and reading the throw site + top 3 `error_message` strings. If the throw site is in `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}` AND the message starts with `AADSTS`, the originator is **eSTS, not broker** — see the AADSTS reference in [`assets/docs/kusto-cheatsheet.md`](assets/docs/kusto-cheatsheet.md). +- **WoW-movers pass is mandatory.** The 60d bucketer's `--peak-floor` silently drops sub-10K-device codes, so [`assets/queries/wow-movers.kql`](assets/queries/wow-movers.kql) MUST be run as a separate pass for both `error_code` and `error_type` (per Step 3d). Its output is **merged into the single 🔴 WoW regressions callout**, sorted by current-week device count descending, with rows tagged `NEW` / `60d↑` / originator chip. Do not render a separate "emerging" callout. Skipping the pass is how the Apr 26 `Failed to parse JWT` spike (7 → 3,461 devs over 7 weeks) hid for two reports running. +- **Section 2 callouts are at-a-glance, Section 4 is the deep dive.** WoW / Slow-burn / Wins items in Section 2 use the `.item` flat-row pattern (no nested cards, no per-item left bars — the parent `.callout` border is the only severity affordance). Each row is a single line of metric chips + a one-line body + an `Attribution card →` link to the corresponding `.attr-card` in Section 4. Do NOT duplicate the dim slicing, PR analysis, or detailed verdict between the two sections — Section 4 is where that lives. See [`assets/templates/template-readme.md`](assets/templates/template-readme.md) for the CSS class reference and the example `.item` markup. +- **Never use bash/PowerShell regex to bulk-edit balanced HTML.** This skill has burned twice on regex strip scripts that ate matched-pair `
` closes, producing inception-style nested-callout bugs that take a depth-tracking script to find. If you need a structural change to the HTML, make a targeted, single-occurrence string replacement (with explicit before/after context) or rewrite the affected block end-to-end. Never run a `-replace` across the whole file expecting it to leave balance intact. +- **Denominator caveat must cite evidence, not hand-wave.** If you flag a large all-spans device-count shift, run [`assets/queries/broker-version-share-wow.kql`](assets/queries/broker-version-share-wow.kql) (single WoW snapshot) or [`assets/queries/broker-version-share.kql`](assets/queries/broker-version-share.kql) (time-series) and name the version cohort the shift moved with. Do not write "recurring telemetry-shape artifact" without backing data; if you don't have it, drop the callout. +- **"Recovery" still merits a PR citation.** When an error pins to a single old broker version and recovers as that version retires, look for the **fix PR in the version that replaced it** before calling it a "natural rolloff." Often the fix is real and just under-credited. +- **Never report WoW-only verdicts** for errors that are flat-or-down WoW but rising on 60d — always cross-check both windows. +- **Never page** based on a regression that turns out to be a downstream of a denominator shift; always include the auth-only-denominator number alongside the all-spans number. +- **Always cite PRs** with full GitHub URLs (the repo URL patterns above), not bare commit SHAs. +- **Filename collision rule.** If a report file already exists for the same Sunday bucket, do not silently overwrite. Open the existing report, list its top-3 findings, and explicitly state in chat what changed in the new data before regenerating. A second run on the same week without a delta is wasted work. +- **No `devs` / `reqs` in user-facing strings.** All UI text — callouts, table headers, KPI labels, verdicts, badges — must say `devices` and `requests`. Internal variable / column / file names in scripts and JSON can stay short. +- **Do not create a separate Markdown summary** of the report — the HTML *is* the deliverable. +- **Do not commit** the report file. It lives in `$env:USERPROFILE\android-oce-reports\` (outside the workspace) precisely so it can't be staged accidentally. + +--- + +## Output checklist + +- [ ] New `oncall-wow-report-YYYY-MM-DD.html` (where `YYYY-MM-DD` is the reporting-week Sunday) exists at `$env:USERPROFILE\android-oce-reports\` (NOT at repo root). If a file for this Sunday already existed, the chat session explicitly stated what changed before regenerating. +- [ ] All sections present and populated (incl. 🚚 Traffic Attribution — even if “None this week”) +- [ ] **60-day trend bucketing run on the full cross-product** — `{error_code, error_type} × {devices, requests}` = 4 runs — union of regressions reported. Per-request retry storms (e.g. small device pool, exploding request count) are flagged on both axes. Source KQL filtered the partial in-progress week with `| where week < datetime()`. +- [ ] **WoW-movers pass run** ([`wow-movers.kql`](assets/queries/wow-movers.kql)) for BOTH `error_code` and `error_type`. Its output rows are **merged into the single 🔴 WoW regressions callout in Section 2** (sorted by curr-week devices descending), each row tagged `NEW` / `60d↑` / originator chip. No separate "emerging" callout. Every row carries throw-site, dominant message, originator, and a next step. If the WoW callout is empty (rare), render "None this week" rather than omit. +- [ ] **Both error-codes AND error-types WoW tables have `Δ requests %` and `Δ devices %` columns**, the 60d sparkline, and a status pill. Any row crossing threshold on either metric is in the regression list. +- [ ] Every WoW regression AND every 60d regression — **for both `error_code` and `error_type`** — has its own spike-attribution card with all 7 dimensions sliced. Cards are built from [`assets/templates/spike-card.html`](assets/templates/spike-card.html). +- [ ] **Every `error_type` regression card includes the 8th-dimension sub-code decomposition** showing the top 3–5 contributing `error_code`s with their Δ vs prior week, and links to those sub-codes' own attribution cards. +- [ ] **Originator pre-check has been run for every broker-tagged card** ([`error-message-and-location.kql`](assets/queries/error-message-and-location.kql)). Throw site and top 3 `error_message` strings are populated from real data, not from the code map. AADSTS-prefixed messages are tagged `eSTS`, not `Broker`. +- [ ] **Every regression card's Code Attribution block populates Originator + Top throw site + Wrapper + Caller hot-spots + Underlying cause + Top error_messages + Likely PRs (with confidence/why-it's-the-suspect) + Next step (with named owner)**. For type cards, the wrapper field focuses on the type's catch-and-rethrow site (e.g. `BaseException`, `ServiceException` constructor). Shallow PR-only attribution is not acceptable. +- [ ] Non-broker errors are explicitly tagged `environmental` / `non-broker` with confidence `none` — not invented broker PRs. +- [ ] Traffic analysis covers totals, per-app, per-span, requests-per-device ratio (per error AND overall), and a sampling-change check. +- [ ] **Every material traffic shift (>10% on any segment, up or down) has a reasoning paragraph** that names the dominant span/app/active-broker/broker-version, and either cites a causal PR (with confidence) — span removed/added, `goAsync()` refactor, sampling change, caller-side SDK release, ECS flight ramp — or explicitly says "no PR identified, suspect X" rather than leaving it unexplained. +- [ ] Denominator caveat (if used) is backed by [`broker-version-share-wow.kql`](assets/queries/broker-version-share-wow.kql) or [`broker-version-share.kql`](assets/queries/broker-version-share.kql) evidence naming the responsible version cohort. No hand-waving. +- [ ] Auth-only denominator used for all reliability %s, denominator caveat called out at top. +- [ ] No `\bdevs\b` or `\breqs\b` in user-facing text. (`Select-String -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false` returns 0.) +- [ ] **Sparklines rendered.** Every `.kpi` tile in the Top-line health section has a `data-spark` array with 8–9 weekly values. Every row in the 60-day trend tables and both WoW tables (codes + types) has a `data-trend` mini-spark. The validator's chart-coverage check passes (KPI coverage ≥1/2 of tiles, total elements ≥15). Past failure mode: the v7 body rebuild dropped all sparklines silently — see `template-readme.md` § "Sparklines are MANDATORY". +- [ ] **Code-attribution depth.** Every `.attr-card`'s Code attribution block uses the full 8-field `
` structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step) per [`assets/docs/code-attribution-template.md`](assets/docs/code-attribution-template.md). A `pr-list`-only stub is **not acceptable** — the validator hard-fails this. Past failure mode (v7 third pass): all 10 cards shipped with PR-only stubs and lost the throw-site / wrapper / underlying-cause analysis. +- [ ] No stale text from previous weeks. (`Select-String -Pattern 'EXAMPLE CONTENT BELOW'` returns 0 — that's the unfinished-section sentinel. The template no longer ships `{{TOKEN}}` placeholders since v2; if the file still contains any `{{`, that's also a leftover.) +- [ ] `get_errors` clean on the HTML file. diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md b/.github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md new file mode 100644 index 00000000..cd5dfe57 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/docs/code-attribution-template.md @@ -0,0 +1,147 @@ +# Code Attribution Card — Per-Spike Checklist + +Use this template for **every** spike-attribution card in the report. The HTML markup matches the `code-attr` / `pr-card` / `origin-tag` styles already in [`report-template.html`](../templates/report-template.html). + +A card without a populated **Originator + Top throw site + Likely PRs + Next step** is not acceptable. "Caller hot-spots", "Underlying cause", and "Top error_messages" are required for any error where the originator is *not* obvious from the error name alone (Android system errors, 3rd-party library wrappers, environmental). + +--- + +## Required fields + +### 1. Originator + +One of: + +- 🟥 `broker` — error originates in our `broker/` or `broker4j/` code +- 🟥 `common` — originates in `common/` or `common4j/` +- 🟧 `Android system` — Android SDK (WebView, Conscrypt, Keystore, okhttp, KeyStore HAL) +- 🟦 `3rd-party lib` — Nimbus JOSE+JWT, Gson, etc. +- 🟦 `eSTS` — server-returned OAuth error (`invalid_grant`, `invalid_resource`, `unauthorized_client`, etc.) +- ⬜ `environmental` — enterprise TLS interception (Zscaler), OEM keystore quirks, network-policy + +### 2. Top throw site + +Fully-qualified `Class.method:line` plus % of cases that throw from this single site. Example: + +> `com.nimbusds.jwt.SignedJWT.getJWTClaimsSet:28`   **97% of cases**   thrown as `ParseException` + +How to find: query raw `android_spans` filtered to the spiking error code over a tight time window, group by `error_location` (or first frame of `error.stack_trace`), order desc. + +### 3. Wrapper + +The broker/common method that catches the originator's exception and re-throws it as the user-visible error code. Often `IDToken.parseJWT()`, `ServiceException(...)`, `ExceptionAdapter.exceptionFromAuthorizationResult()`, `ClientException("Code:" + err, ...)`. + +How to find: walk up the stack from the throw site; look for `try { ... } catch (X e) { throw new Y(...); }` patterns in `broker/` and `common/`. + +### 4. Caller hot-spots + +Top 1–3 callers of the wrapper, with device counts. Helps pin the regression to a specific code path. Example: + +> `GetRegistrationStateV0LegacyExecutor.execute:90` (84 dev) · `AndroidDeviceRegistrationClientController.execute:234` (47 dev) + +### 5. Underlying cause + +The proximate cause one level deeper than "the error fired". Example: + +> 99% `CertificateException` from `TrustManagerImpl.verifyChain` · cert-chain rejection at TLS layer + +How to find: slice on `error.cause` or first 80 chars of `error_message`. + +### 6. Top error_messages + +Top 3–5 distinct `error_message` strings with counts. Often the strongest signal for environmental errors (e.g. `net::ERR_SSL_PROTOCOL_ERROR`, Zscaler-issued cert names, OEM keystore exception text). + +```kql +android_spans +| where EventInfo_Time between (ago(7d) .. now()) +| where error_code == "" +| summarize count() by tostring(error_message) +| top 10 by count_ +``` + +### 7. Likely PRs + +1–3 PRs (or explicit "None"), each rendered as a `pr-card` with: + +- **Confidence**: `high` / `medium` / `low` / `none` (use the matching `pr-conf-*` CSS class) +- **GitHub URL** (full link, not bare SHA) +- **Commit SHA** (short) +- **Author** (`@username`) +- **AB#** if available +- **Why-it's-the-suspect** — 1 sentence explaining the *causal* link, not just the title. Bad: "touches MicrosoftStsAccountCredentialAdapter". Good: "touches the IDToken parse path on MSA interactive flows; matches the Apr 30 climb date." + +| Confidence | Use when | +|---|---| +| 🔴 **high** | Trajectory + flight rollout date both line up; PR diff touches the exact throw site | +| 🟡 **medium** | Code path matches but no flight gate evidence, or matches one of two suspects | +| 🟢 **low** | Candidate from grep, plausible but unverified | +| ⚪ **none** | No broker PR identified — explicitly say *why* (Android system error, eSTS-returned, OEM-specific, environmental) | + +### 8. Next step + +Concrete action with a **named owner** and a **measurable outcome**. Examples: + +- "Disable `ENABLE_OPENID_VC_HANDLING_IN_WEBVIEW_REDIRECT` flight for the affected slice (Outlook + msapps + 16.0.1) and verify spike subsides. Owner: **@somalaya**." +- "Pull 5–10 correlation IDs from Outlook devices hitting this and check eSTS logs for the actual rejected resource ID. Owner: **Outlook + eSTS teams**." +- "Slice by `bound_service_status` vs `content_provider_status` attributes to identify which IPC strategy is failing. Owner: **@pedroro**." + +--- + +## HTML skeleton (copy-paste, then fill in) + +```html +
+
Code attribution
+ +
+
Originator
+
broker short description
+
+ +
+
Top throw site
+
fully.qualified.Class.method:line   NN% of cases
+
+ +
+
Wrapper
+
wrapping.method wraps it as NewException(...)
+
+ +
+
Caller hot-spots
+
caller.A:NN (X dev)  ·  caller.B:NN (Y dev)
+
+ +
+
Underlying cause
+
NN% RootCauseException from root.method
+
+ +
+
Top error_messages
+
message 1  ·  N× message 2  ·  N× message 3
+
+ +
+
Likely PRs
+
+
+
+ 🔴 High +
+ repo#NN · PR title +
commit shortsha · 2026-MM-DD · author @user · AB#NNNNNNN
+
One-sentence causal explanation.
+
+
+
+
+
+ +
+
Next step
+
Concrete action. Owner: @name / team.
+
+
+``` diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md b/.github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md new file mode 100644 index 00000000..5d52ed31 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/docs/kusto-cheatsheet.md @@ -0,0 +1,265 @@ +# Kusto Cheatsheet for the OCE Weekly Report + +Distilled from the **production Android Broker Dashboard** (374 queries) plus lessons learned running the skill end-to-end. **Read this before writing any KQL for this report** — it will save you from the most common silent-data-quality bugs. + +--- + +## 1. Connection + +| | | +|---|---| +| **Cluster** | `https://idsharedeus2.kusto.windows.net` | +| **Database** | `ad-accounts-android-otel` | +| **MCP tool** | `mcp_azure-mcp-ser_kusto` (command `query`) | +| **MCP timeout** | ~240 s — raw `android_spans` queries usually exceed this; **always prefer materialized views** | + +--- + +## 2. Use the canonical *materialized views*, not the bare names + +The dashboard never queries `ErrorStats` directly. It uses the `Metrics` / `Updated` variants, which are pre-aggregated and HLL-bucketed. Use these: + +| Use case | Canonical view | +|---|---| +| Per-error-code counts (devs, reqs) | `materialized_view('ErrorStatsMetrics')` | +| Total broker requests / devices | `materialized_view('BrokerAdoptionStatsUpdated')` | +| Silent auth — all requests | `materialized_view('SilentAuthStatsAllRequestsMetrics')` | +| Silent auth — successes (without expected error) | `materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics')` | +| Interactive auth — all / success | `materialized_view('InteractiveAuthStatsAllRequestsMetrics')` / `…WithoutExpectedErrorMetrics` | +| FIDO requests | `materialized_view('FidoAllRequestsMetrics')` | +| Calling-app share | `materialized_view('AppStatsUpdated')` | +| SKU share | `materialized_view('SkuStatsUpdated')` | +| Latency (TDigest) | `materialized_view('PerfStatsUpdated')` | +| Per-flight slicing | `Operations_ByFlight`, `ErrorCodeBySpan_ByFlight`, `ErrorType_ByFlight` | + +Always wrap in `materialized_view(...)` — referencing the table name directly may pick up the raw, much slower base table. + +Time filter on materialized views is always **`EventInfo_Time`**. Use `PipelineInfo_IngestionTime` only when querying raw `android_spans`. + +--- + +## 3. THE distinct-device-count gotcha (most important rule) + +`countDevices` on `ErrorStats*` is a **per-row distinct count, not additive**. If you sum it across multiple rows you will double-count any device that appeared in more than one slice. **The dashboard never does this.** Every dashboard query computes devices via: + +```kql +| summarize countDevices = dcount_hll(hll_merge(countDevicesHll)) +``` + +`countDevicesHll` is the **HLL sketch** stored alongside the row. Merging HLLs across rows and then `dcount_hll`-ing gives the correct distinct count. + +**Symptom of the bug:** device counts that sum to more than the fleet size; WoW deltas that look enormous when the underlying user impact is small. + +For request counts, `sum(countRequests)` and `sum(countOverall)` are correct (they're additive). + +--- + +## 4. Helper functions used by the dashboard + +Reuse these so this report agrees with the dashboard: + +| Function | Purpose | Used on | +|---|---|---| +| `MergeAccountType(account_type)` | Collapse AAD variants together and MSA variants together | every error/perf query | +| `MergeIsSharedDevice(is_shared_device)` | Normalize null → "personal", true → "shared", false → "personal" | every error/perf query | +| `MergeUiRequiredExceptions(error_type)` | Collapse the 6+ string variants of `UiRequiredException` into one | error-type aggregation | +| `prettyFormatNumber(n)` | "1.2 M" / "856 k" formatting in tile output | display-only tiles | + +The 7-dimension attribution slicing is **fully achievable from `ErrorStatsMetrics`** — it has `account_type`, `is_shared_device`, `broker_version`, `active_broker_package_name`, `AppInfo_Version`, `client_sku`, `calling_package_name`, `span_name`. **You do NOT need a fallback to raw `android_spans` for these dimensions** (this skill previously claimed you did — that was wrong). + +--- + +## 5. Latency — never sum percentiles + +Latency is stored as a TDigest sketch. **Percentiles are not additive** — averaging p95 across rows is meaningless. Always merge first: + +```kql +materialized_view('PerfStatsUpdated') +| where EventInfo_Time between ((_startTime) .. (_endTime)) +| where span_name in ('AcquireTokenSilent','GetAccounts','RemoveAccount','ProcessWebsiteRequest') +| where span_status == 'OK' +| summarize p50 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 50, typeof(long)), + p95 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 95, typeof(long)), + p99 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 99, typeof(long)) + by week=startofweek(EventInfo_Time), span_name +``` + +**Note:** there is also a `PerfStatsMetrics` view, but it does **not** expose per-percentile columns directly — it has the merged TDigest. Use `PerfStatsUpdated` (preferred by the dashboard) and `percentile_tdigest(tdigest_merge(...), N, typeof(long))`. + +--- + +## 6. Column-name reference (so you don't burn a query on a typo) + +| View | Has column | Doesn't have | +|---|---|---| +| `ErrorStatsMetrics` | `error_code`, `error_type`, `span_name`, `broker_version`, `active_broker_package_name`, `AppInfo_Version`, `client_sku`, `calling_package_name`, `account_type`, `is_shared_device`, `EventInfo_Time`, `countOverall`, `countDevicesHll` | `calling_package` (no — it's `calling_package_name`), `countDevices` (no — use the HLL) | +| `BrokerAdoptionStatsUpdated` | `broker_version`, `EventInfo_Time`, `countRequests`, `countDevicesHll` | per-error breakdown (use ErrorStatsMetrics) | +| `PerfStatsUpdated` | `span_name`, `span_status`, `broker_version`, `active_broker_package_name`, `account_type`, `is_shared_device`, `client_sku`, `calling_package_name`, `responseTimeTDigest`, `countRequests` | `p50_ms` / `p95_ms` (no — use `percentile_tdigest`) | +| `AppStatsUpdated` | `calling_package_name`, `EventInfo_Time`, `countRequests`, `countDevicesHll` | error breakdown | + +--- + +## 7. Week alignment — Kusto `startofweek()` is **Sunday-aligned** + +If a user says "the week of May 2 → May 9", Kusto buckets it as `startofweek('2026-05-09') == 2026-05-03T00:00:00Z`. **Always confirm**: print the distinct `startofweek(EventInfo_Time)` values from your first query and verify the bucket label matches the user's intent. Off-by-one-week is the #1 silent error. + +For an 8-complete-week 60-day window ending Sat May 9, the buckets are: +`2026-03-08, 03-15, 03-22, 03-29, 04-05, 04-12, 04-19, 04-26, 05-03` — that's 9 buckets, one of which (the first) was a partial start. Drop the first; keep 8 complete weeks. + +--- + +## 8. Canonical query templates + +### 8a. Reliability (auth-only denominator) + +```kql +let all = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time > ago(70d) + | summarize allReq = sum(countRequests), + allDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time); +let ok = materialized_view('SilentAuthStatsRequestsWithoutExpectedErrorMetrics') + | where EventInfo_Time > ago(70d) + | summarize okReq = sum(countRequests), + okDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time); +all | join kind=inner ok on week + | project week, + reqRel = round(100.0 * okReq / allReq, 3), + devRel = round(100.0 * okDev / allDev, 3) + | order by week asc +``` + +**Auth-only device union** (Silent ∪ Interactive — what the report uses for the "real fleet" KPI). The natural reach for `hll_merge_array` to combine two pre-merged HLL sketches **does not exist in Kusto** (`SEM0260: Unknown function`). Instead, project the raw `countDevicesHll` rows from both views, `union` them, and `hll_merge` once at the end: + +```kql +let s = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | project EventInfo_Time, countDevicesHll; +let i = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | project EventInfo_Time, countDevicesHll; +union s, i +| summarize authDev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time) +| where week < datetime() +| order by week asc +``` + +### 8b. 60-day error trend (feeds `bucket-trends.js`) + +```kql +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time > ago(70d) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| order by error_code asc, week asc +``` + +### 8c. Spike attribution — one slicing dim at a time + +The MCP tool can return ~50–700 KB of JSON; multi-dim cartesians blow this out. **Slice one dimension per query**, then post-process with `summarize-attribution.js`: + +```kql +let codes = dynamic(['no_tokens_found','unauthorized_client','Code:-6', + 'unknown_crypto_error','null_pointer_error','timed_out_execution']); +materialized_view('ErrorStatsMetrics') +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time > ago(14d) +| where error_code in (codes) +| extend wk = startofweek(EventInfo_Time) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)) + by wk, error_code, span_name // <-- swap this dim per query +| order by error_code asc, wk asc, devs desc +``` + +Run once each with the trailing dim set to: `span_name`, `calling_package_name`, `active_broker_package_name`, `broker_version`, `unified_account_type`, `unified_is_shared_device`, `client_sku`. That's the full 7. + +### 8d. Latency — see Section 5 above. + +### 8e. Broker version share + +```kql +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time > ago(21d) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), broker_version +| order by week asc, req desc +``` + +--- + +## 10. Helper scripts + +| Script | Purpose | +|---|---| +| [`bucket-trends.js`](bucket-trends.js) | Bucket every error code into regression / spike / improvement / flat across an N-week window. Pass `--end=YYYY-MM-DD` (Sunday after the reporting week, exclusive) to drop partial in-progress buckets. | +| [`agg.js`](agg.js) | Per-error per-dim top-N rollup with WoW deltas. Feeds spike-attribution dim blocks. | +| [`summarize-attribution.js`](summarize-attribution.js) | Roll up 7-dim attribution slices per (error_code, week) — feeds the spike-attribution cards | +| [`queries/`](queries/) | Canonical KQL templates, one per query — see [`queries/README.md`](queries/README.md) | +| [`templates/`](templates/) | Copy-paste HTML snippets for cards / footer JS | +| [`report-template.html`](../templates/report-template.html) | Canonical layout. Copy to `~/android-oce-reports/oncall-wow-report-.html` and replace `{{TOKENS}}` only — never restructure CSS | + +--- + +## 11. The `error_location` JSON shape (read this before slicing stack-traces) + +`error_location` on `android_spans` is a **serialized JSON string**, not a dynamic object. Naively writing `error_location.MethodName` returns null in KQL. Use `tostring()` to project it raw, then `parse_json()` if you need to drill in: + +```kql +android_spans +| where error_code == 'null_pointer_error' +| extend loc = tostring(error_location) // {"ClassName":"...","MethodName":"...","LineNumber":N} +| extend method = tostring(parse_json(loc).MethodName) +| extend lineNo = toint(parse_json(loc).LineNumber) +| summarize devices = dcount(DeviceInfo_Id) by method, lineNo +| top 20 by devices desc +``` + +For the report's **mandatory Originator pre-check** (Step 4 of SKILL.md), use [`queries/error-message-and-location.kql`](queries/error-message-and-location.kql) — it returns the raw `loc` blob alongside the first 100 chars of `error_message`, which is enough to identify the throw site (file + method + line) and the dominant message string. + +The single most informative attribution query for a regressing code: + +```kql +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where error_code in () +| extend loc = tostring(error_location), + msg = substring(tostring(error_message), 0, 100) +| summarize cnt = count(), + devices = dcount(DeviceInfo_Id) + by error_code, loc, msg +| top 60 by devices desc +``` + +--- + +## 12. AADSTS reference — common eSTS responses bridged into broker errors + +When `error_message` starts with `AADSTS`, the originator is **eSTS, not broker**, regardless of which broker exception class was constructed. Broker (specifically `common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult}`) translates the AAD response into a broker exception code as a courtesy — it is not the cause. + +| AADSTS code | Meaning | Broker exception code (typical) | Originator | Owner | +|---|---|---|---|---| +| `AADSTS500011` | Resource principal not found in tenant | `invalid_resource` | eSTS / tenant config | Resource owner team | +| `AADSTS500014` | Service principal disabled in tenant | `invalid_resource` | eSTS / tenant config | Resource owner team | +| `AADSTS50158` | External claims challenge / CA enforcement | `interaction_required` | eSTS / Conditional Access | Identity CA team | +| `AADSTS50173` | Fresh token needed (CA / FR) | `interaction_required` / `invalid_grant` | eSTS / CA | Identity CA team | +| `AADSTS65001` | User / admin has not consented | `unauthorized_client` | eSTS / app registration | App owner team | +| `AADSTS70008` | Authorization code expired | `invalid_grant` | eSTS (timing) | Investigate caller latency | +| `AADSTS70011` | Invalid scope | `invalid_scope` | eSTS / app registration | App owner team | +| `AADSTS90072` | User account from external tenant doesn't exist locally | `unauthorized_client` | eSTS / B2B config | Tenant admin | +| `AADSTS900971` | No reply address | `invalid_request` | eSTS / app registration | App owner team | + +**Rule of thumb:** if the throw site is an `ExceptionAdapter.*` method AND the message begins with `AADSTS`, tag the card `eSTS` and route to the resource / app owner team. Do not invent a broker PR to "fix" it. + +--- + +## 13. MCP output handling + +- Most queries with multi-week × per-error-code grain return **>50 KB** and are written to a side file by the tool. Read the side file with the `read_file` tool, or pipe through `bucket-trends.js` / `summarize-attribution.js`. +- The first row of `results.items` is the **schema object**, not data. The helper scripts know this. +- If a query times out or returns `BadRequest`, check **column name typos first** (the error message names the missing column). diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql new file mode 100644 index 00000000..a9e03506 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-codes.kql @@ -0,0 +1,15 @@ +// 60-day per-error-code trend. +// Inputs (replace before pasting): +// = first Sunday of the 60d window (e.g. 2026-03-08) +// = end of the reporting week, EXCLUSIVE = next Sunday after the +// reporting week's Sunday (e.g. for a 2026-05-03 report, use 2026-05-10) +// Output: feed to assets/scripts/bucket-trends.js with --start= (no --end needed +// because we filter the partial bucket out at the source — preferred). +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime() // drop partial end-week +| order by error_code asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql new file mode 100644 index 00000000..951e840f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/60d-trend-types.kql @@ -0,0 +1,10 @@ +// 60-day per-error-type trend (with MergeUiRequiredExceptions to collapse variants). +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time between (datetime() .. datetime()) +| where isnotempty(unified_error_type) +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), unified_error_type +| where week < datetime() +| order by unified_error_type asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md new file mode 100644 index 00000000..12351dae --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/README.md @@ -0,0 +1,34 @@ +# `assets/queries/` — canonical KQL templates + +Each `.kql` here is a paste-and-replace template for one of the queries the OCE +weekly report needs. Token convention: + +| Token | Meaning | +|---|---| +| `` | Sunday of the earliest week in the window, ISO date e.g. `2026-03-08` | +| `` | Sunday immediately AFTER the reporting week (EXCLUSIVE upper bound). For a 2026-05-03 report use `2026-05-10`. | +| `` | Sunday of the prior week (the WoW baseline). | +| `` | Comma-separated KQL string list, e.g. `'invalid_resource', 'null_pointer_error'` | +| `` | Same shape but for `unified_error_type`. | +| `` | A single column name, replace per dimension run. | + +**The `` filter is mandatory.** Always include `| where week < datetime()` after the `summarize` so the partial in-progress week is dropped at the source. Otherwise `bucket-trends.js` will see a fake −99% improvement on every code (the partial bucket will look like a fleet-wide collapse). + +## File index + +| File | Purpose | Section it feeds | +|---|---|---| +| [`reliability-auth-only.kql`](reliability-auth-only.kql) | Per-week auth-only requests/devices | Top-line health, denominator caveat | +| [`broker-version-share.kql`](broker-version-share.kql) | Per-week per-version share — **evidence for denominator caveat** | Denominator caveat callout, broker adoption | +| [`broker-version-share-wow.kql`](broker-version-share-wow.kql) | Single WoW snapshot of version share — fastest evidence for cohort transitions | Denominator caveat callout | +| [`60d-trend-codes.kql`](60d-trend-codes.kql) | Feeds `bucket-trends.js` for codes | 60-day trend analysis | +| [`60d-trend-types.kql`](60d-trend-types.kql) | Feeds `bucket-trends.js` for types | 60-day trend analysis | +| [`wow-movers.kql`](wow-movers.kql) | **MANDATORY second pass** — catches small-base codes that spiked sharply this week (below the 60d bucketer's reporting threshold). Run for both `error_code` and `error_type`. **Merge its output rows into the single 🔴 WoW regressions callout** alongside the standard WoW table; tag rows that were absent or near-zero last week with `NEW`. Do not render a separate "emerging" callout. | 🔴 WoW regressions callout (Section 2) | +| [`attr-union-by-dim.kql`](attr-union-by-dim.kql) | **PREFERRED for 2-week WoW.** All 7 dims for N codes (or types) in ONE round-trip; pipe through `summarize-attribution.js --union`. | Spike attribution cards | +| [`attr-codes-by-dim.kql`](attr-codes-by-dim.kql) | Per-dim form (run 7 times). Fall back to this only when the union exceeds payload size or the time window is wider than 2 weeks. | Spike attribution cards | +| [`attr-types-by-dim.kql`](attr-types-by-dim.kql) | Per-dim form for type regressions | Spike attribution cards | +| [`type-subcode-decomposition.kql`](type-subcode-decomposition.kql) | 8th dim for type cards | Type spike-attribution cards | +| [`error-message-and-location.kql`](error-message-and-location.kql) | **MANDATORY** for every broker-tagged regression. Now accepts BOTH `` and `` so codes + types can be sliced in one round-trip. | Code attribution block | +| [`os-version-slice.kql`](os-version-slice.kql) | OS / OEM concentration (raw `android_spans`). **On-demand only** per Step 5 — don't slice every card. | OS-version dim in attribution cards (when applicable) | +| [`latency.kql`](latency.kql) | p50/p95/p99 by hot span | Latency section | +| [`app-share.kql`](app-share.kql) | Top calling apps by week | Traffic analysis | diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql new file mode 100644 index 00000000..4e138833 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/app-share.kql @@ -0,0 +1,11 @@ +// Top calling apps share for last N weeks (typically 3). +materialized_view('AppStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), calling_package_name +| where week < datetime() +| order by week asc, req desc +| summarize topApps = make_list(pack('app', calling_package_name, 'req', req, 'dev', dev), 25) + by week +| order by week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql new file mode 100644 index 00000000..fa31ff56 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-codes-by-dim.kql @@ -0,0 +1,17 @@ +// Spike attribution: codes by ONE dimension at a time. +// Run 7 times with set to each of: +// span_name | calling_package_name | active_broker_package_name | +// broker_version | unified_account_type | unified_is_shared_device | client_sku +// (Plus android_spans-based for OS version — see os-version-slice.kql.) +let codes = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time between (datetime() .. datetime()) +| where error_code in (codes) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, error_code, +| order by error_code asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql new file mode 100644 index 00000000..d9b06736 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-types-by-dim.kql @@ -0,0 +1,15 @@ +// Spike attribution: types by ONE dimension at a time. +// Same usage as attr-codes-by-dim.kql but for error_type regressions. +let types = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| extend unified_account_type = MergeAccountType(account_type) +| extend unified_is_shared_device = MergeIsSharedDevice(is_shared_device) +| where EventInfo_Time between (datetime() .. datetime()) +| where unified_error_type in (types) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, +| order by unified_error_type asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql new file mode 100644 index 00000000..00afd33f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/attr-union-by-dim.kql @@ -0,0 +1,64 @@ +// Spike-attribution union — all 7 dims for N codes (or N types) in ONE query. +// +// Recommended for the standard 2-week WoW attribution pass (Step 5 of SKILL.md). +// 1 round-trip vs 7. ~800 KB payload for 8 codes; well under the MCP limit. +// Falls back to per-dim files (assets/queries/attr-codes-by-dim.kql) if you +// need a wider time window or you exceed payload size. +// +// Inputs: +// e.g. dynamic(['no_tokens_found','timed_out_execution', ...]) +// inclusive (e.g. datetime(2026-04-26)) +// EXCLUSIVE Sunday after the reporting week (e.g. datetime(2026-05-10)) +// either `error_code` or `unified_error_type` (the latter for type cards) +// +// Output schema (consumed by `summarize-attribution.js --union`): +// dim string short label per dimension +// wk datetime reporting week +// string error_code or unified_error_type +// val_string string dim value (cast via tostring() in every union leg) +// devs long dcount_hll merged device count +// errs long sum of countOverall (request count) +// +// For type cards, swap the first line and key: +// let base = materialized_view('ErrorStatsMetrics') +// | extend unified_error_type = MergeUiRequiredExceptions(error_type) +// | where EventInfo_Time between (datetime() .. datetime()) +// | where unified_error_type in () +// | extend wk = startofweek(EventInfo_Time); + +// IMPORTANT — column-aliasing gotcha: every union branch MUST emit `val_string` +// as a real `string` (never `bool(null)`), or Kusto will rename the columns +// `val_string_string` and `val_string_bool` in the result schema, which then +// breaks `summarize-attribution.js` (it now accepts both names as a fallback, +// but emitting one consistent `string` column is cleaner). Use `tostring()` on +// non-string dims (e.g. shared_dev) so every leg has a string-typed column. +let codes = dynamic([]); +let base = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where error_code in (codes) + | extend wk = startofweek(EventInfo_Time); +(base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='span', wk, error_code, val_string=tostring(span_name)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='calling_app', wk, error_code, val_string=tostring(calling_package_name)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='active_broker', wk, error_code, val_string=tostring(active_broker_package_name)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='broker_ver', wk, error_code, val_string=tostring(broker_version)) +| union (base | extend t = MergeAccountType(account_type) + | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='acct_type', wk, error_code, val_string=tostring(t)) +| union (base | extend s = MergeIsSharedDevice(is_shared_device) + | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='shared_dev', wk, error_code, val_string=tostring(s)) +| union (base | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by dim='client_sku', wk, error_code, val_string=tostring(client_sku)) +| where wk < datetime() +| order by error_code asc, dim asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql new file mode 100644 index 00000000..0fe05c58 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share-wow.kql @@ -0,0 +1,34 @@ +// WoW broker-version share comparison \u2014 the canonical evidence for the +// "denominator caveat" callout when an entire version cohort retires. Use this +// instead of the time-series `broker-version-share.kql` when you need a single +// WoW snapshot showing which versions gained/lost share. Modeled on +// `wow-movers.kql`. +// +// Inputs: +// Sunday of the reporting week (e.g. 2026-05-03) +// Sunday after (exclusive, e.g. 2026-05-10) +// Sunday of the baseline week (e.g. 2026-04-26) +// +// Floor: only versions with >100M reqs in either week (filters long-tail). +// Output sorted by current-week req count descending. + +let curr = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize cReq = sum(countRequests), + cDev = dcount_hll(hll_merge(countDevicesHll)) + by broker_version; +let prior = materialized_view('BrokerAdoptionStatsUpdated') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize pReq = sum(countRequests), + pDev = dcount_hll(hll_merge(countDevicesHll)) + by broker_version; +curr | join kind=fullouter prior on broker_version +| extend bv = coalesce(broker_version, broker_version1) +| extend cReq = coalesce(cReq, long(0)), cDev = coalesce(cDev, long(0)), + pReq = coalesce(pReq, long(0)), pDev = coalesce(pDev, long(0)) +| project bv, pReq, cReq, + dReqPct = iff(pReq == 0, real(null), round(100.0 * (cReq - pReq) / pReq, 1)), + pDev, cDev, + dDevPct = iff(pDev == 0, real(null), round(100.0 * (cDev - pDev) / pDev, 1)) +| where cReq > 100000000 or pReq > 100000000 +| order by cReq desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql new file mode 100644 index 00000000..bdfc5a41 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/broker-version-share.kql @@ -0,0 +1,10 @@ +// Per-broker-version request and device share — the canonical evidence for +// the "denominator caveat" callout. If the all-spans device count moved >20% +// WoW, this query tells you WHICH version cohort drove it. +materialized_view('BrokerAdoptionStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| summarize req = sum(countRequests), + dev = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), broker_version +| where week < datetime() +| order by week asc, req desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql new file mode 100644 index 00000000..0272daaa --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/error-message-and-location.kql @@ -0,0 +1,45 @@ +// Stack-trace + error_message slice for code attribution. MANDATORY for every +// broker-tagged regression card before claiming "Originator: Broker". +// +// Rationale: most broker exception codes flow through +// common/ExceptionAdapter.{getExceptionFromTokenErrorResponse, +// exceptionFromAuthorizationResult, clientExceptionFromException}. Without +// reading the throw site + the dominant error_message string, you cannot tell +// whether the code originated in broker code or was bridged from an eSTS +// AADSTS response. (See ../docs/kusto-cheatsheet.md "AADSTS reference table".) +// +// THIS TEMPLATE COVERS BOTH error_code AND error_type IN ONE ROUND-TRIP. +// Pass an empty list for the side you don't want to slice. +// +// Inputs: +// e.g. 'invalid_resource', 'null_pointer_error' (or empty) +// e.g. 'IntuneAppProtectionPolicyRequiredException' (or empty) +// datetime — should be the reporting-week Sunday (e.g. 2026-05-03). +// Use the FULL 7-day reporting window, NOT a narrower 3-5 day slice +// (low-volume types like SSLHandshakeException / Intune* may return +// zero rows in a sub-week window). +// datetime of next Sunday (exclusive) +// +// Tip: if the reporting window returns no rows for a low-volume code/type, fall +// back to the prior 14-day window (` - 7d .. `) before giving up. +// +// Output column 'loc' is a JSON blob {"ClassName":"...","MethodName":"...","LineNumber":N} +// — this is normal. Read it as text. To project the method name only, use +// parse_json(loc).MethodName +// +// HARD RULE (per SKILL.md Step 4): if the throw site is in +// ExceptionAdapter.{getExceptionFromTokenErrorResponse, exceptionFromAuthorizationResult} +// AND the message starts with "AADSTS", the originator is eSTS, not broker. + +let codes = dynamic([]); +let types = dynamic([]); +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where (array_length(codes) > 0 and error_code in (codes)) + or (array_length(types) > 0 and error_type in (types)) +| extend loc = tostring(error_location), + msg = substring(tostring(error_message), 0, 120) +| summarize cnt = count(), + devices = dcount(DeviceInfo_Id) + by error_code, error_type, loc, msg +| top 80 by devices desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql new file mode 100644 index 00000000..471fd2ca --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/latency.kql @@ -0,0 +1,13 @@ +// p50 / p95 / p99 latency on the hot spans. Always merge TDigest before percentile. +materialized_view('PerfStatsUpdated') +| where EventInfo_Time between (datetime() .. datetime()) +| where span_name in ('AcquireTokenSilent','AcquireTokenInteractive', + 'GetAccounts','RemoveAccount','ProcessWebsiteRequest') +| where span_status == 'OK' +| summarize p50 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 50, typeof(long)), + p95 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 95, typeof(long)), + p99 = percentile_tdigest(tdigest_merge(responseTimeTDigest), 99, typeof(long)), + reqs = sum(countRequests) + by week = startofweek(EventInfo_Time), span_name +| where week < datetime() +| order by span_name asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql new file mode 100644 index 00000000..dcb93df6 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/os-version-slice.kql @@ -0,0 +1,11 @@ +// OS version slice (8th attribution dim). Requires raw android_spans because +// ErrorStatsMetrics doesn't carry DeviceInfo_OsVersion. Keep the time window +// tight (<= 7 days) to stay under the MCP 240s timeout. +android_spans +| where PipelineInfo_IngestionTime between (datetime() .. datetime()) +| where error_code in () +| summarize devs = dcount(DeviceInfo_Id), + cnt = count() + by error_code, DeviceInfo_OsVersion +| where devs >= 100 +| top 30 by devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql new file mode 100644 index 00000000..e9758b21 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/reliability-auth-only.kql @@ -0,0 +1,14 @@ +// Auth-only denominator and reliability per week. +let s = materialized_view('SilentAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize req = sum(countRequests), devHll = hll_merge(countDevicesHll) + by week = startofweek(EventInfo_Time); +let i = materialized_view('InteractiveAuthStatsAllRequestsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | summarize req = sum(countRequests), devHll = hll_merge(countDevicesHll) + by week = startofweek(EventInfo_Time); +union s, i +| summarize authReq = sum(req), authDev = dcount_hll(hll_merge(devHll)) + by week +| where week < datetime() +| order by week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql new file mode 100644 index 00000000..4454c79f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/type-subcode-decomposition.kql @@ -0,0 +1,13 @@ +// Sub-code decomposition for an error_type regression card (the "8th dim"). +// Shows top error_codes that roll up under each unified_error_type, with WoW devices. +let types = dynamic([]); +materialized_view('ErrorStatsMetrics') +| extend unified_error_type = MergeUiRequiredExceptions(error_type) +| where EventInfo_Time between (datetime() .. datetime()) +| where unified_error_type in (types) +| extend wk = startofweek(EventInfo_Time) +| where wk < datetime() +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by wk, unified_error_type, error_code +| order by unified_error_type asc, wk asc, devs desc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql new file mode 100644 index 00000000..1a53a8f9 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-movers.kql @@ -0,0 +1,46 @@ +// WoW movers — codes (or types) that moved sharply this week regardless of 60d shape. +// +// MANDATORY pass alongside `60d-trend-codes.kql` / `60d-trend-types.kql` (per +// SKILL.md Step 3b). The 60d bucketer's --peak-floor=10000 EXCLUDES errors +// whose absolute weekly volume is small, but those small-volume codes can +// still spike sharply WoW (e.g. `Failed to parse JWT` 7 -> 3,461 devs over 7 +// weeks, or `Code:-11` 937 -> 2,490 devs WoW). Without this pass those spikes +// are silently dropped from the report. +// +// Inputs: +// Sunday of the reporting week (e.g. 2026-05-03) +// Sunday after (exclusive, e.g. 2026-05-10) +// Sunday of the baseline week (e.g. 2026-04-26) +// +// To run for error_type instead of error_code, copy this query and replace: +// - error_code -> MergeUiRequiredExceptions(error_type) (alias as `t`) +// - drop the `error_code != 'success'` filter +// +// Thresholds (tuneable): +// floor: cDev>=500 OR cReq>=5000 (small enough to catch sub-bucketer-floor codes) +// move: |dDev%|>=25 OR |dReq%|>=50 (real spike, not noise) +// new-this-wk: pDev==0 OR pReq==0 (never seen before this week) + +let curr = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where isnotempty(error_code) and error_code != 'success' + | summarize cDev = dcount_hll(hll_merge(countDevicesHll)), + cReq = sum(countOverall) + by error_code; +let prior = materialized_view('ErrorStatsMetrics') + | where EventInfo_Time between (datetime() .. datetime()) + | where isnotempty(error_code) and error_code != 'success' + | summarize pDev = dcount_hll(hll_merge(countDevicesHll)), + pReq = sum(countOverall) + by error_code; +curr | join kind=fullouter prior on error_code +| extend ec = coalesce(error_code, error_code1) +| extend cDev = coalesce(cDev, long(0)), cReq = coalesce(cReq, long(0)), + pDev = coalesce(pDev, long(0)), pReq = coalesce(pReq, long(0)) +| extend dDevPct = iff(pDev == 0, real(null), 100.0 * (cDev - pDev) / pDev) +| extend dReqPct = iff(pReq == 0, real(null), 100.0 * (cReq - pReq) / pReq) +| where (cDev >= 500 or cReq >= 5000) +| where (abs(dDevPct) >= 25 or abs(dReqPct) >= 50 or pDev == 0 or pReq == 0) +| project ec, pDev, cDev, dDevPct = round(dDevPct, 1), + pReq, cReq, dReqPct = round(dReqPct, 1) +| order by abs(dDevPct) desc nulls first diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql new file mode 100644 index 00000000..7ef5ce46 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/queries/wow-table-sparkline-series.kql @@ -0,0 +1,34 @@ +// 8-week per-error sparkline series for the WoW tables (data-trend arrays). +// +// MANDATORY (per SKILL.md Output checklist, v8): the `data-trend` arrays in +// the Section 6 (error_code) and Section 7 (error_type) WoW tables must come +// from real data — not be fabricated from a "roughly increasing" pattern. +// Past failure mode: small-volume codes (Broker request cancelled, +// kdfv2_key_derivation_error, TimeoutCancellationException) were filtered out +// by the 60d bucketer's peak-floor, then their sparklines were invented inline +// in the WoW table HTML. That's data dishonesty even when the array looks plausible. +// +// This query returns 8 weekly buckets for every code/type that appears in +// either the WoW movers list OR the 60d trend output. Run it twice — once with +// the codes filter, once with the types filter — and feed the result into the +// WoW-table generator so every row has a real-data trend. +// +// Inputs: +// Sunday of week-0 (e.g. 2026-04-12 for an 8-week window ending 2026-06-06) +// Sunday after the reporting week, EXCLUSIVE (e.g. 2026-06-07) +// Dynamic list of error_code values whose sparklines we need. +// Build this from the union of: +// * wow-movers-codes.json results +// * 60d-codes regression/spike/improvement bucket members +// For the type variant, swap to `unified_error_type in ()` +// and the MergeUiRequiredExceptions extension. + +let codes = dynamic([]); +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime() .. datetime()) +| where error_code in (codes) +| summarize devs = dcount_hll(hll_merge(countDevicesHll)), + errs = sum(countOverall) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime() +| order by error_code asc, week asc diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/agg.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/agg.js new file mode 100644 index 00000000..349e8d2d --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/agg.js @@ -0,0 +1,126 @@ +#!/usr/bin/env node +/** + * agg.js — Per-error per-dimension top-N rollup with WoW deltas. + * + * Companion to bucket-trends.js / summarize-attribution.js. Whereas + * summarize-attribution.js is for the cross-dimension cartesian roll-up + * across many dims, this script is the daily workhorse: take one + * "per-week × per-error × per-(one dim)" Kusto JSON file, print a + * human-readable per-error breakdown of the top-N values of that dim + * with previous-week vs current-week counts and a Δ%. + * + * Designed for the Spike Attribution cards. Run once per dim per error + * cluster (span_name, calling_package_name, broker_version, etc.), + * paste the output into the card. + * + * Input shape: a Kusto MCP JSON file produced by: + * + * let codes = dynamic([...]); + * materialized_view('ErrorStatsMetrics') + * | where EventInfo_Time between (datetime() .. datetime()) + * | where error_code in (codes) // or unified_error_type in (types) + * | extend wk = startofweek(EventInfo_Time) + * | where wk < datetime() // drop partial end! + * | summarize devs = dcount_hll(hll_merge(countDevicesHll)), + * errs = sum(countOverall) + * by wk, error_code, + * | order by error_code asc, wk asc, devs desc + * + * Usage: + * node agg.js [ ...] [--top=N] [--metric=devs|reqs] + * + * error_key: "error_code" or "ut" (when extended from MergeUiRequiredExceptions) + * dim_col: the column you grouped by (e.g. span_name, calling_package_name) + * if you pass multiple, they are joined with " | " into a composite key + * --top=5 (default) top-N rows per error + * --metric=devs (default) | reqs + */ +const fs = require('fs'); + +const args = process.argv.slice(2); +const positional = args.filter(a => !a.startsWith('--')); +const file = positional[0]; +const errKey = positional[1] || 'error_code'; +const dimCols = positional.slice(2); +const topN = +((args.find(a => a.startsWith('--top=')) || '').split('=')[1] || 5); +const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); + +if (!file || dimCols.length === 0) { + console.error('Usage: node agg.js [ ...] [--top=N] [--metric=devs|reqs]'); + process.exit(1); +} +if (!['devs', 'reqs'].includes(metric)) { + console.error("--metric must be 'devs' or 'reqs'"); + process.exit(1); +} + +function load(file) { + const j = JSON.parse(fs.readFileSync(file, 'utf8')); + const items = j.results.items.slice(1); + const schema = Object.keys(j.results.items[0]); + return { items, schema }; +} + +function pct(a, b) { + if (!b) return a ? '+inf' : '0'; + return ((a - b) / b * 100).toFixed(1) + '%'; +} + +const { items, schema } = load(file); +const wkIdx = schema.indexOf('wk'); +const errIdx = schema.indexOf(errKey); +const valIdx = schema.indexOf(metric === 'devs' ? 'devs' : 'errs'); +const dimIdxs = dimCols.map(c => { + const i = schema.indexOf(c); + if (i < 0) { + console.error(`Column '${c}' not found in schema: ${schema.join(', ')}`); + process.exit(2); + } + return i; +}); +if (wkIdx < 0 || errIdx < 0 || valIdx < 0) { + console.error(`Required columns missing. schema=${schema.join(', ')} need wk, ${errKey}, ${metric === 'devs' ? 'devs' : 'errs'}`); + process.exit(2); +} + +// group: err -> dimkey -> wk -> value +const m = {}; +const wks = new Set(); +for (const r of items) { + const wk = r[wkIdx], err = r[errIdx], val = r[valIdx]; + const dimKey = dimIdxs.map(i => (r[i] === null || r[i] === undefined || r[i] === '') ? '(blank)' : r[i]).join(' | '); + wks.add(wk); + m[err] = m[err] || {}; + m[err][dimKey] = m[err][dimKey] || {}; + m[err][dimKey][wk] = (m[err][dimKey][wk] || 0) + val; +} +const sortedWks = [...wks].sort(); +if (sortedWks.length < 2) { + console.warn(`[agg] WARN: only ${sortedWks.length} week bucket(s) in input — need >= 2 for WoW deltas.`); +} +const prevWk = sortedWks[0], curWk = sortedWks[sortedWks.length - 1]; + +console.log(`# ${file} (dim: ${dimCols.join(' + ')}, metric: ${metric})`); +console.log(`# WoW: ${prevWk.slice(0, 10)} -> ${curWk.slice(0, 10)}\n`); + +for (const err of Object.keys(m).sort()) { + const rows = Object.entries(m[err]).map(([k, v]) => ({ + key: k, + prev: v[prevWk] || 0, + cur: v[curWk] || 0, + })); + const total = rows.reduce((s, r) => s + r.cur, 0); + rows.sort((a, b) => b.cur - a.cur); + console.log(`## ${err} (cur-week ${metric}=${total.toLocaleString()})`); + for (const r of rows.slice(0, topN)) { + const share = total ? (r.cur / total * 100).toFixed(1) : '0'; + console.log( + ' ' + share.padStart(5) + '%' + + ' Δ ' + pct(r.cur, r.prev).padStart(8) + + ' prev=' + String(r.prev).padStart(11) + + ' cur=' + String(r.cur).padStart(11) + + ' ' + r.key + ); + } + console.log(''); +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 new file mode 100644 index 00000000..3b2c45b1 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bootstrap-report.ps1 @@ -0,0 +1,173 @@ +<# +.SYNOPSIS + Bootstrap a new OCE weekly report file from the canonical template. + +.DESCRIPTION + Implements SKILL.md Step 1 as a script so the workflow doesn't drift across + runs: + 1. Computes the reporting-week Sunday from the current date (most recent + complete Sun-Sat week unless -ReportingSunday is passed explicitly). + 2. Creates ~/android-oce-reports/_data// for raw query payloads. + 3. Decides what to do if the target report file already exists: + - If the existing file is an UNFILLED template stub (header dates + still match the canonical template's reference week), silently + re-bootstrap from the template — there's nothing to preserve. + - If the existing file contains real per-week content (the dates + inside differ from the template's reference week), HALT and + require the caller to explicitly delete or rename the file first. + This is the "filename collision rule" from SKILL.md. + 4. Prunes _data// folders older than -DataRetentionDays (default 60) + so the directory doesn't accumulate stale payloads indefinitely. + +.PARAMETER ReportingSunday + Sunday of the reporting week (yyyy-MM-dd). If omitted, defaults to the most + recent complete Sun-Sat week relative to the system clock. + +.PARAMETER Force + Skip the collision check and overwrite any existing file. + +.PARAMETER DataRetentionDays + How many days of _data// folders to keep before pruning. Default 60. + +.PARAMETER SkillRoot + Path to the skill folder. Defaults to the location of this script's parent. + +.EXAMPLE + .\bootstrap-report.ps1 + # Default: latest complete week, halt on collision + +.EXAMPLE + .\bootstrap-report.ps1 -ReportingSunday 2026-05-31 -Force + +.OUTPUTS + Prints the absolute path of the newly created report file. +#> +[CmdletBinding()] +param( + [string]$ReportingSunday, + [switch]$Force, + [int]$DataRetentionDays = 60, + [string]$SkillRoot +) +$ErrorActionPreference = 'Stop' + +# Locate the skill folder + canonical template +if (-not $SkillRoot) { + # This script lives at /assets/scripts/bootstrap-report.ps1, so go up 2 levels + # to reach /assets/. Templates live at /assets/templates/. + $SkillRoot = Split-Path -Parent (Split-Path -Parent $PSCommandPath) +} +$template = Join-Path $SkillRoot 'templates\report-template.html' +if (-not (Test-Path $template)) { + throw "Canonical template not found at $template. Pass -SkillRoot if running outside the skill folder." +} + +# Compute the reporting Sunday +if (-not $ReportingSunday) { + $today = [datetime]::Today + # Most recent Sunday strictly before today, OR today if today is Sunday + $offset = ($today.DayOfWeek.value__ + 7) % 7 # 0..6 days back to the previous Sunday + $sunday = $today.AddDays(-$offset) + # If today is Sunday but it's still early in the day, prefer the prior complete week + if ($today.DayOfWeek -eq [DayOfWeek]::Sunday -and (Get-Date).Hour -lt 6) { + $sunday = $sunday.AddDays(-7) + } + $ReportingSunday = $sunday.ToString('yyyy-MM-dd') +} +[void][datetime]::Parse($ReportingSunday) # validate format + +# Paths +$reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' +$dataDir = Join-Path $reportDir "_data\$ReportingSunday" +$out = Join-Path $reportDir "oncall-wow-report-$ReportingSunday.html" +New-Item -ItemType Directory -Force $reportDir | Out-Null +New-Item -ItemType Directory -Force $dataDir | Out-Null + +# Read the template's reference dates so we can detect "unfilled stub" collisions. +# A reliable signal of "this file is the template stub": MULTIPLE markers all +# still match the template. We check title, the meta-line dates, AND the first +# KPI value — any divergence means real content has been written. +$templateText = [IO.File]::ReadAllText($template) + +function Get-FingerprintMarkers([string]$text) { + $m = @{} + if ($text -match '([^<]+?)') { $m['title'] = $Matches[1].Trim() } + if ($text -match '
\s*([^<]+)') { $m['metaDate'] = $Matches[1].Trim() } + if ($text -match 'Generated\s+([^<]+?)') { $m['generated'] = $Matches[1].Trim() } + # First KPI tile's value (e.g. "10.58 B"). Differs week-to-week. + if ($text -match '
\s*
[^<]+
\s*
([^<]+?)
') { $m['firstKpi'] = $Matches[1].Trim() } + return $m +} + +$templateMarkers = Get-FingerprintMarkers $templateText + +# Collision check +if ((Test-Path $out) -and -not $Force) { + $existingText = [IO.File]::ReadAllText($out) + $existingMarkers = Get-FingerprintMarkers $existingText + + # "Unfilled stub" requires ALL markers to match the template AND the file size + # to be within 5% of the template's. ANY divergence (a single value updated, + # a single KPI populated, sections added) means real content exists. + $allMatch = $true + foreach ($k in $templateMarkers.Keys) { + if ($existingMarkers[$k] -ne $templateMarkers[$k]) { $allMatch = $false; break } + } + $sizeRatio = (Get-Item $out).Length / [Math]::Max(1, (Get-Item $template).Length) + $sizeClose = ($sizeRatio -ge 0.95) -and ($sizeRatio -le 1.05) + + $isUnfilledStub = $allMatch -and $sizeClose + + if ($isUnfilledStub) { + Write-Warning "Existing $out is an unfilled template stub (all template fingerprints match, size within 5%). Re-bootstrapping silently." + } else { + $divergence = @() + foreach ($k in $templateMarkers.Keys) { + if ($existingMarkers[$k] -ne $templateMarkers[$k]) { + $divergence += " $k`: template='$($templateMarkers[$k])' existing='$($existingMarkers[$k])'" + } + } + if (-not $sizeClose) { + $divergence += " size: template=$((Get-Item $template).Length) bytes existing=$((Get-Item $out).Length) bytes ratio=$([Math]::Round($sizeRatio,2))x" + } + Write-Error @" +A populated report already exists for the same Sunday bucket: + $out + +Divergence from the template (which is why this is NOT classified as an unfilled stub): +$($divergence -join "`n") + +Per the SKILL.md filename-collision rule, do NOT silently overwrite. Either: + 1. Open the existing report, list its top-3 findings, and confirm what changed + in the new data before regenerating. Then re-run with -Force. + 2. Rename / delete the existing file and re-run. +"@ + exit 2 + } +} + +# Bootstrap +Copy-Item $template $out -Force +Write-Host "Bootstrapped $out from $template" +Write-Host "Data folder: $dataDir" + +# Prune old _data folders +$dataRoot = Join-Path $reportDir '_data' +if (Test-Path $dataRoot) { + $cutoff = (Get-Date).AddDays(-$DataRetentionDays) + $oldFolders = Get-ChildItem $dataRoot -Directory | Where-Object { + # Folder name should look like a date; skip the current run's folder + $_.FullName -ne $dataDir -and + $_.LastWriteTime -lt $cutoff + } + if ($oldFolders) { + Write-Host "Pruning $($oldFolders.Count) _data folder(s) older than $DataRetentionDays days:" + $oldFolders | ForEach-Object { + Write-Host " removing $($_.FullName) (last write $($_.LastWriteTime.ToString('yyyy-MM-dd')))" + Remove-Item -Recurse -Force $_.FullName + } + } +} + +# Print the path so callers can capture it +Write-Output $out diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js new file mode 100644 index 00000000..c602fba3 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/bucket-trends.js @@ -0,0 +1,202 @@ +#!/usr/bin/env node +/** + * bucket-trends.js — Bucket every error code into 60-day trend categories. + * + * Input: a Kusto MCP JSON result file from a query of the form: + * + * materialized_view('ErrorStatsMetrics') + * | where EventInfo_Time between (datetime() .. datetime()) + * | where isnotempty(error_code) and error_code != 'success' + * | summarize errs=sum(countOverall), + * devs=dcount_hll(hll_merge(countDevicesHll)) + * by week=startofweek(EventInfo_Time), error_code + * | where week < datetime() // drop partial end-week! + * | order by error_code asc, week asc + * +// (Use dcount_hll on countDevicesHll, NOT sum(countDevices) — see ../docs/kusto-cheatsheet.md.) + * + * Usage: + * node bucket-trends.js + * [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] # inclusive start, EXCLUSIVE end (week-bucket) + * [--peak-floor=N] [--metric=devs|reqs] + * + * --start defaults to the second-earliest week in the data (drops partial start week). + * --end defaults to the most recent week, but the script will WARN-AND-DROP any week + * where (latest EventInfo_Time in the bucket - week-start) < 6 days, because that + * is a partial end-week and will turn every error into a fake -99% improvement. + * + * --metric=devs (default) buckets on weekly device counts (catches errors hitting more users) + * --metric=reqs buckets on weekly request counts (catches per-device retry storms) + * + * Run BOTH metrics and union the regression sets. Reporting on devices alone misses + * retry-storm spikes (e.g. kdfv2_key_derivation_error: 262 -> 5,374 reqs on ~57 devices). + * + * Buckets (computed across the kept weeks, defaulting to all-but-the-first): + * regression: delta > +15% (and not a single-week spike) + * spike: peak >= 3 x mean(other weeks) and peak > 1.5 x max(first,last) + * improvement: delta < -15% + * flat: otherwise + * + * Output flags (NEW v8): + * --summary Suppress the verbose header (week list, partial-bucket + * detection). Print only the bucket counts + the per-bucket + * rows. Recommended for the standard skill workflow. + * --json= Also write a structured JSON sidecar with the bucketed + * result for programmatic consumption (e.g. by a future + * sparkline-data-generator script). The sidecar shape is: + * { + * "metric": "devs" | "reqs", + * "weeks": [iso, iso, ...], + * "buckets": { + * "regression": [ { code, first, last, peak, delta, series: [N,N,...] }, ... ], + * "spike": [...], + * "improvement": [...], + * "flat": [...] + * } + * } + */ +const fs = require('fs'); + +const args = process.argv.slice(2); +const file = args.find(a => !a.startsWith('--')); +const startArg = (args.find(a => a.startsWith('--start=')) || '').split('=')[1]; +const endArg = (args.find(a => a.startsWith('--end=')) || '').split('=')[1]; +const metric = ((args.find(a => a.startsWith('--metric=')) || '').split('=')[1] || 'devs').toLowerCase(); +const summary = args.includes('--summary'); +const jsonArg = (args.find(a => a.startsWith('--json=')) || '').split('=')[1]; +if (!['devs', 'reqs'].includes(metric)) { + console.error(`--metric must be 'devs' or 'reqs', got '${metric}'`); + process.exit(1); +} +const defaultFloor = metric === 'reqs' ? 100000 : 10000; +const peakFloor = +((args.find(a => a.startsWith('--peak-floor=')) || '').split('=')[1] || defaultFloor); +const metricIdx = metric === 'reqs' ? 0 : 1; // [errs, devs] tuple +const keyCol = ((args.find(a => a.startsWith('--key=')) || '').split('=')[1] || 'error_code'); + +if (!file) { + console.error('Usage: node bucket-trends.js [--start=YYYY-MM-DD] [--end=YYYY-MM-DD] [--peak-floor=N] [--metric=devs|reqs] [--key=error_code|unified_error_type] [--summary] [--json=path]'); + process.exit(1); +} + +const d = JSON.parse(fs.readFileSync(file, 'utf8')); +// Schema row can be either an object {col: type} (MCP) or a string array [col, col, ...] +// (from assets/scripts/run-kql.ps1). Detect and locate the key column index so we +// don't assume positional order. +const schemaRow = d.results.items[0]; +let colNames; +if (Array.isArray(schemaRow)) { + colNames = schemaRow.map(String); +} else if (schemaRow && typeof schemaRow === 'object') { + colNames = Object.keys(schemaRow); +} else { + throw new Error('First row of results.items must be the schema row'); +} +const iWeek = colNames.indexOf('week') >= 0 ? colNames.indexOf('week') : colNames.indexOf('wk'); +const iCode = colNames.indexOf(keyCol); +const iErrs = colNames.indexOf('errs'); +const iDevs = colNames.indexOf('devs'); +if (iWeek < 0 || iCode < 0 || iErrs < 0 || iDevs < 0) { + throw new Error(`Schema must include week|wk, ${keyCol}, errs, devs. Got [${colNames.join(', ')}]`); +} + +const items = d.results.items.slice(1); +const series = {}; +for (const r of items) { + const w = r[iWeek], code = r[iCode], errs = r[iErrs], devs = r[iDevs]; + if (!series[code]) series[code] = {}; + series[code][w] = [errs, devs]; +} +const weeks = [...new Set(items.map(r => r[iWeek]))].sort(); +const startISO = startArg ? `${startArg}T00:00:00Z` : weeks[1]; // drop partial start week by default +const endISO = endArg ? `${endArg}T00:00:00Z` : null; // exclusive cutoff + +// --- Partial end-week detection --------------------------------------------- +// Compute the total devices/requests per bucket as a proxy for completeness. +// If the most recent bucket is < 30% of the median of the prior 3 buckets, it's +// almost certainly partial — drop it and warn. This catches the common case of +// running the report at 09:00 UTC Sunday and getting 9 hours of data in the +// "last week" bucket. (Caveat: real fleet collapses also look like this; warn, +// don't crash.) +function bucketTotal(w) { + let t = 0; + for (const wd of Object.values(series)) { + const v = wd[w]; + if (v) t += v[metricIdx]; + } + return t; +} +const totals = weeks.map(w => ({ w, t: bucketTotal(w) })); +const medianOf = arr => { const s = [...arr].sort((a,b)=>a-b); return s[Math.floor(s.length/2)] || 0; }; +let droppedPartial = null; +if (!endArg && weeks.length >= 4) { + const last = totals[totals.length - 1]; + const prevMedian = medianOf(totals.slice(-4, -1).map(x => x.t)); + if (prevMedian > 0 && last.t < prevMedian * 0.3) { + droppedPartial = last.w; + console.warn(`[bucket-trends] WARN: dropping likely-partial end bucket ${last.w} (total=${last.t.toLocaleString()} vs median-of-prior-3=${prevMedian.toLocaleString()}). Pass --end=YYYY-MM-DD to override or filter in KQL.`); + } +} + +const keep = weeks.filter(w => w >= startISO && (endISO ? w < endISO : true) && w !== droppedPartial); +if (!summary) { + console.log('All weeks: ', weeks); + console.log('Trend weeks: ', keep, `(${keep.length} complete)`); + console.log('Metric: ', metric, `(peak floor=${peakFloor.toLocaleString()})`); +} +if (keep.length < 4) { + console.warn(`[bucket-trends] WARN: only ${keep.length} kept weeks — trend buckets will be unstable. Need >= 4 for meaningful regression/improvement classification.`); +} + +const buckets = { regression: [], spike: [], improvement: [], flat: [] }; +for (const [code, wd] of Object.entries(series)) { + const vals = keep.map(w => (wd[w] || [0, 0])[metricIdx]); + const peak = Math.max(...vals); + if (peak < peakFloor) continue; + const first = vals[0] || 1, last = vals[vals.length - 1]; + const f = first || 1; + const delta = (last - f) / f; + const sumOthers = vals.reduce((s, x) => s + x, 0) - peak; + const meanOthers = sumOthers / Math.max(1, vals.length - 1); + const isSpike = peak >= 3 * meanOthers && peak > Math.max(first, last) * 1.5; + let cat; + if (isSpike) cat = 'spike'; + else if (delta > 0.15) cat = 'regression'; + else if (delta < -0.15) cat = 'improvement'; + else cat = 'flat'; + buckets[cat].push({ code, first, last, peak, delta: +(delta * 100).toFixed(1), series: vals }); +} + +// Compact bucket-count line (always emitted, summary or verbose) +const countLine = ['regression','spike','improvement','flat'] + .map(k => `${k}=${buckets[k].length}`).join(' '); +console.log(`\nBucket counts (metric=${metric}, key=${keyCol}, peak-floor=${peakFloor.toLocaleString()}): ${countLine}`); + +for (const k of ['regression', 'improvement', 'spike', 'flat']) { + console.log(`\n=== ${k.toUpperCase()} (${buckets[k].length}) ===`); + buckets[k] + .sort((a, b) => b.peak - a.peak) + .forEach(r => { + console.log( + ` ${r.code.padEnd(60)} first=${String(r.first).padStart(11)} last=${String(r.last).padStart(11)} peak=${String(r.peak).padStart(11)} d=${r.delta >= 0 ? '+' : ''}${r.delta}% series=${JSON.stringify(r.series)}` + ); + }); +} + +// Optional structured JSON sidecar +if (jsonArg) { + const sidecar = { + metric, + key: keyCol, + peakFloor, + weeks: keep, + droppedPartial, + buckets: Object.fromEntries( + Object.entries(buckets).map(([k, arr]) => [ + k, + arr.sort((a, b) => b.peak - a.peak) + ]) + ) + }; + fs.writeFileSync(jsonArg, JSON.stringify(sidecar, null, 2)); + console.log(`\nWrote JSON sidecar -> ${jsonArg}`); +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/find-suspect-prs.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/find-suspect-prs.ps1 new file mode 100644 index 00000000..10c2daef --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/find-suspect-prs.ps1 @@ -0,0 +1,102 @@ +<# +.SYNOPSIS + Find candidate PRs touching a class / file / method, across broker/ and common/ in parallel. + +.DESCRIPTION + Speeds up the PR-grep workflow in SKILL.md Step 4. Given a class name (or + arbitrary regex), runs `git log -S` (pickaxe) AND `git log --grep` against + both broker/ and common/ over the supplied window, then prints a unified + table sorted by date. + + Use this AFTER you have identified the throw-site / wrapper class from the + Originator pre-check (assets/queries/error-message-and-location.kql). + +.PARAMETER Symbol + String to search for in commit diffs (passed to `git log -S`). Typically + the class name or method that hosts the throw site, e.g. + 'ExceptionAdapter', 'clientExceptionFromException', 'getKnownAuthorityResult'. + +.PARAMETER GrepRegex + Optional regex for `git log --grep` (commit message). Defaults to $Symbol. + +.PARAMETER Since + Inclusive start date (yyyy-MM-dd). Defaults to 28 days ago. + +.PARAMETER Until + Inclusive end date. Defaults to today. + +.PARAMETER RepoRoot + Defaults to C:\Users\\Repos\android-complete. Overrides via -RepoRoot. + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol ExceptionAdapter -Since 2026-04-01 + +.EXAMPLE + .\find-suspect-prs.ps1 -Symbol clientExceptionFromException -Since 2026-04-01 -Until 2026-05-09 + +.NOTES + Cites repos with the URL pattern in SKILL.md (broker -> ad-accounts-for-android, + common -> microsoft-authentication-library-common-for-android). +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Symbol, + [string]$GrepRegex, + [string]$Since = (Get-Date).AddDays(-28).ToString('yyyy-MM-dd'), + [string]$Until = (Get-Date).ToString('yyyy-MM-dd'), + [string]$RepoRoot = (Join-Path $env:USERPROFILE 'Repos\android-complete') +) + +if (-not $GrepRegex) { $GrepRegex = [regex]::Escape($Symbol) } + +$repos = @( + @{ Name='broker'; Path=(Join-Path $RepoRoot 'broker'); UrlBase='https://github.com/identity-authnz-teams/ad-accounts-for-android/pull/' } + @{ Name='common'; Path=(Join-Path $RepoRoot 'common'); UrlBase='https://github.com/AzureAD/microsoft-authentication-library-common-for-android/pull/' } +) + +$results = @() +foreach ($r in $repos) { + if (-not (Test-Path $r.Path)) { Write-Warning "Repo path not found: $($r.Path)"; continue } + Push-Location $r.Path + try { + # Pickaxe: PRs whose diff added or removed the symbol + $pickaxeRaw = git log --since=$Since --until=$Until -S $Symbol --pretty=format:'%h|%ai|%an|%s' 2>$null + # Grep: PRs whose subject mentions the regex (case-insensitive) + $grepRaw = git log --since=$Since --until=$Until --pretty=format:'%h|%ai|%an|%s' --grep=$GrepRegex -i 2>$null + + $seen = @{} + foreach ($line in @($pickaxeRaw, $grepRaw | Where-Object { $_ })) { + foreach ($l in @($line)) { + if (-not $l) { continue } + $parts = $l -split '\|', 4 + if ($parts.Count -lt 4) { continue } + $sha = $parts[0] + if ($seen.ContainsKey($sha)) { continue } + $seen[$sha] = $true + # Try to pull the PR number out of the subject (#NNN at end of MS PR convention) + $prNum = $null + if ($parts[3] -match '#(\d{2,5})\b') { $prNum = $Matches[1] } + $results += [pscustomobject]@{ + Repo = $r.Name + Date = $parts[1].Substring(0, 10) + Author = $parts[2] + Sha = $sha + PR = if ($prNum) { '#' + $prNum } else { '' } + Url = if ($prNum) { $r.UrlBase + $prNum } else { '' } + Subject = $parts[3] + } + } + } + } finally { Pop-Location } +} + +if ($results.Count -eq 0) { + Write-Host "No PRs match in window $Since .. $Until for symbol '$Symbol'." + Write-Host " Tip: try a shorter symbol (just the class name), or widen -Since." + exit 0 +} + +$results | Sort-Object Date -Descending | Format-Table Repo, Date, Author, Sha, PR, @{n='Subject';e={$_.Subject.Substring(0, [Math]::Min(80, $_.Subject.Length))}} -AutoSize +Write-Host "" +Write-Host "PR URLs for citation in attribution cards:" +$results | Where-Object Url | Sort-Object Date -Descending | ForEach-Object { Write-Host " $($_.Repo) #$($_.PR.TrimStart('#')): $($_.Url)" } diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 new file mode 100644 index 00000000..3686259f --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/run-kql.ps1 @@ -0,0 +1,103 @@ +<# +.SYNOPSIS + Direct-REST Kusto query helper. Drop-in fallback for the Azure Kusto MCP server + when the MCP times out (the MCP has a 240 s budget and frequently exceeds it on + the per-error-code queries this skill needs). + +.DESCRIPTION + Acquires an Entra token via the local `az` CLI for the Kusto cluster, POSTs the + query to /v2/rest/query, and writes a JSON file whose schema matches what the + other helpers in this skill (bucket-trends.js, summarize-attribution.js) expect: + + { "results": { "items": [ + [colName0, colName1, ...], // first row = column-name list + [row0col0, row0col1, ...], + [row1col0, row1col1, ...], + ... + ] } } + + The `summarize-attribution.js --union` loader will auto-detect this array-form + schema (since the v8 update) — no transformer step needed. + +.PARAMETER Query + KQL query text. Pass via single-quoted PowerShell here-string for safety. + +.PARAMETER Out + Output JSON file path. + +.PARAMETER Cluster + Kusto cluster URI (default: idsharedeus2 — the production Android Broker cluster). + +.PARAMETER Database + Database name (default: ad-accounts-android-otel). + +.PARAMETER TimeoutSec + HTTP timeout (default 300 s — Kusto itself has a 5-minute server-side query budget). + +.EXAMPLE + # Sanity check + .\run-kql.ps1 -Query 'print x=1' -Out test.json + +.EXAMPLE + # Pull the 60-day per-error-code trend + $q = @" +materialized_view('ErrorStatsMetrics') +| where EventInfo_Time between (datetime(2026-04-12) .. datetime(2026-06-07)) +| where isnotempty(error_code) and error_code != 'success' +| summarize errs = sum(countOverall), + devs = dcount_hll(hll_merge(countDevicesHll)) + by week = startofweek(EventInfo_Time), error_code +| where week < datetime(2026-06-07) +| order by error_code asc, week asc +"@ + .\run-kql.ps1 -Query $q -Out 60d-codes.json + +.NOTES + * Requires `az login` to have been run beforehand and the caller to have read + access to the cluster (Android Auth Client SDK security group). + * Runs queries in parallel from PowerShell jobs — see SKILL.md Step 2 for the + "5-queries-in-parallel" pattern. + * If your query payload is large (>50 KB returned), the JSON file may itself + be large — pipe to bucket-trends.js / summarize-attribution.js directly + rather than viewing in-band. +#> +[CmdletBinding()] +param( + [Parameter(Mandatory=$true)][string]$Query, + [Parameter(Mandatory=$true)][string]$Out, + [string]$Cluster = 'https://idsharedeus2.kusto.windows.net', + [string]$Database = 'ad-accounts-android-otel', + [int]$TimeoutSec = 300 +) +$ErrorActionPreference = 'Stop' + +# Acquire token via az CLI (works for users + managed identity) +$tok = az account get-access-token --resource $Cluster --query accessToken -o tsv 2>$null +if (-not $tok) { + throw "Failed to acquire token for $Cluster. Run 'az login' first and verify membership in the Android Auth Client SDK security group." +} + +$body = @{ csl = $Query; db = $Database } | ConvertTo-Json -Compress +$resp = Invoke-RestMethod -Uri "$Cluster/v2/rest/query" -Method Post ` + -Headers @{ Authorization = "Bearer $tok"; 'Content-Type' = 'application/json' } ` + -Body $body -TimeoutSec $TimeoutSec + +# Find the PrimaryResult table (Kusto returns multiple frame types; we want the data) +$primary = $resp | Where-Object { $_.FrameType -eq 'DataTable' -and $_.TableKind -eq 'PrimaryResult' } | Select-Object -First 1 +if (-not $primary) { + # Surface any error frames so the caller can see what went wrong + $err = $resp | Where-Object { $_.FrameType -eq 'DataSetCompletion' -and $_.HasErrors } | Select-Object -First 1 + if ($err) { throw "Kusto query failed with errors. Full response:`n$($resp | ConvertTo-Json -Depth 6)" } + throw 'No PrimaryResult table in response' +} + +# Convert to the canonical schema the JS helpers expect +$colNames = @($primary.Columns | ForEach-Object { $_.ColumnName }) +$items = New-Object System.Collections.ArrayList +[void]$items.Add($colNames) +foreach ($r in $primary.Rows) { [void]$items.Add($r) } + +$obj = @{ results = @{ items = $items } } +# UTF-8 without BOM — keeps emoji/diacritic data clean for downstream consumption +[IO.File]::WriteAllText($Out, ($obj | ConvertTo-Json -Depth 12 -Compress), [System.Text.UTF8Encoding]::new($false)) +Write-Host ("Saved {0} rows -> {1}" -f ($primary.Rows.Count), $Out) diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/summarize-attribution.js b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/summarize-attribution.js new file mode 100644 index 00000000..9d86a75c --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/summarize-attribution.js @@ -0,0 +1,225 @@ +#!/usr/bin/env node +/** + * summarize-attribution.js — Roll up WoW attribution slices for spike-attribution cards. + * + * TWO INPUT MODES: + * + * 1) Per-dim files (legacy mode): one Kusto JSON per dimension, each tagged with + * --label=. Use this when you ran 7 separate per-dim queries. + * + * node summarize-attribution.js \ + * --label=span \ + * --label=calling_app \ + * --label=active_broker \ + * --label=broker_version \ + * --label=acct_type \ + * --label=shared_dev \ + * --label=client_sku + * + * Per-file schema: row[0] must include `error_code`, `wk`/`week`, `devs`/`countDevices`, + * and exactly one trailing string column (the dimension value). + * + * 2) Union mode (NEW, recommended for 2-week WoW attribution — one query covers all dims): + * + * node summarize-attribution.js --union + * + * Expected schema (any column order): + * dim string -- short label e.g. 'span', 'calling_app', 'broker_ver' + * wk | week datetime + * error_code string (or `error_type` — use --key=error_type to switch) + * val_string string } EITHER `val_string`+`val_bool` (Kusto union of + * val_bool bool } mixed-type slice columns) ... + * val string } ... OR a single `val` column + * devs long (use `dcount_hll(hll_merge(countDevicesHll))` upstream) + * errs long (optional — request count, used for retry-storm detection) + * + * The union form is what Step 5 of SKILL.md now recommends — 1 round-trip vs 7. + * See assets/queries/attr-union-by-dim.kql. + * + * Output: per error_code, per dimension, the top-5 values for each week (prior + curr), + * concentration % of curr-week total, and Δd / Δr vs prior week. + * + * IMPORTANT: when you build the source query, ALWAYS use + * dcount_hll(hll_merge(countDevicesHll)) + * for distinct device counts (HLL merging). `sum(countDevices)` double-counts! + */ +const fs = require('fs'); + +// --- arg parsing --------------------------------------------------------- +const argv = process.argv.slice(2); +const inputs = []; // per-dim mode: { label, file } +let pendingLabel = null; +let unionFile = null; +let keyCol = 'error_code'; // override with --key=error_type for type cards +let topN = 5; +for (const a of argv) { + if (a === '--union') { /* next non-flag arg is the file */ pendingLabel = '__UNION__'; continue; } + if (a.startsWith('--union=')) { unionFile = a.split('=')[1]; pendingLabel = null; continue; } + if (a.startsWith('--key=')) { keyCol = a.split('=')[1]; continue; } + if (a.startsWith('--top=')) { topN = parseInt(a.split('=')[1], 10) || 5; continue; } + if (a.startsWith('--label=')) { pendingLabel = a.split('=')[1]; continue; } + if (pendingLabel === '__UNION__') { unionFile = a; pendingLabel = null; continue; } + inputs.push({ label: pendingLabel || 'unknown', file: a }); + pendingLabel = null; +} + +if (!unionFile && inputs.length === 0) { + console.error('Usage:\n node summarize-attribution.js --union [--key=error_code|error_type] [--top=N]\n node summarize-attribution.js --label= file1.json --label= file2.json ...'); + process.exit(1); +} + +// --- helpers -------------------------------------------------------------- +function fmt(n) { + if (n == null) return '–'; + if (Math.abs(n) >= 1e9) return (n / 1e9).toFixed(2) + 'B'; + if (Math.abs(n) >= 1e6) return (n / 1e6).toFixed(2) + 'M'; + if (Math.abs(n) >= 1e3) return (n / 1e3).toFixed(1) + 'k'; + return String(n); +} +function pct(num, den) { return den ? (100 * num / den).toFixed(1) : '0.0'; } +function delta(curr, prior) { + if (prior == null || prior === 0) return curr ? `NEW(+${fmt(curr)})` : '–'; + return ((curr - prior) / prior * 100).toFixed(1) + '%'; +} + +// --- per-dim file loader (legacy mode) ------------------------------------ +function loadSliceFile({ label, file }) { + const d = JSON.parse(fs.readFileSync(file, 'utf8')); + const rows = d.results.items; + const schemaRaw = rows[0]; + // Support both schema forms: object (MCP) and array (assets/scripts/run-kql.ps1). + let schema; + if (Array.isArray(schemaRaw)) { + schema = {}; + for (let i = 0; i < schemaRaw.length; i++) schema[String(schemaRaw[i])] = 'string'; + } else if (schemaRaw && typeof schemaRaw === 'object') { + schema = schemaRaw; + } else { + throw new Error(`${file}: first row of results.items must be the schema (column-name array or {col: type} object). Got: ${JSON.stringify(schemaRaw)}`); + } + const cols = Object.keys(schema); + const idxCode = cols.indexOf(keyCol); + let idxWeek = cols.indexOf('wk'); if (idxWeek < 0) idxWeek = cols.indexOf('week'); + let idxDevs = cols.indexOf('devs'); if (idxDevs < 0) idxDevs = cols.indexOf('countDevices'); + let idxErrs = cols.indexOf('errs'); if (idxErrs < 0) idxErrs = cols.indexOf('countOverall'); + if (idxCode < 0 || idxWeek < 0 || idxDevs < 0) { + throw new Error(`${file}: schema must include ${keyCol}, wk|week, devs|countDevices. Got [${cols.join(', ')}]`); + } + // Find the dim column. When schema was provided as an array (run-kql.ps1) we + // don't have type info, so fall back to "any remaining column" (typically the + // last one in the SELECT). + let idxDim = cols.findIndex((c, i) => + i !== idxCode && i !== idxWeek && i !== idxDevs && i !== idxErrs && schema[c] === 'string'); + if (idxDim < 0) { + idxDim = cols.findIndex((c, i) => + i !== idxCode && i !== idxWeek && i !== idxDevs && i !== idxErrs); + } + if (idxDim < 0) throw new Error(`${file}: no dimension column found`); + + const map = {}; + for (const r of rows.slice(1)) { + const code = r[idxCode], wk = r[idxWeek]; + const dim = (r[idxDim] === null || r[idxDim] === '') ? '(blank)' : r[idxDim]; + const devs = r[idxDevs] || 0; + const errs = idxErrs >= 0 ? (r[idxErrs] || 0) : 0; + const slot = ((map[code] ||= {})[wk] ||= {})[dim] ||= { devs: 0, errs: 0 }; + slot.devs += devs; slot.errs += errs; + } + return { label, map }; +} + +// --- union-mode loader (NEW) --------------------------------------------- +function loadUnion(file) { + const d = JSON.parse(fs.readFileSync(file, 'utf8')); + const rows = d.results.items; + const schemaRaw = rows[0]; + // Two schema shapes are supported: + // (a) Object form (MCP tool): { dim: 0, wk: 1, ... } — keys are column names + // (b) Array form (REST helper assets/scripts/run-kql.ps1): ['dim', 'wk', ...] + // Detect and normalize to an object map { colName -> index }. + let schema; + if (Array.isArray(schemaRaw)) { + schema = {}; + for (let i = 0; i < schemaRaw.length; i++) schema[String(schemaRaw[i])] = i; + } else if (schemaRaw && typeof schemaRaw === 'object') { + schema = schemaRaw; + } else { + throw new Error(`Union file ${file}: first row of results.items must be the schema (column-name array or {col: index} object). Got: ${JSON.stringify(schemaRaw)}`); + } + const cols = Object.keys(schema); + const idx = name => cols.indexOf(name); + const idxDim = idx('dim'); + const idxCode = idx(keyCol); + let idxWeek = idx('wk'); if (idxWeek < 0) idxWeek = idx('week'); + let idxDevs = idx('devs'); if (idxDevs < 0) idxDevs = idx('countDevices'); + let idxErrs = idx('errs'); if (idxErrs < 0) idxErrs = idx('countOverall'); + // Kusto auto-renames duplicate column names from union branches: a column + // declared `val_string` in two `union` legs (one typed string, one typed + // bool(null)) becomes `val_string_string` and `val_string_bool`. Accept + // those as synonyms so the union KQL doesn't need a per-leg cast. + const idxValS = + idx('val_string') >= 0 ? idx('val_string') : + idx('val_string_string') >= 0 ? idx('val_string_string') : + idx('val'); + const idxValB = + idx('val_bool') >= 0 ? idx('val_bool') : + idx('val_string_bool'); + if (idxDim < 0 || idxCode < 0 || idxWeek < 0 || idxDevs < 0 || idxValS < 0) { + throw new Error(`Union file ${file}: schema must include dim, ${keyCol}, wk|week, devs|countDevices, val_string|val|val_string_string (and optionally val_bool|val_string_bool). Got [${cols.join(', ')}]`); + } + // perDim[label].map[code][wk][dimVal] = { devs, errs } + const byDim = {}; + for (const r of rows.slice(1)) { + const label = r[idxDim]; + const code = r[idxCode]; + const wk = r[idxWeek]; + const valS = r[idxValS]; + const valB = idxValB >= 0 ? r[idxValB] : null; + let v; + if (valS !== null && valS !== undefined && valS !== '') v = valS; + else if (valB !== null && valB !== undefined) v = String(valB); + else v = '(blank)'; + const devs = r[idxDevs] || 0; + const errs = idxErrs >= 0 ? (r[idxErrs] || 0) : 0; + const target = byDim[label] ||= { label, map: {} }; + const slot = ((target.map[code] ||= {})[wk] ||= {})[v] ||= { devs: 0, errs: 0 }; + slot.devs += devs; slot.errs += errs; + } + return Object.values(byDim); +} + +const slices = unionFile ? loadUnion(unionFile) : inputs.map(loadSliceFile); + +// --- output -------------------------------------------------------------- +const universe = {}; +for (const s of slices) { + for (const [code, wks] of Object.entries(s.map)) { + for (const wk of Object.keys(wks)) ((universe[code] ||= {})[wk] = true); + } +} +const codes = Object.keys(universe).sort(); + +for (const code of codes) { + const wks = Object.keys(universe[code]).sort(); + const prior = wks[0], curr = wks[wks.length - 1]; + console.log(`\n========== ${code} (prior=${prior?.slice(0,10)} curr=${curr?.slice(0,10)}) ==========`); + for (const s of slices) { + const priorMap = s.map[code]?.[prior] || {}; + const currMap = s.map[code]?.[curr] || {}; + const allVals = new Set([...Object.keys(priorMap), ...Object.keys(currMap)]); + if (allVals.size === 0) continue; + const totC = Object.values(currMap).reduce((a, b) => a + b.devs, 0); + const rows = [...allVals].map(v => ({ + v, + pDev: priorMap[v]?.devs || 0, + cDev: currMap[v]?.devs || 0, + pErr: priorMap[v]?.errs || 0, + cErr: currMap[v]?.errs || 0, + })).sort((a, b) => b.cDev - a.cDev).slice(0, topN); + console.log(`\n -- ${s.label} (curr-total devices=${fmt(totC)})`); + for (const r of rows) { + const share = pct(r.cDev, totC); + console.log(` ${share.padStart(5)}% ${fmt(r.cDev).padStart(8)}d d_dev ${delta(r.cDev, r.pDev).padStart(8)} d_req ${delta(r.cErr, r.pErr).padStart(8)} ${r.v}`); + } + } +} diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 new file mode 100644 index 00000000..148871f5 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/scripts/validate-report.ps1 @@ -0,0 +1,306 @@ +<# +.SYNOPSIS + Validate a generated OCE weekly report HTML before publishing. + +.DESCRIPTION + Runs all required pre-publish checks per SKILL.md "Output checklist": + 1. No stale-template tokens ({{...}} placeholders or "EXAMPLE CONTENT BELOW" sentinel). + 2. No `devs` / `reqs` in user-facing text (only allowed inside
 KQL blocks).
+      3. No U+FFFD (Unicode replacement character) — catches mojibake from emoji edits.
+      4. Section 2 callouts are siblings, NOT nested. Tracks 
open/close depth + from #attention to #trend60d; the depth must return to 0 between callouts. + 5. (Informational) Reports HTML size and number of
openings. + 6. KPI tiles have data-spark coverage (>= half) + overall chart coverage (>=15). + 7. Traffic-attribution sub-block color diversity (tri-state convention). + 8. Code-attribution depth — each .attr-card has the full 8-field Originator block. + 9. Attribution-card layout sanity (v8 regression): + 9a. .attr-card cards-touching guard — CSS must define explicit margin + on .attr-card so successive cards don't visually run together when + the body emits them without an .attr-grid wrapper. + 9b. .dim-row name-overflow guard — CSS must define text-overflow:ellipsis + on .dim-name / .dim-row > span:first-of-type AND min-width:0 on + .dim / .dim-row so long calling-app / version names truncate inside + their dim card rather than bleeding out. + + Exits with non-zero status if any HARD check fails (stale tokens, devs/reqs leak, + U+FFFD, unbalanced div depth, missing layout-guard CSS). + +.PARAMETER Path + Absolute path to the report file. Defaults to the current week's report under + $env:USERPROFILE\android-oce-reports\. + +.EXAMPLE + .\validate-report.ps1 + .\validate-report.ps1 -Path C:\path\to\oncall-wow-report-2026-05-03.html +#> +[CmdletBinding()] +param( + [string]$Path +) + +# Default: pick the most-recent oncall-wow-report-*.html in the user's reports folder +if (-not $Path) { + $reportDir = Join-Path $env:USERPROFILE 'android-oce-reports' + $latest = Get-ChildItem $reportDir -Filter 'oncall-wow-report-*.html' -ErrorAction SilentlyContinue | + Sort-Object LastWriteTime -Descending | Select-Object -First 1 + if (-not $latest) { + Write-Error "No oncall-wow-report-*.html found in $reportDir. Pass -Path explicitly." + exit 2 + } + $Path = $latest.FullName +} + +if (-not (Test-Path $Path)) { + Write-Error "Report file not found: $Path" + exit 2 +} + +$failures = @() +$warnings = @() + +function Add-Fail($msg) { $script:failures += $msg; Write-Host " [FAIL] $msg" -ForegroundColor Red } +function Add-Warn($msg) { $script:warnings += $msg; Write-Host " [WARN] $msg" -ForegroundColor Yellow } +function Pass($msg) { Write-Host " [OK] $msg" -ForegroundColor Green } + +Write-Host "" +Write-Host "Validating: $Path" +Write-Host ("Size: {0:N0} bytes" -f (Get-Item $Path).Length) +Write-Host "" + +# ---- 1. Stale tokens / EXAMPLE sentinel ---- +$stale = Select-String -Path $Path -Pattern '\{\{|EXAMPLE CONTENT BELOW|EXAMPLE_' +if ($stale.Count -gt 0) { + Add-Fail "Stale template tokens found ($($stale.Count)). First few:" + $stale | Select-Object -First 5 | ForEach-Object { Write-Host " L$($_.LineNumber): $($_.Line.Trim().Substring(0, [Math]::Min(110, $_.Line.Trim().Length)))" } +} else { + Pass "No stale {{...}} tokens or EXAMPLE sentinel" +} + +# ---- 2. devs / reqs in user-facing text ---- +# Allowed: occurrences inside
...
KQL blocks. +$content = [System.IO.File]::ReadAllText($Path) +$contentNoCode = [regex]::Replace($content, '(?s)]*>.*?
', '') +$contentNoCode = [regex]::Replace($contentNoCode, '(?s)]*>.*?', '') +$drMatches = [regex]::Matches($contentNoCode, '\b(devs|reqs)\b', 'IgnoreCase') +if ($drMatches.Count -gt 0) { + Add-Fail "Found $($drMatches.Count) devs/reqs occurrence(s) in user-facing text (use 'devices' / 'requests'). First few contexts:" + $drMatches | Select-Object -First 5 | ForEach-Object { + $ctxStart = [Math]::Max(0, $_.Index - 40) + $ctxLen = [Math]::Min(100, $contentNoCode.Length - $ctxStart) + $ctx = $contentNoCode.Substring($ctxStart, $ctxLen) -replace '\s+', ' ' + Write-Host " ...$ctx..." + } +} else { + Pass "No devs/reqs in user-facing text" +} + +# ---- 3. U+FFFD (mojibake from emoji edits) ---- +$bytes = [System.IO.File]::ReadAllBytes($Path) +$text = [System.Text.Encoding]::UTF8.GetString($bytes) +$ufffd = ($text.ToCharArray() | Where-Object { $_ -eq [char]0xFFFD }).Count +if ($ufffd -gt 0) { + Add-Fail "$ufffd U+FFFD replacement character(s) found (mojibake). First context:" + $i = $text.IndexOf([char]0xFFFD) + $start = [Math]::Max(0, $i - 30); $end = [Math]::Min($text.Length, $i + 30) + Write-Host " ...$($text.Substring($start, $end - $start) -replace "`r?`n", ' ')..." +} else { + Pass "No U+FFFD (no mojibake)" +} + +# ---- 4. Section 2 div balance ---- +$lines = Get-Content $Path +$startIdx = -1; $endIdx = -1 +for ($i = 0; $i -lt $lines.Count; $i++) { + if ($lines[$i] -match 'id="attention"') { $startIdx = $i } + if ($lines[$i] -match 'id="trend60d"') { $endIdx = $i; break } +} +if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { + $depth = 0 + for ($i = $startIdx; $i -le $endIdx; $i++) { + if ($null -eq $lines[$i]) { continue } + $depth += ([regex]::Matches($lines[$i], '')).Count + } + if ($depth -ne 0) { + Add-Fail "Section 2 (attention block) has unbalanced
s; net depth at end = $depth (expected 0). Likely cause: a callout is missing its closing
, which makes the next callout nest inside it." + } else { + Pass "Section 2 div balance OK (depth returns to 0)" + } +} else { + Add-Warn "Could not locate the attention block (#attention / #trend60d anchors). Skipping div-balance check." +} + +# ---- 5. Informational: callout count + nested-callout sanity ---- +$calloutOpens = ([regex]::Matches($content, '
half lack +# data-spark, the rebuild dropped them — fail the build. +# 6b. OVERALL (WARN): total chart elements should be ~30+ (8 KPI sparks + +# ~10 trend rows + ~12 WoW-table rows). Warn if under 15. +$sparkCount = ([regex]::Matches($content, 'data-spark=')).Count +$trendCount = ([regex]::Matches($content, 'data-trend=')).Count +$inlineSvg = ([regex]::Matches($content, ']*class="?sparkline')).Count +$kpiTiles = ([regex]::Matches($content, '
Code attribution
` +# must be followed (within the same card) by an `origin-label` row. +$codeAttrBlocks = ([regex]::Matches($content, '
Code attribution
')).Count +$originLabels = ([regex]::Matches($content, 'class="origin-label">Originator')).Count +if ($codeAttrBlocks -ge 1) { + if ($originLabels -lt $codeAttrBlocks) { + Add-Fail "$codeAttrBlocks Code-attribution block(s) but only $originLabels have an Originator row. Each card needs the full 8-field structure (Originator / Top throw site / Wrapper / Caller hot-spots / Underlying cause / Top error_messages / Likely PRs / Next step). See assets/docs/code-attribution-template.md." + } else { + Pass "All $codeAttrBlocks code-attribution block(s) have full 8-field structure" + } +} + +# Cheap nested-callout heuristic: scan the attention block for any callout that +# opens before the previous callout closes. We approximate by tracking depth. +if ($startIdx -ge 0 -and $endIdx -gt $startIdx) { + $depthOuter = 0; $nestedAt = @() + for ($i = $startIdx; $i -le $endIdx; $i++) { + if ($null -eq $lines[$i]) { continue } + # Match the callout container itself, not callout-title. The class can be + # `callout`, `callout urgent`, `callout watch`, `callout win`, etc. — but + # never `callout-title`. Require a space or end-of-class-attr after. + if ($lines[$i] -match '
')).Count + } + if ($nestedAt.Count -gt 0) { + Add-Fail "Nested callout detected at line(s): $($nestedAt -join ', '). Each callout in Section 2 must be a SIBLING, not nested inside another callout." + } else { + Pass "No nested callouts in Section 2" + } +} + +# ---- 9. Attribution-card layout sanity (v8 regression — cards touching + dim-row bleed) ---- +# Two layout bugs hit the v8 rebuild and forced manual CSS patches mid-publish. +# Both have CSS fixes baked into assets/templates/report-template.html now, but the validator +# catches the markup-side preconditions so a future hand-rolled body that +# diverges from the template is flagged before publish. +# +# 9a. Cards-touching guard: if the report has .attr-card outside any .attr-grid +# wrapper AND the CSS in + + +
+ +
+
+

Android Broker · Weekly On-Call Report

+
+ Sun May 3 → Sat May 9, 2026  vs  Apr 26 → May 2  ·  + 60-day window: Mar 8 → May 3 (8 complete weeks)  ·  + Source: android_spans materialized views  ·  + Generated 2026-05-09 +
+
+ v6 · Live data +
+ + + + +

📊 Top-line health — auth-only denominator

+
+
+
Silent auth requests (week)
+
10.58 B
+
+2.3% WoW
+
+
+
+
Silent auth reliability (req)
+
73.34%
+
+0.68 pp WoW (improving)
+
+
+
+
Silent auth reliability (dev)
+
82.52%
+
+1.34 pp WoW (improving)
+
+
+
Interactive auth requests
+
10.29 M
+
+7.5% WoW
+
+
+
+
Interactive reliability (dev)
+
58.43%
+
+0.74 pp WoW (improving)
+
+
+
Interactive devices
+
8.17 M
+
+5.6% WoW
+
+
+
+
Latest broker (16.0.1)
+
70.8%
+
+15.4 pp share WoW (rollout complete)
+
+
+
p95 AcquireTokenSilent
+
5,916 ms
+
−45 ms (−0.8%) WoW
+
+
+ + +

🚨 Things that need attention this week

+ +
+
ℹ️ Denominator caveat — read this first
+

The headline BrokerAdoptionStats device count dropped −18.6% WoW (1.52 B → 1.24 B), but this is not a real fleet shrink. The drop is fully explained by three low-value spans deflating as the 16.0.1 rollout completes:

+
    +
  • OnUpgradeReceiver: 438 M → 151 M events (−65.5%) — fires once per app upgrade; tapers naturally as 16.0.1 finishes deploying. May also be impacted by historical goAsync() refactors that allow the OS to kill the receiver before the span flushes.
  • +
  • SecretKeyWrapping: 329 M → 251 M (−23.7%) — downstream of fewer keystore ops in the OnUpgradeReceiver path.
  • +
  • WrappedKeyAlgorithmIdentifier: 135 M → 87 M (−35.3%) — same downstream cause.
  • +
+

The auth-only denominator (Silent ∪ Interactive) is up: silent countRequests +2.3%, interactive +7.5%, silent device count flat (1.55 B → 1.53 B). All reliability % and per-app figures in this report use the auth-only denominator. Real users are unaffected by the device-count drop.

+
+ +
+
🔴 WoW regressions (last 7 days vs prior 7) — sorted by current-week devices, descending
+

Tags: NEW first appeared this week or last; 60d↑ also rising on the 60-day window; broker / eSTS / Android / env = originator. Built from the standard WoW table union with wow-movers.kql so small-but-recent spikes appear alongside the high-volume movers.

+
+ + +
+
+ EXAMPLE_error_code + devicesEXAMPLE 65 K + Δ WoWEXAMPLE +6.1% + on 16.0.1EXAMPLE 73% + + broker + 60d↑ +52% + +
+
EXAMPLE one-line narrative: throw site common/SomeClass.someMethod:NN, dominant message, and the verdict. Keep this short — the deep dive is in the attribution card below.
+
Owner: EXAMPLE teamAttribution card →
+
+ +
+
+ +
+
🟡 Slow-burn 60-day regressions — rising on 60d window but flat WoW; codes that also moved WoW are in the red callout above with a 60d↑ tag
+
+ +
+
+ EXAMPLE_slow_burn_code + devicesEXAMPLE 4.5 M + Δ 60dEXAMPLE +56% + Δ requests 60dEXAMPLE +40% + on 16.0.1EXAMPLE 78% + + broker + +
+
EXAMPLE: WoW only +X%. Tracks 16.0.1 rollout share; one-line hypothesis or owner pointer.
+
+ +
+

See the 60-day trend section for the full ranked list.

+
+ +
+
🟢 Real wins this week
+
+ +
+
+ EXAMPLE_recovered_code + devicesEXAMPLE 834 K + Δ WoWEXAMPLE −86% + Δ requestsEXAMPLE −78% +
+
EXAMPLE: 100% pinned to broker 16.0.0; recovery is natural rolloff. Likely fix PR: common #EXAMPLE.
+
Watch: EXAMPLE residual cohort.
+
+ +
+
+ +
+
📊 Traffic shape — flat with mild interactive uptick
+

Auth volume is essentially flat in silent (+2.3% requests, −1.2% devices) and slightly up on interactive (+7.5% requests, +5.6% devices). Top calling apps all moved within ±5% on requests. No surge, no collapse, no sampling-rate change suspected. See → 📊 Traffic Analysis.

+
+ + +

📈 60-Day Trend Analysis — bucketed across 8 complete weeks

+ +
+ Methodology: Pulled all error codes from the ErrorStats view over the last 9 weeks. Dropped the partial start week (Mar 1). Kept all codes whose peak weekly device count ≥ 10 K. Bucketed each 8-week series by delta = (last − first) / first: + regression if delta > +15% and trajectory is monotonic-ish; ephemeral spike if peak ≥ 3× mean of surrounding weeks; improvement if delta < −15%; flat otherwise. Every code in the regression list gets a spike-attribution card below. +
+ +
+
⚠️ True 60-day regressions — 5 codes
+ + + + + + + + + +
Error codeWk 1 devicesWk 8 devicesΔ over 8w60d sparklineTrajectory
no_tokens_found13.9 M23.7 M+70.6%monotonic up
unauthorized_client2.72 M3.37 M+23.6%monotonic up
Code:-631.8 K86.4 K+171.5%step-up at wk 6
unknown_crypto_error59.3 K78.4 K+32.4%U-shaped, climbing
null_pointer_error48.5 K70.7 K+45.9%monotonic up
+
+ +
+
Ephemeral 60-day spikes (peaked then recovered)
+ + + + + + + +
Error codeBaselinePeakNow60d sparkline
timed_out_execution17.9 M142.9 M (wk Apr 12)53.4 M
unknown_authority~1 K34.1 M (wk Apr 12)1.45 M
429 (eSTS rate-limit)~10218 K (wk Mar 22)2.5 K
+

Both unknown_authority (common #3082 ABBA deadlock fix) and timed_out_execution (broker #141 flight gating) are recovering. Recommendation: add Aria guardrail at >1M devices/week for unknown_authority to detect any future excursion early.

+
+ +
+
True 60-day improvements
+ + + + + + + + + +
Error codeWk 1Wk 8ΔSparkline
timed_out36.1 M5.1 M−85.9%
invalid_scope1.92 M0.36 M−81.3%
timed_out_thread_pool_saturated1.64 M0.62 M−62.1%
illegal_argument_exception0.21 M0.19 M−7.5% (peak −62%)
null_object, device_network_not_available, access_denied, ONLY_SUPPORTS_ACCOUNT_MANAGER_ERROR_CODE, invalid_keyall −17% to −78% over 8 wks (see appendix)
+

Note: the timed_out drop and timed_out_execution climb are partly the same event — broker #141 reclassifies legacy timed_out into the more specific timed_out_execution. The reclassification is net-neutral but the new code is louder; treat the timed_out "win" with caution.

+
+ +
+
Flat on 60d (within ±10%)
+

io_error, no_account_found, invalid_grant, interaction_required, device_network_not_available_doze_mode, authorization_pending, expired_token, User cancelled, auth_cancelled_by_sdk, invalid_resource, invalid_request, device_registration_needed, Code:-1, Code:-2, Code:-8, operation_interrupted, ipc_return_null_cursor, device_needs_to_be_managed, Redirect url scheme not SSL protected, ipc_operation_not_supported_on_server_side, invalid_client, ipc_connection_error, unknown_error.

+
+ + +

🔎 Spike Attribution — one card per regression

+ +
+ Each card slices on broker version, span, active broker package, calling app, and sub-dimensions where data is available. Concentration thresholds: > 80% in a single value = strong attribution (red bar); 60–80% = medium; < 60% = broad/cross-cutting. Account-type and shared-device-mode dimensions are sourced from raw android_spans and shown when material. +
+ +
+
+
+
+
no_tokens_found
+
Devices: 14.1 M → 23.7 M over 8wks  (+68.2%); WoW 22.9 M → 23.7 M (+3.6%)
+
+
60d regressionsilent pathbroad calling-app spread
+
+
+
Verdict — Slow-burn 60-day regression, no single dominant dimension. Spans AcquireTokenSilent (98%) but spread across all top callers (Outlook 36%, Teams 20%, SkyDrive 11%, AppManager 7%). Active broker is split ~46% Authenticator / 44% AppManager / 10% Intune CP, mirroring fleet-share — so this is not a broker-app-specific issue. Strongest code-attribution candidate: common #3074 (token-cache remove path optimization, AB#3570409). The +9.6 M devices added since wk of Mar 8 closely tracks the rollout window of that PR. Action: bisect by enabling/disabling the filter-first-clone flight on a small ring to confirm causation.
+
Span
AcquireTokenSilent98.6%
+
+
ATISilently1.3%
+
+
MSAL_PerformIpcStrategy0.1%
+
Calling app
com.microsoft.office.outlook35.5%
+
+
com.microsoft.teams20.1%
+
+
com.microsoft.skydrive10.6%
+
+
com.microsoft.office.word7.3%
+
+
com.microsoft.appmanager7.0%
+
Active broker
com.azure.authenticator46.0%
+
+
com.microsoft.appmanager44.5%
+
+
com.microsoft.windowsintune.companyportal9.5%
+
Broker version
16.0.171.1%
+
+
15.1.010.0%
+
+
14.2.09.0%
+
+
other9.9%
+
+
+
Code attribution
+
+
medium
+
common#3074 Token cache filter-first-clone optimization +
Touches the cache-remove path; an over-eager remove or filter mismatch would directly raise no_tokens_found on AcquireTokenSilent.
+
+
+
low
+
common#3081 BrokerDiscovery cache crash fix (shared encryption key with MSAL) +
Same WPJ/encryption surface; less likely root cause but worth ruling out.
+
+
+
+
+
🚚 Traffic attribution
+
Spread across all top callers in proportion to their request volume — no single calling-app traffic surge is responsible. Per-Outlook-request rate of no_tokens_found has risen consistently with the trend, ruling out traffic attribution.
+
+
+
+
+
+
+
timed_out_execution
+
Devices: 18.0 M → 53.4 M over 8wks (peaked 143 M wk of Apr 12); WoW 80.6 M → 53.4 M (−33.7%) — recovering
+
+
60d regressionpeak-then-recoverAppManager-heavy
+
+
+
Verdict — 60d regression with WoW recovery underway. Almost entirely on AcquireTokenSilent (99.9%). The peak at 143 M devices (wk of Apr 12) and subsequent drop to 53.4 M is consistent with broker #141 (HTTP cancellation on ATS command-level timeout) being flight-rolled out and then partially gated back. AppManager (Link to Windows) is the dominant active broker (53% this week, was 69% prior), and most-affected calling app: Outlook 32% / AppManager 25% / Teams 24%. Action: confirm the flight rollout schedule for #141 and check whether the timeout threshold needs tuning before re-enabling broadly. Watch for downstream client retry storms (it converts silent thread-leak into an explicit error → callers must retry cleanly).
+
Span
AcquireTokenSilent99.9%
+
+
ATISilently0.1%
+
+
AcquireTokenInteractive0.1%
+
Calling app
com.microsoft.office.outlook32.5%
+
+
com.microsoft.appmanager25.0%
+
+
com.microsoft.teams24.0%
+
+
com.microsoft.skydrive5.4%
+
Active broker
com.microsoft.appmanager52.9%
+
+
com.azure.authenticator35.4%
+
+
com.microsoft.windowsintune.companyportal11.7%
+
Broker version
16.0.170.8%
+
+
15.1.010.0%
+
+
14.2.08.8%
+
+
other10.4%
+
+
+
Code attribution
+
+
high
+
broker#141 Add flight-gated HTTP cancellation on ATS command-level timeout to eliminate zombie worker threads (AB#3542516) +
This PR explicitly converts long-running ATS calls into timed_out_execution. The 60d trajectory matches the flight rollout perfectly. The reciprocal drop in legacy timed_out (-86%) confirms the reclassification.
+
+
+
+
+
🚚 Traffic attribution
+
AppManager (LTW) dropped from 55.5 M to 28.2 M devices (-49%) WoW while AppManager total request volume rose 2.9% — so this is NOT traffic-driven; the per-AppManager-request rate is what fell, consistent with a flight pull-back.
+
+
+
+
+
+
+
unauthorized_client
+
Devices: 2.74 M → 3.37 M over 8wks (+22.8%); WoW 3.17 M → 3.37 M (+6.3%)
+
+
60d regressionOutlook+Teams concentratedsilent path
+
+
+
Verdict — Mild but consistent 60d climb, very likely traffic-attributed (not a broker bug). Calling-app concentration: 67% in Outlook+Teams alone, with the next 5 callers all being Office apps (Excel 8%, Word 7%, SCMx 4%). Span: AcquireTokenSilent 90% / AcquireTokenInteractive 6%. Active broker shares mirror fleet-share. The growth tracks request-volume growth in Outlook/Teams (+2.1%/+3.9% WoW each, +12% over 60d) closely. The most likely explanation is that some Outlook/Teams app registrations are gradually being marked unauthorized for specific resources/scopes by their first-party app owners — not a broker code issue. Action: sample 10 unauthorized_client correlation IDs from this week's Outlook traffic and check the eSTS error sub-code; route to Outlook + first-party app team if confirmed.
+
Span
AcquireTokenSilent90.3%
+
+
AcquireTokenInteractive5.8%
+
+
ATISilently3.6%
+
Calling app
com.microsoft.office.outlook34.7%
+
+
com.microsoft.teams32.6%
+
+
com.microsoft.office.excel8.0%
+
+
com.microsoft.office.word7.4%
+
Active broker
com.azure.authenticator43.3%
+
+
com.microsoft.windowsintune.companyportal36.9%
+
+
com.microsoft.appmanager19.8%
+
Broker version
16.0.170.5%
+
+
15.1.010.2%
+
+
14.2.08.8%
+
+
other10.5%
+
+
+
Code attribution
+
+
none
+
(no PR) No broker code regression identified +
Mirrors fleet broker-version share; no version concentration. Most likely an app-registration / first-party-config drift on the eSTS side.
+
+
+
+
+
🚚 Traffic attribution
+
Strong traffic-attribution signal: 67% concentration in Outlook+Teams, both growing in request volume. See → 🚚 Traffic Attribution section for full analysis.
+
+
+
+
+
+
+
Code:-6
+
Devices: 33 K → 86 K over 8wks (+162%, peak 92 K wk of Apr 26); WoW 92.6 K → 86.4 K (−6.7%) — first WoW pullback
+
+
60d regressioninteractive onlyIntune-CP active broker
+
+
+
Verdict — Code:-6 (interactive auth canceled by user via system UI) jumped 2.7× starting wk of Apr 19. Span: AcquireTokenInteractive 76% / ATIInteractively 24%. Active broker concentration: 57% Intune Company Portal (vs ~38% fleet share). Calling-app spread: Outlook 30% / Teams 21% / Axis Bank Siddhi 21% (notable: 3rd-party banking app appearing as #3 caller for an interactive-cancellation error suggests an MAM/Intune pop-up issue). Broker-version split: 36% on 15.1.0 (over-represented vs 10% fleet share) and 36% on 16.0.1. Action: investigate whether 15.1.0 introduced an interactive-cancellation path bug, or whether Intune CP is showing a new system dialog that users dismiss. Check broker broker/AADAuthenticator for changes to WebViewClient / interactive consent flows since wk of Apr 12.
+
Span
AcquireTokenInteractive75.8%
+
+
ATIInteractively24.0%
+
+
CertBasedAuth0.1%
+
Calling app
com.microsoft.office.outlook29.6%
+
+
com.microsoft.teams21.4%
+
+
com.axisbank.siddhi.v321.3%
+
+
com.microsoft.windowsintune.companyportal4.7%
+
Active broker
com.microsoft.windowsintune.companyportal56.7%
+
+
com.azure.authenticator24.3%
+
+
com.microsoft.appmanager19.1%
+
Broker version
15.1.037.8%
+
+
16.0.135.5%
+
+
15.0.013.0%
+
+
14.2.04.7%
+
+
+
Code attribution
+
+
low
+
(no PR) Investigate broker/AADAuthenticator WebViewClient changes between 15.1.0 and 16.0.1 +
Both versions are over-represented vs fleet share. Could not pinpoint a single PR via grep — needs targeted diff between 15.1.0 and 16.0.1 release branches focused on interactive consent flows.
+
+
+
+
+
🚚 Traffic attribution
+
Axis Bank Siddhi alone is 21% — a single 3rd-party app being a top contributor to a cancellation-style error is unusual. Check whether their app introduced an interactive AcquireToken call recently and whether their UX is causing user dismissal.
+
+
+
+
+
+
+
null_pointer_error
+
Devices: 48 K → 71 K over 8wks (+46%); WoW 67 K → 71 K (+5.1%)
+
+
60d regressionsilent pathLTW + Authenticator
+
+
+
Verdict — Steady 60d climb in NPE crashes; small absolute volume but trajectory worth flagging. Span: AcquireTokenSilent 98%. Active broker: AppManager (LTW) 53% / Authenticator 26% / Intune CP 20%. Calling app: AppManager 35% / Teams 28% / Outlook 21%. The over-representation of LTW (53% vs ~5-10% fleet share for that active broker) is a strong signal — there's likely a null path specific to the LTW broker process. Action: bucket by error_location / stack-trace prefix and route to LTW team. Cite broker #141 as a possible secondary contributor (timeout cancellation can interact with deferred work that holds a null reference).
+
Span
AcquireTokenSilent98.5%
+
+
DeviceRegistrationApi1.0%
+
+
AcquireTokenInteractive0.3%
+
Calling app
com.microsoft.appmanager35.4%
+
+
com.microsoft.teams28.5%
+
+
com.microsoft.office.outlook21.2%
+
+
com.microsoft.office.word2.8%
+
Active broker
com.microsoft.appmanager53.3%
+
+
com.azure.authenticator26.4%
+
+
com.microsoft.windowsintune.companyportal20.3%
+
Broker version
16.0.170.0%
+
+
15.1.010.5%
+
+
14.2.09.0%
+
+
other10.5%
+
+
+
Code attribution
+
+
low
+
broker#141 HTTP cancellation on ATS timeout (LTW broker process) +
Active broker is heavily LTW (53% vs ~7% fleet share). The same ATS timeout cancellation path may be racing with a null-checked reference in a deferred callback. Needs stack-trace bucketing to confirm.
+
+
+
none
+
(no PR) Awaiting crash bucket by error_location +
Cannot pinpoint specific PR without stacktrace breakdown.
+
+
+
+
+
🚚 Traffic attribution
+
AppManager (LTW) request volume rose only +2.9% WoW while NPE devices from LTW grew +12% — per-LTW-request NPE rate is rising. NOT traffic-driven.
+
+
+
+
+
+
+
unknown_crypto_error
+
Devices: 64 K → 78 K over 8wks (+23%); WoW 76.3 K → 78.4 K (+2.7%)
+
+
60d regressionkeystore / TEEpre-auth flow
+
+
+
Verdict — Slow-burn keystore failure, dominated by device-registration / WPJ paths. Span: KeyPairGeneration 55% / SecretKeyGeneration 45% — both keystore-bound, indicating TEE / hardware keystore issues at first-key-generation time. Active broker: 63% Authenticator / 31% AppManager / 6% Intune CP. Calling app is blank for ~100% of these (consistent with pre-authentication flows like DRS/WPJ where no caller is yet attached). Action: slice by DeviceInfo_OsVersion and OEM (Samsung/Pixel/Xiaomi/Huawei) on raw android_spans; this kind of growth typically maps to a specific OEM/Android-version combo (StrongBox-backed keystore quirks).
+
Span
KeyPairGeneration54.5%
+
+
SecretKeyGeneration45.3%
+
+
SecretKeyWrapping0.1%
+
+
SecretKeyRetrieval0.0%
+
Calling app
(blank — pre-auth)100.0%
+
Active broker
com.azure.authenticator63.0%
+
+
com.microsoft.appmanager31.0%
+
+
com.microsoft.windowsintune.companyportal5.9%
+
Broker version
16.0.171.0%
+
+
15.1.010.0%
+
+
14.2.09.0%
+
+
other10.0%
+
+
+
Code attribution
+
+
none
+
(no PR) No broker PR identified — likely OEM/Android-version-specific keystore behavior +
KeyPairGeneration + SecretKeyGeneration concentration points to TEE/StrongBox keystore. Common Android quirks: Samsung Knox vault provisioning, Xiaomi/Huawei custom keystore HALs.
+
+
+
+
+
🚚 Traffic attribution
+
Pre-authentication flow with no calling app — traffic attribution does not apply.
+
+
+
+
+ + +

🚚 Traffic Attribution — spikes explained by calling-app traffic, not code

+ +
+
+
+
unauthorized_client
+
Classification: traffic-attributed — not a broker code regression
+
+
🚚 traffic-driven
+
+
+
+ unauthorized_client +6.3% devices WoW (and +22.8% over 60d) is concentrated in Outlook (35%) + Teams (33%) — combined 67% — both of which grew in request volume +2.1% and +3.9% WoW respectively, and ~12% over 60d. Per-Outlook-request and per-Teams-request unauthorized_client rates are essentially flat. This means the spike is being driven by these apps issuing more requests (some of which were always going to fail with this error code), not by broker code regressing. No broker code change is implicated. Route to Outlook + Teams app-registration owners on the eSTS side. +
+
+
+ + +

Error codes — WoW with stable (auth-only) denominator

+
+ + + + + + + + + + + + + + + + + + + + +
Error codeStatusDevices nowDevices priorΔ devices60d sparkline
no_tokens_found▲ 60d regression23.73 M22.91 M+3.6%
unauthorized_client▲ 60d regression3.37 M3.17 M+6.3%
unknown_crypto_error▲ 60d regression78.4 k76.3 k+2.8%
null_pointer_error▲ 60d regression70.7 k67.3 k+5.1%
Code:-6⚪ Flat86.4 k92.6 k-6.6%
timed_out_execution▼ Win53.40 M80.63 M-33.8%
unknown_authority▼ Win1.45 M9.00 M-83.9%
429▼ Win2.5 k143.9 k-98.3%
io_error⚪ Flat458.49 M439.68 M+4.3%
no_account_found⚪ Flat305.05 M311.77 M-2.2%
invalid_grant⚪ Flat144.07 M140.76 M+2.4%
device_network_not_available_doze_mode⚪ Flat6.31 M6.28 M+0.5%
interaction_required⚪ Flat5.23 M5.12 M+2.1%
User cancelled⚪ Flat3.19 M3.08 M+3.5%
auth_cancelled_by_sdk⚪ Flat1.46 M1.42 M+2.9%
invalid_resource⚠ Watch1.31 M1.05 M+24.4%
invalid_request⚪ Flat1.28 M1.25 M+2.3%
authorization_pending⚪ Flat175.2 k170.0 k+3.0%
expired_token⚪ Flat111.0 k111.9 k-0.7%
Failed to parse JWT⚠ Watch3.5 k895+288.0%
+ + +

Error types — WoW with stable denominator

+
+ + + + + + + + + + + + + + + + + + +
Error typeDevices nowDevices priorΔ devices %
ClientException536.81 M552.79 M-2.9%
UiRequiredException474.82 M477.28 M-0.5%
ServiceException3.88 M3.73 M+4.0%
IntuneAppProtectionPolicyRequiredException3.22 M3.03 M+6.2%
UserCancelException3.20 M3.09 M+3.5%
ArgumentException190.4 k464.2 k-59.0%
CreateCredentialCancellationException136.9 k141.1 k-2.9%
GetCredentialCancellationException116.1 k115.5 k+0.5%
BrokerCommunicationException73.0 k70.4 k+3.7%
DeviceRegistrationRequiredException64.5 k59.8 k+8.0%
CreatePublicKeyCredentialDomException38.1 k42.4 k-10.3%
JobCancellationException32.8 k31.9 k+2.8%
UnknownHostException19.6 k20.4 k-4.0%
NullPointerException12.2 k11.9 k+2.3%
JsonSyntaxException6.1 k7.0 k-12.5%
SocketException1.5 k1.6 k-7.8%
CreateCredentialUnknownException9691.2 k-19.0%
TimeoutCancellationException9161.5 k-37.2%
+ + +

📊 Traffic analysis

+ +
+ Total auth requests/devices, top calling apps, top spans, requests-per-device, sampling-change check. +
+ +
+
Total broker requests (BrokerAdoptionStats)
12.79 B
−1.1% WoW · 60d −24%
+
Total broker devices (BrokerAdoptionStats)
1.24 B
−18.6% WoW · ⚠ denominator artifact (see top callout)
+
Auth-only requests (Silent + Interactive)
10.59 B
+2.4% WoW · 60d +1.0%
+
Auth-only devices
1.54 B
−1.2% WoW (real fleet flat)
+
Requests / device (silent)
6.92
+3.6% WoW (more requests per dev)
+
Sampling change indicator
⚪ Stable
All-spans dropped >20%, but auth-only <5% — confirms OnUpgradeReceiver taper, not sampling change
+
+ +

Top calling apps

+
+ + + + + + + + + + + + + + +
Calling appRequests nowRequests priorΔ requests %Devices nowDevices priorΔ devices %
com.microsoft.office.outlook3.24 B3.18 B+2.1%458.26 M454.87 M+0.7%
com.microsoft.appmanager2.64 B2.56 B+2.9%282.92 M284.00 M-0.4%
com.microsoft.teams1.79 B1.72 B+3.9%236.82 M234.55 M+1.0%
com.microsoft.skydrive691.12 M691.85 M-0.1%200.50 M199.29 M+0.6%
com.microsoft.skype.teams.ipphone599.97 M597.66 M+0.4%9.03 M9.28 M-2.7%
com.microsoft.office.word419.64 M400.69 M+4.7%63.26 M61.29 M+3.2%
com.microsoft.office.excel279.21 M264.73 M+5.5%44.64 M42.86 M+4.2%
com.microsoft.office.officehubrow247.83 M233.35 M+6.2%38.71 M37.31 M+3.8%
com.microsoft.emmx158.68 M158.54 M+0.1%16.83 M17.10 M-1.6%
com.samsung.android.email.provider151.42 M178.53 M-15.2%4.20 M4.16 M+0.9%
com.microsoft.scmx93.59 M94.15 M-0.6%9.68 M9.74 M-0.7%
com.microsoft.office.powerpoint84.03 M79.37 M+5.9%14.91 M14.17 M+5.2%
com.microsoft.windowsintune.companyportal65.18 M63.85 M+2.1%35.32 M34.80 M+1.5%
com.microsoft.sharepoint12.86 M12.75 M+0.9%5.40 M5.39 M+0.1%
+ +

Top spans by request volume

+
+ + + + + + + + + + + + + + + +
SpanCount nowCount priorΔ %Note
AcquireTokenSilent10.59 B10.35 B+2.3%
DeviceRegistrationApi598.70 M578.30 M+3.5%
AcquireTokenDcfFetchToken365.00 M364.00 M+0.3%
BrokerOperationRequestDispatcher349.10 M344.30 M+1.4%
SecretKeyWrapping251.00 M329.10 M-23.7%Downstream of OnUpgradeReceiver drop
OnUpgradeReceiver151.20 M438.20 M-65.5%Denominator culprit — natural taper as 16.0.1 rollout completes; may also be amplified by goAsync() effects
SecretKeyRetrieval113.40 M111.50 M+1.7%
WrappedKeyAlgorithmIdentifier87.10 M134.70 M-35.3%Downstream of OnUpgradeReceiver drop
RefreshTransferToken78.80 M74.80 M+5.3%
EcsFlightsFetchConfigs47.30 M47.70 M-0.8%
AcquireAtUsingPrt38.80 M37.10 M+4.6%
Passthrough23.90 M24.20 M-1.2%
RefreshPrt20.70 M19.40 M+6.7%
AccountStorageWithBackup11.40 M11.60 M-1.7%
AcquireTokenInteractive10.30 M9.60 M+7.3%Up — matches +7.5% interactive auth growth
+ + +

Latency — ms, p50/p95/p99 by hot span

+
+ + + + +
Spanp50 nowp50 priorp95 nowp95 priorp99 nowp99 priorΔ p99 %
AcquireTokenSilent11851204591659611375813712+0.3%
GetAccounts452448449044011180111784+0.1%
ProcessWebsiteRequest16167677199201-1.0%
RemoveAccount171925928516741945-13.9%
+

Source: PerfStats (TDigest-merged). All hot spans flat or slightly improving except GetAccounts p95 (+2%, within noise).

+ + +

Broker version adoption — request share by version

+
+ + + + + + + + + + +
Broker versionReq share nowReq share priorΔ share ppΔ rel %
16.0.170.77%55.34%+15.43+26.5%
15.1.010.12%10.78%-0.66-7.1%
14.2.08.80%9.42%-0.62-7.6%
15.0.02.27%4.03%-1.75-44.1%
16.0.01.39%13.35%-11.96-89.7%
14.1.11.06%1.24%-0.18-15.6%
14.0.20.99%1.09%-0.10-9.9%
13.3.20.63%0.67%-0.04-6.9%
13.9.10.42%0.45%-0.03-6.9%
13.20.00.40%0.41%-0.01-3.8%
+

16.0.1 rollout effectively complete. Reached 70.8% req share (from 55.4%); 16.0.0 down to 1.4% from 13.4%. Older 15.x and 14.x versions all decline 5-15% as natural attrition. No version regressed in error rate during this rollout — see Spike Attribution cards for per-error broker_version concentration.

+ + +

Appendix

+ +
+ Queries used (Kusto KQL) +
+

Cluster: https://idsharedeus2.kusto.windows.net · Database: ad-accounts-android-otel

+

1. Reliability:

+
let all = SilentAuthStatsAllRequests | where EventInfo_Time > ago(70d)
+  | summarize allReq=sum(countRequests), allDev=sum(countDevices) by week=startofweek(EventInfo_Time);
+let ok = SilentAuthStatsRequestsWithoutExpectedError | where EventInfo_Time > ago(70d)
+  | summarize okReq=sum(countRequests), okDev=sum(countDevices) by week=startofweek(EventInfo_Time);
+all | join kind=inner ok on week
+  | project week, reqRel=round(100.0*okReq/allReq,3), devRel=round(100.0*okDev/allDev,3)
+  | order by week asc
+

2. 60-day error trend (bucketed in post-processing):

+
ErrorStats | where EventInfo_Time > ago(70d)
+  | where isnotempty(error_code) and error_code != 'success'
+  | summarize errs=sum(countOverall), devs=sum(countDevices)
+       by week=startofweek(EventInfo_Time), error_code
+  | order by error_code asc, week asc
+

3. Spike attribution (per error, per dimension):

+
let codes = dynamic(['no_tokens_found','unauthorized_client','Code:-6',
+                     'unknown_crypto_error','null_pointer_error','timed_out_execution']);
+ErrorStats | where EventInfo_Time > ago(14d) | where error_code in (codes)
+  | extend wk=startofweek(EventInfo_Time)
+  | summarize devs=sum(countDevices) by wk, error_code,
+       calling_package_name, active_broker_package_name, broker_version, span_name
+  | order by error_code asc, wk asc, devices desc
+

4. Latency (TDigest-merged):

+
PerfStats | where EventInfo_Time > ago(21d)
+  | where span_name in ('AcquireTokenSilent','GetAccounts','ProcessWebsiteRequest','RemoveAccount')
+  | where span_status == 'OK'
+  | summarize merged=tdigest_merge(responseTimeTDigest), reqs=sum(countRequests)
+       by week=startofweek(EventInfo_Time), span_name
+  | extend p50=percentile_tdigest(merged,50),
+           p95=percentile_tdigest(merged,95),
+           p99=percentile_tdigest(merged,99)
+
+
+ +
+ Methodology & caveats +
+
    +
  • Reporting window: Sun May 3 → Sat May 9, 2026 (Kusto startofweek('2026-05-03')). Baseline: prior week of Apr 26 → May 2.
  • +
  • 60-day window: 8 complete weeks Mar 8 → May 3 (the partial Mar 1 start week is excluded for trend deltas).
  • +
  • Auth-only denominator: all reliability % use countRequests from SilentAuthStatsAllRequestsInteractiveAuthStatsAllRequests. The all-spans denominator from BrokerAdoptionStats is sensitive to receiver/goAsync() taper effects.
  • +
  • Concentration thresholds for attribution cards: >80% = strong (red bar); 60-80% = medium; <60% = broad/cross-cutting.
  • +
  • PR confidence rating: high = trajectory + flight rollout date both line up; medium = code path matches but no flight gate evidence; low = candidate from grep, needs verification; none = no broker PR identified.
  • +
  • Account type / shared-device-mode dimensions are not yet sliced this week — ErrorStats doesn't carry them, requires a targeted android_spans query that we'll add next pass.
  • +
+
+
+ +
+ + + + + \ No newline at end of file diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html new file mode 100644 index 00000000..209e7d70 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/sparkline-footer.html @@ -0,0 +1,42 @@ + + diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html new file mode 100644 index 00000000..70f4a63c --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/spike-card.html @@ -0,0 +1,154 @@ + + +
+
+
{{ERROR_NAME}}   + {{WOW_BADGE}} + {{D60_BADGE}} +
+
+ {{TAG_1}} + {{TAG_2}} +
+
+
+
+ Verdict: {{VERDICT_PARAGRAPH}} +
+ + +
+
Span
+ {{SPAN_DIM_ROWS}} +
+
Calling app
+ {{APP_DIM_ROWS}} +
+
Active broker pkg
+ {{ACTIVE_BROKER_DIM_ROWS}} +
+
Broker version
+ {{BROKER_VERSION_DIM_ROWS}} +
+
Account type
+ {{ACCOUNT_TYPE_DIM_ROWS}} +
+
Shared device
+ {{SHARED_DEVICE_DIM_ROWS}} +
+
Client SKU
+ {{CLIENT_SKU_DIM_ROWS}} +
+
OS version
+ {{OS_VERSION_DIM_ROWS}} +
+
+ + +
+
Code attribution
+
Originator
+
{{ORIGIN_LABEL}} {{ORIGIN_DESCRIPTION}}
+
+
Top throw site
+
{{THROW_SITE_FILE_LINE}} {{THROW_SITE_NOTES}}
+
+
Wrapper
+
{{WRAPPER_CLASS_AND_METHOD}}
+
+
Caller hot-spots
+
{{CALLER_BREAKDOWN}}
+
+
Underlying cause
+
{{ROOT_CAUSE}}
+
+
Top error_messages
+
+
    +
  1. {{MSG_1}} — {{MSG_1_DEVICES}}
  2. +
  3. {{MSG_2}} — {{MSG_2_DEVICES}}
  4. +
  5. {{MSG_3}} — {{MSG_3_DEVICES}}
  6. +
+
+
+
Likely PRs
+
+
+
{{PR_1_CONF_LABEL}}
+
+ {{PR_1_ID}}   + {{PR_1_TITLE}} +
{{PR_1_DATE}} · {{PR_1_AUTHOR}} · sha {{PR_1_SHA}}
+
{{PR_1_WHY}}
+
+
+ +
+
+
Next step
+
📝 {{OWNER_TEAM}}: {{NEXT_ACTION}}
+
+
+ + +
+ Traffic Attribution check: {{TRAFFIC_VERDICT}} +
+
+
diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/template-readme.md b/.github/skills/oncall-weekly-telemetry-report/assets/templates/template-readme.md new file mode 100644 index 00000000..09f27703 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/template-readme.md @@ -0,0 +1,270 @@ +# report-template.html — author guide + +`assets/templates/report-template.html` is the canonical layout for the OCE weekly report. +It is **a real prior week's report kept verbatim as a structural reference** — +not a tokenized skeleton. The right mental model is: + +> *"Open the template, save it under a new filename, then walk top-to-bottom and +> replace every prior-week date / number / verdict / PR citation with current-week +> data. Don't redesign the layout. Don't restyle the CSS."* + +## What you change per week + +| Region | What to update | +|---|---| +| `` and `<h1>` block | Reporting window dates + "Generated …" date | +| KPI tiles (`.kpi-grid`) | Value, delta, `data-spark` array (8–9 numbers) per tile | +| 🚨 Needs-attention callouts (`.callout.urgent` / `.watch` / `.win`) | Replace bullet list with current-week findings; keep the 4 callout categories | +| 📈 60-day trend tables | Rows + `.trend` sparkline arrays, generated by `bucket-trends.js` (4 runs, union of regressions) | +| 🔎 Spike-attribution cards (`.attr-card`) | One card per regression. **Use [`templates/spike-card.html`](templates/spike-card.html) as the per-card skeleton.** Replace dim percentages, throw-site, PR list, etc. | +| 🚚 Traffic-attribution cards | Same as spike cards; render an explicit "None this week" if no errors qualify | +| Error-codes / error-types tables | One row per non-trivial code/type with Δ devices % + Δ requests % + 60d sparkline | +| Traffic / latency / adoption tables | Update numbers; structure stays | +| Appendix PR window list | Run `find-suspect-prs.ps1` (or `git log`) for broker/ + common/ over the 4-week window | + +## What you NEVER change + +- The `<style>` block at the top — the CSS is canonical +- The `<script>` block at the bottom — the sparkline JS is canonical (uses string concatenation, not template literals, on purpose — see comment in the script) +- Section ordering and `id="..."` anchors — the table-of-contents links rely on these + +If the layout itself ever needs to change (new section, new card style), edit +`assets/templates/report-template.html` here in the skill folder and commit so future +weeks inherit the change. + +## Editing strategy: in-place vs head+body+footer rebuild + +Pick by overlap with the prior week: + +- **In-place edit (default)** — when ≤3 attribution cards change AND the section structure is unchanged. Use `replace_string_in_file` with surrounding context per card / table row. Fast and low-risk. +- **Head+body+footer rebuild (fallback)** — when ≥4 attribution cards change, or several callouts get re-categorized, or the regression set has near-zero overlap with the template. Trying to in-place edit at that scale invites the inception-style nested-`</div>` bugs the validator was written to catch. + +> **⚠️ UTF-8 trap in PowerShell composition.** When composing HTML body sections via `@'...'@` heredocs piped to `Set-Content` / `Out-File` (or even `Add-Content`), PowerShell silently strips multi-byte UTF-8 characters — emojis (📊 🚨 🔴 🟡), em-dashes (—), arrows (→), middle-dots (·). The file remains valid UTF-8; the characters just become empty strings. The validator's `U+FFFD` check catches mojibake but NOT silent strips. Two safe approaches: +> +> 1. **Use `[IO.File]::WriteAllText($path, $text, [System.Text.UTF8Encoding]::new($false))`** for the final write — this preserves Unicode literals from the script source. +> 2. **Write a Node.js generator** (`gen-body.js`) that takes a JSON spec and emits the HTML body. Node handles UTF-8 natively. If creating the script becomes painful (the `create` tool occasionally fails on `file_text` in this codebase), fall back to approach 1 with explicit `[char]0xD83D + [char]0xDCCA` for 📊, `[char]0x2192` for →, etc. +> +> The cost when this trap fires: a full restoration pass against every emoji + em-dash + arrow in the report (~30 minutes in v8). + + Boundary lines in the canonical template (verify with `grep` before splitting — they drift as the template evolves): + + | Region | Lines (approx) | Last/first line content | + |---|---|---| + | **head** | 1 → ~342 | ends `<body>` then `<div class="container">` (open) | + | **body** (replace) | ~343 → ~1081 | starts `<div class="header">`, ends `</div>` that closes `.container` | + | **footer** | ~1082 → end | starts `<script>`, ends `</body></html>` | + + Rebuild recipe (PowerShell, single line — multi-line here-strings can mangle JS template literals in the footer; see user-memory `oce-report-lessons.md`): + + ```pwsh + $work="$env:USERPROFILE\android-oce-reports\_data"; $f='<output>.html'; $head=[IO.File]::ReadAllText("$work\head.html"); $body=[IO.File]::ReadAllText("$work\body.html"); $footerRaw=[IO.File]::ReadAllText("$work\footer.html"); $footer=$footerRaw -replace '^</div>\s*',''; [IO.File]::WriteAllText($f, $head + "`n" + $body + "`n" + $footer) + ``` + + The `-replace '^</div>\s*',''` strips the original body's closing `</div>` from the footer so the new body's own closing `</div>` doesn't double up. Always run `validate-report.ps1` after. + + **Critical for the rebuild path:** the rebuilt body must include `data-spark` on every KPI tile and `data-trend` on every relevant table row — the in-place template has these, but a fresh-authored body won't unless you add them explicitly. Reference markup: + + ```html + <!-- KPI tile with sparkline --> + <div class="kpi"> + <div class="label">Silent auth requests (week)</div> + <div class="value">10.59 B</div> + <div class="delta delta-up">+2.4% WoW</div> + <div class="spark" data-spark='[9.97e9,9.61e9,...,1.06e10]' data-color="#0969da"></div> + </div> + + <!-- 60-day trend table row with mini sparkline in the trajectory cell --> + <tr> + <td><code>no_tokens_found</code></td> + <td class="num">2.90 M</td><td class="num">4.52 M</td><td class="num bad">+55.7%</td> + <td><span class="trend" data-trend='[2902878,...,4519309]' data-color="#cf222e" data-w="160"></span></td> + </tr> + ``` + + The footer JS auto-renders both — no per-tile JS calls needed. The validator (Step 7) hard-fails if > half the KPI tiles lack `data-spark`. + +## Validator pass before saving + +Two literal-string greps must return zero results: + +```pwsh +Select-String -Path <output.html> -Pattern '\bdevs\b|\breqs\b' -CaseSensitive:$false # user-facing terminology +Select-String -Path <output.html> -Pattern 'EXAMPLE CONTENT BELOW' # unfinished-section sentinel +``` + +Authors mark unfinished sections with the literal text `EXAMPLE CONTENT BELOW` +inside an HTML comment. The grep catches anything still in flight. + +`devs` / `reqs` are allowed inside `<pre><code>…</code></pre>` KQL blocks +(legitimate Kusto column / variable names). All other occurrences are +forbidden — use `devices` / `requests` in user-facing prose, headers, badges, +and verdicts. + +## Sparklines are MANDATORY (don't drop them) + +The footer JS auto-renders any element with `data-spark` or `data-trend` attributes — but only if you actually emit those attributes. **Past mistake (v7 run):** body was rebuilt without `data-spark` on KPI tiles and without `.trend` cells in tables → the report shipped with zero charts. The validator does not catch this, so it is your responsibility. + +Required spark/trend coverage in every report: + +| Where | Attribute | Length | Color (see palette below) | +|---|---|---|---| +| Every KPI tile in `.kpi-grid` (Top-line health) | `<div class="spark" data-spark='[...]' data-color="..."></div>` inside the tile | 8–9 weekly values | blue/green/dark-blue per metric semantic | +| **Every** row in the 60-day trend tables — true regressions, **ephemeral spikes**, and **true improvements** (all three callout tables) | `<span class="trend" data-trend='[...]' data-color="..." data-w="160"></span>` in the trajectory cell | 8–9 weekly values | red regression / amber spike / green improvement / grey flat | +| Every row in the error-codes WoW table and error-types WoW table | `<span class="trend" data-trend='[...]' data-color="..."></span>` in the 60d-trend column | 8 weekly values | same palette | + +**Past failure modes:** +- v7 first pass: the body rebuild emitted *zero* `data-spark` / `data-trend` (validator now hard-fails this). +- v7 second pass: only the *true regressions* table got sparklines; the **ephemeral spikes** and **true improvements** tables were left text-only. All three tables in the 60-day trend section need the trajectory column with a sparkline — the validator's overall-coverage warn (≥15) catches this approximately, but the rule of thumb is: **if a row reports an 8-week delta, it gets a sparkline.** + +## Traffic-shape callout styling + +The Section 2 "Traffic shape" callout uses the **neutral grey-bordered `<div class="callout">`** (no `urgent` / `watch` / `win` modifier) and a **🚦** icon — it's an informational summary, not an alert. Don't promote it to `watch` (yellow) just because there's been some movement; reserve `watch` for things that need follow-up. + +## Traffic-attribution sub-block on each attribution card (tri-state) + +Each `.attr-card` in Section 4 ends with a small "Traffic attribution" sub-block. **Pick one of three colors based on the verdict — don't paint everything yellow.** Yellow loses meaning when it's the default. + +| Verdict | Color | Title prefix | Inline `style` on the wrapper | +|---|---|---|---| +| Per-request rate clearly moved; traffic ruled out | 🟢 green | `✓ Traffic attribution — ruled out` | `background:#dafbe1;border-color:#1a7f37;` + title `color:#1a7f37;` | +| Mixed signal — traffic + rate both contributing | 🟡 yellow | `⚠ Traffic attribution — partly contributing` | `background:linear-gradient(180deg,#fff8c5 0%,#fff1a8 100%);border-color:#d4a72c;` + title `color:#9a6700;` | +| Traffic IS the dominant driver | 🔴 red | `🚚 Traffic attribution — primary driver (see § 5)` | `background:#ffeef0;border-color:#cf222e;` + title `color:#cf222e;` | + +A red sub-block here means the error **also** belongs in the top-level § 5 "🚚 Traffic Attribution" section. Don't surface a red sub-block without a matching § 5 entry, and don't render § 5 as "None this week" if any attribution card has a red sub-block. + +Past failure mode (v7 second pass): all 10 cards painted yellow regardless of verdict, making the color meaningless. The actual breakdown that week was 6 green + 4 yellow + 0 red. + +**Minimum verification step before publishing** (add to your final-pass checklist): + +```pwsh +Select-String -Path <output.html> -Pattern 'data-spark|data-trend' | Measure-Object | Select-Object Count +``` + +Should return **at least ~30** matches (8 KPI tiles + ~10 60d-trend rows + ~12 WoW-table rows). If the count is zero or near-zero, the report is missing all charts — go back and add them. + +## Attribution-card layout — the two v8 traps + +The CSS in `report-template.html` now guards both, and `validate-report.ps1` § 9 +hard-fails when the rules are missing. Two failure modes to know about: + +### 1. Cards touching (no spacing between consecutive `.attr-card`s) + +The template originally relied on an outer `<div class="attr-grid">` wrapper to +provide `gap: 16px` between cards. A head+body+footer rebuild that emits +`.attr-card` elements directly under `<h2>` produces visually touching cards. + +**Fix in template CSS:** `.attr-card { margin-bottom: 16px }` + `.attr-card + +.attr-card { margin-top: 16px }`. If you ever rewrite the head, make sure both +rules survive. + +### 2. Text bleeding out of `.dim` boxes (long calling-app / version names) + +Two flexbox traps stack here: + +- **`text-overflow: ellipsis` is silently ignored on `display: inline` elements.** + A `<span>` defaults to inline. The name span must be `display: block` (or + `inline-block`) for ellipsis to render. +- **Flex children don't shrink below their content size by default.** Both the + flex child AND every flex ancestor need `min-width: 0` explicitly. + +**Two valid `.dim-row` markup variants — pick one per card:** + +```html +<!-- Variant A: classed spans (original template, recommended) --> +<div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill dominant" style="width:99.0%"></div></div> + <span class="dim-name">AcquireTokenSilent</span> + <span class="dim-pct">99.0%</span> +</div> + +<!-- Variant B: unclassed spans (terser; CSS covers both forms via :first-of-type / :last-of-type) --> +<div class="dim-row"> + <div class="dim-bar-track"><div class="dim-bar-fill" style="width:36.6%"></div></div> + <span>com.microsoft.windowsintune.companyportal</span> + <span>36.6%</span> +</div> + +<!-- Placeholder rows ("Not sliced — …") — one span only, still truncate --> +<div class="dim-row"> + <span style="color:#656d76;font-size:11.5px;">Not sliced — OEM not suspected.</span> +</div> +``` + +The CSS rules `text-overflow: ellipsis` + `display: block` + `flex: 1 1 0` + +`min-width: 0` + `max-width: 100%` are baked into the template name-column +selector for both classed and unclassed variants. Do not bypass them by setting +inline `white-space: normal` or removing `min-width: 0` from `.dim` / +`.attr-dims` — that's how the bug regresses. + +## Sparkline color palette + +Used by both `.spark` (KPI tiles) and `.trend` (table cells): + +| Color hex | Semantic | When to use | +|---|---|---| +| `#cf222e` red | bad / regression | data-trend on a row in the regressions table | +| `#1a7f37` green | good / improvement / win | data-spark on a reliability KPI; data-trend on a recovery | +| `#0969da` blue | neutral / informational | data-spark on traffic-volume KPIs | +| `#0550ae` darker blue | latency | data-spark on p95 KPIs | +| `#9a6700` amber | watch / spike | data-trend on ephemeral spikes (peak-then-recover) | +| `#656d76` grey | flat / no-movement | data-trend on flat rows in long-tail tables | + +## CSS class quick reference + +(Defined in `<style>`; do not redefine inline.) + +### Section 2 callouts (at-a-glance, flat rows — the "Things that need attention" block) + +| Class | Use | +|---|---| +| `.callout` (`.urgent` / `.watch` / `.win`) | The outer card with the colored left rail and pastel background. The rail color IS the severity affordance — do not add per-item left bars inside the callout (they will visually clash). | +| `.item-list` | Container for the flat row list inside a callout body. | +| `.item` | Single divider-separated row. NO chrome — no border, no background, no left bar. The `.item:first-child` selector removes the top divider. | +| `.item-head` | Flex row: name + inline metric chips + tags pushed right. Use `flex-wrap` so it works on narrow viewports. | +| `.item-name` | Monospace bold name (the `error_code` or `error_type`). Append a `<span class="kind">type</span>` pill if it's an error_type, not an error_code. | +| `.metric` (`.up` / `.down`) | Inline metric chip: `<label> <value>`. `.up` = red (regression), `.down` = green (improvement). Use multiple per row for `devices`, `Δ WoW`, `Δ requests`, `on 16.0.1`, etc. | +| `.item-tags` | Right-pushed tag rail. Put the originator chip (`origin-broker` / `origin-thirdparty` / `origin-android` / `origin-env`) here, plus optional `NEW` / `60d↑` tags. | +| `.item-body` | One short narrative line (throw site + dominant message + verdict). Keep it short — the deep dive belongs in the spike-attribution card. | +| `.item-foot` | Optional footer with owner / next step + right-aligned `Attribution card →` link via `.arrow-link`. | + +**HARD RULE:** Section 2 items are at-a-glance — they MUST link to a deep-dive `.attr-card` in Section 4 via `<a class="arrow-link" href="#card-XXX">Attribution card →</a>` rather than duplicating the dim slicing or PR analysis inline. The split between Section 2 (skim) and Section 4 (deep dive) is the whole point of the report layout. + +### Section 4 attribution cards (deep-dive — the "🔎 Spike Attribution" block) + +| Class | Use | +|---|---| +| `.attr-card` | Per-error attribution container. Each WoW regression AND each 60d regression gets one. | +| `.attr-header` (`.urgent` / `.watch`) | Header strip with name + tag chips. | +| `.attr-name`, `.attr-tags`, `.attr-verdict` (`.bad`) | Header content + top verdict paragraph. | +| `.attr-dims` | 7-tile grid for the 7 mandatory dim slices. | +| `.dim` / `.dim-label` / `.dim-row` / `.dim-bar-track` / `.dim-bar-fill` (`.dominant` / `.split`) | Single dim tile with concentration bars. | +| `.code-attr` / `.code-attr-title` | Labeled-grid block under the dims. | +| `.origin-row` | One row in the code-attr grid (label + value). | +| `.stack` | Chip for a `file:line` throw-site reference. | +| `.pr-card` / `.pr-conf` (`-high` / `-medium` / `-low` / `-none`) / `.pr-body` | PR citation with confidence pill. | +| `.origin-tag` (`.origin-broker` / `.origin-android` / `.origin-thirdparty` / `.origin-env`) | Colored chips for the Originator field. | + +### Section 6/7 WoW table row pills + +Status pills in the `error_codes` and `error_types` WoW tables. The 5-color +palette is meaningful — pick the one that matches the row's state: + +| Class | Color | Emoji | When to use | +|---|---|---|---| +| `.pill-bad` | red (#ffeef0 bg / #cf222e text) | 🔴 | Row crossed regression threshold this week — `WoW`, `NEW`, `spike`, or `retry storm` modifier. | +| `.pill-watch` | amber (#fff8c5 bg / #9a6700 text) | 🟡 | Row is flat WoW but rising on the 60d window (use the `60d↑` modifier). | +| `.pill-good` | green (#dafbe1 bg / #1a7f37 text) | 🟢 | Row is improving — recovery, `improving`, `60d↓`, or `requests↓` modifier. | +| `.pill-flat` | grey (#f0f3f6 bg / #656d76 text) | ⚪ | Row is within ±10% on both 60d and WoW; explicitly stable. | +| `.pill-info` | blue (#ddf4ff bg / #0550ae text) | ℹ️ | Informational rows (e.g. policy-driven, fleet-growth-driven). | + +Render pattern: +```html +<span class="pill pill-bad">🔴 WoW</span> +<span class="pill pill-watch">🟡 60d↑</span> +<span class="pill pill-good">🟢 improving</span> +``` + +If your table has zero `.pill-bad` rows the week was unusually quiet — +double-check the WoW-movers and 60d bucketing passes ran. If every row is +`.pill-bad` you've mis-categorized. + diff --git a/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html b/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html new file mode 100644 index 00000000..e16b4071 --- /dev/null +++ b/.github/skills/oncall-weekly-telemetry-report/assets/templates/traffic-attr-card.html @@ -0,0 +1,46 @@ +<!-- + Traffic Attribution card — for errors whose spike is fully or partly explained + by per-app request volume rising rather than per-request failure rate. Routed + to the calling-app team, not to broker. + + Use under the "🚚 Traffic Attribution" section. If no errors qualify in a + given week, emit the "None this week" callout instead (see SKILL.md Step 6c). +--> +<div class="attr-card" id="traffic-card-{{ERROR_ID}}"> + <div class="attr-header" style="background:linear-gradient(180deg,#fff8c5 0%,#fffbe8 100%);border-bottom-color:#d4a72c;"> + <div class="attr-name">{{ERROR_NAME}}   + <span class="tag tag-warn">{{WOW_BADGE}}</span> + <span class="tag tag-info">traffic-driven, not failure-rate</span> + </div> + <div class="attr-tags"> + <span class="tag tag-info">dominant caller: {{DOMINANT_APP}}</span> + </div> + </div> + <div class="attr-body"> + <div class="attr-verdict"> + <strong>Verdict:</strong> {{VERDICT_PARAGRAPH}} + </div> + + <table style="width:100%; font-size:12px;"> + <thead><tr> + <th>Calling app</th> + <th class="num">Δ overall requests WoW</th> + <th class="num">Per-request failure rate (prev → cur)</th> + <th class="num">Δ failure rate</th> + </tr></thead> + <tbody> + <tr> + <td><code>{{APP_1}}</code></td> + <td class="num">{{APP_1_DELTA_REQ}}</td> + <td class="num">{{APP_1_PREV_RATE}} → {{APP_1_CUR_RATE}}</td> + <td class="num {{APP_1_RATE_CLASS}}">{{APP_1_DELTA_RATE}}</td> + </tr> + <!-- repeat per affected app --> + </tbody> + </table> + + <div class="attr-verdict" style="border-left-color:#9a6700; background:#fff8c5; margin-top:12px;"> + <strong>Routing:</strong> 📝 <strong>{{CALLER_OWNER_TEAM}}</strong> (not broker). Per-request failure rate is essentially flat, so a code regression in the broker is not implicated. The error spike is a function of {{DOMINANT_APP}} sending {{APP_1_DELTA_REQ}} more requests this week. + </div> + </div> +</div>