From 23292996ce8d0292d5eb3a235dc1c0df78e5dcf5 Mon Sep 17 00:00:00 2001
From: Antriksh Jain <antrikshjain@microsoft.com>
Date: Wed, 6 May 2026 03:12:26 +0530
Subject: [PATCH 1/5] docs(ai-agents): design doc for context-aware next-step
 guidance

Initial draft of the design for issue #7975  context-aware Next: guidance
across azd ai agent commands plus a new `azd ai agent doctor` diagnostic.

Scoped entirely to cli/azd/extensions/azure.ai.agents/. References core
files for context but proposes no edits outside the extension.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 cli/azd/docs/design/azd-ai-agent-nextsteps.md | 393 ++++++++++++++++++
 1 file changed, 393 insertions(+)
 create mode 100644 cli/azd/docs/design/azd-ai-agent-nextsteps.md

diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
new file mode 100644
index 00000000000..37914caac85
--- /dev/null
+++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
@@ -0,0 +1,393 @@
+# Design: Context-Aware Next-Step Guidance for `azd ai agent`
+
+## Status
+
+**Draft — for design review.**
+Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics".
+Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — same path other extensions use today.
+
+## Goal
+
+After every successful `azd ai agent` command — and on demand via a new `azd ai agent doctor` — print a short, state-aware **Next:** block telling the developer the single best command to run next. Replace the current static, often-misleading hints with output derived from the project's real state (`azure.yaml`, the azd environment, optionally a running agent's OpenAPI endpoint).
+
+Non-goals:
+- Reworking error-path guidance — `internal/exterrors` already does this well.
+- Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. State is read via the existing gRPC `azdext` API.
+- Sample-author hooks, README markers, or YAML-declared hints (see [Alternatives](#alternatives-considered)).
+
+## Background
+
+### What's wrong today
+
+| Surface | Today | Problem |
+|---|---|---|
+| `init` | Hardcoded: "run `azd deploy` or `azd up`" | Skips local-dev path entirely. Suggests deploy even when deps aren't provisioned. |
+| `run` | Hardcoded: `invoke --local "Hello!"` | Wrong payload for invocations-protocol agents. No OpenAPI awareness. |
+| `invoke --local` (success) | Nothing | Dead-end. No nudge toward `azd deploy`. |
+| `invoke` (remote, success) | Nothing | No nudge to `show` / `monitor`. |
+| `show` (success) | Nothing | No nudge to invoke or check logs. |
+| `deploy` | Endpoint URLs only | No invoke command, no link to README. |
+| (none) | — | No way to ask "where was I?" or "what's broken across the project?" |
+
+### What's already in place
+
+These are the building blocks the design reuses:
+
+- **`azdext` gRPC client** — extensions read `azure.yaml` services and azd env vars without re-implementing parsing.
+- **`fetchOpenAPISpec`** in `helpers.go` — already probes `GET /invocations/docs/openapi.json`, caches to disk, fails silently. Drives the rich-payload story.
+- **`exterrors`** factories — typed errors with `Suggestion` fields. We follow the same "structured advice" shape on success paths.
+- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. Already used by the existing `service_target_agent.go` to print endpoint URLs. We populate the same field with our Next: block. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test.
+
+## Architecture
+
+### Component layout
+
+```
+cli/azd/extensions/azure.ai.agents/internal/cmd/
+├── nextstep/                     ← new package (this design)
+│   ├── types.go                  ← Suggestion, State, ServiceState
+│   ├── state.go                  ← AssembleState(ctx, azdClient) → State
+│   ├── resolver.go               ← decision tree per command
+│   ├── format.go                 ← PrintNext / FormatNextForNote
+│   └── openapi.go                ← OpenAPI-derived invoke examples
+├── doctor.go                     ← new command — calls nextstep + extra checks
+├── init.go        ─┐
+├── run.go         ─┼─ each calls nextstep.ResolveAfterX → nextstep.PrintNext
+├── invoke.go      ─┤
+├── show.go        ─┘
+└── ...
+internal/project/
+└── service_target_agent.go       ← Deploy() embeds Next: in its returned artifact
+```
+
+One package owns *all* policy. Each command is a thin caller — it knows when it succeeded and which resolver to invoke; it does not assemble state, format output, or decide commands.
+
+### End-to-end flow
+
+```
+┌────────────────────────────────────────────────────────────────────┐
+│ azd ai agent <cmd> succeeds                                        │
+│                                                                    │
+│   1. nextstep.AssembleState(ctx, azdClient)                        │
+│      ├── azdext.Project().Get      → azure.yaml services           │
+│      ├── azdext.Environment().Get  → azd env vars                  │
+│      └── (run only) fetchOpenAPISpec → invoke payload examples     │
+│                                                                    │
+│   2. nextstep.ResolveAfter<Cmd>(state, args) → []Suggestion        │
+│      ├── walks command-specific decision tree                      │
+│      └── chooses 1 primary + ≤1 secondary                          │
+│                                                                    │
+│   3a. nextstep.PrintNext(w, suggestions)        ← stdout commands  │
+│   3b. nextstep.FormatNextForNote(suggestions)   ← deploy artifact  │
+└────────────────────────────────────────────────────────────────────┘
+```
+
+`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout (or stderr when piped — see [Output Discipline](#output-discipline)).
+
+## State Model
+
+```go
+// types.go
+type State struct {
+    HasProjectEndpoint     bool
+    HasUnresolvedInfraVars bool   // ${...} → Bicep outputs not in azd env
+    HasUnresolvedManualVars []string // ${...} → user-supplied vars not set
+    Services               []ServiceState
+    AgentStatus            string  // optional (Foundry API), empty if unknown
+    HasOpenAPI             bool    // run-time only
+    OpenAPIPayload         string  // run-time only, ""=no spec
+}
+
+type ServiceState struct {
+    Name         string
+    Host         string  // azure.ai.agent / azure.ai.toolbox / ...
+    Protocol     string  // responses / invocations / ""
+    RelativePath string  // svc.RelativePath from azure.yaml
+    IsDeployed   bool    // azd env: deployment metadata present
+}
+
+type Suggestion struct {
+    Command     string  // "azd ai agent run"
+    Description string  // "start the agent locally"
+    Priority    int     // lower = earlier
+}
+```
+
+State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The cost is one gRPC round-trip per resolver invocation, which is negligible compared to anything the user perceives.
+
+### Layered sources
+
+The resolver works with whatever's available — partial state never silences guidance:
+
+| Layer | Source | Authority |
+|---|---|---|
+| 1 | Explicit CLI flags (`--project-endpoint`, `--agent`) | Highest |
+| 2 | Live runtime probes (OpenAPI, Foundry status) | Override env vars |
+| 3 | `azure.yaml` services, protocols, `config.env` refs | Structural truth |
+| 4 | azd env vars | Resolution truth |
+
+If the azd environment is missing or stale (a real scenario for brownfield users), the resolver should still produce useful output from layers 1–3.
+
+## Decision Tree (per command)
+
+The full tree lives in `resolver.go`. This section captures the policy in tabular form for review.
+
+### `init`
+
+| Condition | Primary | Secondary |
+|---|---|---|
+| `HasUnresolvedInfraVars` | `azd provision` | — |
+| `HasUnresolvedManualVars` (non-empty) | `azd env set <KEY> <value>` (one per missing var, up to 3) | — |
+| Otherwise | `azd ai agent run` | — (the `run` resolver will print the protocol-correct `invoke --local` when the agent starts) |
+| Always (third line) | "When ready to deploy to Azure, run `azd deploy`." | — |
+
+### `run` (on listening callback)
+
+| Condition | Primary |
+|---|---|
+| `HasOpenAPI` && payload extracted | `azd ai agent invoke --local '<extracted-payload>'` |
+| `Protocol == "invocations"` | `azd ai agent invoke --local '{"message": "Hello!"}'` |
+| `Protocol == "responses"` or unknown | `azd ai agent invoke --local "Hello!"` |
+
+If OpenAPI fetch failed (404, timeout), append a Tip line:
+```
+Tip:  curl http://localhost:<port>/invocations/docs/openapi.json
+      to verify the exact payload format your agent expects.
+```
+
+### `invoke --local` (success)
+
+Single agent: `azd deploy`, secondary `azd ai agent monitor --follow`.
+Multi-agent: `azd deploy` only (one line per agent for follow-up `invoke <name>`).
+
+### `invoke` (remote)
+
+Success: `azd ai agent show <agent>`, secondary `azd ai agent monitor --follow`.
+Failure: `azd ai agent monitor --follow`.
+
+### `show` (success)
+
+| `AgentStatus` | Primary |
+|---|---|
+| `active` / `idle` | `azd ai agent invoke <agent> 'Hello!'` |
+| `failed` / `""` | `azd ai agent monitor --follow` |
+| transitional | `azd ai agent show <agent>` (re-check) |
+
+### `deploy` (post-deploy hook, embedded in artifact note)
+
+Success, single agent:
+```
+azd ai agent show <agent>              -- verify it's running
+azd ai agent invoke '<payload>'        -- test the deployment
+See <relpath>/README.md for a sample payload appropriate for this agent.
+```
+
+Notes:
+- The `invoke` line **omits the agent name** for single-agent projects (matches sample README convention).
+- README hint is emitted **only if** `os.Stat` succeeds on `<root>/<svc.RelativePath>/{README.md,readme.md,README.MD}`. No hint for projects without a README.
+
+Multi-agent: one `show <name>` and one `invoke <name>` line per agent.
+
+### `doctor`
+
+Special: runs every check unconditionally, then calls `ResolveAfterInit` (or a deploy-aware variant when `IsDeployed`) for the trailing Next: block. Same code path as success-path resolvers.
+
+## Output Format
+
+```
+Next:  <primary command>          -- <description>
+       <secondary command>        -- <description>
+```
+
+Ground rules:
+- One primary, ≤1 secondary. More than two lines drowns out command output.
+- Two-space gap between command and `--`.
+- Multi-agent: repeat per agent on separate lines, no menus.
+- Leading blank line separates from preceding command output.
+- README/docs pointers go on their own line below the commands.
+- Always present `Next:` prefix so the block is visually distinct.
+
+### Output discipline
+
+| Scenario | Destination |
+|---|---|
+| Interactive TTY | `stdout` |
+| Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output |
+| `--output json` / structured output flag | Suppressed entirely |
+| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs) |
+
+### Multi-line note pre-indentation
+
+### Multi-line note pre-indentation
+
+The `azdext.Artifact` SDK field used here renders only the **first line** at the natural indent level; subsequent lines need to be pre-indented by the producer to align under the artifact bullet. `FormatNextForNote` pre-indents lines 2+ with 4 spaces and trims leading/trailing newlines. This is purely a string-formatting concern handled inside the extension.
+
+## Hook Points
+
+| Command | Trigger | Mechanism |
+|---|---|---|
+| `init` | end of `runE` after success message | direct `nextstep.PrintNext(w, …)` |
+| `run` | "agent is listening" callback | direct print before `Press Ctrl+C` line |
+| `invoke --local` / `invoke` | end of `runE` on success | direct print |
+| `show` | end of table render | direct print |
+| `doctor` | after rendering check report | direct print |
+| `deploy` | inside `Deploy()` return path | attach Next: to the returned `azdext.Artifact` |
+
+The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly.
+
+### Why not a `listen.go` event handler for deploy?
+
+The listen daemon's stdout is consumed by core's gRPC channel — `fmt.Printf` from a `postdeployHandler` never reaches the user terminal. Confirmed during a previous spike. The artifact-note path is the only reliable surface.
+
+## OpenAPI Discovery
+
+Reuses `fetchOpenAPISpec` from `helpers.go`. Adds one helper in `nextstep/openapi.go`:
+
+```go
+// ExtractInvokeExample returns a JSON-quoted payload string from the spec's
+// /invocations endpoint (preferring `example`, falling back to a minimal object
+// derived from `schema.required`). Returns "" if the spec is unusable.
+func ExtractInvokeExample(spec []byte) string
+```
+
+Resolution order:
+1. `paths./invocations.post.requestBody.content.application/json.example` — exact author intent.
+2. `…schema.example` — schema-level example.
+3. Generate from `schema.required` + `schema.properties[*].example` — minimum viable JSON.
+4. Return `""` → fall back to protocol-generic payload + Tip line.
+
+All failures are silent. The fetch itself is best-effort and already cached — we never block `run`'s startup notice on it.
+
+## `doctor` Command
+
+`doctor` is the persistence layer for state-aware guidance: when the user has lost terminal context or hit a confusing error, they run one command and get the full picture plus the specific fix for each broken thing.
+
+### Checks (run in order, top to bottom)
+
+| # | Check | Pass | Fail → Fix |
+|---|---|---|---|
+| 1 | `azd` reachable + extension running | "extension running, gRPC channel established" | install / start daemon |
+| 2 | `azure.yaml` valid | "1 service: <names>" | `azd ai agent init` |
+| 3 | azd environment selected | "<env-name>" | `azd env new` / `azd env select` |
+| 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` |
+| 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` |
+| 6 | `agent.yaml` per service is valid | path | fix YAML |
+| 7 (post-MVP) | Authentication (`az account show`) | signed-in user | `azd auth login` |
+| 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall |
+| 9 (post-MVP) | Model deployments exist | name + version | `azd provision` |
+| 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` |
+| 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` |
+
+Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass.
+
+### Doctor output shape
+
+```
+azd ai agent doctor
+  ✓ PASS  <check name>
+          <one-line detail>
+  ✗ FAIL  <check name>
+          <one-line detail>
+          fix:  <command or instruction>
+
+Next:  <resolved next-step block, same format as elsewhere>
+```
+
+When all checks pass, the trailing Next: block is the resolver's `ResolveAfterInit` (if not deployed) or a deploy-aware variant (if deployed) — exactly what the user would have seen at the end of the last successful command.
+
+## Testing Strategy
+
+| Layer | What | How |
+|---|---|---|
+| `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests |
+| `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` |
+| `nextstep/openapi` | each extraction path + corruption / empty fallback | golden specs in `testdata/` |
+| `nextstep/state` | layered source priority, partial state | mocked `azdClient` (testify/mock) |
+| `service_target_agent` | with/without README on disk → with/without hint line | `t.TempDir` + `t.Chdir` |
+| `doctor` | each check pass/fail rendering | mocked `azdClient` |
+
+All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure calls.
+
+## Risks & Mitigations
+
+| Risk | Mitigation |
+|---|---|
+| Resolver guidance contradicts what the command actually did (false advice) | Each resolver receives the *post-execution* state. Tests cover deployed / not-deployed / partial-env permutations. |
+| OpenAPI fetch slows down `run` startup | Already capped + cached + silent on failure. We do not block the listening notice on the fetch. |
+| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. |
+| Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. |
+| Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. |
+| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` cap: 5 agents inline; beyond that, suggest `azd ai agent show` (no args, lists all). |
+
+## Implementation Phases
+
+1. **Foundation** — `nextstep` package (types, format, state assembly, OpenAPI helper). No callers.
+2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command.
+3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification.
+4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver.
+5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). Separate design.
+
+Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975.
+
+## Appendix A: Example Outputs (full block, end-to-end)
+
+### After `init`, fresh project, deps not provisioned
+
+```
+✓ Agent project initialized.
+
+Next:  azd provision  -- set up your Foundry project, models, and connections
+
+This creates your Foundry project and any model deployments, toolboxes,
+or connections your agent needs for local development.
+
+Once that finishes, run 'azd ai agent run' to start locally.
+When ready to deploy to Azure, run 'azd deploy'.
+```
+
+### After `run`, invocations protocol, with OpenAPI
+
+```
+Agent is running at http://localhost:8088
+
+Next:  azd ai agent invoke --local '{"message": "What hotels are in Seattle?"}'
+
+Press Ctrl+C to stop.
+```
+
+### After `azd deploy` (rendered by core from artifact note)
+
+```
+  (✓) Done: Deploying service ag-ui-invocations
+  - Agent playground (portal):     https://ai.azure.com/...
+  - Agent endpoint (invocations):  https://....services.ai.azure.com/...
+    Next:  azd ai agent show ag-ui-invocations    -- verify it's running
+           azd ai agent invoke '<payload>'        -- test the deployment
+    See src/ag-ui-invocations/README.md for a sample payload appropriate for this agent.
+```
+
+### After `azd ai agent doctor`, project deployed and healthy
+
+```
+azd ai agent doctor
+  ✓ PASS  azd CLI is installed and reachable
+          extension running, gRPC channel established
+  ✓ PASS  Project loaded from azure.yaml
+  ✓ PASS  Current azd environment selected
+  ✓ PASS  Agent service detected in azure.yaml
+  ✓ PASS  AZURE_AI_PROJECT_ENDPOINT is set
+  ✓ PASS  agent.yaml for service "echo-agent" is valid
+
+Next:  azd ai agent show echo-agent              -- verify it's running
+       azd ai agent invoke '<payload>'           -- test the deployment
+       See src/echo-agent/README.md for a sample payload appropriate for this agent.
+```
+
+## Appendix B: References
+
+- Issue [#7975](https://github.com/Azure/azure-dev/issues/7975)
+- Existing OpenAPI fetch: [`helpers.go#L296`](https://github.com/Azure/azure-dev/blob/main/cli/azd/extensions/azure.ai.agents/internal/cmd/helpers.go#L296)
+- Existing init suggestion: [`init.go#L1718-L1733`](https://github.com/Azure/azure-dev/blob/main/cli/azd/extensions/azure.ai.agents/internal/cmd/init.go#L1718-L1733)
+- Existing endpoint artifact emission (same SDK field reused for the Next: block): [`service_target_agent.go`](../../extensions/azure.ai.agents/internal/project/service_target_agent.go)
+- `exterrors` package (parallel pattern for error paths): [`internal/exterrors`](../../extensions/azure.ai.agents/internal/exterrors)
+- Reference Foundry sample (OpenAPI opt-in): [`microsoft-foundry/foundry-samples` human-in-the-loop](https://github.com/microsoft-foundry/foundry-samples/blob/fd4ecab832fa491b9ffb4abca862c073777b5e53/samples/python/hosted-agents/bring-your-own/invocations/human-in-the-loop/main.py#L119)

From 18bc6589ee38da9ffc0258a212fd4194a8876a02 Mon Sep 17 00:00:00 2001
From: Antriksh Jain <antrikshjain@microsoft.com>
Date: Wed, 6 May 2026 03:21:58 +0530
Subject: [PATCH 2/5] docs(ai-agents): add separate design for doctor remote
 checks (7-11)

Companion to azd-ai-agent-nextsteps.md. Covers the five live-Azure
doctor checks deferred from the MVP: auth, Foundry reachability,
model deployments, RBAC, and agent runtime status.

Key decisions:
- Opt-in via --remote flag (default stays local-only, fast, no auth)
- Strict dependency matrix; failed prerequisites become explicit Skip results
- Per-check timeouts (~10s); ~30s worst-case walltime for --remote
- Read-only  RBAC fix is surfaced as az CLI command, never auto-applied
- Phased rollout: runner split, auth+reach, models+status, RBAC last

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../azd-ai-agent-doctor-remote-checks.md      | 260 ++++++++++++++++++
 cli/azd/docs/design/azd-ai-agent-nextsteps.md |   4 +-
 2 files changed, 262 insertions(+), 2 deletions(-)
 create mode 100644 cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md

diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
new file mode 100644
index 00000000000..1b3fb166a9b
--- /dev/null
+++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
@@ -0,0 +1,260 @@
+# Design: `azd ai agent doctor` — Remote Checks (7–11)
+
+## Status
+
+**Draft — for design review.**
+Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975).
+Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.**
+
+## Goal
+
+Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"*
+
+Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix.
+
+Non-goals:
+- Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1.
+- Re-implementing what `az` / Foundry portal already do well — we just point the user there when relevant.
+- Any modification to files outside `cli/azd/extensions/azure.ai.agents/`.
+
+## Background
+
+The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions.
+
+The remaining ~30% — and arguably the most frustrating cases — sit in remote land:
+- Token expired silently → every command 401s with no obvious "you're logged out" hint.
+- Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace.
+- Model deployment got deleted out from under the project → first invoke fails with a cryptic 404.
+- Azure-AI Developer role missing → vague 403.
+- Agent deployed but unhealthy → user thinks `azd deploy` succeeded so it must be fine.
+
+These all require live calls and credentials, which is why they're carved out into this separate design.
+
+## Scope of this doc
+
+| Check | What it answers |
+|---|---|
+| 7 — Authentication | Am I logged in? Is the token still valid? |
+| 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? |
+| 9 — Model deployments | Do the model(s) my agent references exist on the project? |
+| 10 — RBAC | Does my principal have the roles needed to deploy / invoke? |
+| 11 — Agent status | Does the deployed agent actually exist and is it active? |
+
+## Architecture
+
+### Component layout (additive only)
+
+```
+cli/azd/extensions/azure.ai.agents/internal/cmd/
+└── doctor/
+    ├── checks_local.go       ← (MVP) checks 1–6
+    ├── checks_remote.go      ← NEW: checks 7–11
+    ├── runner.go             ← orchestrates ordered execution
+    └── types.go              ← Check, Result, Severity
+```
+
+Existing MVP files unchanged. Only `runner.go` learns to dispatch the new check set, gated by a flag (see [Execution Model](#execution-model)).
+
+### `Check` contract
+
+Same shape as MVP — keeps the renderer, sort order, and `Next:` resolver code path identical.
+
+```go
+type Check interface {
+    ID() string                   // stable, e.g. "auth.token"
+    Title() string                // human label
+    Run(ctx context.Context) Result
+}
+
+type Result struct {
+    Status   Status               // Pass | Warn | Fail | Skip
+    Detail   string               // one-line description
+    Fix      string               // single command or instruction; empty for Pass
+    Links    []string             // doc links (optional)
+    Duration time.Duration        // captured by runner, not the check
+}
+```
+
+A remote check that can't run (e.g., not authenticated when running check 9) returns `Status: Skip` with `Detail: "skipped: authentication required (see check 7)"`. Skips never block subsequent checks — the user sees the full picture in one pass.
+
+## Execution Model
+
+### Default: local-only
+
+`azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode.
+
+### Opt-in: `--remote`
+
+`azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)).
+
+Rationale for opt-in (vs. always-on or auto-detect):
+- Remote checks need credentials and add seconds to a command users may run frequently.
+- Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI.
+- A flag is discoverable, reversible, and self-documenting.
+
+We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about.
+
+### Dependency matrix
+
+| Check | Depends on | Action if dependency fails |
+|---|---|---|
+| 7 — Auth | check 3 (env selected) | Skip with "select an env first" |
+| 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip |
+| 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip |
+| 10 — RBAC | 7 (auth), 8 (reachability) | Skip |
+| 11 — Agent status | 8, 9, 10 | Skip |
+
+Skips are explicit in the rendered report. We never quietly drop a check.
+
+## Check Specifications
+
+### Check 7 — Authentication
+
+**What it does:** acquires a token via `azidentity.NewAzureDeveloperCLICredential` (the same credential `agent_context.go` already uses), then introspects expiry.
+
+**Pass:** "<UPN> · token valid for <N> minutes"
+**Warn:** token expires in <5 min (suggest re-login proactively)
+**Fail:** token acquisition error → `azd auth login`
+
+**Why a separate fail vs warn:** users hit this all the time when a CLI session outlives a token; the warn variant tells them to re-login *before* a long-running deploy 401s.
+
+### Check 8 — Foundry project reachability
+
+**What it does:** issues a single `GET <AZURE_AI_PROJECT_ENDPOINT>/agents?api-version=…&$top=1` with the credential from check 7.
+
+**Pass:** "endpoint reachable (HTTP 200)"
+**Fail:** maps the HTTP status to one of:
+- `403` → check 10 (RBAC) will explain. Fix: see check 10 output.
+- `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var.
+- network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`".
+
+The single-shot probe avoids paging and works on tiny test projects. Timeout: 10s, no retries — doctor is diagnostic, not resilient.
+
+### Check 9 — Model deployments
+
+**What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally.
+
+**Pass:** "all <N> referenced models present"
+**Fail:** "missing: <model-name> (referenced by <service>)" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link).
+
+**Edge cases:**
+- Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here.
+- Agents with no model reference (orchestration-only) — pass with "no model references".
+
+### Check 10 — RBAC
+
+**What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set:
+- `Azure AI Developer` (or stronger) on the project — required for deploy + invoke.
+- `Cognitive Services User` on the AI account — required for model usage.
+
+**Pass:** "<role-1>, <role-2>"
+**Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role.
+**Fail:** missing both → command + portal link.
+
+The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example:
+```
+fix:  az role assignment create \
+        --role "Azure AI Developer" \
+        --assignee <principal-id> \
+        --scope <project-resource-id>
+```
+
+**Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited.
+
+### Check 11 — Agent status
+
+**What it does:** for each service in `azure.yaml` with `host: ai-agent`, calls the Foundry agents-list endpoint and finds the matching agent by name.
+
+**Pass:** "<agent-name>: active (v<N>)"
+**Warn:** "<agent-name>: deploying (v<N>)" — running `monitor --follow` is suggested.
+**Fail:**
+- agent not found → `azd deploy`
+- agent in failed state → `azd ai agent monitor --follow` (look at logs, not redeploy blindly)
+
+This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly.
+
+## Output Format
+
+Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible:
+
+```
+azd ai agent doctor --remote
+  ✓ PASS  azd reachable
+  ✓ PASS  azure.yaml valid
+  ✓ PASS  azd environment selected
+  ✓ PASS  agent service in azure.yaml
+  ✓ PASS  AZURE_AI_PROJECT_ENDPOINT set
+  ✓ PASS  agent.yaml valid
+  ✓ PASS  authentication
+          alice@contoso.com · token valid for 47 minutes
+  ✗ FAIL  Foundry project reachability
+          GET /agents → 403
+          fix:  see RBAC check below
+  -  SKIP  model deployments
+          skipped: reachability check failed
+  ✗ FAIL  RBAC
+          principal alice@contoso.com has no role on project scope
+          fix:  az role assignment create --role "Azure AI Developer" \
+                  --assignee 0000... --scope /subscriptions/.../projects/...
+  -  SKIP  agent status
+          skipped: reachability check failed
+
+Next:  az role assignment create ...     -- grant yourself the required role
+       azd ai agent doctor --remote      -- re-run to verify
+```
+
+The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands.
+
+## Performance & Side Effects
+
+Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use.
+
+| Check | Network calls | Worst-case time |
+|---|---|---|
+| 7 — Auth | 0 (cached token) – 1 (refresh) | 2s |
+| 8 — Reach | 1 GET | 10s (timeout) |
+| 9 — Models | 1 list | 5s |
+| 10 — RBAC | 1 list-role-assignments | 5s |
+| 11 — Status | 1 list-agents | 5s |
+
+Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode.
+
+## Testing Strategy
+
+- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8).
+- **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test.
+- **Functional test** (recorded cassette via `mage record`) covering one happy-path full run.
+- No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step.
+
+## Risks & Mitigations
+
+| Risk | Mitigation |
+|---|---|
+| Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. |
+| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. |
+| Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. |
+| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). |
+| User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. |
+
+## Performance Budget
+
+Local-only `doctor` (today): <100ms. Don't regress.
+`--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s.
+We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic.
+
+## Implementation Phases
+
+1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step.
+2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own.
+3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions.
+4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording.
+
+Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4.
+
+## Appendix: Why not fold these into MVP?
+
+Three reasons, in order of weight:
+
+1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit.
+2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it.
+3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path.
diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
index 37914caac85..dff83c41c77 100644
--- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md
+++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
@@ -278,7 +278,7 @@ All failures are silent. The fetch itself is best-effort and already cached —
 | 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` |
 | 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` |
 
-Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass.
+Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
 
 ### Doctor output shape
 
@@ -325,7 +325,7 @@ All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure
 2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command.
 3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification.
 4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver.
-5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). Separate design.
+5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
 
 Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975.
 

From 4358e68497b0beb9909e9b69141d246e9241d05b Mon Sep 17 00:00:00 2001
From: Antriksh Jain <antrikshjain@microsoft.com>
Date: Wed, 6 May 2026 13:12:19 +0530
Subject: [PATCH 3/5] docs(agent-nextsteps): address PR review feedback
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Synthesized across 4 model critiques (claude-opus-4.7, claude-sonnet-4.6,
gpt-5.5, gpt-5.4). Doc-only changes; no implementation churn.

azd-ai-agent-nextsteps.md
- Status: anchor artifact mechanism on `Metadata["note"]` of an
  `ArtifactKindEndpoint` artifact rendered by `pkg/project/artifact.go`
  (no core edits).
- Drop broken `[Alternatives](#alternatives-considered)` link; rephrase
  Non-goals to keep the "no core edits" point explicit.
- Add Assumptions section: design assumes unified `azure.yaml` baseline
  per #7975; `agent.yaml` references are placeholders for the equivalent
  unified service-stanza fields. Pin auth to
  `azidentity.NewAzureDeveloperCLICredential` (matches `agent_context.go`).
- Component layout: align `doctor.go` to `doctor/` subpackage matching
  the remote-checks doc.
- State Model: rename `HasUnresolvedInfraVars bool` to `MissingInfraVars
  []string`; mirror `MissingManualVars`. Add tri-state `IsAuthenticated`
  (Unknown/Authed/Unauthed) populated only via `WithAuthProbe` opt-in;
  default `Unknown` means "skip auth-conditional advice", not "tell user
  to log in". Note `HasOpenAPI`/`OpenAPIPayload` only populated by
  `run`/`doctor --remote` paths.
- Decision Tree: rename `init` condition to `len(MissingInfraVars) > 0`;
  drop `monitor --follow` secondary from `invoke --local` success
  (monitor is hosted-only — verified in `monitor.go`); deploy resolver
  uses cached spec only, no live probe at deploy time (agent may not be
  reachable yet).
- Output discipline: name `isTerminal` source of truth (`helpers.go`
  wrapper around `golang.org/x/term.IsTerminal`). Reconcile JSON-mode
  contradiction: `--output json` on success-path commands suppresses the
  human `Next:` block; `doctor --output json` is a separate structured
  emit with explicit `redacted` field. Remove duplicate "Multi-line note
  pre-indentation" heading.
- Hook Points: clarify deploy-row mechanism cites the existing
  endpoint-kind artifact emitted by `service_target_agent.go` and the
  renderer at `pkg/project/artifact.go` (`ToString`); no new artifact
  kind.
- OpenAPI Discovery: note silent-mode requirement for `fetchOpenAPISpec`
  when called by the resolver path (currently prints "OpenAPI spec saved
  to..." at `helpers.go:373`). Mark `$ref` resolution as out of scope —
  fall back to `<payload>` literal.
- Doctor checks: rename check 1 to "extension gRPC channel healthy";
  point check 7 at `azidentity.NewAzureDeveloperCLICredential`; clarify
  check 6 placeholder for unified-schema service stanza.
- Add `Exit codes & JSON output` subsection: precedence rules
  (any fail -> 1, skip-only -> 2, otherwise 0). JSON envelope sketch with
  `schemaVersion`, `remote`, `redacted`, `checks[]`.
- Testing: TTY/pipe/JSON snapshot routing; non-interactive auth-skip;
  `$ref` -> `""` extractor; cached-spec vs absent fallback; explicit
  skip-cascade test over the dependency matrix; exit-code precedence
  + JSON envelope shape.
- Add Backward Compatibility section.
- Add Security & PII / Telemetry section: redaction rules; `isTerminal`
  source-of-truth; logging policy; brief deferred-telemetry sketch.
- Add Deferred follow-ups: personas/KPIs/full-telemetry-plan rationale.
- Risks: replace cap-of-5 mitigation with a wording that doesn't pin
  a magic number — fall back to a single `azd ai agent show` line if
  multi-agent walls become a real problem.

azd-ai-agent-doctor-remote-checks.md
- Dependency matrix: drop check 8 from check 10 deps (RBAC reads ARM,
  not Foundry data plane); drop check 10 from check 11 deps
  (agents-list is Reader-level).
- Check 8: pin api-version to `DefaultAgentAPIVersion` constant in
  `internal/cmd/agent_context.go` (currently `2025-11-15-preview`); add
  401 case (token expired); broaden 403 wording to "wrong tenant or
  insufficient RBAC".
- Check 10: add redaction rule for non-interactive output; flag in
  JSON envelope.
- Risks: reconcile parallel-fan-out claim post matrix simplification;
  add identifier-leakage row pointing at the redaction rule.
- Testing: add explicit skip-cascade test over the dependency matrix;
  redaction snapshot tests for interactive vs. non-interactive RBAC.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../azd-ai-agent-doctor-remote-checks.md      |  24 ++-
 cli/azd/docs/design/azd-ai-agent-nextsteps.md | 164 ++++++++++++++----
 2 files changed, 150 insertions(+), 38 deletions(-)

diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
index 1b3fb166a9b..bee787fc2bf 100644
--- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
+++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
@@ -100,9 +100,9 @@ We considered `--quick` / `--full` naming but `--remote` describes the actual si
 |---|---|---|
 | 7 — Auth | check 3 (env selected) | Skip with "select an env first" |
 | 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip |
-| 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip |
-| 10 — RBAC | 7 (auth), 8 (reachability) | Skip |
-| 11 — Agent status | 8, 9, 10 | Skip |
+| 9 — Models | 6 (service stanza valid), 8 (reachability) | Skip |
+| 10 — RBAC | 7 (auth) | Skip — RBAC reads ARM, not Foundry data plane, so it does **not** depend on check 8 |
+| 11 — Agent status | 7 (auth), 8 (reachability) | Skip — agents-list is a Reader-level Foundry call and does not require check 10's deploy/invoke roles |
 
 Skips are explicit in the rendered report. We never quietly drop a check.
 
@@ -120,11 +120,12 @@ Skips are explicit in the rendered report. We never quietly drop a check.
 
 ### Check 8 — Foundry project reachability
 
-**What it does:** issues a single `GET <AZURE_AI_PROJECT_ENDPOINT>/agents?api-version=…&$top=1` with the credential from check 7.
+**What it does:** issues a single `GET <AZURE_AI_PROJECT_ENDPOINT>/agents?api-version=<DefaultAgentAPIVersion>&$top=1` with the credential from check 7. The api-version is read from the `DefaultAgentAPIVersion` constant in `internal/cmd/agent_context.go` (currently `2025-11-15-preview`) — never a doc-side literal — so a single source of truth governs which Foundry surface we probe.
 
 **Pass:** "endpoint reachable (HTTP 200)"
 **Fail:** maps the HTTP status to one of:
-- `403` → check 10 (RBAC) will explain. Fix: see check 10 output.
+- `401` → token expired or scope mismatch. Fix: `azd auth login`; if the issue persists, see check 7.
+- `403` → wrong tenant **or** insufficient RBAC. Fix: confirm the active subscription/tenant matches the Foundry project; if it does, see check 10's role-assignment fix.
 - `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var.
 - network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`".
 
@@ -151,7 +152,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10
 **Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role.
 **Fail:** missing both → command + portal link.
 
-The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example:
+The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example (interactive TTY):
 ```
 fix:  az role assignment create \
         --role "Azure AI Developer" \
@@ -159,6 +160,8 @@ fix:  az role assignment create \
         --scope <project-resource-id>
 ```
 
+**Redaction in non-interactive output.** When stdout is piped or `--output json` is set, principal IDs, scope ARNs, and full UPNs in both `detail` and `fix` strings are replaced with `<redacted>`. The JSON envelope sets `"redacted": true` so callers know they're getting the safe variant (see [`Exit codes & JSON output`](./azd-ai-agent-nextsteps.md#exit-codes--json-output) in the companion doc). An interactive-only `--unredacted` flag exists for users who need the raw values in a JSON pipeline.
+
 **Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited.
 
 ### Check 11 — Agent status
@@ -221,7 +224,9 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow
 
 ## Testing Strategy
 
-- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8).
+- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8 — including the "wrong tenant or insufficient RBAC" wording for 403).
+- **Skip-cascade test.** For each dependency edge in the matrix above, assert the dependent check returns `Status: Skip` with the expected detail string when its dependency returns `Status: Fail`. Run as a table-driven test over the matrix so adding/removing edges automatically widens coverage.
+- **Redaction test.** Snapshot the rendered RBAC fix string in interactive (full values) and non-interactive (`<redacted>`) modes, plus the JSON envelope's `redacted` field.
 - **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test.
 - **Functional test** (recorded cassette via `mage record`) covering one happy-path full run.
 - No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step.
@@ -232,9 +237,10 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow
 |---|---|
 | Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. |
 | RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. |
-| Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. |
-| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). |
+| Foundry API drift breaks the reachability probe | Pin the api-version to `DefaultAgentAPIVersion` (`internal/cmd/agent_context.go`), not a doc literal. Failure renders as "API not understood" with a doc link rather than a stack trace. |
+| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8. |
 | User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. |
+| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in check 10: non-interactive output replaces values with `<redacted>` and the JSON envelope flags `"redacted": true`. |
 
 ## Performance Budget
 
diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
index dff83c41c77..60fbdc78ce9 100644
--- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md
+++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
@@ -4,7 +4,7 @@
 
 **Draft — for design review.**
 Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics".
-Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — same path other extensions use today.
+Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — specifically the `Metadata["note"]` key on an `ArtifactKindEndpoint` artifact, which `pkg/project/artifact.go` already renders as a continuation line under the artifact bullet (gated on `Kind == ArtifactKindEndpoint`). The extension already emits endpoint-kind artifacts at `internal/project/service_target_agent.go`; we populate the `note` metadata key on the same artifacts.
 
 ## Goal
 
@@ -13,7 +13,14 @@ After every successful `azd ai agent` command — and on demand via a new `azd a
 Non-goals:
 - Reworking error-path guidance — `internal/exterrors` already does this well.
 - Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. State is read via the existing gRPC `azdext` API.
-- Sample-author hooks, README markers, or YAML-declared hints (see [Alternatives](#alternatives-considered)).
+- Sample-author hooks, README markers, or YAML-declared hints — out of scope for this design.
+- Authoring or editing core azd packages (`cli/azd/pkg/...`, `cli/azd/cmd/...`). The design only *consumes* the existing `Metadata["note"]` rendering contract.
+
+## Assumptions
+
+This design assumes the unified-`azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) has landed: a single `azure.yaml` is the canonical config; `agent.yaml` and `agent.manifest.yaml` no longer exist as separate files. References to `agent.yaml` in the doctor checks and decision trees below are placeholders for the equivalent fields in the unified `services.<name>` stanza. If the extension ships before unification, each `agent.yaml` reference should be read as the corresponding fields under `services.<name>` in `azure.yaml`. The `AssembleState` function abstracts the source split, so resolvers are insulated from the migration — only the state-assembly code needs to change.
+
+The extension's authentication remains `azidentity.NewAzureDeveloperCLICredential` (see [`agent_context.go`](../../extensions/azure.ai.agents/internal/cmd/agent_context.go)). User-facing fix commands therefore say `azd auth login`, not `az login`.
 
 ## Background
 
@@ -36,7 +43,7 @@ These are the building blocks the design reuses:
 - **`azdext` gRPC client** — extensions read `azure.yaml` services and azd env vars without re-implementing parsing.
 - **`fetchOpenAPISpec`** in `helpers.go` — already probes `GET /invocations/docs/openapi.json`, caches to disk, fails silently. Drives the rich-payload story.
 - **`exterrors`** factories — typed errors with `Suggestion` fields. We follow the same "structured advice" shape on success paths.
-- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. Already used by the existing `service_target_agent.go` to print endpoint URLs. We populate the same field with our Next: block. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test.
+- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. The renderer is `pkg/project/artifact.go` (specifically the `Metadata["note"]` branch at `ToString`, ~line 128), which is gated on `Kind == ArtifactKindEndpoint`. The extension already emits endpoint-kind artifacts in `internal/project/service_target_agent.go` (~line 720) for endpoint URLs. We attach our Next: block to the same artifacts via the same `Metadata["note"]` key. No changes to the renderer or any core file. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test that asserts the formatted string against a stub artifact.
 
 ## Architecture
 
@@ -45,12 +52,15 @@ These are the building blocks the design reuses:
 ```
 cli/azd/extensions/azure.ai.agents/internal/cmd/
 ├── nextstep/                     ← new package (this design)
-│   ├── types.go                  ← Suggestion, State, ServiceState
-│   ├── state.go                  ← AssembleState(ctx, azdClient) → State
+│   ├── types.go                  ← Suggestion, State, ServiceState, AuthState
+│   ├── state.go                  ← AssembleState(ctx, azdClient, opts) → State
 │   ├── resolver.go               ← decision tree per command
 │   ├── format.go                 ← PrintNext / FormatNextForNote
 │   └── openapi.go                ← OpenAPI-derived invoke examples
-├── doctor.go                     ← new command — calls nextstep + extra checks
+├── doctor/                       ← new package (matches structure in remote-checks doc)
+│   ├── types.go                  ← Check, Result, Status
+│   ├── runner.go                 ← orchestrates ordered execution + Next: tail
+│   └── checks_local.go           ← MVP local checks (1–6)
 ├── init.go        ─┐
 ├── run.go         ─┼─ each calls nextstep.ResolveAfterX → nextstep.PrintNext
 ├── invoke.go      ─┤
@@ -88,14 +98,23 @@ One package owns *all* policy. Each command is a thin caller — it knows when i
 
 ```go
 // types.go
+type AuthState int
+
+const (
+    AuthUnknown   AuthState = iota // probe was not run; suggestions must skip auth-conditional advice
+    AuthAuthed                     // doctor confirmed a usable token
+    AuthUnauthed                   // doctor confirmed login is needed
+)
+
 type State struct {
-    HasProjectEndpoint     bool
-    HasUnresolvedInfraVars bool   // ${...} → Bicep outputs not in azd env
-    HasUnresolvedManualVars []string // ${...} → user-supplied vars not set
-    Services               []ServiceState
-    AgentStatus            string  // optional (Foundry API), empty if unknown
-    HasOpenAPI             bool    // run-time only
-    OpenAPIPayload         string  // run-time only, ""=no spec
+    HasProjectEndpoint bool
+    MissingInfraVars   []string  // ${...} → Bicep outputs not in azd env (named for actionable advice)
+    MissingManualVars  []string  // ${...} → user-supplied vars not set
+    Services           []ServiceState
+    AgentStatus        string    // optional (Foundry API), empty if unknown
+    HasOpenAPI         bool      // populated only when AssembleState is called from `run` or `doctor --remote`
+    OpenAPIPayload     string    // populated only when AssembleState is called from `run` or `doctor --remote`
+    IsAuthenticated    AuthState // populated only when AssembleState is called from `doctor --remote`
 }
 
 type ServiceState struct {
@@ -113,7 +132,9 @@ type Suggestion struct {
 }
 ```
 
-State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The cost is one gRPC round-trip per resolver invocation, which is negligible compared to anything the user perceives.
+State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The base cost is one gRPC round-trip per resolver invocation, which is negligible.
+
+The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the `doctor --remote` path is the only caller that sets it. Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths.
 
 ### Layered sources
 
@@ -136,8 +157,8 @@ The full tree lives in `resolver.go`. This section captures the policy in tabula
 
 | Condition | Primary | Secondary |
 |---|---|---|
-| `HasUnresolvedInfraVars` | `azd provision` | — |
-| `HasUnresolvedManualVars` (non-empty) | `azd env set <KEY> <value>` (one per missing var, up to 3) | — |
+| `len(MissingInfraVars) > 0` | `azd provision` | — |
+| `len(MissingManualVars) > 0` | `azd env set <KEY> <value>` (one per missing var, up to 3) | — |
 | Otherwise | `azd ai agent run` | — (the `run` resolver will print the protocol-correct `invoke --local` when the agent starts) |
 | Always (third line) | "When ready to deploy to Azure, run `azd deploy`." | — |
 
@@ -157,7 +178,7 @@ Tip:  curl http://localhost:<port>/invocations/docs/openapi.json
 
 ### `invoke --local` (success)
 
-Single agent: `azd deploy`, secondary `azd ai agent monitor --follow`.
+Single agent: `azd deploy` only. (The local invoke just succeeded — `monitor --follow` is hosted-only and not useful here; the natural next step is to ship.)
 Multi-agent: `azd deploy` only (one line per agent for follow-up `invoke <name>`).
 
 ### `invoke` (remote)
@@ -185,6 +206,7 @@ See <relpath>/README.md for a sample payload appropriate for this agent.
 Notes:
 - The `invoke` line **omits the agent name** for single-agent projects (matches sample README convention).
 - README hint is emitted **only if** `os.Stat` succeeds on `<root>/<svc.RelativePath>/{README.md,readme.md,README.MD}`. No hint for projects without a README.
+- **No live OpenAPI probe at deploy time.** The just-deployed agent may not be reachable yet (this is exactly what doctor check 11 observes). `ResolveAfterDeploy` therefore uses any cached spec produced by a prior local `run` (same disk cache as `fetchOpenAPISpec`), falling back to the README pointer, then to the protocol-generic `<payload>` literal. The deployed endpoint's payload can be verified later via `azd ai agent show` or the next-step block from a successful remote `invoke`.
 
 Multi-agent: one `show <name>` and one `invoke <name>` line per agent.
 
@@ -213,10 +235,11 @@ Ground rules:
 |---|---|
 | Interactive TTY | `stdout` |
 | Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output |
-| `--output json` / structured output flag | Suppressed entirely |
-| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs) |
+| `--output json` (success-path commands) | `Next:` block suppressed entirely |
+| `azd ai agent doctor --output json` | **Not** the suppression rule — emits the structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are still suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag |
+| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`) |
 
-### Multi-line note pre-indentation
+The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output should call the same helper to keep the source of truth singular.
 
 ### Multi-line note pre-indentation
 
@@ -231,7 +254,7 @@ The `azdext.Artifact` SDK field used here renders only the **first line** at the
 | `invoke --local` / `invoke` | end of `runE` on success | direct print |
 | `show` | end of table render | direct print |
 | `doctor` | after rendering check report | direct print |
-| `deploy` | inside `Deploy()` return path | attach Next: to the returned `azdext.Artifact` |
+| `deploy` | inside `Deploy()` return path | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. |
 
 The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly.
 
@@ -258,6 +281,10 @@ Resolution order:
 
 All failures are silent. The fetch itself is best-effort and already cached — we never block `run`'s startup notice on it.
 
+**Silent-mode requirement.** Today `fetchOpenAPISpec` prints `OpenAPI spec saved to <path>` to stdout (see `helpers.go:373`). When invoked from the resolver path, that print is unwanted noise. The extraction path will either gate the print behind a `verbose` flag on the helper, or wrap the call in a `silent: true` variant. This is an extension-internal change to existing extension code — still no edits outside the extension.
+
+**Out of scope: OpenAPI `$ref` resolution.** The extractor walks the spec object tree literally. If `requestBody` or `schema` uses `$ref: "#/components/..."`, the extractor returns `""` and the resolver falls back to the protocol-generic payload and Tip line. Cross-document `$ref` resolution adds a JSON-pointer dependency we don't want in the success-path. Authors who hit this can either inline the example or rely on the README-pointer line.
+
 ## `doctor` Command
 
 `doctor` is the persistence layer for state-aware guidance: when the user has lost terminal context or hit a confusing error, they run one command and get the full picture plus the specific fix for each broken thing.
@@ -266,13 +293,13 @@ All failures are silent. The fetch itself is best-effort and already cached —
 
 | # | Check | Pass | Fail → Fix |
 |---|---|---|---|
-| 1 | `azd` reachable + extension running | "extension running, gRPC channel established" | install / start daemon |
+| 1 | `azd` reachable + extension gRPC channel healthy | "extension running, gRPC channel established" | install / start daemon |
 | 2 | `azure.yaml` valid | "1 service: <names>" | `azd ai agent init` |
 | 3 | azd environment selected | "<env-name>" | `azd env new` / `azd env select` |
 | 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` |
 | 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` |
-| 6 | `agent.yaml` per service is valid | path | fix YAML |
-| 7 (post-MVP) | Authentication (`az account show`) | signed-in user | `azd auth login` |
+| 6 | `agent.yaml` per service is valid (placeholder for unified-schema service stanza) | path | fix YAML |
+| 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` |
 | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall |
 | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` |
 | 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` |
@@ -295,19 +322,98 @@ Next:  <resolved next-step block, same format as elsewhere>
 
 When all checks pass, the trailing Next: block is the resolver's `ResolveAfterInit` (if not deployed) or a deploy-aware variant (if deployed) — exactly what the user would have seen at the end of the last successful command.
 
+### Exit codes & JSON output
+
+| Outcome | Exit code |
+|---|---|
+| All checks pass (or pass+skip with no fail) — and at least one check ran | 0 |
+| Any check returns `Status: Fail` | 1 |
+| All checks ran were skipped (e.g., `--remote` against a project missing prerequisites) | 2 |
+
+Precedence: any `Fail` always wins (exit 1). Skip-only without any pass is the dependency-cascade case (exit 2). All-pass or pass+skip with at least one pass is exit 0.
+
+`azd ai agent doctor --output json` emits a structured array, one entry per check, in execution order:
+
+```json
+{
+  "schemaVersion": "1.0",
+  "remote": false,
+  "redacted": true,
+  "checks": [
+    {
+      "id": "local.azure-yaml",
+      "title": "Project loaded from azure.yaml",
+      "status": "pass",
+      "detail": "1 service: echo-agent",
+      "fix": "",
+      "links": [],
+      "durationMs": 4
+    },
+    {
+      "id": "remote.rbac",
+      "title": "RBAC",
+      "status": "fail",
+      "detail": "principal <redacted> has no role on project scope",
+      "fix": "az role assignment create --role \"Azure AI Developer\" --assignee <redacted> --scope <redacted>",
+      "links": ["https://learn.microsoft.com/azure/ai-studio/concepts/rbac"],
+      "durationMs": 412
+    }
+  ]
+}
+```
+
+The `redacted` field is `true` when `--output json` is combined with non-interactive output (the default for piped stdout). Setting `--unredacted` (interactive only) populates principal IDs, scope ARNs, and full UPNs verbatim. The human-readable `Next:` block is **not** included in the JSON envelope — that is the success-path output discipline; doctor's structured emit is its own contract.
+
 ## Testing Strategy
 
 | Layer | What | How |
 |---|---|---|
 | `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests |
+| `nextstep/format` | TTY vs piped routing (stdout vs stderr); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag |
 | `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` |
-| `nextstep/openapi` | each extraction path + corruption / empty fallback | golden specs in `testdata/` |
-| `nextstep/state` | layered source priority, partial state | mocked `azdClient` (testify/mock) |
-| `service_target_agent` | with/without README on disk → with/without hint line | `t.TempDir` + `t.Chdir` |
-| `doctor` | each check pass/fail rendering | mocked `azdClient` |
+| `nextstep/resolver` | non-interactive auth handling — when `IsAuthenticated == AuthUnknown`, no auth-conditional advice (no "azd auth login" appears in `Next:`) | dedicated table cases |
+| `nextstep/openapi` | each extraction path + corruption / empty fallback / unresolved `$ref` → `""` | golden specs in `testdata/` |
+| `nextstep/state` | layered source priority, partial state, `WithAuthProbe` opt-in | mocked `azdClient` (testify/mock) |
+| `service_target_agent` | with/without README on disk → with/without hint line; cached spec present vs absent → payload vs `<payload>` literal | `t.TempDir` + `t.Chdir` |
+| `doctor/runner` | each check pass/fail rendering | mocked `azdClient` |
+| `doctor/runner` | **skip-cascade**: for each dependency edge, assert the dependent check returns `Status: Skip` with the expected detail when its dependency returns `Status: Fail` | table-driven over the dependency matrix |
+| `doctor/runner` | exit-code precedence (fail wins; all-skip = 2; otherwise 0) and JSON envelope shape (schema version, `redacted` flag, per-check fields) | table-driven + snapshot |
 
 All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure calls.
 
+## Backward Compatibility
+
+The Next: block is purely additive; no existing exit code, error type, or stdout format changes. Specifically:
+
+- Success-path JSON output (`--output json` on `init`/`run`/`invoke`/`show`) is unchanged — the human `Next:` block is suppressed in that mode (see [Output discipline](#output-discipline)).
+- The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change.
+- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--remote` and `--output json` flags are opt-in.
+- `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added.
+- Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention.
+
+If any future change to the `Metadata["note"]` rendering behavior in `pkg/project/artifact.go` ships, the regression test pinned in this design's [Output Discipline](#output-discipline) section will break loudly. The fix is extension-side reformatting; no core edits.
+
+## Security & PII / Telemetry
+
+**Output redaction.** Doctor's RBAC check (10) and any future check that surfaces principal IDs or resource ARNs follows two rules:
+
+- **Interactive TTY:** full values are shown — they are necessary for the user to construct the fix command (`az role assignment create ...`).
+- **Non-interactive (piped, CI, `--output json`):** principal IDs, scope ARNs, and full UPNs are replaced with `<redacted>` in `detail` and `fix` strings. The structured JSON envelope sets `"redacted": true` so callers can see they're getting the safe variant. An interactive-only `--unredacted` flag exists for the rare case a user needs the raw values in a JSON pipeline.
+
+**Source of truth for `isTerminal`.** All redaction and stdout/stderr routing decisions are gated by the existing `isTerminal` helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). New code must call the same helper, never re-implement detection — this keeps the test surface singular and the redaction contract auditable.
+
+**Logging.** The `nextstep` package logs only via the standard `log` package, which the extension already silences in production via `setupDebugLogging` (see `internal/cmd/debug.go`). Sensitive values (principal IDs, tokens, full ARNs) are logged at `debug` level only; the resolver's user-facing output never logs.
+
+**Telemetry (deferred).** A full instrumentation plan is out of scope for this design. Initial implementation will emit a single counter event per resolver invocation (`nextstep.resolved` with `command` + `branch_id` attributes) and per-doctor-check (`doctor.check.completed` with `id` + `status` + `duration_ms`). Neither event includes user input, principal IDs, or environment values. A follow-up design will cover the full schema once we have implementation experience.
+
+## Deferred follow-ups
+
+The reviewer surfaced three additional concerns that are out of scope for this design pass and will be handled separately:
+
+- **User personas / target audiences.** Not standard in design docs in this repo (see `cli/azd/docs/design/`); we treat "developers building with the agents extension" as the implicit audience.
+- **Success metrics / KPIs.** Same rationale; metrics will be wired up alongside telemetry in the follow-up design once we have something concrete to measure.
+- **Full telemetry / observability plan.** See the brief sketch above; the complete attribute schema, sampling policy, and dashboards are deferred.
+
 ## Risks & Mitigations
 
 | Risk | Mitigation |
@@ -317,7 +423,7 @@ All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure
 | Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. |
 | Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. |
 | Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. |
-| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` cap: 5 agents inline; beyond that, suggest `azd ai agent show` (no args, lists all). |
+| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` produces one `show <name>` and one `invoke <name>` line per agent. Multi-agent walls remain readable for the project sizes we expect (≤ ~10). If the wall becomes a real problem in practice, fall back to a single `azd ai agent show` (no args, lists all) line; this can be tuned in a follow-up without changing the resolver contract. |
 
 ## Implementation Phases
 

From 575a450792e57afd7939be74b4658976a2d4ecaa Mon Sep 17 00:00:00 2001
From: Antriksh Jain <antrikshjain@microsoft.com>
Date: Wed, 6 May 2026 19:05:56 +0530
Subject: [PATCH 4/5] docs(azure.ai.agents): address therealjohn doctor doc
 review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Flip flag model from `--remote` opt-in to `--local-only` opt-out. Default
  `azd ai agent doctor` now runs all checks (1–12); `--local-only` is the CI
  / offline opt-out. Update Goal, Execution Model, Output Format example,
  Performance Budget, Implementation Phases, and the Appendix accordingly.
- Add Terminology section defining MVP explicitly as "local checks 1–6 ship
  first" — chronological, not a permanent capability cap.
- Soften the local-only framing in Background: clarify that remote checks
  intentionally make network calls under user credentials, which is the
  whole point of this doc.
- Add Check 12: Agent identity role assignments. Lists role assignments on
  the agent's managed identity at project / account / resource-group scopes,
  with a new INFO glyph for the informational case. Distinct from check 10
  (user RBAC) — different principal, different fix command, different
  preconditions. Rename check 10 to "User RBAC" for symmetry.
- Add a tracking note on Check 9 referencing #7962 (`agent.yaml` →
  `azure.yaml` unification, syncing with @trangevi).
- Rewrite the first Risks row: replace the "users in CI miss `--remote`"
  scenario (no longer applicable) with a precise non-interactive auth
  failure path that surfaces a clear "re-run with --local-only" message.
- Update companion `azd-ai-agent-nextsteps.md` for cross-reference
  consistency: doctor table extended to checks 7–12; flag references
  flipped (`--local-only` is the new opt-in flag, full sweep is default);
  comments on `WithAuthProbe` updated to match.

Resolves comments #1–7 from @therealjohn on PR #8057. Comment #8 (scope:
move to `azure.ai.projects`?) is deferred for @rajeshkamal5050.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../azd-ai-agent-doctor-remote-checks.md      | 122 ++++++++++++------
 cli/azd/docs/design/azd-ai-agent-nextsteps.md |  17 +--
 2 files changed, 94 insertions(+), 45 deletions(-)

diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
index bee787fc2bf..68ad7df228a 100644
--- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
+++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
@@ -1,4 +1,4 @@
-# Design: `azd ai agent doctor` — Remote Checks (7–11)
+# Design: `azd ai agent doctor` — Remote Checks (7–12)
 
 ## Status
 
@@ -6,11 +6,19 @@
 Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975).
 Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.**
 
+## Terminology
+
+- **MVP doctor** — the first shippable slice of the command: local checks (1–6) only. Defined in the [companion doc](./azd-ai-agent-nextsteps.md). MVP is a chronological term ("what lands first"), not a permanent capability boundary.
+- **Local checks** — read-only inspection of `azure.yaml`, `agent.yaml`, env vars, etc. No network, no token.
+- **Remote checks** — anything that issues a network call under user credentials. Specified in this doc.
+
 ## Goal
 
 Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"*
 
-Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix.
+The end-state default is to run **all** checks (local then remote) on every `doctor` invocation. CI and air-gapped users opt out of the remote ones via `--local-only` (see [Execution Model](#execution-model)).
+
+Concretely: add six checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, user RBAC, agent runtime status, agent-identity role assignments — each with a precise "what to do next" fix.
 
 Non-goals:
 - Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1.
@@ -19,9 +27,9 @@ Non-goals:
 
 ## Background
 
-The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions.
+Local checks (1–6) — `azure.yaml` exists, env vars are set, YAML parses — run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions and are what MVP doctor ships first.
 
-The remaining ~30% — and arguably the most frustrating cases — sit in remote land:
+The remaining ~30% — and arguably the most frustrating cases — *intentionally* require network calls under the user's credentials:
 - Token expired silently → every command 401s with no obvious "you're logged out" hint.
 - Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace.
 - Model deployment got deleted out from under the project → first invoke fails with a cryptic 404.
@@ -37,8 +45,9 @@ These all require live calls and credentials, which is why they're carved out in
 | 7 — Authentication | Am I logged in? Is the token still valid? |
 | 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? |
 | 9 — Model deployments | Do the model(s) my agent references exist on the project? |
-| 10 — RBAC | Does my principal have the roles needed to deploy / invoke? |
+| 10 — User RBAC | Does my principal have the roles needed to deploy / invoke? |
 | 11 — Agent status | Does the deployed agent actually exist and is it active? |
+| 12 — Agent identity roles | What roles does the agent's managed identity have, and on which scopes? |
 
 ## Architecture
 
@@ -79,20 +88,20 @@ A remote check that can't run (e.g., not authenticated when running check 9) ret
 
 ## Execution Model
 
-### Default: local-only
+### Default: all checks (1–12)
 
-`azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode.
+`azd ai agent doctor` runs the full check set in order. Local checks (1–6) run first; remote checks (7–12) run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)). The default is "tell me everything that's wrong" — a half-answer that skips the most frustrating cases isn't worth shipping.
 
-### Opt-in: `--remote`
+### Opt-out: `--local-only`
 
-`azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)).
+`azd ai agent doctor --local-only` skips checks 7–12. Use this when:
+- Running in CI / non-interactive shells where an interactive auth prompt would hang.
+- Working offline / behind a restrictive firewall.
+- Investigating a config-only issue and you don't want network latency.
 
-Rationale for opt-in (vs. always-on or auto-detect):
-- Remote checks need credentials and add seconds to a command users may run frequently.
-- Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI.
-- A flag is discoverable, reversible, and self-documenting.
+Non-interactive detection (`!isTerminal`) is **not** automatic — explicit is better than magic. If a user forgets `--local-only` in CI, the auth check fails fast with a clear message and the rest of the remote checks Skip cleanly (see [Risks](#risks--mitigations)).
 
-We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about.
+We considered `--remote` opt-in but rejected it: doctor's value is highest when it answers the live-Azure questions by default, and every "why doesn't this work" question we've seen in practice falls in that bucket. CI is the niche, not the default.
 
 ### Dependency matrix
 
@@ -103,6 +112,7 @@ We considered `--quick` / `--full` naming but `--remote` describes the actual si
 | 9 — Models | 6 (service stanza valid), 8 (reachability) | Skip |
 | 10 — RBAC | 7 (auth) | Skip — RBAC reads ARM, not Foundry data plane, so it does **not** depend on check 8 |
 | 11 — Agent status | 7 (auth), 8 (reachability) | Skip — agents-list is a Reader-level Foundry call and does not require check 10's deploy/invoke roles |
+| 12 — Agent identity roles | 11 (agent status pass — we need the agent's MI principal ID) | Skip — without an active agent there is no MI to introspect |
 
 Skips are explicit in the rendered report. We never quietly drop a check.
 
@@ -135,6 +145,8 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10
 
 **What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally.
 
+> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead. Syncing with @trangevi on timing.
+
 **Pass:** "all <N> referenced models present"
 **Fail:** "missing: <model-name> (referenced by <service>)" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link).
 
@@ -142,7 +154,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10
 - Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here.
 - Agents with no model reference (orchestration-only) — pass with "no model references".
 
-### Check 10 — RBAC
+### Check 10 — User RBAC
 
 **What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set:
 - `Azure AI Developer` (or stronger) on the project — required for deploy + invoke.
@@ -176,12 +188,44 @@ fix:  az role assignment create \
 
 This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly.
 
+### Check 12 — Agent identity role assignments
+
+**What it does:** for each active agent found in check 11, resolves the agent's managed-identity principal ID, then lists role assignments on the **agent identity** at three scope levels:
+
+1. The Foundry project (scope of the agent itself).
+2. The parent AI account.
+3. The resource group containing the project (proxy for "any resource the agent might touch").
+
+We don't try to predict *which specific resource* the agent will need at invocation time — that's data-plane and varies per agent code. Instead we surface the visible role-assignment landscape so the user can answer "does my agent identity have any roles where I'd expect it to?"
+
+**Pass:** agent MI has at least one assignment at the project scope, plus a non-empty list at account or resource-group scope.
+**Warn:** agent MI has assignments only at the agent's own scope — it almost certainly cannot read project / account resources at runtime. Render the list with a "this looks under-privileged for typical agents" hint.
+**Fail:** agent MI has zero role assignments anywhere reachable. This is the smoking-gun for "my agent runs but every tool call 403s."
+
+**Output shape:**
+```
+ⓘ INFO  agent identity roles  (agent: research-bot)
+        principal: <mi-principal-id>
+        project scope:
+          - Cognitive Services User
+        account scope:
+          - (none)
+        resource-group scope:
+          - Storage Blob Data Reader
+```
+
+We render this as `INFO` (a fourth glyph alongside pass/warn/fail/skip) when there's nothing actionable to flag, since the value is mostly informational — the user inspects the list to confirm it matches their mental model. Warn/Fail variants surface only the actionable cases above.
+
+**Why a separate check from 10:** check 10 is about the *user's* permissions to deploy/invoke; check 12 is about the *agent's* permissions at runtime. They fail differently, fix differently (`az role assignment create --assignee <mi-id>` vs `<user-id>`), and the agent MI doesn't exist until check 11 has confirmed an active agent — so they can't be merged without losing precision.
+
+**Redaction.** Same rules as check 10: principal IDs and scope ARNs are replaced with `<redacted>` in non-interactive output, and the JSON envelope sets `"redacted": true`.
+
 ## Output Format
 
 Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible:
 
 ```
-azd ai agent doctor --remote
+azd ai agent doctor
   ✓ PASS  azd reachable
   ✓ PASS  azure.yaml valid
   ✓ PASS  azd environment selected
@@ -195,32 +239,35 @@ azd ai agent doctor --remote
           fix:  see RBAC check below
   -  SKIP  model deployments
           skipped: reachability check failed
-  ✗ FAIL  RBAC
+  ✗ FAIL  user RBAC
           principal alice@contoso.com has no role on project scope
           fix:  az role assignment create --role "Azure AI Developer" \
                   --assignee 0000... --scope /subscriptions/.../projects/...
   -  SKIP  agent status
           skipped: reachability check failed
+  -  SKIP  agent identity roles
+          skipped: agent status check did not pass
 
 Next:  az role assignment create ...     -- grant yourself the required role
-       azd ai agent doctor --remote      -- re-run to verify
+       azd ai agent doctor               -- re-run to verify
 ```
 
 The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands.
 
 ## Performance & Side Effects
 
-Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use.
+Per [Performance Budget](#performance-budget) below — `--local-only` keeps doctor under 100ms for users who explicitly opt out. Default mode adds a few seconds and explicit credential use.
 
 | Check | Network calls | Worst-case time |
 |---|---|---|
 | 7 — Auth | 0 (cached token) – 1 (refresh) | 2s |
 | 8 — Reach | 1 GET | 10s (timeout) |
 | 9 — Models | 1 list | 5s |
-| 10 — RBAC | 1 list-role-assignments | 5s |
+| 10 — User RBAC | 1 list-role-assignments | 5s |
 | 11 — Status | 1 list-agents | 5s |
+| 12 — Agent MI roles | 3 list-role-assignments (project / account / RG) | 6s |
 
-Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode.
+Worst-case full-sweep walltime: ~35s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode.
 
 ## Testing Strategy
 
@@ -229,38 +276,39 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow
 - **Redaction test.** Snapshot the rendered RBAC fix string in interactive (full values) and non-interactive (`<redacted>`) modes, plus the JSON envelope's `redacted` field.
 - **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test.
 - **Functional test** (recorded cassette via `mage record`) covering one happy-path full run.
-- No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step.
+- No live tests in CI for the remote-check path — replay-only. Cassette refresh is a manual maintainer step.
 
 ## Risks & Mitigations
 
 | Risk | Mitigation |
 |---|---|
-| Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. |
-| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. |
+| User in CI forgets `--local-only`; remote checks 401 or hang on auth refresh | Detect non-interactive (`!isTerminal`) inside check 7; if a token refresh would require interaction, fail fast with `Status: Fail, detail: "non-interactive shell — re-run with --local-only"`. Checks 8–12 then Skip via the dependency matrix. No hang, no silent success. |
+| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s in checks 10 or 12, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. |
 | Foundry API drift breaks the reachability probe | Pin the api-version to `DefaultAgentAPIVersion` (`internal/cmd/agent_context.go`), not a doc literal. Failure renders as "API not understood" with a doc link rather than a stack trace. |
-| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8. |
-| User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. |
-| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in check 10: non-interactive output replaces values with `<redacted>` and the JSON envelope flags `"redacted": true`. |
+| Slow tenant or VPN turns full sweep into a long wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8; check 12 waits on 11. |
+| User runs `doctor` against a fresh project before `azd provision` | Checks 8–12 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. |
+| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in checks 10 and 12: non-interactive output replaces values with `<redacted>` and the JSON envelope flags `"redacted": true`. |
 
 ## Performance Budget
 
-Local-only `doctor` (today): <100ms. Don't regress.
-`--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s.
-We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic.
+`--local-only` doctor: <100ms. Don't regress.
+Default (full sweep): ≤ 35s walltime in the worst case (sick endpoint). Typical: <6s.
+We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot diagnostic.
 
 ## Implementation Phases
 
-1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step.
+1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--local-only` flag (default false), no behavior change at this step beyond exposing the flag.
 2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own.
 3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions.
-4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording.
+4. **User RBAC (10)** — role-list API for the current principal, with the fix-command wording.
+5. **Agent identity roles (12)** — last because it depends on a passing check 11 and adds a new INFO glyph to the renderer.
 
-Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4.
+Each phase is independently shippable. While 7–8 are landing, the renderer can already display "remote checks not yet implemented" for 9–12 without breaking existing users (who will be on `--local-only` behavior until they explicitly opt in via the flag-default flip in phase 1).
 
-## Appendix: Why not fold these into MVP?
+## Appendix: Why is this a separate doc from MVP?
 
 Three reasons, in order of weight:
 
-1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit.
-2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it.
-3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path.
+1. **Auth + network coupling.** Local checks (1–6) run in any project state, including before `azd auth login`. Remote checks fundamentally cannot. Splitting the design makes the contract explicit and lets us iterate on the remote surface without churning the local one.
+2. **Review surface.** The MVP design (the companion doc) is already large. Adding six live-Azure checks with their own error mapping, timeout, redaction, and skip semantics doubles it.
+3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks in a phased rollout (see [Implementation Phases](#implementation-phases)) so we can iterate the wording, the role-set, and the api-version pinning without churning everything at once.
diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
index 60fbdc78ce9..8936b985b2f 100644
--- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md
+++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
@@ -112,9 +112,9 @@ type State struct {
     MissingManualVars  []string  // ${...} → user-supplied vars not set
     Services           []ServiceState
     AgentStatus        string    // optional (Foundry API), empty if unknown
-    HasOpenAPI         bool      // populated only when AssembleState is called from `run` or `doctor --remote`
-    OpenAPIPayload     string    // populated only when AssembleState is called from `run` or `doctor --remote`
-    IsAuthenticated    AuthState // populated only when AssembleState is called from `doctor --remote`
+    HasOpenAPI         bool      // populated only when AssembleState is called from `run` or `doctor` (any mode that contacts the agent)
+    OpenAPIPayload     string    // populated only when AssembleState is called from `run` or `doctor` (any mode that contacts the agent)
+    IsAuthenticated    AuthState // populated only when AssembleState is called from `doctor` (full sweep, i.e. not `--local-only`)
 }
 
 type ServiceState struct {
@@ -134,7 +134,7 @@ type Suggestion struct {
 
 State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The base cost is one gRPC round-trip per resolver invocation, which is negligible.
 
-The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the `doctor --remote` path is the only caller that sets it. Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths.
+The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the full-sweep `doctor` path is the only caller that sets it (`--local-only` runs leave it false). Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths.
 
 ### Layered sources
 
@@ -302,10 +302,11 @@ All failures are silent. The fetch itself is best-effort and already cached —
 | 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` |
 | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall |
 | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` |
-| 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` |
+| 10 (post-MVP) | User RBAC sufficient | role list | `az role assignment create …` |
 | 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` |
+| 12 (post-MVP) | Agent identity role assignments | roles on project / account / RG | `az role assignment create --assignee <mi-id> …` |
 
-Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
+Checks 7–12 are listed in the issue (and surfaced in review feedback) but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–12 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
 
 ### Doctor output shape
 
@@ -328,7 +329,7 @@ When all checks pass, the trailing Next: block is the resolver's `ResolveAfterIn
 |---|---|
 | All checks pass (or pass+skip with no fail) — and at least one check ran | 0 |
 | Any check returns `Status: Fail` | 1 |
-| All checks ran were skipped (e.g., `--remote` against a project missing prerequisites) | 2 |
+| All checks ran were skipped (e.g., full sweep against a project missing prerequisites, or `--local-only` with all local prereqs short-circuited) | 2 |
 
 Precedence: any `Fail` always wins (exit 1). Skip-only without any pass is the dependency-cascade case (exit 2). All-pass or pass+skip with at least one pass is exit 0.
 
@@ -387,7 +388,7 @@ The Next: block is purely additive; no existing exit code, error type, or stdout
 
 - Success-path JSON output (`--output json` on `init`/`run`/`invoke`/`show`) is unchanged — the human `Next:` block is suppressed in that mode (see [Output discipline](#output-discipline)).
 - The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change.
-- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--remote` and `--output json` flags are opt-in.
+- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--local-only` and `--output json` flags are opt-in.
 - `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added.
 - Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention.
 

From 9595e0699e604fde3aca4d146410fb3e23261d17 Mon Sep 17 00:00:00 2001
From: Antriksh Jain <antrikshjain@microsoft.com>
Date: Mon, 11 May 2026 11:43:13 +0530
Subject: [PATCH 5/5] docs(azure.ai.agents): address trangevi review and
 tighten spec voice

- Configuration sources: drop unified-azure.yaml assumption; commit to two-file split (azure.yaml + agent.yaml) and AssembleState abstraction.

- State Layering: row 3 now correctly attributes protocol/config.env refs to agent.yaml.

- Stale state definition: spell out Missing / Empty / Drift rules.

- Output discipline: piped/CI is suppressed (not stderr); add structured nextStep JSON field.

- Hook Points: end-of-runE is the rule; run/deploy are the documented exceptions.

- No opt-out env var: Next: block is always on when output discipline allows.

- Sweep stale stderr/placeholder/checks-7-11/reviewer-voice references for spec consistency.

- Add inline cspell:ignore directives so cspell-lint passes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../azd-ai-agent-doctor-remote-checks.md      |  6 +-
 cli/azd/docs/design/azd-ai-agent-nextsteps.md | 59 ++++++++++++-------
 2 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
index 68ad7df228a..50a7419bdbd 100644
--- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
+++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md
@@ -1,3 +1,5 @@
+<!-- cspell:ignore nextsteps unredacted walltime -->
+
 # Design: `azd ai agent doctor` — Remote Checks (7–12)
 
 ## Status
@@ -57,7 +59,7 @@ These all require live calls and credentials, which is why they're carved out in
 cli/azd/extensions/azure.ai.agents/internal/cmd/
 └── doctor/
     ├── checks_local.go       ← (MVP) checks 1–6
-    ├── checks_remote.go      ← NEW: checks 7–11
+    ├── checks_remote.go      ← NEW: checks 7–12
     ├── runner.go             ← orchestrates ordered execution
     └── types.go              ← Check, Result, Severity
 ```
@@ -145,7 +147,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10
 
 **What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally.
 
-> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead. Syncing with @trangevi on timing.
+> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead.
 
 **Pass:** "all <N> referenced models present"
 **Fail:** "missing: <model-name> (referenced by <service>)" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link).
diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
index 8936b985b2f..ead99018c15 100644
--- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md
+++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md
@@ -1,3 +1,5 @@
+<!-- cspell:ignore exterrors nextstep relpath unredacted Chdir NEXTSTEPS prereqs -->
+
 # Design: Context-Aware Next-Step Guidance for `azd ai agent`
 
 ## Status
@@ -16,9 +18,14 @@ Non-goals:
 - Sample-author hooks, README markers, or YAML-declared hints — out of scope for this design.
 - Authoring or editing core azd packages (`cli/azd/pkg/...`, `cli/azd/cmd/...`). The design only *consumes* the existing `Metadata["note"]` rendering contract.
 
-## Assumptions
+## Configuration sources
+
+This design ships against today's two-file split; the unified `azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) is not a prerequisite:
 
-This design assumes the unified-`azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) has landed: a single `azure.yaml` is the canonical config; `agent.yaml` and `agent.manifest.yaml` no longer exist as separate files. References to `agent.yaml` in the doctor checks and decision trees below are placeholders for the equivalent fields in the unified `services.<name>` stanza. If the extension ships before unification, each `agent.yaml` reference should be read as the corresponding fields under `services.<name>` in `azure.yaml`. The `AssembleState` function abstracts the source split, so resolvers are insulated from the migration — only the state-assembly code needs to change.
+- `azure.yaml` — service registration (`services.<name>.host: azure.ai.agent`), language, source path.
+- `agent.yaml` — agent definition: protocol, models, tools, `config.env` references.
+
+`AssembleState` reads from both: structural fields (host, language, source) come from `azure.yaml`; agent-runtime fields (protocol, env refs) come from `agent.yaml`. When unification eventually lands, only `AssembleState` changes — every resolver above it sees the same `State` shape. References to `agent.yaml` in the decision trees and doctor checks below are literal, not placeholders.
 
 The extension's authentication remains `azidentity.NewAzureDeveloperCLICredential` (see [`agent_context.go`](../../extensions/azure.ai.agents/internal/cmd/agent_context.go)). User-facing fix commands therefore say `azd auth login`, not `az login`.
 
@@ -92,7 +99,7 @@ One package owns *all* policy. Each command is a thin caller — it knows when i
 └────────────────────────────────────────────────────────────────────┘
 ```
 
-`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout (or stderr when piped — see [Output Discipline](#output-discipline)).
+`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout, and only on a TTY — non-TTY runs suppress the block entirely (see [Output Discipline](#output-discipline)).
 
 ## State Model
 
@@ -144,10 +151,20 @@ The resolver works with whatever's available — partial state never silences gu
 |---|---|---|
 | 1 | Explicit CLI flags (`--project-endpoint`, `--agent`) | Highest |
 | 2 | Live runtime probes (OpenAPI, Foundry status) | Override env vars |
-| 3 | `azure.yaml` services, protocols, `config.env` refs | Structural truth |
+| 3 | `azure.yaml` (services, language) + `agent.yaml` (protocol, `config.env` refs) | Structural truth |
 | 4 | azd env vars | Resolution truth |
 
-If the azd environment is missing or stale (a real scenario for brownfield users), the resolver should still produce useful output from layers 1–3.
+If the azd environment is missing or stale, the resolver should still produce useful output from layers 1–3.
+
+**"Stale" definition.** An env var is stale when any of the following holds:
+
+1. **Missing** — the key is not present in the current `.azure/<env>/.env` file but layer 3 expects it (e.g., `agent.yaml` declares `config.env: AGENT_FOO_ENDPOINT` and the env file has no entry).
+2. **Empty** — the key is present but its value is the empty string.
+3. **Drift** — the key is present and non-empty, but a higher-priority layer disagrees with it. The resolver detects two concrete drift cases:
+   - The corresponding `agent.yaml` field changed (e.g., agent renamed in `azure.yaml` while the env still holds the old `AGENT_<OLD>_ENDPOINT`).
+   - The OpenAPI/Foundry probe (layer 2) returned a different endpoint than the env value.
+
+Drift is *informational* — the resolver still emits a Next: line ("env appears stale; re-run `azd provision` or `azd env refresh`") rather than blocking. The resolver does not call ARM to confirm whether the recorded endpoint still resolves; that's doctor's responsibility.
 
 ## Decision Tree (per command)
 
@@ -234,12 +251,12 @@ Ground rules:
 | Scenario | Destination |
 |---|---|
 | Interactive TTY | `stdout` |
-| Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output |
-| `--output json` (success-path commands) | `Next:` block suppressed entirely |
-| `azd ai agent doctor --output json` | **Not** the suppression rule — emits the structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are still suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag |
-| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`) |
+| Piped output / CI (detected via `isTerminal`) | **Suppressed.** The human `Next:` block is a TTY-only affordance. Writing it to stdout would pollute parsable command output; writing it to stderr would be misread as a warning or failure by callers grepping stderr. Machine consumers opt into structured guidance via `--output json`. |
+| `--output json` (success-path commands) | Human `Next:` block suppressed; the same guidance is exposed as an optional `nextStep: { primary, secondary?, note? }` field on the existing JSON envelope. |
+| `azd ai agent doctor --output json` | Structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag. |
+| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`). |
 
-The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output should call the same helper to keep the source of truth singular.
+The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output calls the same helper to keep the source of truth singular.
 
 ### Multi-line note pre-indentation
 
@@ -247,14 +264,16 @@ The `azdext.Artifact` SDK field used here renders only the **first line** at the
 
 ## Hook Points
 
+Wherever possible, hooks fire at the **end of `runE`** — the same call site every command uses for cleanup — so all next-step emission flows through one helper, `nextstep.PrintNext`. Two commands cannot use that hook; both deviations are documented in the table below.
+
 | Command | Trigger | Mechanism |
 |---|---|---|
 | `init` | end of `runE` after success message | direct `nextstep.PrintNext(w, …)` |
-| `run` | "agent is listening" callback | direct print before `Press Ctrl+C` line |
 | `invoke --local` / `invoke` | end of `runE` on success | direct print |
-| `show` | end of table render | direct print |
-| `doctor` | after rendering check report | direct print |
-| `deploy` | inside `Deploy()` return path | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. |
+| `show` | end of `runE` (after table render returns) | direct print |
+| `doctor` | end of `runE` (after report renders, before `os.Exit`) | direct print |
+| `run` | "agent is listening" callback (**not** end of `runE`) | direct print before `Press Ctrl+C` line. `run` blocks until SIGINT, so end-of-`runE` would print on shutdown — too late to be useful. |
+| `deploy` | inside `Deploy()` return path (**not** the deploy command's `runE`, which lives in core azd) | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. |
 
 The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly.
 
@@ -298,7 +317,7 @@ All failures are silent. The fetch itself is best-effort and already cached —
 | 3 | azd environment selected | "<env-name>" | `azd env new` / `azd env select` |
 | 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` |
 | 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` |
-| 6 | `agent.yaml` per service is valid (placeholder for unified-schema service stanza) | path | fix YAML |
+| 6 | `agent.yaml` per service is valid | path | fix YAML |
 | 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` |
 | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall |
 | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` |
@@ -370,7 +389,7 @@ The `redacted` field is `true` when `--output json` is combined with non-interac
 | Layer | What | How |
 |---|---|---|
 | `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests |
-| `nextstep/format` | TTY vs piped routing (stdout vs stderr); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag |
+| `nextstep/format` | TTY vs non-TTY routing (stdout vs suppressed); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag |
 | `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` |
 | `nextstep/resolver` | non-interactive auth handling — when `IsAuthenticated == AuthUnknown`, no auth-conditional advice (no "azd auth login" appears in `Next:`) | dedicated table cases |
 | `nextstep/openapi` | each extraction path + corruption / empty fallback / unresolved `$ref` → `""` | golden specs in `testdata/` |
@@ -390,7 +409,7 @@ The Next: block is purely additive; no existing exit code, error type, or stdout
 - The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change.
 - `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--local-only` and `--output json` flags are opt-in.
 - `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added.
-- Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention.
+- No new environment variables are introduced. The Next: block is always rendered when output discipline allows it; there is no opt-out env var.
 
 If any future change to the `Metadata["note"]` rendering behavior in `pkg/project/artifact.go` ships, the regression test pinned in this design's [Output Discipline](#output-discipline) section will break loudly. The fix is extension-side reformatting; no core edits.
 
@@ -409,7 +428,7 @@ If any future change to the `Metadata["note"]` rendering behavior in `pkg/projec
 
 ## Deferred follow-ups
 
-The reviewer surfaced three additional concerns that are out of scope for this design pass and will be handled separately:
+Three additional concerns are out of scope for this design pass and will be handled separately:
 
 - **User personas / target audiences.** Not standard in design docs in this repo (see `cli/azd/docs/design/`); we treat "developers building with the agents extension" as the implicit audience.
 - **Success metrics / KPIs.** Same rationale; metrics will be wired up alongside telemetry in the follow-up design once we have something concrete to measure.
@@ -421,7 +440,7 @@ The reviewer surfaced three additional concerns that are out of scope for this d
 |---|---|
 | Resolver guidance contradicts what the command actually did (false advice) | Each resolver receives the *post-execution* state. Tests cover deployed / not-deployed / partial-env permutations. |
 | OpenAPI fetch slows down `run` startup | Already capped + cached + silent on failure. We do not block the listening notice on the fetch. |
-| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. |
+| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on non-TTY output; user can ignore. |
 | Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. |
 | Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. |
 | Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` produces one `show <name>` and one `invoke <name>` line per agent. Multi-agent walls remain readable for the project sizes we expect (≤ ~10). If the wall becomes a real problem in practice, fall back to a single `azd ai agent show` (no args, lists all) line; this can be tuned in a follow-up without changing the resolver contract. |
@@ -432,7 +451,7 @@ The reviewer surfaced three additional concerns that are out of scope for this d
 2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command.
 3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification.
 4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver.
-5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
+5. **(Follow-up)** Doctor checks 7–12 (auth, reachability, model deployments, RBAC, agent status, end-to-end probe). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md).
 
 Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975.