From 23292996ce8d0292d5eb3a235dc1c0df78e5dcf5 Mon Sep 17 00:00:00 2001 From: Antriksh Jain Date: Wed, 6 May 2026 03:12:26 +0530 Subject: [PATCH 1/5] docs(ai-agents): design doc for context-aware next-step guidance Initial draft of the design for issue #7975 context-aware Next: guidance across azd ai agent commands plus a new `azd ai agent doctor` diagnostic. Scoped entirely to cli/azd/extensions/azure.ai.agents/. References core files for context but proposes no edits outside the extension. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- cli/azd/docs/design/azd-ai-agent-nextsteps.md | 393 ++++++++++++++++++ 1 file changed, 393 insertions(+) create mode 100644 cli/azd/docs/design/azd-ai-agent-nextsteps.md diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md new file mode 100644 index 00000000000..37914caac85 --- /dev/null +++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md @@ -0,0 +1,393 @@ +# Design: Context-Aware Next-Step Guidance for `azd ai agent` + +## Status + +**Draft — for design review.** +Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics". +Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — same path other extensions use today. + +## Goal + +After every successful `azd ai agent` command — and on demand via a new `azd ai agent doctor` — print a short, state-aware **Next:** block telling the developer the single best command to run next. Replace the current static, often-misleading hints with output derived from the project's real state (`azure.yaml`, the azd environment, optionally a running agent's OpenAPI endpoint). + +Non-goals: +- Reworking error-path guidance — `internal/exterrors` already does this well. +- Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. State is read via the existing gRPC `azdext` API. +- Sample-author hooks, README markers, or YAML-declared hints (see [Alternatives](#alternatives-considered)). + +## Background + +### What's wrong today + +| Surface | Today | Problem | +|---|---|---| +| `init` | Hardcoded: "run `azd deploy` or `azd up`" | Skips local-dev path entirely. Suggests deploy even when deps aren't provisioned. | +| `run` | Hardcoded: `invoke --local "Hello!"` | Wrong payload for invocations-protocol agents. No OpenAPI awareness. | +| `invoke --local` (success) | Nothing | Dead-end. No nudge toward `azd deploy`. | +| `invoke` (remote, success) | Nothing | No nudge to `show` / `monitor`. | +| `show` (success) | Nothing | No nudge to invoke or check logs. | +| `deploy` | Endpoint URLs only | No invoke command, no link to README. | +| (none) | — | No way to ask "where was I?" or "what's broken across the project?" | + +### What's already in place + +These are the building blocks the design reuses: + +- **`azdext` gRPC client** — extensions read `azure.yaml` services and azd env vars without re-implementing parsing. +- **`fetchOpenAPISpec`** in `helpers.go` — already probes `GET /invocations/docs/openapi.json`, caches to disk, fails silently. Drives the rich-payload story. +- **`exterrors`** factories — typed errors with `Suggestion` fields. We follow the same "structured advice" shape on success paths. +- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. Already used by the existing `service_target_agent.go` to print endpoint URLs. We populate the same field with our Next: block. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test. + +## Architecture + +### Component layout + +``` +cli/azd/extensions/azure.ai.agents/internal/cmd/ +├── nextstep/ ← new package (this design) +│ ├── types.go ← Suggestion, State, ServiceState +│ ├── state.go ← AssembleState(ctx, azdClient) → State +│ ├── resolver.go ← decision tree per command +│ ├── format.go ← PrintNext / FormatNextForNote +│ └── openapi.go ← OpenAPI-derived invoke examples +├── doctor.go ← new command — calls nextstep + extra checks +├── init.go ─┐ +├── run.go ─┼─ each calls nextstep.ResolveAfterX → nextstep.PrintNext +├── invoke.go ─┤ +├── show.go ─┘ +└── ... +internal/project/ +└── service_target_agent.go ← Deploy() embeds Next: in its returned artifact +``` + +One package owns *all* policy. Each command is a thin caller — it knows when it succeeded and which resolver to invoke; it does not assemble state, format output, or decide commands. + +### End-to-end flow + +``` +┌────────────────────────────────────────────────────────────────────┐ +│ azd ai agent succeeds │ +│ │ +│ 1. nextstep.AssembleState(ctx, azdClient) │ +│ ├── azdext.Project().Get → azure.yaml services │ +│ ├── azdext.Environment().Get → azd env vars │ +│ └── (run only) fetchOpenAPISpec → invoke payload examples │ +│ │ +│ 2. nextstep.ResolveAfter(state, args) → []Suggestion │ +│ ├── walks command-specific decision tree │ +│ └── chooses 1 primary + ≤1 secondary │ +│ │ +│ 3a. nextstep.PrintNext(w, suggestions) ← stdout commands │ +│ 3b. nextstep.FormatNextForNote(suggestions) ← deploy artifact │ +└────────────────────────────────────────────────────────────────────┘ +``` + +`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout (or stderr when piped — see [Output Discipline](#output-discipline)). + +## State Model + +```go +// types.go +type State struct { + HasProjectEndpoint bool + HasUnresolvedInfraVars bool // ${...} → Bicep outputs not in azd env + HasUnresolvedManualVars []string // ${...} → user-supplied vars not set + Services []ServiceState + AgentStatus string // optional (Foundry API), empty if unknown + HasOpenAPI bool // run-time only + OpenAPIPayload string // run-time only, ""=no spec +} + +type ServiceState struct { + Name string + Host string // azure.ai.agent / azure.ai.toolbox / ... + Protocol string // responses / invocations / "" + RelativePath string // svc.RelativePath from azure.yaml + IsDeployed bool // azd env: deployment metadata present +} + +type Suggestion struct { + Command string // "azd ai agent run" + Description string // "start the agent locally" + Priority int // lower = earlier +} +``` + +State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The cost is one gRPC round-trip per resolver invocation, which is negligible compared to anything the user perceives. + +### Layered sources + +The resolver works with whatever's available — partial state never silences guidance: + +| Layer | Source | Authority | +|---|---|---| +| 1 | Explicit CLI flags (`--project-endpoint`, `--agent`) | Highest | +| 2 | Live runtime probes (OpenAPI, Foundry status) | Override env vars | +| 3 | `azure.yaml` services, protocols, `config.env` refs | Structural truth | +| 4 | azd env vars | Resolution truth | + +If the azd environment is missing or stale (a real scenario for brownfield users), the resolver should still produce useful output from layers 1–3. + +## Decision Tree (per command) + +The full tree lives in `resolver.go`. This section captures the policy in tabular form for review. + +### `init` + +| Condition | Primary | Secondary | +|---|---|---| +| `HasUnresolvedInfraVars` | `azd provision` | — | +| `HasUnresolvedManualVars` (non-empty) | `azd env set ` (one per missing var, up to 3) | — | +| Otherwise | `azd ai agent run` | — (the `run` resolver will print the protocol-correct `invoke --local` when the agent starts) | +| Always (third line) | "When ready to deploy to Azure, run `azd deploy`." | — | + +### `run` (on listening callback) + +| Condition | Primary | +|---|---| +| `HasOpenAPI` && payload extracted | `azd ai agent invoke --local ''` | +| `Protocol == "invocations"` | `azd ai agent invoke --local '{"message": "Hello!"}'` | +| `Protocol == "responses"` or unknown | `azd ai agent invoke --local "Hello!"` | + +If OpenAPI fetch failed (404, timeout), append a Tip line: +``` +Tip: curl http://localhost:/invocations/docs/openapi.json + to verify the exact payload format your agent expects. +``` + +### `invoke --local` (success) + +Single agent: `azd deploy`, secondary `azd ai agent monitor --follow`. +Multi-agent: `azd deploy` only (one line per agent for follow-up `invoke `). + +### `invoke` (remote) + +Success: `azd ai agent show `, secondary `azd ai agent monitor --follow`. +Failure: `azd ai agent monitor --follow`. + +### `show` (success) + +| `AgentStatus` | Primary | +|---|---| +| `active` / `idle` | `azd ai agent invoke 'Hello!'` | +| `failed` / `""` | `azd ai agent monitor --follow` | +| transitional | `azd ai agent show ` (re-check) | + +### `deploy` (post-deploy hook, embedded in artifact note) + +Success, single agent: +``` +azd ai agent show -- verify it's running +azd ai agent invoke '' -- test the deployment +See /README.md for a sample payload appropriate for this agent. +``` + +Notes: +- The `invoke` line **omits the agent name** for single-agent projects (matches sample README convention). +- README hint is emitted **only if** `os.Stat` succeeds on `//{README.md,readme.md,README.MD}`. No hint for projects without a README. + +Multi-agent: one `show ` and one `invoke ` line per agent. + +### `doctor` + +Special: runs every check unconditionally, then calls `ResolveAfterInit` (or a deploy-aware variant when `IsDeployed`) for the trailing Next: block. Same code path as success-path resolvers. + +## Output Format + +``` +Next: -- + -- +``` + +Ground rules: +- One primary, ≤1 secondary. More than two lines drowns out command output. +- Two-space gap between command and `--`. +- Multi-agent: repeat per agent on separate lines, no menus. +- Leading blank line separates from preceding command output. +- README/docs pointers go on their own line below the commands. +- Always present `Next:` prefix so the block is visually distinct. + +### Output discipline + +| Scenario | Destination | +|---|---| +| Interactive TTY | `stdout` | +| Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output | +| `--output json` / structured output flag | Suppressed entirely | +| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs) | + +### Multi-line note pre-indentation + +### Multi-line note pre-indentation + +The `azdext.Artifact` SDK field used here renders only the **first line** at the natural indent level; subsequent lines need to be pre-indented by the producer to align under the artifact bullet. `FormatNextForNote` pre-indents lines 2+ with 4 spaces and trims leading/trailing newlines. This is purely a string-formatting concern handled inside the extension. + +## Hook Points + +| Command | Trigger | Mechanism | +|---|---|---| +| `init` | end of `runE` after success message | direct `nextstep.PrintNext(w, …)` | +| `run` | "agent is listening" callback | direct print before `Press Ctrl+C` line | +| `invoke --local` / `invoke` | end of `runE` on success | direct print | +| `show` | end of table render | direct print | +| `doctor` | after rendering check report | direct print | +| `deploy` | inside `Deploy()` return path | attach Next: to the returned `azdext.Artifact` | + +The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly. + +### Why not a `listen.go` event handler for deploy? + +The listen daemon's stdout is consumed by core's gRPC channel — `fmt.Printf` from a `postdeployHandler` never reaches the user terminal. Confirmed during a previous spike. The artifact-note path is the only reliable surface. + +## OpenAPI Discovery + +Reuses `fetchOpenAPISpec` from `helpers.go`. Adds one helper in `nextstep/openapi.go`: + +```go +// ExtractInvokeExample returns a JSON-quoted payload string from the spec's +// /invocations endpoint (preferring `example`, falling back to a minimal object +// derived from `schema.required`). Returns "" if the spec is unusable. +func ExtractInvokeExample(spec []byte) string +``` + +Resolution order: +1. `paths./invocations.post.requestBody.content.application/json.example` — exact author intent. +2. `…schema.example` — schema-level example. +3. Generate from `schema.required` + `schema.properties[*].example` — minimum viable JSON. +4. Return `""` → fall back to protocol-generic payload + Tip line. + +All failures are silent. The fetch itself is best-effort and already cached — we never block `run`'s startup notice on it. + +## `doctor` Command + +`doctor` is the persistence layer for state-aware guidance: when the user has lost terminal context or hit a confusing error, they run one command and get the full picture plus the specific fix for each broken thing. + +### Checks (run in order, top to bottom) + +| # | Check | Pass | Fail → Fix | +|---|---|---|---| +| 1 | `azd` reachable + extension running | "extension running, gRPC channel established" | install / start daemon | +| 2 | `azure.yaml` valid | "1 service: " | `azd ai agent init` | +| 3 | azd environment selected | "" | `azd env new` / `azd env select` | +| 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` | +| 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` | +| 6 | `agent.yaml` per service is valid | path | fix YAML | +| 7 (post-MVP) | Authentication (`az account show`) | signed-in user | `azd auth login` | +| 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall | +| 9 (post-MVP) | Model deployments exist | name + version | `azd provision` | +| 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` | +| 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` | + +Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass. + +### Doctor output shape + +``` +azd ai agent doctor + ✓ PASS + + ✗ FAIL + + fix: + +Next: +``` + +When all checks pass, the trailing Next: block is the resolver's `ResolveAfterInit` (if not deployed) or a deploy-aware variant (if deployed) — exactly what the user would have seen at the end of the last successful command. + +## Testing Strategy + +| Layer | What | How | +|---|---|---| +| `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests | +| `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` | +| `nextstep/openapi` | each extraction path + corruption / empty fallback | golden specs in `testdata/` | +| `nextstep/state` | layered source priority, partial state | mocked `azdClient` (testify/mock) | +| `service_target_agent` | with/without README on disk → with/without hint line | `t.TempDir` + `t.Chdir` | +| `doctor` | each check pass/fail rendering | mocked `azdClient` | + +All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure calls. + +## Risks & Mitigations + +| Risk | Mitigation | +|---|---| +| Resolver guidance contradicts what the command actually did (false advice) | Each resolver receives the *post-execution* state. Tests cover deployed / not-deployed / partial-env permutations. | +| OpenAPI fetch slows down `run` startup | Already capped + cached + silent on failure. We do not block the listening notice on the fetch. | +| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. | +| Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. | +| Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. | +| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` cap: 5 agents inline; beyond that, suggest `azd ai agent show` (no args, lists all). | + +## Implementation Phases + +1. **Foundation** — `nextstep` package (types, format, state assembly, OpenAPI helper). No callers. +2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command. +3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification. +4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver. +5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). Separate design. + +Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975. + +## Appendix A: Example Outputs (full block, end-to-end) + +### After `init`, fresh project, deps not provisioned + +``` +✓ Agent project initialized. + +Next: azd provision -- set up your Foundry project, models, and connections + +This creates your Foundry project and any model deployments, toolboxes, +or connections your agent needs for local development. + +Once that finishes, run 'azd ai agent run' to start locally. +When ready to deploy to Azure, run 'azd deploy'. +``` + +### After `run`, invocations protocol, with OpenAPI + +``` +Agent is running at http://localhost:8088 + +Next: azd ai agent invoke --local '{"message": "What hotels are in Seattle?"}' + +Press Ctrl+C to stop. +``` + +### After `azd deploy` (rendered by core from artifact note) + +``` + (✓) Done: Deploying service ag-ui-invocations + - Agent playground (portal): https://ai.azure.com/... + - Agent endpoint (invocations): https://....services.ai.azure.com/... + Next: azd ai agent show ag-ui-invocations -- verify it's running + azd ai agent invoke '' -- test the deployment + See src/ag-ui-invocations/README.md for a sample payload appropriate for this agent. +``` + +### After `azd ai agent doctor`, project deployed and healthy + +``` +azd ai agent doctor + ✓ PASS azd CLI is installed and reachable + extension running, gRPC channel established + ✓ PASS Project loaded from azure.yaml + ✓ PASS Current azd environment selected + ✓ PASS Agent service detected in azure.yaml + ✓ PASS AZURE_AI_PROJECT_ENDPOINT is set + ✓ PASS agent.yaml for service "echo-agent" is valid + +Next: azd ai agent show echo-agent -- verify it's running + azd ai agent invoke '' -- test the deployment + See src/echo-agent/README.md for a sample payload appropriate for this agent. +``` + +## Appendix B: References + +- Issue [#7975](https://github.com/Azure/azure-dev/issues/7975) +- Existing OpenAPI fetch: [`helpers.go#L296`](https://github.com/Azure/azure-dev/blob/main/cli/azd/extensions/azure.ai.agents/internal/cmd/helpers.go#L296) +- Existing init suggestion: [`init.go#L1718-L1733`](https://github.com/Azure/azure-dev/blob/main/cli/azd/extensions/azure.ai.agents/internal/cmd/init.go#L1718-L1733) +- Existing endpoint artifact emission (same SDK field reused for the Next: block): [`service_target_agent.go`](../../extensions/azure.ai.agents/internal/project/service_target_agent.go) +- `exterrors` package (parallel pattern for error paths): [`internal/exterrors`](../../extensions/azure.ai.agents/internal/exterrors) +- Reference Foundry sample (OpenAPI opt-in): [`microsoft-foundry/foundry-samples` human-in-the-loop](https://github.com/microsoft-foundry/foundry-samples/blob/fd4ecab832fa491b9ffb4abca862c073777b5e53/samples/python/hosted-agents/bring-your-own/invocations/human-in-the-loop/main.py#L119) From 18bc6589ee38da9ffc0258a212fd4194a8876a02 Mon Sep 17 00:00:00 2001 From: Antriksh Jain Date: Wed, 6 May 2026 03:21:58 +0530 Subject: [PATCH 2/5] docs(ai-agents): add separate design for doctor remote checks (7-11) Companion to azd-ai-agent-nextsteps.md. Covers the five live-Azure doctor checks deferred from the MVP: auth, Foundry reachability, model deployments, RBAC, and agent runtime status. Key decisions: - Opt-in via --remote flag (default stays local-only, fast, no auth) - Strict dependency matrix; failed prerequisites become explicit Skip results - Per-check timeouts (~10s); ~30s worst-case walltime for --remote - Read-only RBAC fix is surfaced as az CLI command, never auto-applied - Phased rollout: runner split, auth+reach, models+status, RBAC last Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azd-ai-agent-doctor-remote-checks.md | 260 ++++++++++++++++++ cli/azd/docs/design/azd-ai-agent-nextsteps.md | 4 +- 2 files changed, 262 insertions(+), 2 deletions(-) create mode 100644 cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md new file mode 100644 index 00000000000..1b3fb166a9b --- /dev/null +++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md @@ -0,0 +1,260 @@ +# Design: `azd ai agent doctor` — Remote Checks (7–11) + +## Status + +**Draft — for design review.** +Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975). +Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** + +## Goal + +Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"* + +Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix. + +Non-goals: +- Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1. +- Re-implementing what `az` / Foundry portal already do well — we just point the user there when relevant. +- Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. + +## Background + +The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions. + +The remaining ~30% — and arguably the most frustrating cases — sit in remote land: +- Token expired silently → every command 401s with no obvious "you're logged out" hint. +- Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace. +- Model deployment got deleted out from under the project → first invoke fails with a cryptic 404. +- Azure-AI Developer role missing → vague 403. +- Agent deployed but unhealthy → user thinks `azd deploy` succeeded so it must be fine. + +These all require live calls and credentials, which is why they're carved out into this separate design. + +## Scope of this doc + +| Check | What it answers | +|---|---| +| 7 — Authentication | Am I logged in? Is the token still valid? | +| 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? | +| 9 — Model deployments | Do the model(s) my agent references exist on the project? | +| 10 — RBAC | Does my principal have the roles needed to deploy / invoke? | +| 11 — Agent status | Does the deployed agent actually exist and is it active? | + +## Architecture + +### Component layout (additive only) + +``` +cli/azd/extensions/azure.ai.agents/internal/cmd/ +└── doctor/ + ├── checks_local.go ← (MVP) checks 1–6 + ├── checks_remote.go ← NEW: checks 7–11 + ├── runner.go ← orchestrates ordered execution + └── types.go ← Check, Result, Severity +``` + +Existing MVP files unchanged. Only `runner.go` learns to dispatch the new check set, gated by a flag (see [Execution Model](#execution-model)). + +### `Check` contract + +Same shape as MVP — keeps the renderer, sort order, and `Next:` resolver code path identical. + +```go +type Check interface { + ID() string // stable, e.g. "auth.token" + Title() string // human label + Run(ctx context.Context) Result +} + +type Result struct { + Status Status // Pass | Warn | Fail | Skip + Detail string // one-line description + Fix string // single command or instruction; empty for Pass + Links []string // doc links (optional) + Duration time.Duration // captured by runner, not the check +} +``` + +A remote check that can't run (e.g., not authenticated when running check 9) returns `Status: Skip` with `Detail: "skipped: authentication required (see check 7)"`. Skips never block subsequent checks — the user sees the full picture in one pass. + +## Execution Model + +### Default: local-only + +`azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode. + +### Opt-in: `--remote` + +`azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)). + +Rationale for opt-in (vs. always-on or auto-detect): +- Remote checks need credentials and add seconds to a command users may run frequently. +- Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI. +- A flag is discoverable, reversible, and self-documenting. + +We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about. + +### Dependency matrix + +| Check | Depends on | Action if dependency fails | +|---|---|---| +| 7 — Auth | check 3 (env selected) | Skip with "select an env first" | +| 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip | +| 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip | +| 10 — RBAC | 7 (auth), 8 (reachability) | Skip | +| 11 — Agent status | 8, 9, 10 | Skip | + +Skips are explicit in the rendered report. We never quietly drop a check. + +## Check Specifications + +### Check 7 — Authentication + +**What it does:** acquires a token via `azidentity.NewAzureDeveloperCLICredential` (the same credential `agent_context.go` already uses), then introspects expiry. + +**Pass:** " · token valid for minutes" +**Warn:** token expires in <5 min (suggest re-login proactively) +**Fail:** token acquisition error → `azd auth login` + +**Why a separate fail vs warn:** users hit this all the time when a CLI session outlives a token; the warn variant tells them to re-login *before* a long-running deploy 401s. + +### Check 8 — Foundry project reachability + +**What it does:** issues a single `GET /agents?api-version=…&$top=1` with the credential from check 7. + +**Pass:** "endpoint reachable (HTTP 200)" +**Fail:** maps the HTTP status to one of: +- `403` → check 10 (RBAC) will explain. Fix: see check 10 output. +- `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var. +- network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`". + +The single-shot probe avoids paging and works on tiny test projects. Timeout: 10s, no retries — doctor is diagnostic, not resilient. + +### Check 9 — Model deployments + +**What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally. + +**Pass:** "all referenced models present" +**Fail:** "missing: (referenced by )" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link). + +**Edge cases:** +- Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here. +- Agents with no model reference (orchestration-only) — pass with "no model references". + +### Check 10 — RBAC + +**What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set: +- `Azure AI Developer` (or stronger) on the project — required for deploy + invoke. +- `Cognitive Services User` on the AI account — required for model usage. + +**Pass:** ", " +**Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role. +**Fail:** missing both → command + portal link. + +The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example: +``` +fix: az role assignment create \ + --role "Azure AI Developer" \ + --assignee \ + --scope +``` + +**Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited. + +### Check 11 — Agent status + +**What it does:** for each service in `azure.yaml` with `host: ai-agent`, calls the Foundry agents-list endpoint and finds the matching agent by name. + +**Pass:** ": active (v)" +**Warn:** ": deploying (v)" — running `monitor --follow` is suggested. +**Fail:** +- agent not found → `azd deploy` +- agent in failed state → `azd ai agent monitor --follow` (look at logs, not redeploy blindly) + +This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly. + +## Output Format + +Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible: + +``` +azd ai agent doctor --remote + ✓ PASS azd reachable + ✓ PASS azure.yaml valid + ✓ PASS azd environment selected + ✓ PASS agent service in azure.yaml + ✓ PASS AZURE_AI_PROJECT_ENDPOINT set + ✓ PASS agent.yaml valid + ✓ PASS authentication + alice@contoso.com · token valid for 47 minutes + ✗ FAIL Foundry project reachability + GET /agents → 403 + fix: see RBAC check below + - SKIP model deployments + skipped: reachability check failed + ✗ FAIL RBAC + principal alice@contoso.com has no role on project scope + fix: az role assignment create --role "Azure AI Developer" \ + --assignee 0000... --scope /subscriptions/.../projects/... + - SKIP agent status + skipped: reachability check failed + +Next: az role assignment create ... -- grant yourself the required role + azd ai agent doctor --remote -- re-run to verify +``` + +The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands. + +## Performance & Side Effects + +Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use. + +| Check | Network calls | Worst-case time | +|---|---|---| +| 7 — Auth | 0 (cached token) – 1 (refresh) | 2s | +| 8 — Reach | 1 GET | 10s (timeout) | +| 9 — Models | 1 list | 5s | +| 10 — RBAC | 1 list-role-assignments | 5s | +| 11 — Status | 1 list-agents | 5s | + +Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode. + +## Testing Strategy + +- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8). +- **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test. +- **Functional test** (recorded cassette via `mage record`) covering one happy-path full run. +- No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step. + +## Risks & Mitigations + +| Risk | Mitigation | +|---|---| +| Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. | +| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. | +| Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. | +| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). | +| User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. | + +## Performance Budget + +Local-only `doctor` (today): <100ms. Don't regress. +`--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s. +We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic. + +## Implementation Phases + +1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step. +2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own. +3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions. +4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording. + +Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4. + +## Appendix: Why not fold these into MVP? + +Three reasons, in order of weight: + +1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit. +2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it. +3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path. diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md index 37914caac85..dff83c41c77 100644 --- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md +++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md @@ -278,7 +278,7 @@ All failures are silent. The fetch itself is best-effort and already cached — | 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` | | 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` | -Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass. +Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). ### Doctor output shape @@ -325,7 +325,7 @@ All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure 2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command. 3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification. 4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver. -5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). Separate design. +5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975. From 4358e68497b0beb9909e9b69141d246e9241d05b Mon Sep 17 00:00:00 2001 From: Antriksh Jain Date: Wed, 6 May 2026 13:12:19 +0530 Subject: [PATCH 3/5] docs(agent-nextsteps): address PR review feedback MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Synthesized across 4 model critiques (claude-opus-4.7, claude-sonnet-4.6, gpt-5.5, gpt-5.4). Doc-only changes; no implementation churn. azd-ai-agent-nextsteps.md - Status: anchor artifact mechanism on `Metadata["note"]` of an `ArtifactKindEndpoint` artifact rendered by `pkg/project/artifact.go` (no core edits). - Drop broken `[Alternatives](#alternatives-considered)` link; rephrase Non-goals to keep the "no core edits" point explicit. - Add Assumptions section: design assumes unified `azure.yaml` baseline per #7975; `agent.yaml` references are placeholders for the equivalent unified service-stanza fields. Pin auth to `azidentity.NewAzureDeveloperCLICredential` (matches `agent_context.go`). - Component layout: align `doctor.go` to `doctor/` subpackage matching the remote-checks doc. - State Model: rename `HasUnresolvedInfraVars bool` to `MissingInfraVars []string`; mirror `MissingManualVars`. Add tri-state `IsAuthenticated` (Unknown/Authed/Unauthed) populated only via `WithAuthProbe` opt-in; default `Unknown` means "skip auth-conditional advice", not "tell user to log in". Note `HasOpenAPI`/`OpenAPIPayload` only populated by `run`/`doctor --remote` paths. - Decision Tree: rename `init` condition to `len(MissingInfraVars) > 0`; drop `monitor --follow` secondary from `invoke --local` success (monitor is hosted-only — verified in `monitor.go`); deploy resolver uses cached spec only, no live probe at deploy time (agent may not be reachable yet). - Output discipline: name `isTerminal` source of truth (`helpers.go` wrapper around `golang.org/x/term.IsTerminal`). Reconcile JSON-mode contradiction: `--output json` on success-path commands suppresses the human `Next:` block; `doctor --output json` is a separate structured emit with explicit `redacted` field. Remove duplicate "Multi-line note pre-indentation" heading. - Hook Points: clarify deploy-row mechanism cites the existing endpoint-kind artifact emitted by `service_target_agent.go` and the renderer at `pkg/project/artifact.go` (`ToString`); no new artifact kind. - OpenAPI Discovery: note silent-mode requirement for `fetchOpenAPISpec` when called by the resolver path (currently prints "OpenAPI spec saved to..." at `helpers.go:373`). Mark `$ref` resolution as out of scope — fall back to `` literal. - Doctor checks: rename check 1 to "extension gRPC channel healthy"; point check 7 at `azidentity.NewAzureDeveloperCLICredential`; clarify check 6 placeholder for unified-schema service stanza. - Add `Exit codes & JSON output` subsection: precedence rules (any fail -> 1, skip-only -> 2, otherwise 0). JSON envelope sketch with `schemaVersion`, `remote`, `redacted`, `checks[]`. - Testing: TTY/pipe/JSON snapshot routing; non-interactive auth-skip; `$ref` -> `""` extractor; cached-spec vs absent fallback; explicit skip-cascade test over the dependency matrix; exit-code precedence + JSON envelope shape. - Add Backward Compatibility section. - Add Security & PII / Telemetry section: redaction rules; `isTerminal` source-of-truth; logging policy; brief deferred-telemetry sketch. - Add Deferred follow-ups: personas/KPIs/full-telemetry-plan rationale. - Risks: replace cap-of-5 mitigation with a wording that doesn't pin a magic number — fall back to a single `azd ai agent show` line if multi-agent walls become a real problem. azd-ai-agent-doctor-remote-checks.md - Dependency matrix: drop check 8 from check 10 deps (RBAC reads ARM, not Foundry data plane); drop check 10 from check 11 deps (agents-list is Reader-level). - Check 8: pin api-version to `DefaultAgentAPIVersion` constant in `internal/cmd/agent_context.go` (currently `2025-11-15-preview`); add 401 case (token expired); broaden 403 wording to "wrong tenant or insufficient RBAC". - Check 10: add redaction rule for non-interactive output; flag in JSON envelope. - Risks: reconcile parallel-fan-out claim post matrix simplification; add identifier-leakage row pointing at the redaction rule. - Testing: add explicit skip-cascade test over the dependency matrix; redaction snapshot tests for interactive vs. non-interactive RBAC. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azd-ai-agent-doctor-remote-checks.md | 24 ++- cli/azd/docs/design/azd-ai-agent-nextsteps.md | 164 ++++++++++++++---- 2 files changed, 150 insertions(+), 38 deletions(-) diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md index 1b3fb166a9b..bee787fc2bf 100644 --- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md +++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md @@ -100,9 +100,9 @@ We considered `--quick` / `--full` naming but `--remote` describes the actual si |---|---|---| | 7 — Auth | check 3 (env selected) | Skip with "select an env first" | | 8 — Reachability | 5 (`AZURE_AI_PROJECT_ENDPOINT` set), 7 (auth) | Skip | -| 9 — Models | 6 (`agent.yaml` valid), 8 (reachability) | Skip | -| 10 — RBAC | 7 (auth), 8 (reachability) | Skip | -| 11 — Agent status | 8, 9, 10 | Skip | +| 9 — Models | 6 (service stanza valid), 8 (reachability) | Skip | +| 10 — RBAC | 7 (auth) | Skip — RBAC reads ARM, not Foundry data plane, so it does **not** depend on check 8 | +| 11 — Agent status | 7 (auth), 8 (reachability) | Skip — agents-list is a Reader-level Foundry call and does not require check 10's deploy/invoke roles | Skips are explicit in the rendered report. We never quietly drop a check. @@ -120,11 +120,12 @@ Skips are explicit in the rendered report. We never quietly drop a check. ### Check 8 — Foundry project reachability -**What it does:** issues a single `GET /agents?api-version=…&$top=1` with the credential from check 7. +**What it does:** issues a single `GET /agents?api-version=&$top=1` with the credential from check 7. The api-version is read from the `DefaultAgentAPIVersion` constant in `internal/cmd/agent_context.go` (currently `2025-11-15-preview`) — never a doc-side literal — so a single source of truth governs which Foundry surface we probe. **Pass:** "endpoint reachable (HTTP 200)" **Fail:** maps the HTTP status to one of: -- `403` → check 10 (RBAC) will explain. Fix: see check 10 output. +- `401` → token expired or scope mismatch. Fix: `azd auth login`; if the issue persists, see check 7. +- `403` → wrong tenant **or** insufficient RBAC. Fix: confirm the active subscription/tenant matches the Foundry project; if it does, see check 10's role-assignment fix. - `404` → endpoint is wrong or project is gone. Fix: `azd provision` or fix env var. - network/DNS/TLS → "verify VPN / firewall / typo in `AZURE_AI_PROJECT_ENDPOINT`". @@ -151,7 +152,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10 **Warn:** has invoke role but not deploy role → user can run agents but not `azd deploy`. Surfaces as "you can invoke but not deploy" with the exact `az role assignment create` command for the missing role. **Fail:** missing both → command + portal link. -The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example: +The fix command is templated with the actual principal ID and scope so the user can paste it directly. Example (interactive TTY): ``` fix: az role assignment create \ --role "Azure AI Developer" \ @@ -159,6 +160,8 @@ fix: az role assignment create \ --scope ``` +**Redaction in non-interactive output.** When stdout is piped or `--output json` is set, principal IDs, scope ARNs, and full UPNs in both `detail` and `fix` strings are replaced with ``. The JSON envelope sets `"redacted": true` so callers know they're getting the safe variant (see [`Exit codes & JSON output`](./azd-ai-agent-nextsteps.md#exit-codes--json-output) in the companion doc). An interactive-only `--unredacted` flag exists for users who need the raw values in a JSON pipeline. + **Why we don't auto-create the assignment:** doctor is read-only. RBAC changes are the kind of action that should be explicit and audited. ### Check 11 — Agent status @@ -221,7 +224,9 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow ## Testing Strategy -- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8). +- **Unit tests per check** with a fake `agentClient` interface. Cover pass / warn / fail / skip paths and HTTP status mapping (especially 401/403/404 disambiguation in check 8 — including the "wrong tenant or insufficient RBAC" wording for 403). +- **Skip-cascade test.** For each dependency edge in the matrix above, assert the dependent check returns `Status: Skip` with the expected detail string when its dependency returns `Status: Fail`. Run as a table-driven test over the matrix so adding/removing edges automatically widens coverage. +- **Redaction test.** Snapshot the rendered RBAC fix string in interactive (full values) and non-interactive (``) modes, plus the JSON envelope's `redacted` field. - **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test. - **Functional test** (recorded cassette via `mage record`) covering one happy-path full run. - No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step. @@ -232,9 +237,10 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow |---|---| | Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. | | RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. | -| Foundry API drift breaks the reachability probe | Pin the api-version explicitly. Failure renders as "API not understood" with a doc link rather than a stack trace. | -| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts, parallel fan-out for independent checks (8 + 10 can run concurrently). | +| Foundry API drift breaks the reachability probe | Pin the api-version to `DefaultAgentAPIVersion` (`internal/cmd/agent_context.go`), not a doc literal. Failure renders as "API not understood" with a doc link rather than a stack trace. | +| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8. | | User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. | +| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in check 10: non-interactive output replaces values with `` and the JSON envelope flags `"redacted": true`. | ## Performance Budget diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md index dff83c41c77..60fbdc78ce9 100644 --- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md +++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md @@ -4,7 +4,7 @@ **Draft — for design review.** Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975) — "Context-aware next-step guidance and diagnostics". -Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — same path other extensions use today. +Scope: the `azure.ai.agents` extension only. **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** The deploy hook returns its Next: block via the standard extension SDK return value (`*azdext.Artifact`) — specifically the `Metadata["note"]` key on an `ArtifactKindEndpoint` artifact, which `pkg/project/artifact.go` already renders as a continuation line under the artifact bullet (gated on `Kind == ArtifactKindEndpoint`). The extension already emits endpoint-kind artifacts at `internal/project/service_target_agent.go`; we populate the `note` metadata key on the same artifacts. ## Goal @@ -13,7 +13,14 @@ After every successful `azd ai agent` command — and on demand via a new `azd a Non-goals: - Reworking error-path guidance — `internal/exterrors` already does this well. - Any modification to files outside `cli/azd/extensions/azure.ai.agents/`. State is read via the existing gRPC `azdext` API. -- Sample-author hooks, README markers, or YAML-declared hints (see [Alternatives](#alternatives-considered)). +- Sample-author hooks, README markers, or YAML-declared hints — out of scope for this design. +- Authoring or editing core azd packages (`cli/azd/pkg/...`, `cli/azd/cmd/...`). The design only *consumes* the existing `Metadata["note"]` rendering contract. + +## Assumptions + +This design assumes the unified-`azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) has landed: a single `azure.yaml` is the canonical config; `agent.yaml` and `agent.manifest.yaml` no longer exist as separate files. References to `agent.yaml` in the doctor checks and decision trees below are placeholders for the equivalent fields in the unified `services.` stanza. If the extension ships before unification, each `agent.yaml` reference should be read as the corresponding fields under `services.` in `azure.yaml`. The `AssembleState` function abstracts the source split, so resolvers are insulated from the migration — only the state-assembly code needs to change. + +The extension's authentication remains `azidentity.NewAzureDeveloperCLICredential` (see [`agent_context.go`](../../extensions/azure.ai.agents/internal/cmd/agent_context.go)). User-facing fix commands therefore say `azd auth login`, not `az login`. ## Background @@ -36,7 +43,7 @@ These are the building blocks the design reuses: - **`azdext` gRPC client** — extensions read `azure.yaml` services and azd env vars without re-implementing parsing. - **`fetchOpenAPISpec`** in `helpers.go` — already probes `GET /invocations/docs/openapi.json`, caches to disk, fails silently. Drives the rich-payload story. - **`exterrors`** factories — typed errors with `Suggestion` fields. We follow the same "structured advice" shape on success paths. -- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. Already used by the existing `service_target_agent.go` to print endpoint URLs. We populate the same field with our Next: block. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test. +- **`azdext.Artifact` return value** from `Deploy()` — the extension SDK's normal mechanism for an extension to surface lines below its progress bullet. The renderer is `pkg/project/artifact.go` (specifically the `Metadata["note"]` branch at `ToString`, ~line 128), which is gated on `Kind == ArtifactKindEndpoint`. The extension already emits endpoint-kind artifacts in `internal/project/service_target_agent.go` (~line 720) for endpoint URLs. We attach our Next: block to the same artifacts via the same `Metadata["note"]` key. No changes to the renderer or any core file. Indentation rules are documented in [Output Discipline](#output-discipline) and pinned by an extension-side regression test that asserts the formatted string against a stub artifact. ## Architecture @@ -45,12 +52,15 @@ These are the building blocks the design reuses: ``` cli/azd/extensions/azure.ai.agents/internal/cmd/ ├── nextstep/ ← new package (this design) -│ ├── types.go ← Suggestion, State, ServiceState -│ ├── state.go ← AssembleState(ctx, azdClient) → State +│ ├── types.go ← Suggestion, State, ServiceState, AuthState +│ ├── state.go ← AssembleState(ctx, azdClient, opts) → State │ ├── resolver.go ← decision tree per command │ ├── format.go ← PrintNext / FormatNextForNote │ └── openapi.go ← OpenAPI-derived invoke examples -├── doctor.go ← new command — calls nextstep + extra checks +├── doctor/ ← new package (matches structure in remote-checks doc) +│ ├── types.go ← Check, Result, Status +│ ├── runner.go ← orchestrates ordered execution + Next: tail +│ └── checks_local.go ← MVP local checks (1–6) ├── init.go ─┐ ├── run.go ─┼─ each calls nextstep.ResolveAfterX → nextstep.PrintNext ├── invoke.go ─┤ @@ -88,14 +98,23 @@ One package owns *all* policy. Each command is a thin caller — it knows when i ```go // types.go +type AuthState int + +const ( + AuthUnknown AuthState = iota // probe was not run; suggestions must skip auth-conditional advice + AuthAuthed // doctor confirmed a usable token + AuthUnauthed // doctor confirmed login is needed +) + type State struct { - HasProjectEndpoint bool - HasUnresolvedInfraVars bool // ${...} → Bicep outputs not in azd env - HasUnresolvedManualVars []string // ${...} → user-supplied vars not set - Services []ServiceState - AgentStatus string // optional (Foundry API), empty if unknown - HasOpenAPI bool // run-time only - OpenAPIPayload string // run-time only, ""=no spec + HasProjectEndpoint bool + MissingInfraVars []string // ${...} → Bicep outputs not in azd env (named for actionable advice) + MissingManualVars []string // ${...} → user-supplied vars not set + Services []ServiceState + AgentStatus string // optional (Foundry API), empty if unknown + HasOpenAPI bool // populated only when AssembleState is called from `run` or `doctor --remote` + OpenAPIPayload string // populated only when AssembleState is called from `run` or `doctor --remote` + IsAuthenticated AuthState // populated only when AssembleState is called from `doctor --remote` } type ServiceState struct { @@ -113,7 +132,9 @@ type Suggestion struct { } ``` -State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The cost is one gRPC round-trip per resolver invocation, which is negligible compared to anything the user perceives. +State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The base cost is one gRPC round-trip per resolver invocation, which is negligible. + +The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the `doctor --remote` path is the only caller that sets it. Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths. ### Layered sources @@ -136,8 +157,8 @@ The full tree lives in `resolver.go`. This section captures the policy in tabula | Condition | Primary | Secondary | |---|---|---| -| `HasUnresolvedInfraVars` | `azd provision` | — | -| `HasUnresolvedManualVars` (non-empty) | `azd env set ` (one per missing var, up to 3) | — | +| `len(MissingInfraVars) > 0` | `azd provision` | — | +| `len(MissingManualVars) > 0` | `azd env set ` (one per missing var, up to 3) | — | | Otherwise | `azd ai agent run` | — (the `run` resolver will print the protocol-correct `invoke --local` when the agent starts) | | Always (third line) | "When ready to deploy to Azure, run `azd deploy`." | — | @@ -157,7 +178,7 @@ Tip: curl http://localhost:/invocations/docs/openapi.json ### `invoke --local` (success) -Single agent: `azd deploy`, secondary `azd ai agent monitor --follow`. +Single agent: `azd deploy` only. (The local invoke just succeeded — `monitor --follow` is hosted-only and not useful here; the natural next step is to ship.) Multi-agent: `azd deploy` only (one line per agent for follow-up `invoke `). ### `invoke` (remote) @@ -185,6 +206,7 @@ See /README.md for a sample payload appropriate for this agent. Notes: - The `invoke` line **omits the agent name** for single-agent projects (matches sample README convention). - README hint is emitted **only if** `os.Stat` succeeds on `//{README.md,readme.md,README.MD}`. No hint for projects without a README. +- **No live OpenAPI probe at deploy time.** The just-deployed agent may not be reachable yet (this is exactly what doctor check 11 observes). `ResolveAfterDeploy` therefore uses any cached spec produced by a prior local `run` (same disk cache as `fetchOpenAPISpec`), falling back to the README pointer, then to the protocol-generic `` literal. The deployed endpoint's payload can be verified later via `azd ai agent show` or the next-step block from a successful remote `invoke`. Multi-agent: one `show ` and one `invoke ` line per agent. @@ -213,10 +235,11 @@ Ground rules: |---|---| | Interactive TTY | `stdout` | | Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output | -| `--output json` / structured output flag | Suppressed entirely | -| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs) | +| `--output json` (success-path commands) | `Next:` block suppressed entirely | +| `azd ai agent doctor --output json` | **Not** the suppression rule — emits the structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are still suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag | +| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`) | -### Multi-line note pre-indentation +The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output should call the same helper to keep the source of truth singular. ### Multi-line note pre-indentation @@ -231,7 +254,7 @@ The `azdext.Artifact` SDK field used here renders only the **first line** at the | `invoke --local` / `invoke` | end of `runE` on success | direct print | | `show` | end of table render | direct print | | `doctor` | after rendering check report | direct print | -| `deploy` | inside `Deploy()` return path | attach Next: to the returned `azdext.Artifact` | +| `deploy` | inside `Deploy()` return path | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. | The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly. @@ -258,6 +281,10 @@ Resolution order: All failures are silent. The fetch itself is best-effort and already cached — we never block `run`'s startup notice on it. +**Silent-mode requirement.** Today `fetchOpenAPISpec` prints `OpenAPI spec saved to ` to stdout (see `helpers.go:373`). When invoked from the resolver path, that print is unwanted noise. The extraction path will either gate the print behind a `verbose` flag on the helper, or wrap the call in a `silent: true` variant. This is an extension-internal change to existing extension code — still no edits outside the extension. + +**Out of scope: OpenAPI `$ref` resolution.** The extractor walks the spec object tree literally. If `requestBody` or `schema` uses `$ref: "#/components/..."`, the extractor returns `""` and the resolver falls back to the protocol-generic payload and Tip line. Cross-document `$ref` resolution adds a JSON-pointer dependency we don't want in the success-path. Authors who hit this can either inline the example or rely on the README-pointer line. + ## `doctor` Command `doctor` is the persistence layer for state-aware guidance: when the user has lost terminal context or hit a confusing error, they run one command and get the full picture plus the specific fix for each broken thing. @@ -266,13 +293,13 @@ All failures are silent. The fetch itself is best-effort and already cached — | # | Check | Pass | Fail → Fix | |---|---|---|---| -| 1 | `azd` reachable + extension running | "extension running, gRPC channel established" | install / start daemon | +| 1 | `azd` reachable + extension gRPC channel healthy | "extension running, gRPC channel established" | install / start daemon | | 2 | `azure.yaml` valid | "1 service: " | `azd ai agent init` | | 3 | azd environment selected | "" | `azd env new` / `azd env select` | | 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` | | 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` | -| 6 | `agent.yaml` per service is valid | path | fix YAML | -| 7 (post-MVP) | Authentication (`az account show`) | signed-in user | `azd auth login` | +| 6 | `agent.yaml` per service is valid (placeholder for unified-schema service stanza) | path | fix YAML | +| 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` | | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall | | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` | | 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` | @@ -295,19 +322,98 @@ Next: When all checks pass, the trailing Next: block is the resolver's `ResolveAfterInit` (if not deployed) or a deploy-aware variant (if deployed) — exactly what the user would have seen at the end of the last successful command. +### Exit codes & JSON output + +| Outcome | Exit code | +|---|---| +| All checks pass (or pass+skip with no fail) — and at least one check ran | 0 | +| Any check returns `Status: Fail` | 1 | +| All checks ran were skipped (e.g., `--remote` against a project missing prerequisites) | 2 | + +Precedence: any `Fail` always wins (exit 1). Skip-only without any pass is the dependency-cascade case (exit 2). All-pass or pass+skip with at least one pass is exit 0. + +`azd ai agent doctor --output json` emits a structured array, one entry per check, in execution order: + +```json +{ + "schemaVersion": "1.0", + "remote": false, + "redacted": true, + "checks": [ + { + "id": "local.azure-yaml", + "title": "Project loaded from azure.yaml", + "status": "pass", + "detail": "1 service: echo-agent", + "fix": "", + "links": [], + "durationMs": 4 + }, + { + "id": "remote.rbac", + "title": "RBAC", + "status": "fail", + "detail": "principal has no role on project scope", + "fix": "az role assignment create --role \"Azure AI Developer\" --assignee --scope ", + "links": ["https://learn.microsoft.com/azure/ai-studio/concepts/rbac"], + "durationMs": 412 + } + ] +} +``` + +The `redacted` field is `true` when `--output json` is combined with non-interactive output (the default for piped stdout). Setting `--unredacted` (interactive only) populates principal IDs, scope ARNs, and full UPNs verbatim. The human-readable `Next:` block is **not** included in the JSON envelope — that is the success-path output discipline; doctor's structured emit is its own contract. + ## Testing Strategy | Layer | What | How | |---|---|---| | `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests | +| `nextstep/format` | TTY vs piped routing (stdout vs stderr); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag | | `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` | -| `nextstep/openapi` | each extraction path + corruption / empty fallback | golden specs in `testdata/` | -| `nextstep/state` | layered source priority, partial state | mocked `azdClient` (testify/mock) | -| `service_target_agent` | with/without README on disk → with/without hint line | `t.TempDir` + `t.Chdir` | -| `doctor` | each check pass/fail rendering | mocked `azdClient` | +| `nextstep/resolver` | non-interactive auth handling — when `IsAuthenticated == AuthUnknown`, no auth-conditional advice (no "azd auth login" appears in `Next:`) | dedicated table cases | +| `nextstep/openapi` | each extraction path + corruption / empty fallback / unresolved `$ref` → `""` | golden specs in `testdata/` | +| `nextstep/state` | layered source priority, partial state, `WithAuthProbe` opt-in | mocked `azdClient` (testify/mock) | +| `service_target_agent` | with/without README on disk → with/without hint line; cached spec present vs absent → payload vs `` literal | `t.TempDir` + `t.Chdir` | +| `doctor/runner` | each check pass/fail rendering | mocked `azdClient` | +| `doctor/runner` | **skip-cascade**: for each dependency edge, assert the dependent check returns `Status: Skip` with the expected detail when its dependency returns `Status: Fail` | table-driven over the dependency matrix | +| `doctor/runner` | exit-code precedence (fail wins; all-skip = 2; otherwise 0) and JSON envelope shape (schema version, `redacted` flag, per-check fields) | table-driven + snapshot | All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure calls. +## Backward Compatibility + +The Next: block is purely additive; no existing exit code, error type, or stdout format changes. Specifically: + +- Success-path JSON output (`--output json` on `init`/`run`/`invoke`/`show`) is unchanged — the human `Next:` block is suppressed in that mode (see [Output discipline](#output-discipline)). +- The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change. +- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--remote` and `--output json` flags are opt-in. +- `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added. +- Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention. + +If any future change to the `Metadata["note"]` rendering behavior in `pkg/project/artifact.go` ships, the regression test pinned in this design's [Output Discipline](#output-discipline) section will break loudly. The fix is extension-side reformatting; no core edits. + +## Security & PII / Telemetry + +**Output redaction.** Doctor's RBAC check (10) and any future check that surfaces principal IDs or resource ARNs follows two rules: + +- **Interactive TTY:** full values are shown — they are necessary for the user to construct the fix command (`az role assignment create ...`). +- **Non-interactive (piped, CI, `--output json`):** principal IDs, scope ARNs, and full UPNs are replaced with `` in `detail` and `fix` strings. The structured JSON envelope sets `"redacted": true` so callers can see they're getting the safe variant. An interactive-only `--unredacted` flag exists for the rare case a user needs the raw values in a JSON pipeline. + +**Source of truth for `isTerminal`.** All redaction and stdout/stderr routing decisions are gated by the existing `isTerminal` helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). New code must call the same helper, never re-implement detection — this keeps the test surface singular and the redaction contract auditable. + +**Logging.** The `nextstep` package logs only via the standard `log` package, which the extension already silences in production via `setupDebugLogging` (see `internal/cmd/debug.go`). Sensitive values (principal IDs, tokens, full ARNs) are logged at `debug` level only; the resolver's user-facing output never logs. + +**Telemetry (deferred).** A full instrumentation plan is out of scope for this design. Initial implementation will emit a single counter event per resolver invocation (`nextstep.resolved` with `command` + `branch_id` attributes) and per-doctor-check (`doctor.check.completed` with `id` + `status` + `duration_ms`). Neither event includes user input, principal IDs, or environment values. A follow-up design will cover the full schema once we have implementation experience. + +## Deferred follow-ups + +The reviewer surfaced three additional concerns that are out of scope for this design pass and will be handled separately: + +- **User personas / target audiences.** Not standard in design docs in this repo (see `cli/azd/docs/design/`); we treat "developers building with the agents extension" as the implicit audience. +- **Success metrics / KPIs.** Same rationale; metrics will be wired up alongside telemetry in the follow-up design once we have something concrete to measure. +- **Full telemetry / observability plan.** See the brief sketch above; the complete attribute schema, sampling policy, and dashboards are deferred. + ## Risks & Mitigations | Risk | Mitigation | @@ -317,7 +423,7 @@ All tests run under `go test -short -timeout 180s ./internal/...`. No live Azure | Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. | | Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. | | Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. | -| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` cap: 5 agents inline; beyond that, suggest `azd ai agent show` (no args, lists all). | +| Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` produces one `show ` and one `invoke ` line per agent. Multi-agent walls remain readable for the project sizes we expect (≤ ~10). If the wall becomes a real problem in practice, fall back to a single `azd ai agent show` (no args, lists all) line; this can be tuned in a follow-up without changing the resolver contract. | ## Implementation Phases From 575a450792e57afd7939be74b4658976a2d4ecaa Mon Sep 17 00:00:00 2001 From: Antriksh Jain Date: Wed, 6 May 2026 19:05:56 +0530 Subject: [PATCH 4/5] docs(azure.ai.agents): address therealjohn doctor doc review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Flip flag model from `--remote` opt-in to `--local-only` opt-out. Default `azd ai agent doctor` now runs all checks (1–12); `--local-only` is the CI / offline opt-out. Update Goal, Execution Model, Output Format example, Performance Budget, Implementation Phases, and the Appendix accordingly. - Add Terminology section defining MVP explicitly as "local checks 1–6 ship first" — chronological, not a permanent capability cap. - Soften the local-only framing in Background: clarify that remote checks intentionally make network calls under user credentials, which is the whole point of this doc. - Add Check 12: Agent identity role assignments. Lists role assignments on the agent's managed identity at project / account / resource-group scopes, with a new INFO glyph for the informational case. Distinct from check 10 (user RBAC) — different principal, different fix command, different preconditions. Rename check 10 to "User RBAC" for symmetry. - Add a tracking note on Check 9 referencing #7962 (`agent.yaml` → `azure.yaml` unification, syncing with @trangevi). - Rewrite the first Risks row: replace the "users in CI miss `--remote`" scenario (no longer applicable) with a precise non-interactive auth failure path that surfaces a clear "re-run with --local-only" message. - Update companion `azd-ai-agent-nextsteps.md` for cross-reference consistency: doctor table extended to checks 7–12; flag references flipped (`--local-only` is the new opt-in flag, full sweep is default); comments on `WithAuthProbe` updated to match. Resolves comments #1–7 from @therealjohn on PR #8057. Comment #8 (scope: move to `azure.ai.projects`?) is deferred for @rajeshkamal5050. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azd-ai-agent-doctor-remote-checks.md | 122 ++++++++++++------ cli/azd/docs/design/azd-ai-agent-nextsteps.md | 17 +-- 2 files changed, 94 insertions(+), 45 deletions(-) diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md index bee787fc2bf..68ad7df228a 100644 --- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md +++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md @@ -1,4 +1,4 @@ -# Design: `azd ai agent doctor` — Remote Checks (7–11) +# Design: `azd ai agent doctor` — Remote Checks (7–12) ## Status @@ -6,11 +6,19 @@ Tracking issue: [#7975](https://github.com/Azure/azure-dev/issues/7975). Companion to [`azd-ai-agent-nextsteps.md`](./azd-ai-agent-nextsteps.md), which covers the doctor command's plumbing and local checks (1–6). **All code lives under `cli/azd/extensions/azure.ai.agents/`. No files outside the extension are modified.** +## Terminology + +- **MVP doctor** — the first shippable slice of the command: local checks (1–6) only. Defined in the [companion doc](./azd-ai-agent-nextsteps.md). MVP is a chronological term ("what lands first"), not a permanent capability boundary. +- **Local checks** — read-only inspection of `azure.yaml`, `agent.yaml`, env vars, etc. No network, no token. +- **Remote checks** — anything that issues a network call under user credentials. Specified in this doc. + ## Goal Extend `azd ai agent doctor` past its local-only MVP so it can answer the question every developer eventually asks: *"the project is wired up correctly, so why doesn't this work end-to-end?"* -Concretely: add five checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, RBAC, agent runtime status — each with a precise "what to do next" fix. +The end-state default is to run **all** checks (local then remote) on every `doctor` invocation. CI and air-gapped users opt out of the remote ones via `--local-only` (see [Execution Model](#execution-model)). + +Concretely: add six checks that exercise the live Azure surface — auth, Foundry reachability, model deployments, user RBAC, agent runtime status, agent-identity role assignments — each with a precise "what to do next" fix. Non-goals: - Fixing remote problems automatically (`doctor --fix`). Surfacing the right command is enough for v1. @@ -19,9 +27,9 @@ Non-goals: ## Background -The MVP doctor checks (1–6) are pure local reads — `azure.yaml` exists, env vars are set, YAML parses. They run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions. +Local checks (1–6) — `azure.yaml` exists, env vars are set, YAML parses — run in tens of milliseconds and never need network or auth. They catch ~70% of "why is my project broken" questions and are what MVP doctor ships first. -The remaining ~30% — and arguably the most frustrating cases — sit in remote land: +The remaining ~30% — and arguably the most frustrating cases — *intentionally* require network calls under the user's credentials: - Token expired silently → every command 401s with no obvious "you're logged out" hint. - Foundry endpoint typo'd in env → DNS/TLS error buried in a stack trace. - Model deployment got deleted out from under the project → first invoke fails with a cryptic 404. @@ -37,8 +45,9 @@ These all require live calls and credentials, which is why they're carved out in | 7 — Authentication | Am I logged in? Is the token still valid? | | 8 — Foundry reachability | Can I reach `AZURE_AI_PROJECT_ENDPOINT`? | | 9 — Model deployments | Do the model(s) my agent references exist on the project? | -| 10 — RBAC | Does my principal have the roles needed to deploy / invoke? | +| 10 — User RBAC | Does my principal have the roles needed to deploy / invoke? | | 11 — Agent status | Does the deployed agent actually exist and is it active? | +| 12 — Agent identity roles | What roles does the agent's managed identity have, and on which scopes? | ## Architecture @@ -79,20 +88,20 @@ A remote check that can't run (e.g., not authenticated when running check 9) ret ## Execution Model -### Default: local-only +### Default: all checks (1–12) -`azd ai agent doctor` runs checks 1–6 only. Same as MVP. No network, no token, no surprise prompts. This is the daily-driver mode. +`azd ai agent doctor` runs the full check set in order. Local checks (1–6) run first; remote checks (7–12) run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)). The default is "tell me everything that's wrong" — a half-answer that skips the most frustrating cases isn't worth shipping. -### Opt-in: `--remote` +### Opt-out: `--local-only` -`azd ai agent doctor --remote` runs **1–11** in order. Local checks always run first; remote checks run only if their preconditions pass (see [Dependency Matrix](#dependency-matrix)). +`azd ai agent doctor --local-only` skips checks 7–12. Use this when: +- Running in CI / non-interactive shells where an interactive auth prompt would hang. +- Working offline / behind a restrictive firewall. +- Investigating a config-only issue and you don't want network latency. -Rationale for opt-in (vs. always-on or auto-detect): -- Remote checks need credentials and add seconds to a command users may run frequently. -- Hitting Foundry on every `doctor` invocation is bad citizenship and surprising in CI. -- A flag is discoverable, reversible, and self-documenting. +Non-interactive detection (`!isTerminal`) is **not** automatic — explicit is better than magic. If a user forgets `--local-only` in CI, the auth check fails fast with a clear message and the rest of the remote checks Skip cleanly (see [Risks](#risks--mitigations)). -We considered `--quick` / `--full` naming but `--remote` describes the actual side effect — network calls under your credentials — which is what reviewers and CI authors care about. +We considered `--remote` opt-in but rejected it: doctor's value is highest when it answers the live-Azure questions by default, and every "why doesn't this work" question we've seen in practice falls in that bucket. CI is the niche, not the default. ### Dependency matrix @@ -103,6 +112,7 @@ We considered `--quick` / `--full` naming but `--remote` describes the actual si | 9 — Models | 6 (service stanza valid), 8 (reachability) | Skip | | 10 — RBAC | 7 (auth) | Skip — RBAC reads ARM, not Foundry data plane, so it does **not** depend on check 8 | | 11 — Agent status | 7 (auth), 8 (reachability) | Skip — agents-list is a Reader-level Foundry call and does not require check 10's deploy/invoke roles | +| 12 — Agent identity roles | 11 (agent status pass — we need the agent's MI principal ID) | Skip — without an active agent there is no MI to introspect | Skips are explicit in the rendered report. We never quietly drop a check. @@ -135,6 +145,8 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10 **What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally. +> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead. Syncing with @trangevi on timing. + **Pass:** "all referenced models present" **Fail:** "missing: (referenced by )" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link). @@ -142,7 +154,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10 - Multi-version models (e.g., `gpt-4o:2024-08-06`) — we match on deployment name, not version. Version mismatch surfaces at runtime; not in scope here. - Agents with no model reference (orchestration-only) — pass with "no model references". -### Check 10 — RBAC +### Check 10 — User RBAC **What it does:** for the current principal, lists role assignments on the Foundry project scope and checks for the minimum role set: - `Azure AI Developer` (or stronger) on the project — required for deploy + invoke. @@ -176,12 +188,44 @@ fix: az role assignment create \ This is the check that closes the loop on "I deployed and it succeeded — why doesn't `invoke` work?" Often the answer is the agent is still rolling out; check 11 says so explicitly. +### Check 12 — Agent identity role assignments + +**What it does:** for each active agent found in check 11, resolves the agent's managed-identity principal ID, then lists role assignments on the **agent identity** at three scope levels: + +1. The Foundry project (scope of the agent itself). +2. The parent AI account. +3. The resource group containing the project (proxy for "any resource the agent might touch"). + +We don't try to predict *which specific resource* the agent will need at invocation time — that's data-plane and varies per agent code. Instead we surface the visible role-assignment landscape so the user can answer "does my agent identity have any roles where I'd expect it to?" + +**Pass:** agent MI has at least one assignment at the project scope, plus a non-empty list at account or resource-group scope. +**Warn:** agent MI has assignments only at the agent's own scope — it almost certainly cannot read project / account resources at runtime. Render the list with a "this looks under-privileged for typical agents" hint. +**Fail:** agent MI has zero role assignments anywhere reachable. This is the smoking-gun for "my agent runs but every tool call 403s." + +**Output shape:** +``` +ⓘ INFO agent identity roles (agent: research-bot) + principal: + project scope: + - Cognitive Services User + account scope: + - (none) + resource-group scope: + - Storage Blob Data Reader +``` + +We render this as `INFO` (a fourth glyph alongside pass/warn/fail/skip) when there's nothing actionable to flag, since the value is mostly informational — the user inspects the list to confirm it matches their mental model. Warn/Fail variants surface only the actionable cases above. + +**Why a separate check from 10:** check 10 is about the *user's* permissions to deploy/invoke; check 12 is about the *agent's* permissions at runtime. They fail differently, fix differently (`az role assignment create --assignee ` vs ``), and the agent MI doesn't exist until check 11 has confirmed an active agent — so they can't be merged without losing precision. + +**Redaction.** Same rules as check 10: principal IDs and scope ARNs are replaced with `` in non-interactive output, and the JSON envelope sets `"redacted": true`. + ## Output Format Same renderer as MVP. Skipped checks render with a distinct glyph so they're visible: ``` -azd ai agent doctor --remote +azd ai agent doctor ✓ PASS azd reachable ✓ PASS azure.yaml valid ✓ PASS azd environment selected @@ -195,32 +239,35 @@ azd ai agent doctor --remote fix: see RBAC check below - SKIP model deployments skipped: reachability check failed - ✗ FAIL RBAC + ✗ FAIL user RBAC principal alice@contoso.com has no role on project scope fix: az role assignment create --role "Azure AI Developer" \ --assignee 0000... --scope /subscriptions/.../projects/... - SKIP agent status skipped: reachability check failed + - SKIP agent identity roles + skipped: agent status check did not pass Next: az role assignment create ... -- grant yourself the required role - azd ai agent doctor --remote -- re-run to verify + azd ai agent doctor -- re-run to verify ``` The trailing `Next:` block is resolved from the **first** failed remote check that has a single user-actionable fix — same logic as MVP, just extended to know about remote-check fix commands. ## Performance & Side Effects -Per [Performance Budget](#performance-budget) below — defaults remain local-only so users can keep running `doctor` reflexively. `--remote` adds a couple of seconds and explicit credential use. +Per [Performance Budget](#performance-budget) below — `--local-only` keeps doctor under 100ms for users who explicitly opt out. Default mode adds a few seconds and explicit credential use. | Check | Network calls | Worst-case time | |---|---|---| | 7 — Auth | 0 (cached token) – 1 (refresh) | 2s | | 8 — Reach | 1 GET | 10s (timeout) | | 9 — Models | 1 list | 5s | -| 10 — RBAC | 1 list-role-assignments | 5s | +| 10 — User RBAC | 1 list-role-assignments | 5s | | 11 — Status | 1 list-agents | 5s | +| 12 — Agent MI roles | 3 list-role-assignments (project / account / RG) | 6s | -Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode. +Worst-case full-sweep walltime: ~35s with a sick endpoint. Each check has its own timeout, applied via `context.WithTimeout`. The runner reports per-check duration in `--verbose` mode. ## Testing Strategy @@ -229,38 +276,39 @@ Worst-case `--remote` walltime: ~30s with a sick endpoint. Each check has its ow - **Redaction test.** Snapshot the rendered RBAC fix string in interactive (full values) and non-interactive (``) modes, plus the JSON envelope's `redacted` field. - **Snapshot test** for the rendered report in mixed pass/fail/skip configurations — same harness as MVP doctor's local-checks snapshot test. - **Functional test** (recorded cassette via `mage record`) covering one happy-path full run. -- No live tests in CI for `--remote` — replay-only. Cassette refresh is a manual maintainer step. +- No live tests in CI for the remote-check path — replay-only. Cassette refresh is a manual maintainer step. ## Risks & Mitigations | Risk | Mitigation | |---|---| -| Auth prompt during `--remote` surprises users in CI | Detect non-interactive (`!isTerminal`) and refuse the auth refresh; emit a clear "skipped: non-interactive" instead of hanging. | -| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. | +| User in CI forgets `--local-only`; remote checks 401 or hang on auth refresh | Detect non-interactive (`!isTerminal`) inside check 7; if a token refresh would require interaction, fail fast with `Status: Fail, detail: "non-interactive shell — re-run with --local-only"`. Checks 8–12 then Skip via the dependency matrix. No hang, no silent success. | +| RBAC list call requires a permission users may also lack | If the role-assignments call itself 403s in checks 10 or 12, render the check as Warn, not Fail — give the user the role-assignment command anyway and let them try. | | Foundry API drift breaks the reachability probe | Pin the api-version to `DefaultAgentAPIVersion` (`internal/cmd/agent_context.go`), not a doc literal. Failure renders as "API not understood" with a doc link rather than a stack trace. | -| Slow tenant or VPN turns `--remote` into a 30s wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8. | -| User runs `--remote` against a fresh project before `azd provision` | Checks 8–11 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. | -| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in check 10: non-interactive output replaces values with `` and the JSON envelope flags `"redacted": true`. | +| Slow tenant or VPN turns full sweep into a long wait | Strict per-check timeouts. After the dependency-matrix simplification above, **independent checks 8 and 10 can run concurrently** once their shared dependency (check 7) passes. Check 9 still waits on 8; check 11 waits on 8; check 12 waits on 11. | +| User runs `doctor` against a fresh project before `azd provision` | Checks 8–12 short-circuit to Skip via the dependency matrix; check 5 (env var missing) tells them to provision. | +| Sensitive identifiers (principal IDs, scope ARNs) leak into shared CI logs | See [Output Format](#output-format) and the redaction rule in checks 10 and 12: non-interactive output replaces values with `` and the JSON envelope flags `"redacted": true`. | ## Performance Budget -Local-only `doctor` (today): <100ms. Don't regress. -`--remote`: ≤ 30s walltime in the worst case (sick endpoint). Typical: <5s. -We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot, opt-in diagnostic. +`--local-only` doctor: <100ms. Don't regress. +Default (full sweep): ≤ 35s walltime in the worst case (sick endpoint). Typical: <6s. +We do **not** add any background polling, prefetching, or daemon work for these checks. Doctor remains a one-shot diagnostic. ## Implementation Phases -1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--remote` flag, no behavior change at this step. +1. **Runner refactor** — split `checks_local.go` / `checks_remote.go`, add `--local-only` flag (default false), no behavior change at this step beyond exposing the flag. 2. **Auth + reachability (7, 8)** — the two checks every other remote check depends on. Useful even on their own. 3. **Models + agent status (9, 11)** — leverage existing `agentClient` calls. Closes the deploy-loop questions. -4. **RBAC (10)** — last because the role-list API is the trickiest to stub and the most likely to need iteration on the fix-command wording. +4. **User RBAC (10)** — role-list API for the current principal, with the fix-command wording. +5. **Agent identity roles (12)** — last because it depends on a passing check 11 and adds a new INFO glyph to the renderer. -Each phase is independently shippable and behind the `--remote` flag, so partial rollout is safe. We can ship phases 1–2 and gather feedback before completing 3–4. +Each phase is independently shippable. While 7–8 are landing, the renderer can already display "remote checks not yet implemented" for 9–12 without breaking existing users (who will be on `--local-only` behavior until they explicitly opt in via the flag-default flip in phase 1). -## Appendix: Why not fold these into MVP? +## Appendix: Why is this a separate doc from MVP? Three reasons, in order of weight: -1. **Auth + network coupling.** MVP doctor runs in any project state, including before `azd auth login`. Remote checks fundamentally cannot. The flag separation makes the contract explicit. -2. **Review surface.** The MVP design (the companion doc) is already large. Adding five live-Azure checks with their own error mapping, timeout, and skip semantics doubles it. -3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks behind a flag so we can iterate the wording, the role-set, and the api-version pinning without churning the always-on path. +1. **Auth + network coupling.** Local checks (1–6) run in any project state, including before `azd auth login`. Remote checks fundamentally cannot. Splitting the design makes the contract explicit and lets us iterate on the remote surface without churning the local one. +2. **Review surface.** The MVP design (the companion doc) is already large. Adding six live-Azure checks with their own error mapping, timeout, redaction, and skip semantics doubles it. +3. **Iteration cost on Foundry surfaces.** The Foundry agents API is still evolving; we want to land remote checks in a phased rollout (see [Implementation Phases](#implementation-phases)) so we can iterate the wording, the role-set, and the api-version pinning without churning everything at once. diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md index 60fbdc78ce9..8936b985b2f 100644 --- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md +++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md @@ -112,9 +112,9 @@ type State struct { MissingManualVars []string // ${...} → user-supplied vars not set Services []ServiceState AgentStatus string // optional (Foundry API), empty if unknown - HasOpenAPI bool // populated only when AssembleState is called from `run` or `doctor --remote` - OpenAPIPayload string // populated only when AssembleState is called from `run` or `doctor --remote` - IsAuthenticated AuthState // populated only when AssembleState is called from `doctor --remote` + HasOpenAPI bool // populated only when AssembleState is called from `run` or `doctor` (any mode that contacts the agent) + OpenAPIPayload string // populated only when AssembleState is called from `run` or `doctor` (any mode that contacts the agent) + IsAuthenticated AuthState // populated only when AssembleState is called from `doctor` (full sweep, i.e. not `--local-only`) } type ServiceState struct { @@ -134,7 +134,7 @@ type Suggestion struct { State is **assembled fresh on each call**. No singleton, no cache across commands. `azure.yaml` parsing inside one call is cached by `azdClient` already. The base cost is one gRPC round-trip per resolver invocation, which is negligible. -The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the `doctor --remote` path is the only caller that sets it. Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths. +The `IsAuthenticated` probe is **not** part of the base assembly: it requires a token-introspection call which is network-bound and would regress the perf claim above for every command. `AssembleState` accepts an option (`WithAuthProbe`) that defaults to false; the full-sweep `doctor` path is the only caller that sets it (`--local-only` runs leave it false). Every other resolver receives `IsAuthenticated == AuthUnknown` and treats auth-conditional suggestions as "skip advice that needs a confirmed login state" rather than "tell user to log in" — i.e., we never produce login-prompt noise in success paths. ### Layered sources @@ -302,10 +302,11 @@ All failures are silent. The fetch itself is best-effort and already cached — | 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` | | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall | | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` | -| 10 (post-MVP) | RBAC sufficient | role list | `az role assignment create …` | +| 10 (post-MVP) | User RBAC sufficient | role list | `az role assignment create …` | | 11 (post-MVP) | Agent status (if deployed) | `active (vN)` | `azd ai agent monitor --follow` | +| 12 (post-MVP) | Agent identity role assignments | roles on project / account / RG | `az role assignment create --assignee …` | -Checks 7–11 are listed in the issue but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–11 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). +Checks 7–12 are listed in the issue (and surfaced in review feedback) but pulled into a follow-up to keep the MVP shippable. Checks 1–6 are pure local reads; 7–12 require Foundry control-plane calls and RBAC introspection that need their own design pass — see [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). ### Doctor output shape @@ -328,7 +329,7 @@ When all checks pass, the trailing Next: block is the resolver's `ResolveAfterIn |---|---| | All checks pass (or pass+skip with no fail) — and at least one check ran | 0 | | Any check returns `Status: Fail` | 1 | -| All checks ran were skipped (e.g., `--remote` against a project missing prerequisites) | 2 | +| All checks ran were skipped (e.g., full sweep against a project missing prerequisites, or `--local-only` with all local prereqs short-circuited) | 2 | Precedence: any `Fail` always wins (exit 1). Skip-only without any pass is the dependency-cascade case (exit 2). All-pass or pass+skip with at least one pass is exit 0. @@ -387,7 +388,7 @@ The Next: block is purely additive; no existing exit code, error type, or stdout - Success-path JSON output (`--output json` on `init`/`run`/`invoke`/`show`) is unchanged — the human `Next:` block is suppressed in that mode (see [Output discipline](#output-discipline)). - The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change. -- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--remote` and `--output json` flags are opt-in. +- `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--local-only` and `--output json` flags are opt-in. - `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added. - Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention. From 9595e0699e604fde3aca4d146410fb3e23261d17 Mon Sep 17 00:00:00 2001 From: Antriksh Jain Date: Mon, 11 May 2026 11:43:13 +0530 Subject: [PATCH 5/5] docs(azure.ai.agents): address trangevi review and tighten spec voice - Configuration sources: drop unified-azure.yaml assumption; commit to two-file split (azure.yaml + agent.yaml) and AssembleState abstraction. - State Layering: row 3 now correctly attributes protocol/config.env refs to agent.yaml. - Stale state definition: spell out Missing / Empty / Drift rules. - Output discipline: piped/CI is suppressed (not stderr); add structured nextStep JSON field. - Hook Points: end-of-runE is the rule; run/deploy are the documented exceptions. - No opt-out env var: Next: block is always on when output discipline allows. - Sweep stale stderr/placeholder/checks-7-11/reviewer-voice references for spec consistency. - Add inline cspell:ignore directives so cspell-lint passes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../azd-ai-agent-doctor-remote-checks.md | 6 +- cli/azd/docs/design/azd-ai-agent-nextsteps.md | 59 ++++++++++++------- 2 files changed, 43 insertions(+), 22 deletions(-) diff --git a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md index 68ad7df228a..50a7419bdbd 100644 --- a/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md +++ b/cli/azd/docs/design/azd-ai-agent-doctor-remote-checks.md @@ -1,3 +1,5 @@ + + # Design: `azd ai agent doctor` — Remote Checks (7–12) ## Status @@ -57,7 +59,7 @@ These all require live calls and credentials, which is why they're carved out in cli/azd/extensions/azure.ai.agents/internal/cmd/ └── doctor/ ├── checks_local.go ← (MVP) checks 1–6 - ├── checks_remote.go ← NEW: checks 7–11 + ├── checks_remote.go ← NEW: checks 7–12 ├── runner.go ← orchestrates ordered execution └── types.go ← Check, Result, Severity ``` @@ -145,7 +147,7 @@ The single-shot probe avoids paging and works on tiny test projects. Timeout: 10 **What it does:** parses each service's `agent.yaml` for model references, then queries the Foundry project's deployments list once and matches names locally. -> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead. Syncing with @trangevi on timing. +> **Tracking note ([#7962](https://github.com/Azure/azure-dev/issues/7962)).** Once the `agent.yaml` → `azure.yaml` unification lands, this check (and any other reference to `agent.yaml` in this doc) reads model refs from `azure.yaml` instead. **Pass:** "all referenced models present" **Fail:** "missing: (referenced by )" → `azd provision` (or in the rare case the user deleted a deployment manually, "redeploy via Foundry portal" with link). diff --git a/cli/azd/docs/design/azd-ai-agent-nextsteps.md b/cli/azd/docs/design/azd-ai-agent-nextsteps.md index 8936b985b2f..ead99018c15 100644 --- a/cli/azd/docs/design/azd-ai-agent-nextsteps.md +++ b/cli/azd/docs/design/azd-ai-agent-nextsteps.md @@ -1,3 +1,5 @@ + + # Design: Context-Aware Next-Step Guidance for `azd ai agent` ## Status @@ -16,9 +18,14 @@ Non-goals: - Sample-author hooks, README markers, or YAML-declared hints — out of scope for this design. - Authoring or editing core azd packages (`cli/azd/pkg/...`, `cli/azd/cmd/...`). The design only *consumes* the existing `Metadata["note"]` rendering contract. -## Assumptions +## Configuration sources + +This design ships against today's two-file split; the unified `azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) is not a prerequisite: -This design assumes the unified-`azure.yaml` proposal in [#7975](https://github.com/Azure/azure-dev/issues/7975) has landed: a single `azure.yaml` is the canonical config; `agent.yaml` and `agent.manifest.yaml` no longer exist as separate files. References to `agent.yaml` in the doctor checks and decision trees below are placeholders for the equivalent fields in the unified `services.` stanza. If the extension ships before unification, each `agent.yaml` reference should be read as the corresponding fields under `services.` in `azure.yaml`. The `AssembleState` function abstracts the source split, so resolvers are insulated from the migration — only the state-assembly code needs to change. +- `azure.yaml` — service registration (`services..host: azure.ai.agent`), language, source path. +- `agent.yaml` — agent definition: protocol, models, tools, `config.env` references. + +`AssembleState` reads from both: structural fields (host, language, source) come from `azure.yaml`; agent-runtime fields (protocol, env refs) come from `agent.yaml`. When unification eventually lands, only `AssembleState` changes — every resolver above it sees the same `State` shape. References to `agent.yaml` in the decision trees and doctor checks below are literal, not placeholders. The extension's authentication remains `azidentity.NewAzureDeveloperCLICredential` (see [`agent_context.go`](../../extensions/azure.ai.agents/internal/cmd/agent_context.go)). User-facing fix commands therefore say `azd auth login`, not `az login`. @@ -92,7 +99,7 @@ One package owns *all* policy. Each command is a thin caller — it knows when i └────────────────────────────────────────────────────────────────────┘ ``` -`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout (or stderr when piped — see [Output Discipline](#output-discipline)). +`deploy` is the special case: an extension's `Deploy()` returns `[]*azdext.Artifact` and that is the extension's only output channel after the call returns. We populate the artifact's note field with the formatted Next: block — exactly the same mechanism `service_target_agent.go` already uses today to print endpoint URLs. Everywhere else, the extension writes directly to stdout, and only on a TTY — non-TTY runs suppress the block entirely (see [Output Discipline](#output-discipline)). ## State Model @@ -144,10 +151,20 @@ The resolver works with whatever's available — partial state never silences gu |---|---|---| | 1 | Explicit CLI flags (`--project-endpoint`, `--agent`) | Highest | | 2 | Live runtime probes (OpenAPI, Foundry status) | Override env vars | -| 3 | `azure.yaml` services, protocols, `config.env` refs | Structural truth | +| 3 | `azure.yaml` (services, language) + `agent.yaml` (protocol, `config.env` refs) | Structural truth | | 4 | azd env vars | Resolution truth | -If the azd environment is missing or stale (a real scenario for brownfield users), the resolver should still produce useful output from layers 1–3. +If the azd environment is missing or stale, the resolver should still produce useful output from layers 1–3. + +**"Stale" definition.** An env var is stale when any of the following holds: + +1. **Missing** — the key is not present in the current `.azure//.env` file but layer 3 expects it (e.g., `agent.yaml` declares `config.env: AGENT_FOO_ENDPOINT` and the env file has no entry). +2. **Empty** — the key is present but its value is the empty string. +3. **Drift** — the key is present and non-empty, but a higher-priority layer disagrees with it. The resolver detects two concrete drift cases: + - The corresponding `agent.yaml` field changed (e.g., agent renamed in `azure.yaml` while the env still holds the old `AGENT__ENDPOINT`). + - The OpenAPI/Foundry probe (layer 2) returned a different endpoint than the env value. + +Drift is *informational* — the resolver still emits a Next: line ("env appears stale; re-run `azd provision` or `azd env refresh`") rather than blocking. The resolver does not call ARM to confirm whether the recorded endpoint still resolves; that's doctor's responsibility. ## Decision Tree (per command) @@ -234,12 +251,12 @@ Ground rules: | Scenario | Destination | |---|---| | Interactive TTY | `stdout` | -| Piped output / CI (detected via `isTerminal`) | `stderr` so it doesn't pollute parsable output | -| `--output json` (success-path commands) | `Next:` block suppressed entirely | -| `azd ai agent doctor --output json` | **Not** the suppression rule — emits the structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are still suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag | -| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`) | +| Piped output / CI (detected via `isTerminal`) | **Suppressed.** The human `Next:` block is a TTY-only affordance. Writing it to stdout would pollute parsable command output; writing it to stderr would be misread as a warning or failure by callers grepping stderr. Machine consumers opt into structured guidance via `--output json`. | +| `--output json` (success-path commands) | Human `Next:` block suppressed; the same guidance is exposed as an optional `nextStep: { primary, secondary?, note? }` field on the existing JSON envelope. | +| `azd ai agent doctor --output json` | Structured check-result array (see [Exit codes & JSON output](#exit-codes--json-output)). Human-readable `Next:` lines are suppressed; redacted/non-redacted RBAC details appear there with an explicit `redacted` flag. | +| Inside `azd deploy` | Attached to the returned `azdext.Artifact` (`Metadata["note"]` on an `ArtifactKindEndpoint` artifact — same path used today for endpoint URLs in `service_target_agent.go`). | -The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output should call the same helper to keep the source of truth singular. +The `isTerminal` check uses the existing helper in `internal/cmd/helpers.go` (which wraps `golang.org/x/term.IsTerminal`). All other extension code that needs to detect interactive vs. non-interactive output calls the same helper to keep the source of truth singular. ### Multi-line note pre-indentation @@ -247,14 +264,16 @@ The `azdext.Artifact` SDK field used here renders only the **first line** at the ## Hook Points +Wherever possible, hooks fire at the **end of `runE`** — the same call site every command uses for cleanup — so all next-step emission flows through one helper, `nextstep.PrintNext`. Two commands cannot use that hook; both deviations are documented in the table below. + | Command | Trigger | Mechanism | |---|---|---| | `init` | end of `runE` after success message | direct `nextstep.PrintNext(w, …)` | -| `run` | "agent is listening" callback | direct print before `Press Ctrl+C` line | | `invoke --local` / `invoke` | end of `runE` on success | direct print | -| `show` | end of table render | direct print | -| `doctor` | after rendering check report | direct print | -| `deploy` | inside `Deploy()` return path | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. | +| `show` | end of `runE` (after table render returns) | direct print | +| `doctor` | end of `runE` (after report renders, before `os.Exit`) | direct print | +| `run` | "agent is listening" callback (**not** end of `runE`) | direct print before `Press Ctrl+C` line. `run` blocks until SIGINT, so end-of-`runE` would print on shutdown — too late to be useful. | +| `deploy` | inside `Deploy()` return path (**not** the deploy command's `runE`, which lives in core azd) | populate `Metadata["note"]` on the existing `ArtifactKindEndpoint` artifact returned by `service_target_agent.go` (~line 720). `pkg/project/artifact.go` (`ToString`, ~line 128) renders the value as a continuation line under the artifact bullet, gated on `Kind == ArtifactKindEndpoint`. We do **not** introduce a new artifact kind. | The deploy hook returns its Next: block through the same `azdext.Artifact` SDK field that `service_target_agent.go` already uses for endpoint URLs. No new APIs, no edits outside the extension. Everywhere else, the extension owns its terminal directly. @@ -298,7 +317,7 @@ All failures are silent. The fetch itself is best-effort and already cached — | 3 | azd environment selected | "" | `azd env new` / `azd env select` | | 4 | Agent service in `azure.yaml` | service count + names | `azd ai agent init` | | 5 | `AZURE_AI_PROJECT_ENDPOINT` set | URL value | `azd provision` | -| 6 | `agent.yaml` per service is valid (placeholder for unified-schema service stanza) | path | fix YAML | +| 6 | `agent.yaml` per service is valid | path | fix YAML | | 7 (post-MVP) | Authentication via `azidentity.NewAzureDeveloperCLICredential` (the credential `agent_context.go` already uses) | signed-in user + token validity | `azd auth login` | | 8 (post-MVP) | Foundry project reachable | endpoint probe OK | network/firewall | | 9 (post-MVP) | Model deployments exist | name + version | `azd provision` | @@ -370,7 +389,7 @@ The `redacted` field is `true` when `--output json` is combined with non-interac | Layer | What | How | |---|---|---| | `nextstep/format` | indentation, blank-line rules, secondary suppression, empty-suggestions early return | table-driven unit tests | -| `nextstep/format` | TTY vs piped routing (stdout vs stderr); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag | +| `nextstep/format` | TTY vs non-TTY routing (stdout vs suppressed); `--output json` suppression of the human `Next:` block | snapshot tests with stub `isTerminal` and a fake `OutputFormat` flag | | `nextstep/resolver` | every branch of every per-command tree | table-driven, fake `State` | | `nextstep/resolver` | non-interactive auth handling — when `IsAuthenticated == AuthUnknown`, no auth-conditional advice (no "azd auth login" appears in `Next:`) | dedicated table cases | | `nextstep/openapi` | each extraction path + corruption / empty fallback / unresolved `$ref` → `""` | golden specs in `testdata/` | @@ -390,7 +409,7 @@ The Next: block is purely additive; no existing exit code, error type, or stdout - The deploy hook adds a `Metadata["note"]` value to artifacts that already exist; the artifact `Kind` stays `ArtifactKindEndpoint`. Tools parsing artifacts by kind see no change. - `azd ai agent doctor` is a new command — its addition cannot break existing scripts. The new `--local-only` and `--output json` flags are opt-in. - `nextstep` and `doctor` are new internal packages under the extension. No public Go API surface is added. -- Existing extension-side env vars and behaviors (e.g., `AZD_AGENT_NO_NEXTSTEPS`, if shipped — see Open Questions) are documented in `cli/azd/docs/environment-variables.md` per repo convention. +- No new environment variables are introduced. The Next: block is always rendered when output discipline allows it; there is no opt-out env var. If any future change to the `Metadata["note"]` rendering behavior in `pkg/project/artifact.go` ships, the regression test pinned in this design's [Output Discipline](#output-discipline) section will break loudly. The fix is extension-side reformatting; no core edits. @@ -409,7 +428,7 @@ If any future change to the `Metadata["note"]` rendering behavior in `pkg/projec ## Deferred follow-ups -The reviewer surfaced three additional concerns that are out of scope for this design pass and will be handled separately: +Three additional concerns are out of scope for this design pass and will be handled separately: - **User personas / target audiences.** Not standard in design docs in this repo (see `cli/azd/docs/design/`); we treat "developers building with the agents extension" as the implicit audience. - **Success metrics / KPIs.** Same rationale; metrics will be wired up alongside telemetry in the follow-up design once we have something concrete to measure. @@ -421,7 +440,7 @@ The reviewer surfaced three additional concerns that are out of scope for this d |---|---| | Resolver guidance contradicts what the command actually did (false advice) | Each resolver receives the *post-execution* state. Tests cover deployed / not-deployed / partial-env permutations. | | OpenAPI fetch slows down `run` startup | Already capped + cached + silent on failure. We do not block the listening notice on the fetch. | -| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on piped stderr; user can ignore. | +| Output noise: every command grows by 3 lines | One primary + ≤1 secondary; suppressed under `--output json` and on non-TTY output; user can ignore. | | Decision tree drifts from sample reality | All trees are table-driven and unit-tested per branch. Sample contract changes go through the same review. | | Artifact-note rendering behavior could shift in a future SDK update | Pin the indentation contract with an extension-side regression test that constructs an artifact and asserts the formatted string. If the SDK changes shape, the test breaks loudly and we adapt formatting — still no edits required outside the extension. | | Multi-agent projects produce wall-of-text Next: blocks | `ResolveAfterDeploy` produces one `show ` and one `invoke ` line per agent. Multi-agent walls remain readable for the project sizes we expect (≤ ~10). If the wall becomes a real problem in practice, fall back to a single `azd ai agent show` (no args, lists all) line; this can be tuned in a follow-up without changing the resolver contract. | @@ -432,7 +451,7 @@ The reviewer surfaced three additional concerns that are out of scope for this d 2. **Wire success paths** — `init`, `run`, `invoke`, `show`. Resolver per command. 3. **Deploy hook** — attach Next: block to the returned `azdext.Artifact` (same SDK field already used for endpoint URLs in `service_target_agent.go`). README-on-disk verification. 4. **`doctor` command** — checks 1–6 (local-only). Wire trailing Next: through resolver. -5. **(Follow-up)** Doctor checks 7–11 (auth, reachability, RBAC, deployments, agent status). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). +5. **(Follow-up)** Doctor checks 7–12 (auth, reachability, model deployments, RBAC, agent status, end-to-end probe). See [`azd-ai-agent-doctor-remote-checks.md`](./azd-ai-agent-doctor-remote-checks.md). Each phase is independently shippable and reviewable. Phases 1–4 are the MVP for #7975.