Skip to content

Add stream-stall telemetry to Anthropic Messages streaming (#321432)#321671

Open
meganrogge wants to merge 4 commits into
mainfrom
meganrogge/messages-api-stream-idle-watchdog
Open

Add stream-stall telemetry to Anthropic Messages streaming (#321432)#321671
meganrogge wants to merge 4 commits into
mainfrom
meganrogge/messages-api-stream-idle-watchdog

Conversation

@meganrogge

@meganrogge meganrogge commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What

Adds an observe-only idle watchdog to processResponseFromMessagesEndpoint in messagesApi.ts (the Anthropic Messages API path used by vscode-copilot-cli with Claude models). It detects when the streaming body stalls and emits a new messagesApi.streamIdleTimeout telemetry event. It does not change request behavior — the stream is not aborted.

Why

The streaming-stall class of hang in #321432: the HTTP response can return 200 with headers and then stall mid-stream — the body iterator (for await (const chunk of response.body)) stops producing chunks but never completes and never errors, so the request hangs indefinitely. Today only the success path is instrumented, so when this happens the stall is invisible in telemetry.

This was the root cause of 4 of the 6 X_AGENT_STILL_RESPONDING timeouts in MSBench run 27640643629 (the others were genuine 60-min budget exhaustion). The stall is environment-agnostic — the eval harness just amplifies its frequency — so it likely affects real users too. We want to measure how often this happens in the wild first before changing any request behavior.

What this change does

  • Arms an idle watchdog that is reset on every received chunk.
  • If no chunk arrives within the window, it emits messagesApi.streamIdleTimeout once and otherwise leaves the stream untouched.
  • The watchdog timer is always cleared in a finally.

This is intentionally non-invasive: no abort, no reject, no behavior change. A follow-up can add mitigation once we have data on frequency and shape.

Telemetry dimensions

messagesApi.streamIdleTimeout (classification SystemMetaData, purpose PerformanceAndHealth):

  • requestId, ghRequestId — correlate with server-side logs
  • idleMs — ms since the last chunk when the watchdog tripped
  • elapsedMs — ms since the stream started
  • chunksReceived — chunks received before the stall
  • completionsEmitted — completions emitted before the stall (0 = first-turn hang, >0 = mid-stream hang)

Related: #321640

The Anthropic Messages streaming body can return 200 with headers and then
hang mid-stream with no further chunk and no error, leaving the for-await
loop pending indefinitely. Add a 120s idle watchdog that resets on each
chunk, emits a messagesApi.streamIdleTimeout telemetry event when it trips
(so the stall is observable in the wild), rejects the iterator, and cancels
the underlying reader so the stream settles instead of hanging.
Copilot AI review requested due to automatic review settings June 16, 2026 21:14
@meganrogge meganrogge self-assigned this Jun 16, 2026
@meganrogge meganrogge added this to the 1.126.0 milestone Jun 16, 2026
Drop the abort/reject behavior so request behavior is unchanged. The idle
watchdog now only detects a stalled Anthropic Messages stream and emits the
messagesApi.streamIdleTimeout event once, so we can first measure how often
the #321432 stall happens in the wild before changing behavior.
@meganrogge meganrogge enabled auto-merge (squash) June 16, 2026 21:18

This comment was marked as outdated.

@meganrogge meganrogge changed the title Add idle watchdog + telemetry to Anthropic Messages streaming (#321432) Add stream-stall telemetry to Anthropic Messages streaming (#321432) Jun 16, 2026
kycutler
kycutler previously approved these changes Jun 16, 2026
Comment thread extensions/copilot/src/platform/endpoint/node/messagesApi.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants