Skip to content

fix(agent-core): recover interrupted tool exchanges#273

Open
haosenwang1018 wants to merge 1 commit into
MoonshotAI:mainfrom
haosenwang1018:fix/recover-interrupted-tool-exchange
Open

fix(agent-core): recover interrupted tool exchanges#273
haosenwang1018 wants to merge 1 commit into
MoonshotAI:mainfrom
haosenwang1018:fix/recover-interrupted-tool-exchange

Conversation

@haosenwang1018
Copy link
Copy Markdown

Related Issue

Resolve #269

Problem

If Kimi Code is force-closed after a tool.call is recorded but before the matching tool.result and step.end are written, session replay rebuilds an assistant message with unresolved tool_calls. The next LLM request then fails with a provider-side 400 because the replayed message history has no matching tool response.

What changed

  • Added replay-time recovery for interrupted open steps: any recorded tool call still missing a result is completed with an error tool.result, then deferred user/background messages are released back into history.
  • Kept legitimate sealed pending exchanges intact. The recovery only synthesizes missing tool results when replay reaches EOF with an open step, so existing async-tool/compaction behavior that intentionally keeps a pending exchange is not converted into an error.
  • Added a regression test that replays the broken wire shape from the issue and verifies the recovered history is provider-valid and durably appends the synthetic tool.result.
  • Added a patch changeset for @moonshot-ai/agent-core and @moonshot-ai/kimi-code.

Verification

  • pnpm --filter @moonshot-ai/agent-core exec vitest run test/agent/records/index.test.ts test/agent/compaction/full.test.ts
  • pnpm --filter @moonshot-ai/agent-core run typecheck
  • pnpm exec oxlint packages/agent-core/src/agent/context/index.ts packages/agent-core/src/agent/records/index.ts packages/agent-core/test/agent/records/index.test.ts
  • pnpm --filter @moonshot-ai/agent-core run test
  • pnpm --filter @moonshot-ai/kimi-code run typecheck
  • pnpm --filter @moonshot-ai/kimi-code run test
  • pnpm changeset status --since=origin/main

Checklist

  • I have read the CONTRIBUTING document.
  • I have linked a related issue, or explained the problem above.
  • I have added tests that prove my feature works.
  • Ran gen-changesets skill, or this PR needs no changeset.
  • Ran gen-docs skill, or this PR needs no doc update.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 1, 2026

🦋 Changeset detected

Latest commit: 9d7eae9

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@moonshot-ai/agent-core Patch
@moonshot-ai/kimi-code Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d7eae9796

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

// before the loop could finish pairing recorded tool calls.
const missingToolResultIds = this.openSteps.size > 0 ? [...this.pendingToolResultIds] : [];
for (const toolCallId of missingToolResultIds) {
this.appendLoopEvent({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include recovered messages in replay output

When the first resume repairs an interrupted tool exchange, this call runs after AgentRecords.restore has cleared records.restoring; ReplayBuilder.push only records messages while restoring, so the synthetic tool result created here—and any deferred user/background messages flushed immediately afterward—are absent from ResumedAgentState.replay. The TUI renders that replay on resume, so the first resume after a crash can still show an unresolved tool call and omit the queued continuation even though context and persistence were repaired; the transcript only looks correct after a second resume from the newly appended record.

Useful? React with 👍 / 👎.

// A sealed step can intentionally keep waiting for async tool output across
// context operations. An open step at replay EOF means the process stopped
// before the loop could finish pairing recorded tool calls.
const missingToolResultIds = this.openSteps.size > 0 ? [...this.pendingToolResultIds] : [];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recover pending calls after compaction clears the step

If manual or auto compaction completes after a tool.call is recorded but before its tool.result, context.applyCompaction preserves the assistant/tool-call message but clears openSteps; after a crash at that point, replay reaches EOF with pendingToolResultIds still populated and openSteps.size === 0, so this guard synthesizes no result. The next model request still contains the compacted history with an assistant tool call and no matching tool message, reproducing the provider 400 this recovery is meant to prevent.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Session resume breaks after force-interrupt during tool execution (400 tool_call_ids missing)

1 participant