Rework compaction to use history-replacement model

Junior's current visible-thread compaction accumulates structured XML summaries alongside a trimmed live transcript. Codex uses a full history-replacement model that is simpler to reason about, avoids unbounded summary accumulation, and better preserves recent user context across long threads.

Important correction: in Junior, `conversation-memory.ts` is not the only context-growth surface. After a completed agent turn, the next normal turn primarily seeds the model from durable Pi checkpoint history, not from `conversation.compactions`.

## Current Behavior

- `packages/junior/src/chat/services/conversation-memory.ts` compacts persisted Slack conversation state pre-turn:
  - strategy: summarize the oldest message batch, append it as a `<compaction>` inside `<thread-compactions>`, then trim `conversation.messages` from the front
  - summary format: XML sections for active asks, completed/superseded asks, and facts
  - cap: 16 compaction records, then oldest summaries are merged
- `buildConversationContext(...)` renders the compacted Slack transcript for routing/thinking and for initial prompt background when no Pi history exists.
- `reply-executor.ts` usually loads durable Pi messages from the active or last completed turn checkpoint with `loadPiMessagesForTurn(...)`.
- `respond.ts` suppresses `<thread-background>` when `context.piMessages` exists, then seeds `agent.state.messages` directly from those Pi messages.

That means current conversation-memory compaction can shrink routing/thread-background context while Pi history can still grow without a Codex-style replacement step.

## Gap

- The primary agent history is durable Pi checkpoint history, which is not compacted by `conversation-memory.ts`.
- Compaction records in visible conversation state still consume bounded-but-real routing/background context.
- Older user wording only survives through summaries unless it remains in the live message tail or reusable Pi history.
- Estimated tokens can diverge from actual model usage, especially with tool-heavy turns, richer message parts, and attachments.
- Codex's initial-context reinjection distinction matters for Junior only if we later add mid-turn compaction. Pre-turn compaction can rely on normal per-turn runtime context injection.

## Proposed Direction

Implement Codex-style history replacement for Junior's durable Pi history, and keep conversation-memory compaction as the secondary visible-thread/routing compaction surface.

### 1. Compact Durable Pi History

Add a local context compactor that operates on `PiMessage[]` from a completed turn checkpoint. It should summarize current history, then install a bounded replacement history instead of appending more summary records.

Replacement shape:

- recent real user messages up to a configured token budget
- one synthetic handoff summary message

Do not include stale `<runtime-turn-context>` parts, raw image/base64 payloads, or unbounded tool output in the retained messages.

### 2. Handoff-Oriented Summary Prompt

Switch from XML-section summaries to a concise handoff prompt modeled on Codex:

- current outstanding asks
- key decisions and completed outcomes
- durable context, constraints, user preferences, IDs, URLs, artifact/canvas/sandbox references
- clear next steps or unresolved blockers

The summary should be easy for the resumed agent to consume as narrative context, not a nested compaction log.

### 3. Preserve Recent User Messages

Retain recent user-authored messages verbatim up to a budget. Codex's local compaction uses 20K tokens. Junior should start with a local constant and tune from observed usage.

### 4. Trigger Before Fresh Turns

Run automatic compaction after `loadPiMessagesForTurn(...)` has loaded reusable history and before `generateAssistantReply(...)` receives `piMessages`.

Do not compact an `awaiting_resume` checkpoint. Timeout/auth resume depends on safe Pi `continue()` boundaries and should keep exact checkpoint history until the turn completes.

### 5. Derive Thresholds From Model Context Windows

Do not keep a hardcoded 9K-token threshold. Automatic Pi-history compaction should derive its threshold from the active agent model's advertised context window and reserve output headroom. Visible conversation-state compaction should use the same budget rule against the fast model.

`AI_MODEL_CONTEXT_WINDOW_TOKENS` and `AI_FAST_MODEL_CONTEXT_WINDOW_TOKENS` should override advertised context windows when provider metadata is missing, stale, or intentionally constrained.

### 6. Internal Context Forking

No Slack message, slash command, or model tool should invoke compaction directly. Manual/user-facing compaction is out of scope.

Junior should expose an internal compaction command for future orchestration strategies, such as handing off to a coding agent or upgrading the model. That internal command should take the current reusable completed Pi history plus visible thread state, summarize/compact it, create a fresh compacted model history, and make that compacted history the next reusable context for the thread.

Implementation should create a new synthetic completed compaction checkpoint/session and update `conversation.processing.lastSessionId` to it, rather than destructively rewriting the prior completed turn checkpoint in place. Keeping the pre-compaction checkpoint available makes the fork auditable and avoids corrupting timeout/auth recovery state.

### 7. Keep Conversation State Bounded Too

Keep visible conversation-state compaction bounded for subscribed routing, thinking selection, and no-Pi-history prompt background. It can continue to retain bounded chunk summaries plus recent visible messages as long as it does not grow without bound.

### 8. Improve Token Accounting

Track server-reported input tokens for the last model call or max single call where available, and use character estimates only as fallback. Avoid using cumulative turn usage as the trigger by itself because multi-step/tool-heavy turns can overcount across multiple model calls.

### 9. Defer Mid-Turn Compaction

Do not implement Codex's mid-turn compaction first. It requires special initial-context injection before the last real user message and must preserve Pi `continue()` semantics. Pre-turn compaction covers the main Junior risk with much lower recovery complexity.

## Prior Art

- Codex local compaction: `codex-rs/core/src/compact.rs`
  - full history replacement
  - handoff summary prompt
  - recent user-message retention up to 20K tokens
  - automatic and internally callable triggers
  - optional initial-context injection for mid-turn compaction

Codex should be treated as algorithmic prior art only. Junior does not have, and should not depend on, a remote/server-side compaction endpoint. The summarization, retained-message selection, replacement-history construction, persistence, triggers, and verification all need to be implemented inside Junior.

## Implementation Plan

1. Add/update the relevant spec to state that compaction owns durable Pi history, while Slack conversation state remains the visible thread/routing source.
2. Add a `chat/services/context-compaction.ts` compactor for message selection, token estimation, summary prompting, and replacement-history construction.
3. Add persistence support to create a new completed compaction checkpoint/session from replacement Pi history and point `conversation.processing.lastSessionId` at it.
4. Wire automatic pre-turn compaction in `reply-executor.ts` after Pi history load and before agent execution.
5. Update `conversation-memory.ts` thresholds to derive from model context windows instead of a fixed 9K/7K window.
6. Add an internal compaction command for future orchestration flows; do not add user-facing Slack commands or model tools.
7. Add focused unit tests for replacement selection/truncation and checkpoint forking.
8. Add integration coverage proving a long Slack thread uses compacted Pi history on the next turn.
9. Add an eval for long-thread continuity after compaction if the expected behavior depends on model interpretation.

## Scope Notes

- Remote/server-side compaction is not available for Junior and should not be part of this design.
- User-facing/manual compaction is out of scope.
- Pre/post compact hooks are optional and can be deferred.
- Mid-turn compaction should be deferred until pre-turn replacement is working and verified.
- This should not change timeout/auth checkpoint resume semantics.

Action taken on behalf of David Cramer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework compaction to use history-replacement model #431

Current Behavior

Gap

Proposed Direction

1. Compact Durable Pi History

2. Handoff-Oriented Summary Prompt

3. Preserve Recent User Messages

4. Trigger Before Fresh Turns

5. Derive Thresholds From Model Context Windows

6. Internal Context Forking

7. Keep Conversation State Bounded Too

8. Improve Token Accounting

9. Defer Mid-Turn Compaction

Prior Art

Implementation Plan

Scope Notes

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Rework compaction to use history-replacement model #431

Description

Current Behavior

Gap

Proposed Direction

1. Compact Durable Pi History

2. Handoff-Oriented Summary Prompt

3. Preserve Recent User Messages

4. Trigger Before Fresh Turns

5. Derive Thresholds From Model Context Windows

6. Internal Context Forking

7. Keep Conversation State Bounded Too

8. Improve Token Accounting

9. Defer Mid-Turn Compaction

Prior Art

Implementation Plan

Scope Notes

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions