Skip to content

feat: translate Anthropic image content blocks across all providers#82

Merged
rsdouglas merged 2 commits intoopenseed-dev:mainfrom
lucamorettibuilds:feat/proxy-image-content
Feb 28, 2026
Merged

feat: translate Anthropic image content blocks across all providers#82
rsdouglas merged 2 commits intoopenseed-dev:mainfrom
lucamorettibuilds:feat/proxy-image-content

Conversation

@lucamorettibuilds
Copy link
Copy Markdown
Contributor

What

Adds image block translation to the LLM proxy so creatures can include images (screenshots, UI mockups, web content) in their conversations, regardless of which provider backs the model.

Changes

Single file change to src/host/proxy.ts:

  • Helper: anthropicImageToDataUrl() — converts Anthropic image source (base64 or URL) to a data URL string
  • OpenAI Responses API: Anthropic image blocks → input_image items with data URL
  • Chat Completions (OpenRouter): Anthropic image blocks → image_url content parts in the standard content: [{type: 'image_url', ...}] format
  • Gemini: base64 image blocks → inlineData parts with mimeType and data. URL-based images are skipped since Gemini requires inline data.

Context

Addresses #73. Creatures speak Anthropic format natively, which already supports image content blocks. The proxy currently silently drops these blocks during translation — this change preserves them.

Complements #75 (image stripping for event logs): images flow through to the LLM via the proxy, but are stripped from persisted event data to avoid bloating logs.

Not included

  • URL-to-base64 fetching for Gemini (would need async fetch + error handling in the translation layer)
  • Image support in tool_result content blocks (Anthropic supports images there too, but it's a rarer pattern)

Add image handling to the LLM proxy translation layer so creatures
can send images (screenshots, UI mockups, web images) in their
conversations regardless of which provider backs the model.

- OpenAI Responses API: image blocks → input_image with data URL
- Chat Completions (OpenRouter): image blocks → image_url content parts
- Gemini: base64 image blocks → inlineData parts
  (URL-based images skipped — Gemini requires inline data)

Addresses openseed-dev#73 (image handling for creatures) — this is the proxy
translation piece. Creatures already speak Anthropic format which
supports image blocks natively; this ensures those blocks survive
translation to other providers.

Works alongside the image stripping in events.ts (openseed-dev#75) — images
flow through to the LLM but are stripped from event logs.
Copy link
Copy Markdown
Contributor

@openseed-patch openseed-patch Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff carefully. The translation logic is sound and the Gemini inline-data path is correct. Two things worth addressing:

1. Silent data loss when a Gemini URL-image is the only content in a turn (lines ~475-479)

When a user message is only an image with source.type === 'url', the parts array stays empty, the if (parts.length) guard fires, and the entire user turn is silently dropped. The model sees a conversation with a gap. This is probably rare today (genomes don't send image-only turns), but it will bite someone. A warning log here would at least make it diagnosable:

} else if (block.type === 'image') {
  if (block.source?.type === 'base64' && ...) {
    parts.push({ inlineData: { ... } });
  } else if (block.source?.type === 'url') {
    // Gemini doesn't support remote URLs inline; skip but warn
    console.warn('[proxy] Gemini: dropping URL-based image block (not supported)');
  }
}

2. translateMessagesToChat: mixed text+image turn emits two consecutive user messages

A single Anthropic user turn containing [{type:"text",...}, {type:"image",...}] produces:

{ role: 'user', content: "Look at this:" }   // from text block
{ role: 'user', content: [{type:"image_url",...}] }  // from image block

OpenAI/OpenRouter chat completions reject (or silently merge) consecutive same-role messages. The fix is to accumulate all blocks from one turn into a single content array before pushing:

const parts: Array<{type: string, [k: string]: any}> = [];
for (const block of content) {
  if (block.type === 'text') parts.push({ type: 'text', text: block.text });
  else if (block.type === 'image') {
    const url = anthropicImageToDataUrl(block.source);
    if (url) parts.push({ type: 'image_url', image_url: { url } });
  }
}
if (parts.length) chatMessages.push({ role: 'user', content: parts });

(Note: the text+image split is pre-existing behaviour for the non-image case too — this PR doesn't make it worse, but adding images here makes the collision more likely to actually occur.)

The Responses API and Gemini translators are fine — Responses API accepts multiple items per role, Gemini accumulates into parts[] already.

Addresses openseed-patch review feedback on PR openseed-dev#82:

1. translateMessagesToChat: accumulate text+image blocks from a single
   Anthropic user turn into one chat message with a content array, instead
   of emitting consecutive same-role messages (rejected by OpenAI/OpenRouter).
   Tool result blocks still flush to separate tool-role messages.

2. translateMessagesToGemini: log a warning when URL-based image blocks
   are silently dropped (Gemini only supports inline base64). This makes
   the data loss diagnosable instead of silent.
@lucamorettibuilds
Copy link
Copy Markdown
Contributor Author

Good catches on both points. Fixed in b99b272:

1. Gemini URL-image warning — Added a console.warn when URL-based image blocks are dropped so it's diagnosable instead of silently swallowing the turn.

2. Consecutive same-role messages — Reworked the Chat translation to accumulate all text + image blocks from a single Anthropic user turn into one content array before pushing. Tool result blocks still flush to separate tool-role messages (they need their own tool_call_id), but the interleaved text/image blocks no longer produce consecutive user messages.

The Responses API and Gemini translators were already correct as you noted — only the Chat path needed the fix.

@lucamorettibuilds
Copy link
Copy Markdown
Contributor Author

Both concerns from the review are addressed in b99b272:

  1. Mixed content accumulation — text + image blocks now accumulate into a single parts array instead of overwriting, so multi-block messages translate correctly.

  2. Gemini URL-image drops — a console.warn is emitted when Gemini receives URL-based images (which it can't use), so callers get visibility into content being dropped.

Ready for re-review when you get a chance.

Copy link
Copy Markdown
Contributor

@openseed-patch openseed-patch Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed b99b272. Both fixes are correct.

Chat path accumulation — the userParts[] buffer correctly collects text and image blocks and flushes them as a single user message, either before a tool_result or at end-of-loop. The flush logic (length === 1 && type === 'text' ? string : array) preserves the simple string format for the common case and falls back to an array for mixed content. Edge cases I checked: only tool_result blocks (userParts stays empty, flush is skipped ✓), multiple consecutive tool_results (each flush guard handles it ✓), URL-only image in Chat path (anthropicImageToDataUrl returns the URL string, gets pushed as image_url block correctly ✓).

Gemini URL-image warning — placed correctly inside the else if (block.type === 'image') branch, fires only when source.type === 'url'. The turn is still dropped when it's URL-image-only (the if (parts.length) guard), but it's now diagnosable. That's exactly what I asked for.

Ship it.

Copy link
Copy Markdown
Contributor

@openseed-patch openseed-patch Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixes look correct.

Consecutive user messages (Chat): The accumulator pattern is solid — text and image blocks merge into a single content array, and tool_result blocks flush the accumulator first. The single-text optimization (userParts.length === 1 && userParts[0].type === 'text' ? userParts[0].text : [...userParts]) preserves the old behavior for the common case, which avoids any regressions with providers that prefer the string form.

Gemini URL warning: Clean and diagnosable.

Responses API: Each text/image block as a separate { role: 'user', content: [...] } item is fine — the Responses API input array allows this unlike Chat Completions.

One minor note for awareness: anthropicImageToDataUrl is typed source: any. That's fine for now given the surrounding proxy code is already untyped, just something to tighten up if someone adds Anthropic SDK types later.

LGTM.

@rsdouglas rsdouglas merged commit 7141502 into openseed-dev:main Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants