feat: translate Anthropic image content blocks across all providers#82
Conversation
Add image handling to the LLM proxy translation layer so creatures can send images (screenshots, UI mockups, web images) in their conversations regardless of which provider backs the model. - OpenAI Responses API: image blocks → input_image with data URL - Chat Completions (OpenRouter): image blocks → image_url content parts - Gemini: base64 image blocks → inlineData parts (URL-based images skipped — Gemini requires inline data) Addresses openseed-dev#73 (image handling for creatures) — this is the proxy translation piece. Creatures already speak Anthropic format which supports image blocks natively; this ensures those blocks survive translation to other providers. Works alongside the image stripping in events.ts (openseed-dev#75) — images flow through to the LLM but are stripped from event logs.
There was a problem hiding this comment.
Reviewed the diff carefully. The translation logic is sound and the Gemini inline-data path is correct. Two things worth addressing:
1. Silent data loss when a Gemini URL-image is the only content in a turn (lines ~475-479)
When a user message is only an image with source.type === 'url', the parts array stays empty, the if (parts.length) guard fires, and the entire user turn is silently dropped. The model sees a conversation with a gap. This is probably rare today (genomes don't send image-only turns), but it will bite someone. A warning log here would at least make it diagnosable:
} else if (block.type === 'image') {
if (block.source?.type === 'base64' && ...) {
parts.push({ inlineData: { ... } });
} else if (block.source?.type === 'url') {
// Gemini doesn't support remote URLs inline; skip but warn
console.warn('[proxy] Gemini: dropping URL-based image block (not supported)');
}
}2. translateMessagesToChat: mixed text+image turn emits two consecutive user messages
A single Anthropic user turn containing [{type:"text",...}, {type:"image",...}] produces:
{ role: 'user', content: "Look at this:" } // from text block
{ role: 'user', content: [{type:"image_url",...}] } // from image blockOpenAI/OpenRouter chat completions reject (or silently merge) consecutive same-role messages. The fix is to accumulate all blocks from one turn into a single content array before pushing:
const parts: Array<{type: string, [k: string]: any}> = [];
for (const block of content) {
if (block.type === 'text') parts.push({ type: 'text', text: block.text });
else if (block.type === 'image') {
const url = anthropicImageToDataUrl(block.source);
if (url) parts.push({ type: 'image_url', image_url: { url } });
}
}
if (parts.length) chatMessages.push({ role: 'user', content: parts });(Note: the text+image split is pre-existing behaviour for the non-image case too — this PR doesn't make it worse, but adding images here makes the collision more likely to actually occur.)
The Responses API and Gemini translators are fine — Responses API accepts multiple items per role, Gemini accumulates into parts[] already.
Addresses openseed-patch review feedback on PR openseed-dev#82: 1. translateMessagesToChat: accumulate text+image blocks from a single Anthropic user turn into one chat message with a content array, instead of emitting consecutive same-role messages (rejected by OpenAI/OpenRouter). Tool result blocks still flush to separate tool-role messages. 2. translateMessagesToGemini: log a warning when URL-based image blocks are silently dropped (Gemini only supports inline base64). This makes the data loss diagnosable instead of silent.
|
Good catches on both points. Fixed in b99b272: 1. Gemini URL-image warning — Added a 2. Consecutive same-role messages — Reworked the Chat translation to accumulate all text + image blocks from a single Anthropic user turn into one The Responses API and Gemini translators were already correct as you noted — only the Chat path needed the fix. |
|
Both concerns from the review are addressed in b99b272:
Ready for re-review when you get a chance. |
There was a problem hiding this comment.
Re-reviewed b99b272. Both fixes are correct.
Chat path accumulation — the userParts[] buffer correctly collects text and image blocks and flushes them as a single user message, either before a tool_result or at end-of-loop. The flush logic (length === 1 && type === 'text' ? string : array) preserves the simple string format for the common case and falls back to an array for mixed content. Edge cases I checked: only tool_result blocks (userParts stays empty, flush is skipped ✓), multiple consecutive tool_results (each flush guard handles it ✓), URL-only image in Chat path (anthropicImageToDataUrl returns the URL string, gets pushed as image_url block correctly ✓).
Gemini URL-image warning — placed correctly inside the else if (block.type === 'image') branch, fires only when source.type === 'url'. The turn is still dropped when it's URL-image-only (the if (parts.length) guard), but it's now diagnosable. That's exactly what I asked for.
Ship it.
There was a problem hiding this comment.
Both fixes look correct.
Consecutive user messages (Chat): The accumulator pattern is solid — text and image blocks merge into a single content array, and tool_result blocks flush the accumulator first. The single-text optimization (userParts.length === 1 && userParts[0].type === 'text' ? userParts[0].text : [...userParts]) preserves the old behavior for the common case, which avoids any regressions with providers that prefer the string form.
Gemini URL warning: Clean and diagnosable.
Responses API: Each text/image block as a separate { role: 'user', content: [...] } item is fine — the Responses API input array allows this unlike Chat Completions.
One minor note for awareness: anthropicImageToDataUrl is typed source: any. That's fine for now given the surrounding proxy code is already untyped, just something to tighten up if someone adds Anthropic SDK types later.
LGTM.
What
Adds image block translation to the LLM proxy so creatures can include images (screenshots, UI mockups, web content) in their conversations, regardless of which provider backs the model.
Changes
Single file change to
src/host/proxy.ts:anthropicImageToDataUrl()— converts Anthropic image source (base64 or URL) to a data URL stringimageblocks →input_imageitems with data URLimageblocks →image_urlcontent parts in the standardcontent: [{type: 'image_url', ...}]formatinlineDataparts withmimeTypeanddata. URL-based images are skipped since Gemini requires inline data.Context
Addresses #73. Creatures speak Anthropic format natively, which already supports
imagecontent blocks. The proxy currently silently drops these blocks during translation — this change preserves them.Complements #75 (image stripping for event logs): images flow through to the LLM via the proxy, but are stripped from persisted event data to avoid bloating logs.
Not included
tool_resultcontent blocks (Anthropic supports images there too, but it's a rarer pattern)