perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM by mikasenghaas · Pull Request #85 · PrimeIntellect-ai/renderers

mikasenghaas · 2026-06-11T05:26:47Z

Summary

The renderer-client rollout path keeps and ships the full processed pixel_values (tens of MB per large image) for every image on every turn, so resident multimodal memory grows with turns × concurrency. At 256 concurrent 1024² rollouts the trace/env-worker alone retains ~86 GB.

Two opt-in, default-off modes shrink this (behaviour unchanged unless a consumer sets the env flag), plus a cache-size default cut:

RENDERERS_MM_EPHEMERAL (stored data): generate returns a descriptor-only multi_modal_data (image_grid_thw + mm_hashes + mm_placeholders, no pixel_values), so the trajectory never retains decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of O(image-pixels). The consumer re-derives pixels downstream from the message images. Purely client-side — safe on any engine, incl. vLLM 0.22.
RENDERERS_MM_HASH_CACHE (transported data): send each image's pixels once, then descriptor-only (None kwargs) so the engine serves it from its mm-hash cache. _build_qwen_vl_features is now descriptor-aware (per-item None slots aligned to mm_hashes), driven by a sent-hash memory with a cache-miss fallback. Requires an engine that resolves None from cache (the disagg router topology); a plain single-server vLLM 0.22 forces skip_mm_cache=True and crashes on an unresolved None, so this stays OFF until the engine supports it.
image_cache_max default 256 → 32: each entry holds a decoded pixel tensor and the pool holds one cache per renderer, so 256 capped resident cache memory at ~pool_size × 256 × pixel_bytes (tens of GB for large images).

Verification

Stored RAM, retaining K concurrent rollouts' image mm (1024², 8 turns), full vs ephemeral:

K (rollouts)	FULL (pixels)	EPHEMERAL (descriptor)
32	12.8 GB	3.1 GB
64	23.2 GB	3.4 GB
128	44.4 GB	4.3 GB
256	86.0 GB	5.8 GB

~15× less at K=256; per-rollout slope drops ~20× (0.33 → 0.015 GB/rollout) — stored mm stops scaling with concurrency/resolution.

tests/test_client.py: 12 passing, incl. new ephemeral-return and hash-only-serialization tests.

🤖 Generated with Claude Code

Note

Add ephemeral pixel return and reduce image cache size to bound multimodal RAM usage

Adds two environment-controlled multimodal memory modes in renderers/client.py: _MM_EPHEMERAL strips pixel_values from returned MultiModalData after sending (keeping descriptors and hashes), and _MM_HASH_ONLY skips re-sending pixel payloads for already-seen image hashes using a module-level LRU cache (_mm_sent).
Adds helpers _mm_seen, _mm_record, _mm_forget, and _strip_pixels_for to manage the in-process hash cache and rebuild MultiModalData without pixel data.
Updates _build_qwen_vl_features to support per-item None slots for hash-only images and omit kwargs_data entirely when all items are hash-only.
Reduces default image_cache_max from 256 to 32 for Qwen3.5, Qwen3.6, Qwen3-VL, and Kimi K2.5 renderer configs in renderers/configs.py.
Behavioral Change: the reduced image_cache_max default (256 → 32) affects all deployments using these renderer configs without explicit overrides; cached pixel tensor memory drops proportionally.

^{Macroscope summarized 92361ef.}

…ut RAM The renderer-client rollout path keeps and ships the full processed ``pixel_values`` (tens of MB per large image) for every image on every turn, so resident multimodal memory grows with turns x concurrency — at 256 concurrent 1024^2 rollouts the trace/env-worker alone retains ~86 GB. Two opt-in, default-off modes shrink this; behaviour is unchanged unless a consumer sets the env flag: - ``RENDERERS_MM_EPHEMERAL`` (stored data): ``generate`` returns a descriptor-only ``multi_modal_data`` (image_grid_thw + mm_hashes + mm_placeholders, no ``pixel_values``), so the trajectory never retains decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of O(image-pixels) — 86 GB -> 5.8 GB at 256 concurrent 1024^2 rollouts, and the per-rollout slope drops ~20x. The consumer re-derives pixels downstream from the message images. Purely client-side; safe on any engine incl. vLLM 0.22. - ``RENDERERS_MM_HASH_CACHE`` (transported data): send each image's pixels once, then descriptor-only (``None`` kwargs) so the engine serves it from its mm-hash cache. ``_build_qwen_vl_features`` is now descriptor-aware (per-item ``None`` slots aligned to ``mm_hashes``); a sent-hash memory drives it with a cache-miss fallback. REQUIRES an engine that resolves ``None`` from cache (the disagg router topology) — a plain single-server vLLM 0.22 forces ``skip_mm_cache=True`` and crashes on an unresolved ``None``, so this stays OFF until the engine supports it. Also lower ``image_cache_max`` default 256 -> 32: each entry holds a decoded pixel tensor and the pool holds one cache per renderer, so 256 capped resident cache memory at ~pool_size x 256 x pixel_bytes (tens of GB for large images). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85

perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85
mikasenghaas wants to merge 1 commit into
mainfrom
perf/mm-hash-only-cache

mikasenghaas commented Jun 11, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikasenghaas commented Jun 11, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Add ephemeral pixel return and reduce image cache size to bound multimodal RAM usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 11, 2026 •

edited by macroscopeapp Bot

Loading