Skip to content

perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85

Draft
mikasenghaas wants to merge 1 commit into
mainfrom
perf/mm-hash-only-cache
Draft

perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85
mikasenghaas wants to merge 1 commit into
mainfrom
perf/mm-hash-only-cache

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

The renderer-client rollout path keeps and ships the full processed pixel_values (tens of MB per large image) for every image on every turn, so resident multimodal memory grows with turns × concurrency. At 256 concurrent 1024² rollouts the trace/env-worker alone retains ~86 GB.

Two opt-in, default-off modes shrink this (behaviour unchanged unless a consumer sets the env flag), plus a cache-size default cut:

  • RENDERERS_MM_EPHEMERAL (stored data): generate returns a descriptor-only multi_modal_data (image_grid_thw + mm_hashes + mm_placeholders, no pixel_values), so the trajectory never retains decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of O(image-pixels). The consumer re-derives pixels downstream from the message images. Purely client-side — safe on any engine, incl. vLLM 0.22.
  • RENDERERS_MM_HASH_CACHE (transported data): send each image's pixels once, then descriptor-only (None kwargs) so the engine serves it from its mm-hash cache. _build_qwen_vl_features is now descriptor-aware (per-item None slots aligned to mm_hashes), driven by a sent-hash memory with a cache-miss fallback. Requires an engine that resolves None from cache (the disagg router topology); a plain single-server vLLM 0.22 forces skip_mm_cache=True and crashes on an unresolved None, so this stays OFF until the engine supports it.
  • image_cache_max default 256 → 32: each entry holds a decoded pixel tensor and the pool holds one cache per renderer, so 256 capped resident cache memory at ~pool_size × 256 × pixel_bytes (tens of GB for large images).

Verification

Stored RAM, retaining K concurrent rollouts' image mm (1024², 8 turns), full vs ephemeral:

K (rollouts) FULL (pixels) EPHEMERAL (descriptor)
32 12.8 GB 3.1 GB
64 23.2 GB 3.4 GB
128 44.4 GB 4.3 GB
256 86.0 GB 5.8 GB

~15× less at K=256; per-rollout slope drops ~20× (0.33 → 0.015 GB/rollout) — stored mm stops scaling with concurrency/resolution.

tests/test_client.py: 12 passing, incl. new ephemeral-return and hash-only-serialization tests.

🤖 Generated with Claude Code

Note

Add ephemeral pixel return and reduce image cache size to bound multimodal RAM usage

  • Adds two environment-controlled multimodal memory modes in renderers/client.py: _MM_EPHEMERAL strips pixel_values from returned MultiModalData after sending (keeping descriptors and hashes), and _MM_HASH_ONLY skips re-sending pixel payloads for already-seen image hashes using a module-level LRU cache (_mm_sent).
  • Adds helpers _mm_seen, _mm_record, _mm_forget, and _strip_pixels_for to manage the in-process hash cache and rebuild MultiModalData without pixel data.
  • Updates _build_qwen_vl_features to support per-item None slots for hash-only images and omit kwargs_data entirely when all items are hash-only.
  • Reduces default image_cache_max from 256 to 32 for Qwen3.5, Qwen3.6, Qwen3-VL, and Kimi K2.5 renderer configs in renderers/configs.py.
  • Behavioral Change: the reduced image_cache_max default (256 → 32) affects all deployments using these renderer configs without explicit overrides; cached pixel tensor memory drops proportionally.

Macroscope summarized 92361ef.

…ut RAM

The renderer-client rollout path keeps and ships the full processed
``pixel_values`` (tens of MB per large image) for every image on every turn,
so resident multimodal memory grows with turns x concurrency — at 256
concurrent 1024^2 rollouts the trace/env-worker alone retains ~86 GB.

Two opt-in, default-off modes shrink this; behaviour is unchanged unless a
consumer sets the env flag:

- ``RENDERERS_MM_EPHEMERAL`` (stored data): ``generate`` returns a
  descriptor-only ``multi_modal_data`` (image_grid_thw + mm_hashes +
  mm_placeholders, no ``pixel_values``), so the trajectory never retains
  decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of
  O(image-pixels) — 86 GB -> 5.8 GB at 256 concurrent 1024^2 rollouts, and the
  per-rollout slope drops ~20x. The consumer re-derives pixels downstream from
  the message images. Purely client-side; safe on any engine incl. vLLM 0.22.

- ``RENDERERS_MM_HASH_CACHE`` (transported data): send each image's pixels once,
  then descriptor-only (``None`` kwargs) so the engine serves it from its
  mm-hash cache. ``_build_qwen_vl_features`` is now descriptor-aware (per-item
  ``None`` slots aligned to ``mm_hashes``); a sent-hash memory drives it with a
  cache-miss fallback. REQUIRES an engine that resolves ``None`` from cache (the
  disagg router topology) — a plain single-server vLLM 0.22 forces
  ``skip_mm_cache=True`` and crashes on an unresolved ``None``, so this stays OFF
  until the engine supports it.

Also lower ``image_cache_max`` default 256 -> 32: each entry holds a decoded
pixel tensor and the pool holds one cache per renderer, so 256 capped resident
cache memory at ~pool_size x 256 x pixel_bytes (tens of GB for large images).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant