perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85
Draft
mikasenghaas wants to merge 1 commit into
Draft
perf(mm): ephemeral pixel return + smaller image cache to bound rollout RAM#85mikasenghaas wants to merge 1 commit into
mikasenghaas wants to merge 1 commit into
Conversation
…ut RAM The renderer-client rollout path keeps and ships the full processed ``pixel_values`` (tens of MB per large image) for every image on every turn, so resident multimodal memory grows with turns x concurrency — at 256 concurrent 1024^2 rollouts the trace/env-worker alone retains ~86 GB. Two opt-in, default-off modes shrink this; behaviour is unchanged unless a consumer sets the env flag: - ``RENDERERS_MM_EPHEMERAL`` (stored data): ``generate`` returns a descriptor-only ``multi_modal_data`` (image_grid_thw + mm_hashes + mm_placeholders, no ``pixel_values``), so the trajectory never retains decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of O(image-pixels) — 86 GB -> 5.8 GB at 256 concurrent 1024^2 rollouts, and the per-rollout slope drops ~20x. The consumer re-derives pixels downstream from the message images. Purely client-side; safe on any engine incl. vLLM 0.22. - ``RENDERERS_MM_HASH_CACHE`` (transported data): send each image's pixels once, then descriptor-only (``None`` kwargs) so the engine serves it from its mm-hash cache. ``_build_qwen_vl_features`` is now descriptor-aware (per-item ``None`` slots aligned to ``mm_hashes``); a sent-hash memory drives it with a cache-miss fallback. REQUIRES an engine that resolves ``None`` from cache (the disagg router topology) — a plain single-server vLLM 0.22 forces ``skip_mm_cache=True`` and crashes on an unresolved ``None``, so this stays OFF until the engine supports it. Also lower ``image_cache_max`` default 256 -> 32: each entry holds a decoded pixel tensor and the pool holds one cache per renderer, so 256 capped resident cache memory at ~pool_size x 256 x pixel_bytes (tens of GB for large images). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The renderer-client rollout path keeps and ships the full processed
pixel_values(tens of MB per large image) for every image on every turn, so resident multimodal memory grows withturns × concurrency. At 256 concurrent 1024² rollouts the trace/env-worker alone retains ~86 GB.Two opt-in, default-off modes shrink this (behaviour unchanged unless a consumer sets the env flag), plus a cache-size default cut:
RENDERERS_MM_EPHEMERAL(stored data):generatereturns a descriptor-onlymulti_modal_data(image_grid_thw+mm_hashes+mm_placeholders, nopixel_values), so the trajectory never retains decoded tensors. Stored mm becomes O(1) per image (a descriptor) instead of O(image-pixels). The consumer re-derives pixels downstream from the message images. Purely client-side — safe on any engine, incl. vLLM 0.22.RENDERERS_MM_HASH_CACHE(transported data): send each image's pixels once, then descriptor-only (Nonekwargs) so the engine serves it from its mm-hash cache._build_qwen_vl_featuresis now descriptor-aware (per-itemNoneslots aligned tomm_hashes), driven by a sent-hash memory with a cache-miss fallback. Requires an engine that resolvesNonefrom cache (the disagg router topology); a plain single-server vLLM 0.22 forcesskip_mm_cache=Trueand crashes on an unresolvedNone, so this stays OFF until the engine supports it.image_cache_maxdefault 256 → 32: each entry holds a decoded pixel tensor and the pool holds one cache per renderer, so 256 capped resident cache memory at~pool_size × 256 × pixel_bytes(tens of GB for large images).Verification
Stored RAM, retaining K concurrent rollouts' image mm (1024², 8 turns), full vs ephemeral:
~15× less at K=256; per-rollout slope drops ~20× (0.33 → 0.015 GB/rollout) — stored mm stops scaling with concurrency/resolution.
tests/test_client.py: 12 passing, incl. new ephemeral-return and hash-only-serialization tests.🤖 Generated with Claude Code
Note
Add ephemeral pixel return and reduce image cache size to bound multimodal RAM usage
_MM_EPHEMERALstripspixel_valuesfrom returnedMultiModalDataafter sending (keeping descriptors and hashes), and_MM_HASH_ONLYskips re-sending pixel payloads for already-seen image hashes using a module-level LRU cache (_mm_sent)._mm_seen,_mm_record,_mm_forget, and_strip_pixels_forto manage the in-process hash cache and rebuildMultiModalDatawithout pixel data._build_qwen_vl_featuresto support per-itemNoneslots for hash-only images and omitkwargs_dataentirely when all items are hash-only.image_cache_maxfrom 256 to 32 for Qwen3.5, Qwen3.6, Qwen3-VL, and Kimi K2.5 renderer configs in renderers/configs.py.image_cache_maxdefault (256 → 32) affects all deployments using these renderer configs without explicit overrides; cached pixel tensor memory drops proportionally.Macroscope summarized 92361ef.