Skip to content

gemini_context_cache_manager.create gate uses request total tokens, not cacheable prefix tokens #5847

@rannn505

Description

@rannn505

Summary

GeminiContextCacheManager._create_new_cache_with_contents decides whether to call caches.create() using a token count that represents the full request (system_instruction + tools + all chat history + user turn), but the actual caches.create() call only sends the cacheable prefix (system_instruction + tools + contents[:cache_contents_count]).

This mismatch causes caches.create() to fail with 400 INVALID_ARGUMENT ("The cached content is of tokens. The minimum token count to start caching is 4096.") on every cold-prefix request whose chat history happens to push the full token count over 4096 while the prefix itself stays below.

Worse, on failure the manager returns CacheMetadata(fingerprint=..., contents_count=..., cache_name=None) (no cache_name), so the next request with the same fingerprint hits the same failure path and tries to create again. Infinite create-fail loop, zero cache hits, flood of Failed to create cache: 400 INVALID_ARGUMENT warnings.

Affected version

  • google-adk == 1.34.1

Environment

  • Model: gemini-3.5-flash on Vertex AI
  • ContextCacheConfig(cache_intervals=10, ttl_seconds=1800, min_tokens=4096)
  • App-level config (App.context_cache_config)
  • ~24h prod observation: hundreds of failures/min, cache_hit_pct ~= 0% across all routed models

Reproduction (code path walkthrough)

  1. google/adk/flows/llm_flows/context_cache_processor.py:81

    llm_request.cacheable_contents_token_count = previous_token_count

    previous_token_count is the prior response's usage_metadata.prompt_token_count — i.e. the full prompt the model just saw (incl. chat history + tools + system_instruction + the latest user turn).

  2. google/adk/models/gemini_context_cache_manager.py:312-324 (gate inside _create_new_cache_with_contents):

    if cacheable_contents_token_count < ctx_cache_config.min_tokens:
        logger.info("Previous request too small to cache: ...")
        return None
    if cacheable_contents_token_count < _GEMINI_MIN_CACHE_TOKENS:
        logger.info("Token count below Gemini minimum: ...")
        return None

    This uses the full prompt count from step 1 as the gate.

  3. google/adk/models/gemini_context_cache_manager.py:372-441 (_create_gemini_cache):

    cache_contents = llm_request.contents[:cache_contents_count]   # line 388
    ...
    cache_config = CreateCachedContentConfig(
        system_instruction=llm_request.config.system_instruction,
        tools=llm_request.config.tools,
        contents=cache_contents,
        ...
    )
    await self.genai_client.aio.caches.create(model=..., config=cache_config)   # line 423

    The actual caches.create() call only sends the prefix slice. The server measures that prefix, finds it < 4096, and rejects with 400.

  4. On failure (gemini_context_cache_manager.py:~133), the manager returns:

    CacheMetadata(fingerprint=..., contents_count=cache_contents_count)

    with no cache_name. The next request matches the same fingerprint, sees no cache_name, and goes back through _create_new_cache_with_contents — same gate, same failure.

Observed log spam (production)

WARNING Failed to create cache: 400 INVALID_ARGUMENT.
{'error': {'code': 400, 'message': 'The cached content is of 1820 tokens. The minimum token count to start caching is 4096.', 'status': 'INVALID_ARGUMENT'}}

Token counts in real failures: 1820, 1857, 1991, 1829 — all well below 4096 because the prefix (system_instruction + tools + few-content sub-agents) is small even when the full chat history pushes the gate's measurement above 4096.

Suggested fix

The gate must be measured on the same payload caches.create() will receive. Two options:

  1. Estimate prefix tokens at gate time. Compute system_instruction + tools + contents[:cache_contents_count] token count locally and compare against min_tokens / _GEMINI_MIN_CACHE_TOKENS. This avoids the extra count_tokens API round trip if you can use the on-device tokenizer; otherwise one client.models.count_tokens call is cheap relative to letting the request fail at the server.
  2. Cache the negative result by fingerprint. Even with the gate fix, transient API rejections still happen; refusing to retry the same fingerprint for ttl_seconds would prevent the infinite-loop log flood.

Either fix on its own breaks the loop; both together would be ideal.

Workaround for users

Set App.context_cache_config = None to disable caching entirely until a fix lands. There is no per-agent disable in 1.34.1 — App.context_cache_config is global (llm_agent.py:266), and the only safe workaround is full disable.

Metadata

Metadata

Labels

core[Component] This issue is related to the core interface and implementationrequest clarification[Status] The maintainer need clarification or more information from the author

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions