gemini_context_cache_manager.create gate uses request total tokens, not cacheable prefix tokens

## Summary

`GeminiContextCacheManager._create_new_cache_with_contents` decides whether to call `caches.create()` using a token count that represents the **full request** (system_instruction + tools + all chat history + user turn), but the actual `caches.create()` call only sends the **cacheable prefix** (system_instruction + tools + `contents[:cache_contents_count]`).

This mismatch causes `caches.create()` to fail with `400 INVALID_ARGUMENT` ("The cached content is of <N> tokens. The minimum token count to start caching is 4096.") on every cold-prefix request whose chat history happens to push the full token count over 4096 while the prefix itself stays below.

Worse, on failure the manager returns `CacheMetadata(fingerprint=..., contents_count=..., cache_name=None)` (no `cache_name`), so the next request with the same fingerprint hits the same failure path and tries to create again. Infinite create-fail loop, zero cache hits, flood of `Failed to create cache: 400 INVALID_ARGUMENT` warnings.

## Affected version

- `google-adk == 1.34.1`

## Environment

- Model: `gemini-3.5-flash` on Vertex AI
- `ContextCacheConfig(cache_intervals=10, ttl_seconds=1800, min_tokens=4096)`
- App-level config (`App.context_cache_config`)
- ~24h prod observation: hundreds of failures/min, `cache_hit_pct ~= 0%` across all routed models

## Reproduction (code path walkthrough)

1. `google/adk/flows/llm_flows/context_cache_processor.py:81`
   ```python
   llm_request.cacheable_contents_token_count = previous_token_count
   ```
   `previous_token_count` is the **prior response's** `usage_metadata.prompt_token_count` — i.e. the full prompt the model just saw (incl. chat history + tools + system_instruction + the latest user turn).

2. `google/adk/models/gemini_context_cache_manager.py:312-324` (gate inside `_create_new_cache_with_contents`):
   ```python
   if cacheable_contents_token_count < ctx_cache_config.min_tokens:
       logger.info("Previous request too small to cache: ...")
       return None
   if cacheable_contents_token_count < _GEMINI_MIN_CACHE_TOKENS:
       logger.info("Token count below Gemini minimum: ...")
       return None
   ```
   This uses the full prompt count from step 1 as the gate.

3. `google/adk/models/gemini_context_cache_manager.py:372-441` (`_create_gemini_cache`):
   ```python
   cache_contents = llm_request.contents[:cache_contents_count]   # line 388
   ...
   cache_config = CreateCachedContentConfig(
       system_instruction=llm_request.config.system_instruction,
       tools=llm_request.config.tools,
       contents=cache_contents,
       ...
   )
   await self.genai_client.aio.caches.create(model=..., config=cache_config)   # line 423
   ```
   The actual `caches.create()` call only sends the prefix slice. The server measures *that* prefix, finds it < 4096, and rejects with 400.

4. On failure (`gemini_context_cache_manager.py:~133`), the manager returns:
   ```python
   CacheMetadata(fingerprint=..., contents_count=cache_contents_count)
   ```
   with no `cache_name`. The next request matches the same fingerprint, sees no `cache_name`, and goes back through `_create_new_cache_with_contents` — same gate, same failure.

## Observed log spam (production)

```
WARNING Failed to create cache: 400 INVALID_ARGUMENT.
{'error': {'code': 400, 'message': 'The cached content is of 1820 tokens. The minimum token count to start caching is 4096.', 'status': 'INVALID_ARGUMENT'}}
```

Token counts in real failures: 1820, 1857, 1991, 1829 — all well below 4096 because the prefix (system_instruction + tools + few-content sub-agents) is small even when the full chat history pushes the gate's measurement above 4096.

## Suggested fix

The gate must be measured on the same payload `caches.create()` will receive. Two options:

1. **Estimate prefix tokens at gate time.** Compute `system_instruction + tools + contents[:cache_contents_count]` token count locally and compare against `min_tokens` / `_GEMINI_MIN_CACHE_TOKENS`. This avoids the extra `count_tokens` API round trip if you can use the on-device tokenizer; otherwise one `client.models.count_tokens` call is cheap relative to letting the request fail at the server.
2. **Cache the negative result by fingerprint.** Even with the gate fix, transient API rejections still happen; refusing to retry the same fingerprint for `ttl_seconds` would prevent the infinite-loop log flood.

Either fix on its own breaks the loop; both together would be ideal.

## Workaround for users

Set `App.context_cache_config = None` to disable caching entirely until a fix lands. There is no per-agent disable in 1.34.1 — `App.context_cache_config` is global (`llm_agent.py:266`), and the only safe workaround is full disable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemini_context_cache_manager.create gate uses request total tokens, not cacheable prefix tokens #5847

Summary

Affected version

Environment

Reproduction (code path walkthrough)

Observed log spam (production)

Suggested fix

Workaround for users

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gemini_context_cache_manager.create gate uses request total tokens, not cacheable prefix tokens #5847

Description

Summary

Affected version

Environment

Reproduction (code path walkthrough)

Observed log spam (production)

Suggested fix

Workaround for users

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions