feat: meter Gemini thinking tokens and grounding requests#3178
Conversation
- Thinking tokens: Extracted from standard completion tokens to ensure they are billed accurately at the correct model-specific rate. - Grounding requests: Added flat-fee metering for Google Search by tracking grounding_metadata across both streaming and non-streaming responses. - Pricing updates: Corrected stale rates for Gemini 2.5 Flash output, cached tokens, thinking tokens, and grounding requests.
|
are thinking tokens not already included in output tokens in the usage object? |
Yes they are already included but we split them because they are billed at different rates, like this: |
There was a problem hiding this comment.
Pull request overview
This PR updates Gemini metering in the AI chat driver to correctly account for Gemini “thinking” tokens (billed at a distinct rate) and to add flat-fee metering for grounded Google Search requests by detecting grounding_metadata in both streaming and non-streaming responses.
Changes:
- Split Gemini
reasoning_tokens(“thinking tokens”) out ofcompletion_tokensand meter each at its own model-specific rate. - Add
grounding_requestsusage metering (1 per response whengrounding_metadatais present) for streaming and non-streaming Gemini completions. - Update Gemini model pricing entries to include
thinking_tokensandgrounding_requestsrates (and refresh some existing token rates).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/backend/drivers/ai-chat/utils/OpenAIUtil.js |
Captures streamed extra_content and forwards it into the usage calculator for provider-specific metering. |
src/backend/drivers/ai-chat/providers/gemini/models.ts |
Adds/updates Gemini cost keys for thinking_tokens and grounding_requests (and adjusts some stale token rates). |
src/backend/drivers/ai-chat/providers/gemini/GeminiChatProvider.ts |
Implements Gemini-specific usage shaping: cached token exclusion, thinking token split, and grounding request detection. |
src/backend/drivers/ai-chat/providers/gemini/GeminiChatProvider.test.ts |
Updates expected usage shapes and adds unit tests for thinking-token and grounding-request metering (streaming + non-streaming). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Cast to access Gemini-specific extras passed alongside usage: | ||
| // - choices: non-stream grounding metadata lives in choices[0].message.extra_content | ||
| // - extra_content: streaming grounding metadata accumulated by the stream handler | ||
| const { usage, choices, extra_content } = args as { |
| @@ -285,6 +286,7 @@ export const create_chat_stream_handler = | |||
| // Apps have to choose to handle extra_content themselves, it doesn't seem like theres a way we can do it in a backwards | |||
| // compatible fashion since most streaming apps will handle chat history by continuously updating content themselves | |||
| // This doesn't present us a chance to add in an extra object for gemini's chat continuing features | |||
| last_extra_content = choice.delta.extra_content; | |||
|
@ProgrammerIn-wonderland is this mergable? |
Salazareo
left a comment
There was a problem hiding this comment.
code itself looks ok, but prices are a bit off, im gonna fix those and merge this
thanks for the contribution!
| prompt_tokens: 30, | ||
| completion_tokens: 250, | ||
| cached_tokens: 3, | ||
| completion_tokens: 100, |
There was a problem hiding this comment.
these look off pretty sure they're still 250
https://ai.google.dev/gemini-api/docs/pricing#gemini-2.5-flash
| completion_tokens: 300, | ||
| thinking_tokens: 300, | ||
| cached_tokens: 5, | ||
| grounding_requests: 1_400_000, |
There was a problem hiding this comment.
these are cheaper on some of these models
Pricing corrections (verified against ai.google.dev/gemini-api/docs/pricing): - gemini-2.5-flash output is $2.50/M, not $1.00/M: restore completion_tokens to 250 and bill thinking_tokens at the same output rate (250). The previous 100 under-billed output ~60%, and this is the provider's default model. - gemini-2.5-flash cache read is $0.03/M: restore cached_tokens to 3 (the 7.5 value over-billed). - Grounding with Google Search is $35 / 1,000 requests for Gemini 2.x models and $14 / 1,000 for 3.x. Set grounding_requests to 3_500_000 for gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-flash-lite and gemini-2.5-pro; 3.x models keep 1_400_000. Streaming robustness: - In create_chat_stream_handler, don't let a later extra_content chunk without grounding_metadata overwrite an earlier one that carried it, so grounding requests are still metered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
grounding_metadataacross both streaming and non-streaming responses.Closes #3132
Test
thinking_tokens: 0andgrounding_requests: 0in the expected usage shapes