Skip to content

Add native Anthropic, OpenAI, and Gemini SDK support#3

Open
anassg-lago wants to merge 5 commits into
mainfrom
feature/anthropic-openai-gemini-native
Open

Add native Anthropic, OpenAI, and Gemini SDK support#3
anassg-lago wants to merge 5 commits into
mainfrom
feature/anthropic-openai-gemini-native

Conversation

@anassg-lago
Copy link
Copy Markdown
Collaborator

Summary

Adds three new native LLM provider integrations to the SDK, bringing total coverage to five (Bedrock + Mistral + Anthropic + OpenAI + Gemini). Plus a canonical-schema extension for audio-output tokens and a flaky-test fix.

Commits in this PR

Commit Summary
c23e692 Fix flaky test_repeated_overflow_keeps_window_sliding (race condition with max_batch_size == max_buffer_size)
8da7b82 Native Anthropic SDK support (messages.create sync + stream, messages.stream context manager)
0f79c5a Native OpenAI SDK support (Chat Completions + Responses API, sync + stream + async)
6c487ab Native Gemini SDK support (google-genai, sync + stream + async via client.aio.models)

Provider coverage matrix

Provider Sync Stream Async Tool calls Reasoning Cache Multimodal
Anthropic native ✗ (folded) ✓ (5m + 1h)
OpenAI native ✓ (subset of output) ✓ (auto) ✓ (audio)
Gemini native ✓ (additive to output) ✓ (CachedContent API) ✓ (audio + image, per modality)

Key design decisions

  • CanonicalUsage extended with audio_outputllm_audio_output_tokens metric code. Populated by GPT-4o-audio output and Gemini TTS responses.
  • Detector now returns gemini (was google) for google-genai clients — keeps naming consistent with bedrock/anthropic/openai/mistral.
  • OpenAI streaming auto-injects stream_options.include_usage=True when missing. Without it, OpenAI emits no usage on streamed responses — silent under-billing.
  • OpenAI adapter auto-detects Chat Completions vs Responses API by usage-field shape (prompt_tokens vs input_tokens).
  • Reasoning-tokens semantic difference documented: OpenAI's reasoning_tokens is a SUBSET of completion_tokens; Gemini's thoughtsTokenCount is ADDITIVE to candidatesTokenCount. Customers configuring per-metric billing must account for this. Documented in adapter docstrings + README.

Tests

  • 283 → 304 unit tests (+19 Anthropic +18 OpenAI adapter +9 OpenAI wrapper +15 Gemini adapter +6 Gemini wrapper)
  • Live integration tests (skipped without API keys): 3 Anthropic + 5 OpenAI + 4 Gemini — all pass against the real APIs with a mock Lago HTTP server
  • 24 captured response fixtures from real APIs (9 Anthropic + 10 OpenAI + 5 Gemini)
  • Coverage maintained ≥ 80%

Quality gate

  • ruff check src tests — clean
  • mypy --strict src — 0 issues in 21 source files
  • pytest tests/unit — 304/304 pass
  • Live integration tests verified manually against real APIs

Known gaps (intentional, documented)

  • OpenAI Predicted Outputs tokens (accepted_/rejected_prediction_tokens) — not surfaced; would risk double-counting against completion_tokens
  • Gemini Vertex AI mode — same adapter works (same response shape), not specifically tested
  • Multimodal image input on OpenAI — Phase 3

Test plan

  • Review the canonical-schema change (single new field: audio_output)
  • Verify the detector change doesn't break existing kind == "google" consumers (none today)
  • Run unit suite locally (pytest tests/unit -q)
  • Optionally run live integration tests with API keys exported

- src/lago_agent_sdk/adapters/anthropic_native.py — extract_anthropic_native
- src/lago_agent_sdk/wrappers/anthropic.py — wraps messages.create (sync + async,
  streaming and non-streaming) and messages.stream context manager
- Wired into sdk.wrap() dispatch and adapters/__init__.py exports
- anthropic = ["anthropic>=0.30"] optional-dep group
- 19 new unit tests + 3 live integration tests; 256 unit tests pass
- Coverage 80.71% — gate maintained
- 9 captured response fixtures from real Anthropic API
- README + CHANGELOG updated
The test set max_batch_size == max_buffer_size == 100, which caused the
push that brings the buffer to 100 to trigger a wake on the background
worker. The worker would take a batch (emptying the buffer) and then race
with the remaining 150 pushes to call slow_sender. On CI's slower runners
the worker sometimes squeezed in additional batches before slow_sender
finally blocked, leaving the buffer with fewer items than the expected
sliding window.

Setting max_batch_size > max_buffer_size guarantees push() never sets the
wake event (buffer can never reach max_batch_size). Combined with a long
flush_interval the worker only runs once shutdown() releases the pause in
the finally block — fully deterministic. Verified with 5 consecutive runs.
Adapter handles both API shapes with auto-detection:

  Chat Completions (client.chat.completions.create):
    usage.prompt_tokens                                -> input
    usage.completion_tokens                            -> output
    usage.prompt_tokens_details.cached_tokens          -> cache_read
    usage.prompt_tokens_details.audio_tokens           -> audio_input
    usage.completion_tokens_details.reasoning_tokens   -> reasoning   (o-series)
    usage.completion_tokens_details.audio_tokens       -> audio_output
    count of choices[0].message.tool_calls             -> tool_calls

  Responses API (client.responses.create):
    usage.input_tokens                                 -> input
    usage.output_tokens                                -> output
    usage.input_tokens_details.cached_tokens           -> cache_read
    usage.output_tokens_details.reasoning_tokens       -> reasoning
    count of output[].type == "function_call"          -> tool_calls

Wrapper covers both methods, sync + streaming, on both OpenAI and AsyncOpenAI.
For Chat Completions streaming, auto-injects stream_options.include_usage=true
when missing so the final chunk carries usage data (without that flag, OpenAI
emits no usage on streamed responses).

CanonicalUsage extended with audio_output (mapped to llm_audio_output_tokens)
to capture GPT-4o-audio output usage.

OpenAI is the first provider to actually populate llm_reasoning_tokens
(o-series surfaces reasoning tokens separately; Anthropic/Bedrock fold them
into output_tokens).

Predicted Outputs tokens (accepted/rejected_prediction_tokens) are
intentionally not surfaced -- documented in the adapter docstring as a
v1 gap.

27 new unit tests (18 adapter + 9 wrapper). 5 live integration tests gated
on OPENAI_API_KEY. 10 captured response fixtures from the real OpenAI API.

Total: 283 unit tests passing, ruff + mypy strict clean.
Adapter maps usage_metadata fields to CanonicalUsage:

  prompt_token_count                                   -> input
  candidates_token_count                               -> output
  cached_content_token_count                           -> cache_read
  thoughts_token_count                                 -> reasoning
  prompt_tokens_details[modality=AUDIO].token_count    -> audio_input
  prompt_tokens_details[modality=IMAGE].token_count    -> image_input
  candidates_tokens_details[modality=AUDIO].token_count -> audio_output
  count of candidates[0].content.parts[].function_call -> tool_calls

Wrapper covers client.models.generate_content + generate_content_stream
(sync) and the async variants under client.aio.models. Idempotent via
_lago_instrumented sentinel.

Detector now returns 'gemini' (was 'google') for google-genai clients --
matches the naming convention used by other providers (bedrock, anthropic,
openai, mistral).

Semantic note vs OpenAI:
  Gemini's `thoughts_token_count` is ADDITIVE to `candidates_token_count`
  (verified by math across all 5 fixtures: input + output + reasoning = total).
  OpenAI's `reasoning_tokens` is a SUBSET of `completion_tokens`.
  Documented in adapter docstring + README for customers configuring
  per-metric billing.

Gemini 2.5 emits reasoning tokens by default (no explicit thinking_config
needed) -- second provider populating llm_reasoning_tokens.

21 new unit tests (15 adapter + 6 wrapper). 4 live integration tests
gated on GEMINI_API_KEY. 5 captured response fixtures (plain, tool use,
streaming, thinking, multi-turn).

Total: 304 unit tests passing, ruff + mypy strict clean.
CI runs `ruff format --check` which was failing because earlier dev only
ran `ruff check` (linter) locally, not the formatter. Auto-formatting
restores whitespace consistency in:

- src/lago_agent_sdk/adapters/gemini_native.py
- src/lago_agent_sdk/wrappers/openai.py
- tests/unit/adapters/fixtures/capture_openai.py
- tests/unit/adapters/test_gemini_native.py
- tests/unit/test_wrapper_gemini.py

No functional changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant