Skip to content

Support OpenAI Responses API instrumentation#210

Open
sipercai wants to merge 2 commits into
mainfrom
feat/openai-responses-api
Open

Support OpenAI Responses API instrumentation#210
sipercai wants to merge 2 commits into
mainfrom
feat/openai-responses-api

Conversation

@sipercai

@sipercai sipercai commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Description

This PR adds latest-experimental OpenAI Responses API instrumentation for OpenAI.responses.create and AsyncOpenAI.responses.create.

The new instrumentation records request and response metadata including token usage, response status, service tier, reasoning request details, cached input tokens, reasoning output tokens, tool definitions, and message content when content capture is enabled. It supports non-streaming calls, stream=True, raw .parse() responses, sync and async clients, and error handling while keeping older OpenAI SDKs compatible.

Follow-up updates also cover Responses helper paths: responses.stream(model=..., input=...), existing-response streaming through responses.stream(response_id=...), and responses.parse() / AsyncResponses.parse().

Fixes #209

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Validation Evidence

  • .tox/py311-test-instrumentation-openai-v2-latest/bin/python -m pytest instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -q - 20 passed.
  • uvx tox -e py311-test-instrumentation-openai-v2-latest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra - 146 passed, 2 existing async stream close warnings.
  • uvx tox -e py311-test-instrumentation-openai-v2-oldest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra - 106 passed, 2 skipped, 40 existing/deprecation warnings.
  • uvx tox -e lint-instrumentation-openai-v2 - passed, pylint 10.00/10.
  • uvx tox -e precommit - passed.
  • git diff --check - passed.
  • Claude team review loop - /tmp/codex-claude-review/loongsuite-python-agent-a82236395d/run-20260603-161349, rounds r1-r4 completed; final P2 registry/constant follow-up deferred.
  • Weaver JSON live-check sample - /tmp/openai-responses-weaver-sample.json, report /tmp/openai-responses-weaver-report.json. The sample produced 4 mocked Responses spans. Weaver ran successfully but reported that the local registry does not yet define gen_ai.openai.response.status, gen_ai.openai.request.previous_response_id, or gen_ai.usage.output_tokens_details.reasoning_tokens; this is a registry/schema follow-up rather than an instrumentation runtime failure.

Note: check_loongsuite_pr_readiness.py --repo . is not applicable for this upstream-style instrumentation-genai/opentelemetry-instrumentation-openai-v2 change. The checker currently rejects non-instrumentation-loongsuite/ plugin paths as forbidden for new LoongSuite plugin PRs.

Does This PR Require a Core Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

See contributing.md for styleguide, changelog guidelines, and more.

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

@sipercai sipercai left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding Responses API instrumentation. I verified the PR on the head commit and the core responses.create paths are working. I also found several official OpenAI Responses SDK helper surfaces from #209 that still need to be fixed or explicitly scoped before this can be considered complete.

What I verified locally:

  • PR head: 46e58a70 (feat/openai-responses-api).
  • SDK surface checked with openai==1.109.1: Responses / AsyncResponses expose create, stream, retrieve, parse, input_items, cancel, and delete.
  • Focused checks passed:
    • uvx tox -e py311-test-instrumentation-openai-v2-latest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra: 132 passed.
    • uvx tox -e py311-test-instrumentation-openai-v2-oldest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra: 106 passed, 2 skipped.
    • uvx tox -e lint-instrumentation-openai-v2: passed.
    • uvx tox -e precommit: passed.
  • Live smoke against an OpenAI-compatible provider:
    • client.responses.create(...): request succeeded and produced 1 GenAI span with request/response model and token attributes.
    • client.responses.create(..., stream=True): request succeeded and produced 1 GenAI span with request/response model and token attributes.
    • client.responses.stream(model=..., input=...): request succeeded and produced 1 GenAI span, but emitted invalid OpenTelemetry attribute warnings for openai.Omit sentinel values.
    • client.responses.stream(response_id=...): request succeeded, but instrumentation produced 0 GenAI spans.
    • client.responses.parse(...): request succeeded and returned a parsed object, but instrumentation produced 0 GenAI spans.

Findings to address:

  1. responses.stream(model=..., input=...) passes OpenAI SDK Omit sentinels into the new telemetry mapping.

    The SDK helper delegates to self.create(..., stream=True) and forwards omitted optional parameters as openai.Omit. The PR's value_is_set() only filters openai.NotGiven, so the new Responses mapping treats Omit as a real value. In live smoke this produced invalid OpenTelemetry attribute warnings for fields such as gen_ai.request.temperature, gen_ai.request.top_p, gen_ai.openai.request.previous_response_id, gen_ai.openai.request.background, gen_ai.openai.request.store, and gen_ai.openai.request.parallel_tool_calls.

    Please treat openai.Omit the same as NotGiven at value_is_set() or at the Responses mapping boundary, and add a test for client.responses.stream(model=..., input=...) that asserts no invalid attributes/warnings are emitted.

  2. responses.stream(response_id=...) / async existing-response streaming is not instrumented.

    The OpenAI SDK's responses.stream(response_id=..., starting_after=...) helper uses retrieve(stream=True), not create(stream=True). This PR only wraps Responses.create and AsyncResponses.create, so the existing-response stream helper is outside the current instrumentation. A live smoke request succeeded but produced 0 GenAI spans.

    Please either support this path, for example by wrapping Responses.retrieve / AsyncResponses.retrieve or by explicitly handling the stream helper, or document why existing-response streaming is intentionally out of scope for #209. If it is in scope, add sync and async tests that demonstrate span count goes from 0 to 1 and preserve token/model attributes.

  3. responses.parse() / AsyncResponses.parse() structured-output helpers are not instrumented or scoped.

    The current OpenAI SDK exposes responses.parse() as a structured-output helper, and it does not call the wrapped create() method; it has its own POST/parser path. In live smoke, client.responses.parse(...) succeeded and returned a parsed object, but instrumentation produced 0 GenAI spans. OpenLLMetry also treats Responses.parse as a separate wrapper surface, which is a useful signal that this is not just an alias of create().

    Please either instrument Responses.parse / AsyncResponses.parse, or explicitly document that structured-output helpers are deferred to a follow-up issue. Given #209 is about supporting the newer OpenAI Responses SDK surface, I would prefer covering it in this PR or at least making the scope boundary explicit.

  4. The test matrix is still too narrow for claiming broad Responses API support.

    The new tests cover direct responses.create, async create, direct stream=True, raw response, status mapping, errors, NO_CONTENT, and a function-tool output. They do not cover the official SDK helper surfaces above, and they do not yet cover built-in Responses tools, multimodal input, previous_response_id / conversation state, background/cancel behavior, async helper parity, concurrency isolation, or SPAN_AND_EVENT content mode.

    Please add at least targeted tests for the helper paths above, and either add or explicitly defer the broader Responses API matrix items. OpenInference has examples/conformance coverage for multimodal, async stream, function calling, file search, web search, and structured outputs; OpenLLMetry covers additional wrapper surfaces such as retrieve and parse. Those should be used only as reference signals, not copied as a schema.

CI is green and the core responses.create implementation is promising, but I do not think this PR fully resolves #209 until the helper-path instrumentation gaps and Omit sentinel handling are fixed or explicitly scoped out.

@sipercai

sipercai commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator Author

Updated this PR to address the Responses API helper-path review feedback:

  • Treat openai.Omit the same as NotGiven, preventing responses.stream(model=..., input=...) from exporting Omit sentinels as invalid attributes.
  • Instrument Responses.parse / AsyncResponses.parse.
  • Instrument existing-response streaming through Responses.retrieve(stream=True) / AsyncResponses.retrieve(stream=True), while keeping non-streaming retrieve as a no-op for GenAI spans.
  • Added focused sync/async tests for the SDK stream helper, parse helpers, existing-response streaming, and non-streaming retrieve no-op behavior.

Validation:

  • uvx tox -e py311-test-instrumentation-openai-v2-latest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra: 146 passed, 2 existing warnings.
  • uvx tox -e py311-test-instrumentation-openai-v2-oldest -- instrumentation-genai/opentelemetry-instrumentation-openai-v2/tests/test_responses.py -ra: 106 passed, 2 skipped, 40 existing/deprecation warnings.
  • uvx tox -e lint-instrumentation-openai-v2: passed, pylint 10.00/10.
  • uvx tox -e precommit: passed.
  • git diff --check: passed.
  • Claude review loop: /tmp/codex-claude-review/loongsuite-python-agent-a82236395d/run-20260603-161349, rounds r1-r4 completed; final P2 registry/constant follow-up deferred.
  • Weaver JSON live-check sample: /tmp/openai-responses-weaver-sample.json, 4 mocked Responses spans. Weaver ran but the local registry does not yet define gen_ai.openai.response.status, gen_ai.openai.request.previous_response_id, or gen_ai.usage.output_tokens_details.reasoning_tokens, so the telemetry contract has a registry follow-up rather than an instrumentation runtime failure.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support OpenAI Responses API and enrich OpenAI v2 GenAI telemetry

2 participants