Add streaming timing metrics to generic stream wrappers by Nik-Reddy · Pull Request #13 · open-telemetry/opentelemetry-python-genai

Nik-Reddy · 2026-05-13T23:56:01Z

Closes #8

What this does

Adds three streaming timing measurements to the shared utils layer:

gen_ai.client.operation.time_to_first_chunk histogram
gen_ai.response.time_to_first_chunk span attribute
gen_ai.client.operation.time_per_output_chunk histogram (one data point per inter-chunk gap)

All timing logic lives in the stream wrapper base classes (SyncStreamWrapper/AsyncStreamWrapper in util/opentelemetry-util-genai). The OpenAI chat wrappers just pass through the invocation start time and copy measured values back, keeping provider-specific code minimal.

How timing works

TTFC: measured from invocation start (when the HTTP request was issued) to when the first chunk arrives
time_per_output_chunk: measures the blocking read duration for each chunk after the first, so user processing time between pulls doesn't inflate the metric
Uses timeit.default_timer() around the blocking next()/anext() call

Scope

This wires up timing for OpenAI chat completions streams (which use the shared ABC). Anthropic and the Responses API have their own stream wrappers that don't extend the ABC, so they will need separate follow-up work.

Testing

5 unit tests for the timing logic in stream.py (sync, async, single chunk, error cases)
2 integration tests verifying histogram recording in metrics.py
All existing tests still pass

Copilot

Pull request overview

Adds shared streaming timing support in the GenAI utility layer and wires OpenAI v2 chat stream wrappers to propagate those timings into inference telemetry.

Changes:

Adds TTFC and per-output-chunk timing capture to sync/async stream wrapper base classes.
Records new streaming timing histograms and the TTFC span attribute from InferenceInvocation.
Updates OpenAI v2 chat stream wrappers, changelogs, and utility tests for the new timing behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`util/opentelemetry-util-genai/src/opentelemetry/util/genai/stream.py`	Measures TTFC and per-chunk read durations in shared sync/async stream wrappers.
`util/opentelemetry-util-genai/src/opentelemetry/util/genai/metrics.py`	Records streaming timing histograms from invocation timing fields.
`util/opentelemetry-util-genai/src/opentelemetry/util/genai/instruments.py`	Defines histogram creation helpers for streaming timing metrics.
`util/opentelemetry-util-genai/src/opentelemetry/util/genai/_inference_invocation.py`	Adds timing fields and emits TTFC as an inference span attribute.
`instrumentation/opentelemetry-instrumentation-openai-v2/src/opentelemetry/instrumentation/openai_v2/chat_wrappers.py`	Passes invocation start time into stream wrappers and copies measured timings back.
`util/opentelemetry-util-genai/tests/test_stream.py`	Adds sync/async stream wrapper timing tests.
`util/opentelemetry-util-genai/tests/test_handler_metrics.py`	Adds metric recorder tests for streaming timing histograms.
`util/opentelemetry-util-genai/CHANGELOG.md`	Documents utility streaming timing support.
`instrumentation/opentelemetry-instrumentation-openai-v2/CHANGELOG.md`	Documents OpenAI v2 chat streaming timing metrics.

Comments suppressed due to low confidence (1)

util/opentelemetry-util-genai/src/opentelemetry/util/genai/stream.py:217

The async wrapper has the same unbounded accumulation issue: _self_chunk_gaps stores one entry per output chunk until the stream finalizes. For large or long-lived streams this can grow without bound; prefer recording each gap immediately or passing timings through a bounded recorder.

        self._self_chunk_gaps: list[float] = []

Nik-Reddy · 2026-05-14T01:29:56Z

@lmolkova This implements the three items you mentioned in #8:

gen_ai.response.time_to_first_chunk span attribute
gen_ai.client.operation.time_to_first_chunk histogram
gen_ai.client.operation.time_per_output_chunk histogram (one record per inter-chunk gap)

For chunk gaps I'm measuring blocking read time rather than wall clock between returns, so user-side processing doesn't inflate it. Let me know if that's not what you had in mind.

Anthropic/Responses API don't use the ABC yet so they'll need follow-up work in a separate PR.

MikeGoldsmith

Looks mostly okay, I've left a few of comments we should resolve before accepting.

Nik-Reddy · 2026-05-16T23:16:09Z

@MikeGoldsmith resolved all review thread, can you re-review when you have a moment.

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

eternalcuriouslearner

LGTM!! Couple of changes and a dumb performance question:

Can we please add a VCR backed test to verify the span and metrics?
Can we extend the tests we have for sync wrapper to async wrapper?
Any idea on how large a chunk_gap_seconds can grow for long running streams or multi turn conversations?

aabmass · 2026-05-18T15:39:43Z

FYI if your PR needs a changelog entry, please update it to use towncrier #16

Apply the inter-chunk gap semantics requested by MikeGoldsmith on PR open-telemetry#13. The wrapper now tracks only `_self_last_chunk_at` and pushes each gap to the timing target via the new `_record_chunk_gap` method on the `_TimingTarget` protocol; `InferenceInvocation` implements the method and continues to drain buffered gaps via `_consume_streaming_timing` at stop. The `time_per_output_chunk` histogram now matches the semconv definition `t_after_read[n] - t_after_read[n-1]`. Document the TTFC contract on the `monotonic_start_s` property: the anchor is invocation creation, so instrumentations must not perform meaningful work between `start_inference()` and the wrapped SDK call. Re-baseline the affected `test_stream.py` cases to the inter-chunk math, assert exact gap values on the timing target, drop the obsolete wrapper-side list checks, and hoist the `unittest.mock.patch` imports to module scope. Add towncrier news fragments for PR open-telemetry#13 under `util/opentelemetry-util-genai/.changelog/13.added` and `instrumentation/opentelemetry-instrumentation-openai-v2/.changelog/13.added` per the changelog migration requested by aabmass.

The towncrier news fragments at util/opentelemetry-util-genai/.changelog/13.added and instrumentation/opentelemetry-instrumentation-openai-v2/.changelog/13.added already cover the streaming TTFC and time_per_output_chunk metrics, so remove the duplicated Unreleased lines from the two CHANGELOG.md files. Keeping them would have caused the changelog workflow to fail because it blocks PRs that diff CHANGELOG.md against main.

Nik-Reddy · 2026-05-20T07:25:59Z

FYI if your PR needs a changelog entry, please update it to use towncrier #16

@aabmass Migrated the entries to towncrier news fragments at util/opentelemetry-util-genai/.changelog/13.added and instrumentation/opentelemetry-instrumentation-openai-v2/.changelog/13.added, and removed the duplicate Unreleased lines from the two CHANGELOG.md files

lzchen · 2026-05-20T21:11:01Z

+
+    def _record_chunk_gap(self, gap: float) -> None: ...
+
+


Shouldn't _consume_streaming_timing be defined here as well?

_TimingTarget is the write side that the stream wrapper uses to push timing into an invocation: ttfc_seconds gets set once at first chunk and chunk gaps are pushed through _record_chunk_gap. _consume_streaming_timing is the read side, called by InvocationMetricsRecorder at finalize to drain the buffered values. The wrapper never calls it, so adding it to the Protocol would advertise a method the only caller of the Protocol does not need.

I can add it if you want both sides of the contract documented in one place, but on its own it would be unused by the wrapper.

lzchen · 2026-05-20T21:15:09Z

 _logger = logging.getLogger(__name__)


+class _TimingTarget(Protocol):


Does this mean we need all invocations to implement this? Shouldn't we just include this as part of GenAIInvocation?

Two reasons it is a Protocol today:

Keeps stream.py off the concrete GenAIInvocation. The wrapper only needs an object that can receive ttfc_seconds and accept _record_chunk_gap, nothing else from the invocation surface.

Non-streaming invocation types (embeddings, tools, workflows) would have to carry unused ttfc_seconds / chunk-gap state if those fields lived on the base. With the Protocol, only invocations that actually stream declare it.

If you want it on GenAIInvocation anyway, I can move it in a follow-up. Trade is two extra fields on every invocation (None / empty on non-streaming ones) for a single place to document the contract. Let me know which way you want it.

lzchen · 2026-05-20T21:20:22Z

        attrs.update(self.metric_attributes)
        return attrs

+    def _record_chunk_gap(self, gap: float) -> None:


Will this logic be pretty similar across different invocations? Probably can abstract this?

Same thread as _TimingTarget above. If we lift the streaming timing state up to GenAIInvocation, both _record_chunk_gap and _consume_streaming_timing move with it and the Protocol goes away.

If we keep the Protocol, those methods stay on InferenceInvocation since that is the only invocation that streams today. Will follow whichever direction you pick on the thread above and batch it with the next push.

Leaning toward keeping the Protocol for this PR, batching the decision with the next push.

The trade I see:

Protocol: stream.py stays decoupled from the concrete invocation types, and EmbeddingInvocation, ToolInvocation, AgentInvocation, WorkflowInvocation do not carry ttfc_seconds and the chunk-gap buffer as dead state. A future streaming invocation type implements the Protocol surface (ttfc_seconds: float | None attr and _record_chunk_gap method) and gets type-checking from pyright.

On GenAIInvocation: one doc site for the streaming-timing contract, but every invocation pays for state it does not use, and there is no signal that distinguishes streaming from non-streaming invocations at the type level.

Threads at stream.py:46 and _inference_invocation.py:185 are the same call, so I will follow your direction here for all three.

Happy to flip to the base-class version if you prefer the single-site documentation.

lzchen · 2026-05-20T21:36:32Z

+        Buffered gaps are drained by ``_consume_streaming_timing`` when the
+        invocation stops.
+        """
+        self.chunk_gap_seconds.append(gap)


Seems like there's no limit to this unbounded list, which might be a concern if the stream is really long. Can we record to the histogram directly every time instead?

One snag with inline recording: gen_ai.response.model is only set after the first response chunk arrives, so gaps between chunks 1 and 2 would emit without it while later gaps have it (same
stream split across two attribute keys).

Options I see:

Inline + drop response.model from this metric.

Inline + snapshot attributes at first-chunk time.

Sliding-window cap on the buffer, keep finalize record.

Keep as is, document the memory shape.

@lzchen I'm leaning toward option 4 (keep as is, document the memory shape) for this PR. The buffer lifetime is bounded by the invocation, and the per-invocation list length is the SDK's chunk count for that one call. For openai-v2 chat completion streams in scope here, that stays in the thousands of floats per invocation and is released at stop. Can add a short comment near chunk_gap_seconds documenting the shape.

Option 2 (inline record with snapshot attributes at first-chunk) is the cleaner path if we later target agent-style indefinite streams. Happy to switch to option 2 here if you would rather not defer it.

…review Two changes from the second round of review on PR open-telemetry#13: 1. Reword the TTFC and time-per-output-chunk histogram descriptions so the measurement window is explicit. TTFC now spells out that the timer runs from the client issuing the generation request to the first chunk being received in the response stream. time-per-output-chunk now clarifies that the value is recorded for every chunk after the first, measured end-to-end between consecutive chunks. 2. Validate that start_time_s is provided whenever a timing_target is wired up. TTFC has no meaning without a start time, so passing a timing target alone is a misconfiguration. SyncStreamWrapper and AsyncStreamWrapper now raise ValueError at construction rather than silently emitting no TTFC. start_time_s stays optional when no timing_target is provided so the wrapper is still usable as a plain chunk-processing wrapper. Tests cover the new failure mode on both wrappers.

Apply the inter-chunk gap semantics requested by MikeGoldsmith on PR open-telemetry#13. The wrapper now tracks only `_self_last_chunk_at` and pushes each gap to the timing target via the new `_record_chunk_gap` method on the `_TimingTarget` protocol; `InferenceInvocation` implements the method and continues to drain buffered gaps via `_consume_streaming_timing` at stop. The `time_per_output_chunk` histogram now matches the semconv definition `t_after_read[n] - t_after_read[n-1]`. Document the TTFC contract on the `monotonic_start_s` property: the anchor is invocation creation, so instrumentations must not perform meaningful work between `start_inference()` and the wrapped SDK call. Re-baseline the affected `test_stream.py` cases to the inter-chunk math, assert exact gap values on the timing target, drop the obsolete wrapper-side list checks, and hoist the `unittest.mock.patch` imports to module scope. Add towncrier news fragments for PR open-telemetry#13 under `util/opentelemetry-util-genai/.changelog/13.added` and `instrumentation/opentelemetry-instrumentation-openai-v2/.changelog/13.added` per the changelog migration requested by aabmass.

The towncrier news fragments at util/opentelemetry-util-genai/.changelog/13.added and instrumentation/opentelemetry-instrumentation-openai-v2/.changelog/13.added already cover the streaming TTFC and time_per_output_chunk metrics, so remove the duplicated Unreleased lines from the two CHANGELOG.md files. Keeping them would have caused the changelog workflow to fail because it blocks PRs that diff CHANGELOG.md against main.

…review Two changes from the second round of review on PR open-telemetry#13: 1. Reword the TTFC and time-per-output-chunk histogram descriptions so the measurement window is explicit. TTFC now spells out that the timer runs from the client issuing the generation request to the first chunk being received in the response stream. time-per-output-chunk now clarifies that the value is recorded for every chunk after the first, measured end-to-end between consecutive chunks. 2. Validate that start_time_s is provided whenever a timing_target is wired up. TTFC has no meaning without a start time, so passing a timing target alone is a misconfiguration. SyncStreamWrapper and AsyncStreamWrapper now raise ValueError at construction rather than silently emitting no TTFC. start_time_s stays optional when no timing_target is provided so the wrapper is still usable as a plain chunk-processing wrapper. Tests cover the new failure mode on both wrappers.

Nik-Reddy · 2026-05-28T01:40:42Z

@MikeGoldsmith @eternalcuriouslearner rebased this PR onto main (eaca532) to clear the conflict from PR #60, then squashed the history into a single commit (6f465ca) so the diff is easier to scan. CI is green across all 82 checks on the new head.

The single conflict was the OpenAI changelog file. PR #60 renamed opentelemetry-instrumentation-openai-v2/ to opentelemetry-instrumentation-genai-openai/, and my changelog entry was sitting at the old path. Moved it to instrumentation/opentelemetry-instrumentation-genai-openai/.changelog/13.added and dropped the now-redundant v2 wording from the entry text. No behavior changes from the rebase, and all prior thread resolutions are preserved.

One side effect of the squash is that previous inline comments are showing as outdated on the new commit. The line context is unchanged, so the resolutions should still hold against 6f465ca.

If everything looks good, could you take another pass when you have a moment?

@lzchen thanks for the deep review. Your seven inline comments are all addressed in the squashed commit, and my replies to each are still on the original threads (now marked outdated by the force push).

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Records two streaming-specific histograms on inference invocations: gen_ai.client.operation.time_to_first_chunk and gen_ai.client.operation.time_per_output_chunk. Adds a _TimingTarget protocol on the shared stream wrappers in opentelemetry-util-genai so provider instrumentations can opt into the timing path without duplicating it, and wires the OpenAI instrumentation through it. Includes unit tests for both the util-genai stream wrappers and the OpenAI stream path, asserting that the time_to_first_chunk span attribute and the two histograms are emitted with the expected attribute set. Towncrier changelog entries live under util/opentelemetry-util-genai/.changelog/13.added and instrumentation/opentelemetry-instrumentation-genai-openai/.changelog/13.added.

Copilot AI review requested due to automatic review settings May 13, 2026 23:56

Nik-Reddy requested a review from a team as a code owner May 13, 2026 23:56

Copilot started reviewing on behalf of Nik-Reddy May 13, 2026 23:56 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Comment thread util/opentelemetry-util-genai/src/opentelemetry/util/genai/stream.py Outdated

Comment thread util/opentelemetry-util-genai/src/opentelemetry/util/genai/_inference_invocation.py

Nik-Reddy added a commit to Nik-Reddy/opentelemetry-python-genai that referenced this pull request May 14, 2026

chore: update changelog links to PR open-telemetry#13

27dd4e2

lmolkova reviewed May 14, 2026

View reviewed changes

Comment thread ...metry-instrumentation-openai-v2/src/opentelemetry/instrumentation/openai_v2/chat_wrappers.py Outdated

MikeGoldsmith requested changes May 14, 2026

View reviewed changes

MikeGoldsmith self-assigned this May 14, 2026

Nik-Reddy requested review from MikeGoldsmith and lmolkova May 15, 2026 05:54

Nik-Reddy requested a review from Copilot May 18, 2026 01:28

Copilot started reviewing on behalf of Nik-Reddy May 18, 2026 01:28 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

eternalcuriouslearner suggested changes May 18, 2026

View reviewed changes

lzchen reviewed May 20, 2026

View reviewed changes

Comment thread util/opentelemetry-util-genai/src/opentelemetry/util/genai/stream.py

lzchen reviewed May 20, 2026

View reviewed changes

Comment thread util/opentelemetry-util-genai/src/opentelemetry/util/genai/instruments.py Outdated

lzchen reviewed May 20, 2026

View reviewed changes

Comment thread util/opentelemetry-util-genai/src/opentelemetry/util/genai/instruments.py Outdated

lzchen reviewed May 20, 2026

View reviewed changes

Nik-Reddy force-pushed the feat/streaming-timing-metrics branch from 6ea537b to 78917c6 Compare May 21, 2026 05:45

Nik-Reddy requested a review from lzchen May 21, 2026 07:10

Nik-Reddy mentioned this pull request May 23, 2026

Cohere V2 instrumentation: continue scaffold + chat completions work from opentelemetry-python-contrib#4418 #56

Open

Nik-Reddy force-pushed the feat/streaming-timing-metrics branch from 78917c6 to 4d89828 Compare May 27, 2026 20:57

Nik-Reddy added a commit to Nik-Reddy/opentelemetry-python-genai that referenced this pull request May 27, 2026

chore: update changelog links to PR open-telemetry#13

9090fc0

Nik-Reddy force-pushed the feat/streaming-timing-metrics branch from 4d89828 to 6f465ca Compare May 27, 2026 20:59

Nik-Reddy requested a review from eternalcuriouslearner May 28, 2026 01:49

Nik-Reddy mentioned this pull request May 29, 2026

REQUEST: New membership for Nik-Reddy open-telemetry/community#3490

Closed

6 tasks

Nik-Reddy requested a review from Copilot May 31, 2026 19:46

Copilot AI reviewed May 31, 2026

View reviewed changes

Nik-Reddy force-pushed the feat/streaming-timing-metrics branch from 6f465ca to dbcafb1 Compare May 31, 2026 20:28

		_logger = logging.getLogger(__name__)


		class _TimingTarget(Protocol):

Conversation

Nik-Reddy commented May 13, 2026

What this does

How timing works

Scope

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Nik-Reddy commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MikeGoldsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nik-Reddy commented May 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eternalcuriouslearner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aabmass commented May 18, 2026

Uh oh!

Nik-Reddy commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nik-Reddy May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nik-Reddy commented May 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nik-Reddy commented May 14, 2026 •

edited

Loading

Nik-Reddy commented May 20, 2026 •

edited

Loading

Nik-Reddy May 21, 2026 •

edited

Loading