feat: Add STT `utterance_end_latency` metric for streaming STT by kimdwkimdw · Pull Request #4966 · livekit/agents

kimdwkimdw · 2026-02-27T03:39:42Z

Motivation

LiveKit's STTMetrics includes a duration field, but the official observability docs state:

duration — For non-streaming STT, the amount of time (seconds) it took to create the transcript. Always 0 for streaming STT.

This leaves a gap: every other pipeline component has a responsiveness metric (ttft for LLM, ttfb for TTS), but streaming STT has none. Operators cannot measure, monitor, or compare streaming STT engine performance.

Solution

This PR adds utterance_end_latency to STTMetrics — the wall-clock delay from when the audio at the transcript's end_time is locally enqueued in RecognizeStream to when the FINAL_TRANSCRIPT is received.

utterance_end_latency = time.now() − local_enqueue_time_of(audio_at_end_time)

Component	Latency Metric	What it measures
LLM	`ttft`	Time to first token from LLM
TTS	`ttfb`	Time to first audio byte from TTS
STT	`utterance_end_latency`	Audio-at-end_time local enqueue complete → FINAL_TRANSCRIPT received

This metric is provider-agnostic — it works for any streaming STT plugin that populates end_time on speech alternatives.

How it works

Audio Input            RecognizeStream                              STT Engine
    │                        │                                           │
    │── push_frame(chunk1) ──▶│── enqueue audio to _input_ch ───────────▶│
    │                         │  record: (cum=0.2, wall_t1_local_enq)    │
    │── push_frame(chunk2) ──▶│── enqueue audio to _input_ch ───────────▶│
    │                         │  record: (cum=0.4, wall_t2_local_enq)    │
    │── push_frame(chunk3) ──▶│── enqueue audio to _input_ch ───────────▶│
    │                         │  record: (cum=0.6, wall_t3_local_enq)    │
    │── push_frame(chunk4) ──▶│── enqueue audio to _input_ch ───────────▶│ (silence)
    │                         │  record: (cum=0.8, wall_t4_local_enq)    │
    │── push_frame(chunk5) ──▶│── enqueue audio to _input_ch ───────────▶│ (silence)
    │                         │  record: (cum=1.0, wall_t5_local_enq)    │
    │                         │                                          │
    │                         │◀──────── FINAL_TRANSCRIPT ───────────────│
    │                         │          (end_time = 0.5)                │
    │                         │                                          │
    │                         │ bisect_left([.2,.4,.6,.8,1.0], 0.5)      │
    │                         │   → idx=2 → wall_t3_local_enq            │
    │                         │                                          │
    │                         │ utterance_end_latency                    │
    │                         │   = now() − wall_t3_local_enq            │
    │                         │                                          │
    │                         │ prune timeline ≤ 0.5                     │
    │                         │ emit STTMetrics                          │

Why bisect on end_time, not "last pushed frame"?

In streaming STT, audio flows continuously and silence frames are often enqueued after the utterance already ended. By the time FINAL_TRANSCRIPT arrives, later silence chunks may already exist. Using the latest enqueue time (wall_t5) would mostly measure "recent silence enqueue -> FINAL receive", which is not useful. The bisect approach maps FINAL to the chunk containing end_time (chunk3 here), which is the correct anchor for this metric.

Implementation details:

push_frame() appends (cumulative_audio_seconds, wall_clock_time) after local enqueue to _input_ch
On FINAL_TRANSCRIPT, uses bisect_left to find the wall-clock time matching the transcript's end_time
Computes utterance_end_latency = now() − matched_push_time
Prunes the timeline up to end_time to keep memory bounded

`utterance_end_latency` vs `transcription_delay`

These metrics measure fundamentally different things:

	`transcription_delay`	`utterance_end_latency`
Measures	End-of-speech → transcript available	Audio-at-end_time local enqueue complete → FINAL received
Includes VAD/EOU timing?	Yes	No
Computed in	`agent_activity.py` (voice pipeline)	`stt.py` (base `RecognizeStream`)
Scope	Per user turn	Per FINAL_TRANSCRIPT event
Best for	Pipeline-level latency debugging	STT engine benchmarking

transcription_delay tells you how long the user waited after they stopped speaking.
utterance_end_latency tells you how quickly FINAL is returned after the relevant audio is locally enqueued, minimizing VAD/EOU coupling.

Changes

File	Description
`stt/stt.py`	Audio push timeline tracking in `push_frame()`, bisect-based lookup on FINAL_TRANSCRIPT, metric emission, timeline pruning
`metrics/base.py`	New `utterance_end_latency: float \| None` field on `STTMetrics` (default `None`)
`metrics/utils.py`	Log `utterance_end_latency` in structured metrics output when present
`voice/agent_activity.py`	Collect `utterance_end_latency` from STT metrics events, attach to per-turn `MetricsReport`
`llm/chat_context.py`	New `utterance_end_latency` field on `MetricsReport` TypedDict
`cli/cli.py`	Display `stt_utt_end` in console mode turn metrics
`tests/test_agent_session.py`	Updated to expect 4 metric events (was 3), assert new `stt_metrics` event with `streamed=True`

Provider plugin fixes in this branch

livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py
- Fix FINAL SpeechData.end_time mapping to use the last word end timestamp.
livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
- For use_realtime=True, populate FINAL SpeechData.start_time/end_time from audio_start_ms/audio_end_ms.

Manual validation (console)

Validation was performed using the provided AgentSession harness and provider switching.

Validated combinations:

google.STT (use_streaming=True)
deepgram.STT
rtzr.STT
openai.STT(use_realtime=True, model="gpt-4o-mini-transcribe")

Validation Example

from dotenv import dotenv_values, load_dotenv

from livekit import agents
from livekit.agents import Agent, AgentSession
from livekit.plugins import anthropic, deepgram, elevenlabs, google, noise_cancellation, openai, rtzr, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

load_dotenv(".env")

config = dotenv_values(".env")


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions=config["AGENT_INSTRUCTIONS"])


async def entrypoint(ctx: agents.JobContext):
    session = AgentSession(
        ########## OpenAI
        # stt=openai.STT(
        #     language="ko",
        #     model="gpt-4o-mini-transcribe",
        #     use_realtime=True,
        # ),
        ########### Azure OpenAI
        llm=openai.LLM.with_azure(
            model="gpt-5-mini",
            azure_deployment=config["AZURE_LLM_DEPLOYMENT"],
            azure_endpoint=config["AZURE_LLM_OPENAI_ENDPOINT"],
            api_key=config["AZURE_LLM_OPENAI_API_KEY"],
            api_version="2024-12-01-preview",
        ),
        ########### Google Cloud STT
        stt=google.STT(
            languages=["ko-KR"],
            use_streaming=True,
            credentials_file=config.get("GOOGLE_APPLICATION_CREDENTIALS"),
        ),
        ########### Deepgram
        # stt=deepgram.STT(
        #     model="nova-2",
        #     language="ko",
        # ),
        ########### RTZR
        # stt=rtzr.STT(
        #     model="sommers_ko",
        #     language="ko",
        #     # keywords=["리턴제로", "음성인식"],
        # ),
        ############ ElevenLabs
        tts=elevenlabs.TTS(
            api_key=config["ELEVENLABS_API_KEY"],
            voice_id=config["ELEVENLABS_VOICE_ID"],
            model="eleven_flash_v2_5",
        ),
        vad=silero.VAD.load(),
        turn_detection=MultilingualModel(),
        resume_false_interruption=True,
        allow_interruptions=True,
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
    )

    await session.generate_reply(instructions="Greet the user and offer your assistance.")


if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Track audio push timeline in RecognizeStream.push_frame() and compute wall-clock delay from audio push to FINAL_TRANSCRIPT receipt. Emitted as STTMetrics.utterance_end_latency for pure STT-engine-only latency measurement, analogous to LLM ttft and TTS ttfb.

…SCRIPT

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41f770b803

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T03:46:14Z

livekit-agents/livekit/agents/stt/stt.py

+                        audio_pos = end_time - self._start_time_offset
+                        push_wall_clock = self._lookup_push_time(audio_pos)
+                        if push_wall_clock is not None:
+                            utterance_end_latency = max(0.0, time.time() - push_wall_clock)
+                            self._prune_push_timestamps(audio_pos)


Bound push timeline when end timestamps are missing

The new latency timeline is only pruned in this FINAL_TRANSCRIPT path when end_time > 0 and lookup succeeds, while push_frame() appends on every audio frame. That means long-running streams with no final timestamps (or long silence before any final transcript) keep growing _audio_push_wall_times/_audio_push_timestamps without bound, which can steadily increase memory usage in production sessions. Add a pruning/capping path that does not depend on end_time availability.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-02-27T03:46:14Z

livekit-agents/livekit/agents/voice/agent_activity.py

+        if self._last_user_final_stt_request_id:
+            utterance_end_latency = self._stt_utterance_latency_by_request_id.pop(
+                self._last_user_final_stt_request_id,
+                None,


Purge stale keyed STT latencies beyond the last transcript id

This logic pops only the latest transcript request ID, so earlier keyed latency entries collected in the same user turn are never removed from _stt_utterance_latency_by_request_id. Streaming providers can emit multiple final transcripts with different IDs before turn completion (for example, NVIDIA assigns request_id from each response object), so the dict can grow over time and retain stale per-request state. Clear or bound older keyed entries when a turn is committed.

Useful? React with 👍 / 👎.

devin-ai-integration

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

devin-ai-integration · 2026-02-27T03:47:20Z

livekit-agents/livekit/agents/stt/stt.py

+                utterance_end_latency: float | None = None
+                if ev.alternatives:
+                    end_time = ev.alternatives[0].end_time
+                    if end_time > 0.0:
+                        audio_pos = end_time - self._start_time_offset
+                        push_wall_clock = self._lookup_push_time(audio_pos)
+                        if push_wall_clock is not None:


🟡 Cumulative audio timeline not reset on STT stream retry causes incorrect utterance_end_latency

After a streaming STT retry/reconnect in _main_task, the _cumulative_audio_seconds counter and the push timeline lists are never reset. The STT provider's new connection reports end_time starting from 0 (relative to the new session), plus start_time_offset. The computation audio_pos = end_time - self._start_time_offset at livekit-agents/livekit/agents/stt/stt.py:429 recovers the provider's raw position (a small number), but _audio_push_wall_times contains cumulative values from all audio ever pushed (a much larger number).

Root Cause and Impact

Consider a retry scenario:

10s of audio pushed → _cumulative_audio_seconds = 10.0, timeline = [0.2, 0.4, ..., 10.0]

_run() fails with APIError, retry triggered at livekit-agents/livekit/agents/stt/stt.py:330-358

_start_time_offset is increased by elapsed wall-clock time

New _run() starts, provider receives only new audio and reports end_time relative to the new session (e.g. 3.0) plus start_time_offset

audio_pos = end_time - start_time_offset = 3.0 — but the timeline entries start at ~10.0+

bisect_left([10.2, 10.4, ...], 3.0) returns 0, mapping to the push timestamp of the very first frame

utterance_end_latency = now() - very_old_timestamp → a spuriously large value (many seconds)

Impact: After any STT stream retry, the reported utterance_end_latency will be wildly incorrect (inflated) rather than None. The timeline and _cumulative_audio_seconds should be reset (or an offset adjustment applied) when _main_task retries.

Prompt for agents

In livekit-agents/livekit/agents/stt/stt.py, the _cumulative_audio_seconds counter and the _audio_push_wall_times / _audio_push_timestamps lists are never reset when the stream retries via _main_task (lines 330-358). After a retry, the STT provider's reported end_time (minus start_time_offset) reflects the position in the NEW connection (starting near 0), but _cumulative_audio_seconds still holds the total from all audio ever pushed. Fix this by resetting the timeline state at the beginning of each retry iteration in _main_task. Specifically, inside the while loop at line 330 (before calling self._run()), add: self._cumulative_audio_seconds = 0.0, self._audio_push_wall_times.clear(), self._audio_push_timestamps.clear(). Alternatively, record the cumulative audio offset at the start of each _run() and adjust the audio_pos calculation accordingly.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-02-27T03:47:22Z

livekit-agents/livekit/agents/voice/agent_activity.py

+        if isinstance(ev, STTMetrics) and ev.utterance_end_latency is not None:
+            if ev.request_id:
+                self._stt_utterance_latency_by_request_id[ev.request_id] = ev.utterance_end_latency
+            else:
+                self._last_unkeyed_stt_utterance_latency = ev.utterance_end_latency


🟡 _stt_utterance_latency_by_request_id dict accumulates unpopped entries across turns

Within a single user turn, multiple FINAL_TRANSCRIPT events may be received (each producing an STTMetrics with utterance_end_latency), but on_final_transcript at livekit-agents/livekit/agents/voice/agent_activity.py:1390 overwrites _last_user_final_stt_request_id with only the latest request_id. When _user_turn_completed_task runs, it only pops the entry for the last request_id at line 1612, leaving all earlier entries orphaned in _stt_utterance_latency_by_request_id.

Detailed Explanation

For STT providers that use distinct request_id values per FINAL_TRANSCRIPT (rather than a stable session-level ID), the flow is:

FINAL_TRANSCRIPT Setting up packages #1 (request_id="r1") → _on_metrics_collected stores {"r1": latency1}; on_final_transcript sets _last_user_final_stt_request_id = "r1"

FINAL_TRANSCRIPT More processor implementations #2 (request_id="r2") → _on_metrics_collected stores {"r1": latency1, "r2": latency2}; on_final_transcript overwrites to _last_user_final_stt_request_id = "r2"

_user_turn_completed_task pops only "r2", leaving "r1" in the dict forever

Over many turns, the dict grows without bound. Each entry is small (str → float), so the memory impact is slow but unbounded.

Impact: Slow memory leak proportional to the number of intermediate FINAL_TRANSCRIPT events across all turns, for providers that use per-response request IDs.

Prompt for agents

In livekit-agents/livekit/agents/voice/agent_activity.py, the _stt_utterance_latency_by_request_id dict (line 132) accumulates entries that are never cleaned up when multiple FINAL_TRANSCRIPT events occur per turn. To fix, either: (1) clear the entire dict after each turn in _user_turn_completed_task around line 1627 where _last_user_final_stt_request_id is reset to None (add self._stt_utterance_latency_by_request_id.clear()), or (2) store only the latest keyed latency (as a single value) instead of a dict, since only the last FINAL_TRANSCRIPT per turn is used.

Was this helpful? React with 👍 or 👎 to provide feedback.

kimdwkimdw added 2 commits February 26, 2026 17:05

fix(openai, deepgram): wrong end_time when SpeechEventType.FINAL_TRAN…

41f770b

…SCRIPT

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

devin-ai-integration bot reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add STT `utterance_end_latency` metric for streaming STT#4966

feat: Add STT `utterance_end_latency` metric for streaming STT#4966
kimdwkimdw wants to merge 2 commits intolivekit:mainfrom
kimdwkimdw:feature/stt_metric

kimdwkimdw commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 27, 2026

Uh oh!

devin-ai-integration bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kimdwkimdw commented Feb 27, 2026

Motivation

Solution

How it works

utterance_end_latency vs transcription_delay

Changes

Provider plugin fixes in this branch

Manual validation (console)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`utterance_end_latency` vs `transcription_delay`