Skip to content

feat: Add STT utterance_end_latency metric for streaming STT#4966

Open
kimdwkimdw wants to merge 2 commits intolivekit:mainfrom
kimdwkimdw:feature/stt_metric
Open

feat: Add STT utterance_end_latency metric for streaming STT#4966
kimdwkimdw wants to merge 2 commits intolivekit:mainfrom
kimdwkimdw:feature/stt_metric

Conversation

@kimdwkimdw
Copy link
Contributor

Motivation

LiveKit's STTMetrics includes a duration field, but the official observability docs state:

duration β€” For non-streaming STT, the amount of time (seconds) it took to create the transcript. Always 0 for streaming STT.

This leaves a gap: every other pipeline component has a responsiveness metric (ttft for LLM, ttfb for TTS), but streaming STT has none. Operators cannot measure, monitor, or compare streaming STT engine performance.

Solution

This PR adds utterance_end_latency to STTMetrics β€” the wall-clock delay from when the audio at the transcript's end_time is locally enqueued in RecognizeStream to when the FINAL_TRANSCRIPT is received.

utterance_end_latency = time.now() βˆ’ local_enqueue_time_of(audio_at_end_time)
Component Latency Metric What it measures
LLM ttft Time to first token from LLM
TTS ttfb Time to first audio byte from TTS
STT utterance_end_latency Audio-at-end_time local enqueue complete β†’ FINAL_TRANSCRIPT received

This metric is provider-agnostic β€” it works for any streaming STT plugin that populates end_time on speech alternatives.

How it works

Audio Input            RecognizeStream                              STT Engine
    β”‚                        β”‚                                           β”‚
    │── push_frame(chunk1) ──▢│── enqueue audio to _input_ch ───────────▢│
    β”‚                         β”‚  record: (cum=0.2, wall_t1_local_enq)    β”‚
    │── push_frame(chunk2) ──▢│── enqueue audio to _input_ch ───────────▢│
    β”‚                         β”‚  record: (cum=0.4, wall_t2_local_enq)    β”‚
    │── push_frame(chunk3) ──▢│── enqueue audio to _input_ch ───────────▢│
    β”‚                         β”‚  record: (cum=0.6, wall_t3_local_enq)    β”‚
    │── push_frame(chunk4) ──▢│── enqueue audio to _input_ch ───────────▢│ (silence)
    β”‚                         β”‚  record: (cum=0.8, wall_t4_local_enq)    β”‚
    │── push_frame(chunk5) ──▢│── enqueue audio to _input_ch ───────────▢│ (silence)
    β”‚                         β”‚  record: (cum=1.0, wall_t5_local_enq)    β”‚
    β”‚                         β”‚                                          β”‚
    β”‚                         │◀──────── FINAL_TRANSCRIPT ───────────────│
    β”‚                         β”‚          (end_time = 0.5)                β”‚
    β”‚                         β”‚                                          β”‚
    β”‚                         β”‚ bisect_left([.2,.4,.6,.8,1.0], 0.5)      β”‚
    β”‚                         β”‚   β†’ idx=2 β†’ wall_t3_local_enq            β”‚
    β”‚                         β”‚                                          β”‚
    β”‚                         β”‚ utterance_end_latency                    β”‚
    β”‚                         β”‚   = now() βˆ’ wall_t3_local_enq            β”‚
    β”‚                         β”‚                                          β”‚
    β”‚                         β”‚ prune timeline ≀ 0.5                     β”‚
    β”‚                         β”‚ emit STTMetrics                          β”‚

Why bisect on end_time, not "last pushed frame"?

In streaming STT, audio flows continuously and silence frames are often enqueued after the utterance already ended. By the time FINAL_TRANSCRIPT arrives, later silence chunks may already exist. Using the latest enqueue time (wall_t5) would mostly measure "recent silence enqueue -> FINAL receive", which is not useful. The bisect approach maps FINAL to the chunk containing end_time (chunk3 here), which is the correct anchor for this metric.

Implementation details:

  1. push_frame() appends (cumulative_audio_seconds, wall_clock_time) after local enqueue to _input_ch
  2. On FINAL_TRANSCRIPT, uses bisect_left to find the wall-clock time matching the transcript's end_time
  3. Computes utterance_end_latency = now() βˆ’ matched_push_time
  4. Prunes the timeline up to end_time to keep memory bounded

utterance_end_latency vs transcription_delay

These metrics measure fundamentally different things:

transcription_delay utterance_end_latency
Measures End-of-speech β†’ transcript available Audio-at-end_time local enqueue complete β†’ FINAL received
Includes VAD/EOU timing? Yes No
Computed in agent_activity.py (voice pipeline) stt.py (base RecognizeStream)
Scope Per user turn Per FINAL_TRANSCRIPT event
Best for Pipeline-level latency debugging STT engine benchmarking

transcription_delay tells you how long the user waited after they stopped speaking.
utterance_end_latency tells you how quickly FINAL is returned after the relevant audio is locally enqueued, minimizing VAD/EOU coupling.

Changes

File Description
stt/stt.py Audio push timeline tracking in push_frame(), bisect-based lookup on FINAL_TRANSCRIPT, metric emission, timeline pruning
metrics/base.py New utterance_end_latency: float | None field on STTMetrics (default None)
metrics/utils.py Log utterance_end_latency in structured metrics output when present
voice/agent_activity.py Collect utterance_end_latency from STT metrics events, attach to per-turn MetricsReport
llm/chat_context.py New utterance_end_latency field on MetricsReport TypedDict
cli/cli.py Display stt_utt_end in console mode turn metrics
tests/test_agent_session.py Updated to expect 4 metric events (was 3), assert new stt_metrics event with streamed=True

Provider plugin fixes in this branch

  • livekit-plugins/livekit-plugins-deepgram/livekit/plugins/deepgram/stt.py
    • Fix FINAL SpeechData.end_time mapping to use the last word end timestamp.
  • livekit-plugins/livekit-plugins-openai/livekit/plugins/openai/stt.py
    • For use_realtime=True, populate FINAL SpeechData.start_time/end_time from audio_start_ms/audio_end_ms.

Manual validation (console)

Validation was performed using the provided AgentSession harness and provider switching.

Validated combinations:

  • google.STT (use_streaming=True)
  • deepgram.STT
  • rtzr.STT
  • openai.STT(use_realtime=True, model="gpt-4o-mini-transcribe")
Validation Example
from dotenv import dotenv_values, load_dotenv

from livekit import agents
from livekit.agents import Agent, AgentSession
from livekit.plugins import anthropic, deepgram, elevenlabs, google, noise_cancellation, openai, rtzr, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel

load_dotenv(".env")

config = dotenv_values(".env")


class Assistant(Agent):
    def __init__(self) -> None:
        super().__init__(instructions=config["AGENT_INSTRUCTIONS"])


async def entrypoint(ctx: agents.JobContext):
    session = AgentSession(
        ########## OpenAI
        # stt=openai.STT(
        #     language="ko",
        #     model="gpt-4o-mini-transcribe",
        #     use_realtime=True,
        # ),
        ########### Azure OpenAI
        llm=openai.LLM.with_azure(
            model="gpt-5-mini",
            azure_deployment=config["AZURE_LLM_DEPLOYMENT"],
            azure_endpoint=config["AZURE_LLM_OPENAI_ENDPOINT"],
            api_key=config["AZURE_LLM_OPENAI_API_KEY"],
            api_version="2024-12-01-preview",
        ),
        ########### Google Cloud STT
        stt=google.STT(
            languages=["ko-KR"],
            use_streaming=True,
            credentials_file=config.get("GOOGLE_APPLICATION_CREDENTIALS"),
        ),
        ########### Deepgram
        # stt=deepgram.STT(
        #     model="nova-2",
        #     language="ko",
        # ),
        ########### RTZR
        # stt=rtzr.STT(
        #     model="sommers_ko",
        #     language="ko",
        #     # keywords=["λ¦¬ν„΄μ œλ‘œ", "μŒμ„±μΈμ‹"],
        # ),
        ############ ElevenLabs
        tts=elevenlabs.TTS(
            api_key=config["ELEVENLABS_API_KEY"],
            voice_id=config["ELEVENLABS_VOICE_ID"],
            model="eleven_flash_v2_5",
        ),
        vad=silero.VAD.load(),
        turn_detection=MultilingualModel(),
        resume_false_interruption=True,
        allow_interruptions=True,
    )

    await session.start(
        room=ctx.room,
        agent=Assistant(),
    )

    await session.generate_reply(instructions="Greet the user and offer your assistance.")


if __name__ == "__main__":
    agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))

Track audio push timeline in RecognizeStream.push_frame() and compute
wall-clock delay from audio push to FINAL_TRANSCRIPT receipt. Emitted
as STTMetrics.utterance_end_latency for pure STT-engine-only latency
measurement, analogous to LLM ttft and TTS ttfb.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 41f770b803

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +431 to +435
audio_pos = end_time - self._start_time_offset
push_wall_clock = self._lookup_push_time(audio_pos)
if push_wall_clock is not None:
utterance_end_latency = max(0.0, time.time() - push_wall_clock)
self._prune_push_timestamps(audio_pos)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound push timeline when end timestamps are missing

The new latency timeline is only pruned in this FINAL_TRANSCRIPT path when end_time > 0 and lookup succeeds, while push_frame() appends on every audio frame. That means long-running streams with no final timestamps (or long silence before any final transcript) keep growing _audio_push_wall_times/_audio_push_timestamps without bound, which can steadily increase memory usage in production sessions. Add a pruning/capping path that does not depend on end_time availability.

Useful? React with πŸ‘Β / πŸ‘Ž.

Comment on lines +1611 to +1614
if self._last_user_final_stt_request_id:
utterance_end_latency = self._stt_utterance_latency_by_request_id.pop(
self._last_user_final_stt_request_id,
None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Purge stale keyed STT latencies beyond the last transcript id

This logic pops only the latest transcript request ID, so earlier keyed latency entries collected in the same user turn are never removed from _stt_utterance_latency_by_request_id. Streaming providers can emit multiple final transcripts with different IDs before turn completion (for example, NVIDIA assigns request_id from each response object), so the dict can grow over time and retain stale per-request state. Clear or bound older keyed entries when a turn is committed.

Useful? React with πŸ‘Β / πŸ‘Ž.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment on lines +427 to +433
utterance_end_latency: float | None = None
if ev.alternatives:
end_time = ev.alternatives[0].end_time
if end_time > 0.0:
audio_pos = end_time - self._start_time_offset
push_wall_clock = self._lookup_push_time(audio_pos)
if push_wall_clock is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟑 Cumulative audio timeline not reset on STT stream retry causes incorrect utterance_end_latency

After a streaming STT retry/reconnect in _main_task, the _cumulative_audio_seconds counter and the push timeline lists are never reset. The STT provider's new connection reports end_time starting from 0 (relative to the new session), plus start_time_offset. The computation audio_pos = end_time - self._start_time_offset at livekit-agents/livekit/agents/stt/stt.py:429 recovers the provider's raw position (a small number), but _audio_push_wall_times contains cumulative values from all audio ever pushed (a much larger number).

Root Cause and Impact

Consider a retry scenario:

  1. 10s of audio pushed β†’ _cumulative_audio_seconds = 10.0, timeline = [0.2, 0.4, ..., 10.0]
  2. _run() fails with APIError, retry triggered at livekit-agents/livekit/agents/stt/stt.py:330-358
  3. _start_time_offset is increased by elapsed wall-clock time
  4. New _run() starts, provider receives only new audio and reports end_time relative to the new session (e.g. 3.0) plus start_time_offset
  5. audio_pos = end_time - start_time_offset = 3.0 β€” but the timeline entries start at ~10.0+
  6. bisect_left([10.2, 10.4, ...], 3.0) returns 0, mapping to the push timestamp of the very first frame
  7. utterance_end_latency = now() - very_old_timestamp β†’ a spuriously large value (many seconds)

Impact: After any STT stream retry, the reported utterance_end_latency will be wildly incorrect (inflated) rather than None. The timeline and _cumulative_audio_seconds should be reset (or an offset adjustment applied) when _main_task retries.

Prompt for agents
In livekit-agents/livekit/agents/stt/stt.py, the _cumulative_audio_seconds counter and the _audio_push_wall_times / _audio_push_timestamps lists are never reset when the stream retries via _main_task (lines 330-358). After a retry, the STT provider's reported end_time (minus start_time_offset) reflects the position in the NEW connection (starting near 0), but _cumulative_audio_seconds still holds the total from all audio ever pushed. Fix this by resetting the timeline state at the beginning of each retry iteration in _main_task. Specifically, inside the while loop at line 330 (before calling self._run()), add: self._cumulative_audio_seconds = 0.0, self._audio_push_wall_times.clear(), self._audio_push_timestamps.clear(). Alternatively, record the cumulative audio offset at the start of each _run() and adjust the audio_pos calculation accordingly.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Comment on lines +1150 to +1154
if isinstance(ev, STTMetrics) and ev.utterance_end_latency is not None:
if ev.request_id:
self._stt_utterance_latency_by_request_id[ev.request_id] = ev.utterance_end_latency
else:
self._last_unkeyed_stt_utterance_latency = ev.utterance_end_latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟑 _stt_utterance_latency_by_request_id dict accumulates unpopped entries across turns

Within a single user turn, multiple FINAL_TRANSCRIPT events may be received (each producing an STTMetrics with utterance_end_latency), but on_final_transcript at livekit-agents/livekit/agents/voice/agent_activity.py:1390 overwrites _last_user_final_stt_request_id with only the latest request_id. When _user_turn_completed_task runs, it only pops the entry for the last request_id at line 1612, leaving all earlier entries orphaned in _stt_utterance_latency_by_request_id.

Detailed Explanation

For STT providers that use distinct request_id values per FINAL_TRANSCRIPT (rather than a stable session-level ID), the flow is:

  1. FINAL_TRANSCRIPT Setting up packagesΒ #1 (request_id="r1") β†’ _on_metrics_collected stores {"r1": latency1}; on_final_transcript sets _last_user_final_stt_request_id = "r1"
  2. FINAL_TRANSCRIPT More processor implementationsΒ #2 (request_id="r2") β†’ _on_metrics_collected stores {"r1": latency1, "r2": latency2}; on_final_transcript overwrites to _last_user_final_stt_request_id = "r2"
  3. _user_turn_completed_task pops only "r2", leaving "r1" in the dict forever

Over many turns, the dict grows without bound. Each entry is small (str β†’ float), so the memory impact is slow but unbounded.

Impact: Slow memory leak proportional to the number of intermediate FINAL_TRANSCRIPT events across all turns, for providers that use per-response request IDs.

Prompt for agents
In livekit-agents/livekit/agents/voice/agent_activity.py, the _stt_utterance_latency_by_request_id dict (line 132) accumulates entries that are never cleaned up when multiple FINAL_TRANSCRIPT events occur per turn. To fix, either: (1) clear the entire dict after each turn in _user_turn_completed_task around line 1627 where _last_user_final_stt_request_id is reset to None (add self._stt_utterance_latency_by_request_id.clear()), or (2) store only the latest keyed latency (as a single value) instead of a dict, since only the last FINAL_TRANSCRIPT per turn is used.
Open in Devin Review

Was this helpful? React with πŸ‘ or πŸ‘Ž to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant