Skip to content

feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh#4940

Open
kimdwkimdw wants to merge 1 commit intolivekit:mainfrom
kimdwkimdw:feature/rtzr_patch_260223
Open

feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh#4940
kimdwkimdw wants to merge 1 commit intolivekit:mainfrom
kimdwkimdw:feature/rtzr_patch_260223

Conversation

@kimdwkimdw
Copy link
Contributor

Summary

  • Add SpeechEventType.RECOGNITION_USAGE to RTZR streaming STT so streamed STTMetrics(audio_duration) is collected.
    • For observability, now supports audio_duration in STTMetrics
  • Stabilize websocket shutdown flow (_send_lock, deterministic cleanup, safer recv cancellation).
  • Fix token refresh behavior for concurrent and near-expiry cases.
  • Align RTZR websocket query options with docs and update README.
  • Add sommers_en model to RTZR streaming STT.

Changes

  • livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/stt.py
    • Add/validate STT options (sample_rate, encoding, domain, ITN/filter flags).
    • Serialize send_bytes and EOS on one send path.
    • Emit RECOGNITION_USAGE from sent audio duration.
  • livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/rtzrapi.py
    • Fix token refresh logic:
      • add _token_lock to prevent concurrent refresh races
      • refresh proactively before expiry window
      • use canonical Bearer <token> auth header
    • Add use_itn, use_disfluency_filter, use_profanity_filter, and language (whisper-only) to build_config().
  • livekit-plugins/livekit-plugins-rtzr/README.md
    • Add streaming docs link and concise option examples.

Testing

Manual sanity:

  • Confirm transcript flow is normal.
  • Confirm STT metrics include non-zero streamed audio_duration.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5b267425c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +426 to +430
audio_bstream = utils.audio.AudioByteStream(
sample_rate=sample_rate,
num_channels=1,
samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve buffered samples when switching chunk sizes

Reinitializing audio_bstream here can drop unsent PCM buffered in the previous AudioByteStream instance, because that class keeps partial-frame bytes internally until flush() (see livekit-agents/livekit/agents/utils/audio.py). In common streaming inputs where frame duration is not an exact divisor of 50ms (e.g. 20ms frames), switching to _STREAMING_CHUNK_MS at this point discards the remainder (typically ~10ms), clipping audio at the start of each segment and degrading transcript accuracy/usage accounting.

Useful? React with 👍 / 👎.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Comment on lines 421 to +431

if frames and not self._ws:
await self._ensure_connected()
# After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming
if not switched_to_streaming:
audio_bstream = utils.audio.AudioByteStream(
sample_rate=sample_rate,
num_channels=1,
samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
)
switched_to_streaming = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Audio data silently dropped when replacing AudioByteStream after initial connection

When the first audio frames are produced and the WebSocket connects, the AudioByteStream is replaced with a new one using _STREAMING_CHUNK_MS (200ms) chunk size. Any partial frame data buffered inside the old AudioByteStream (up to ~49ms worth of audio) is silently discarded because the old stream is never flushed before being replaced.

Root Cause and Impact

The AudioByteStream.write() method (livekit-agents/livekit/agents/utils/audio.py:82-117) buffers incoming audio data and only returns complete frames once enough data has accumulated. Any leftover bytes remain in the internal _buf.

At stt.py:421-431, after the first successful _ensure_connected(), the old stream is replaced:

if not switched_to_streaming:
    audio_bstream = utils.audio.AudioByteStream(
        sample_rate=sample_rate,
        num_channels=1,
        samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
    )
    switched_to_streaming = True

The old audio_bstream's internal buffer (containing up to 49ms of audio at the beginning of the speech segment) is garbage-collected without being flushed. This means the first ~50ms of each new segment's audio can be clipped.

Impact: Up to 50ms of audio at the very beginning of each speech segment is lost, which could clip the onset of speech and slightly affect recognition accuracy.

Suggested change
if frames and not self._ws:
await self._ensure_connected()
# After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming
if not switched_to_streaming:
audio_bstream = utils.audio.AudioByteStream(
sample_rate=sample_rate,
num_channels=1,
samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
)
switched_to_streaming = True
if not switched_to_streaming:
# Flush remaining buffered data from the initial-chunk stream
remaining_frames = audio_bstream.flush()
if remaining_frames:
frames.extend(remaining_frames)
audio_bstream = utils.audio.AudioByteStream(
sample_rate=sample_rate,
num_channels=1,
samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
)
switched_to_streaming = True
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@kimdwkimdw kimdwkimdw changed the title Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh Feb 26, 2026
@kimdwkimdw kimdwkimdw force-pushed the feature/rtzr_patch_260223 branch from b5b2674 to 3a4f98d Compare February 26, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant