feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh by kimdwkimdw · Pull Request #4940 · livekit/agents

kimdwkimdw · 2026-02-25T04:29:47Z

Summary

Add SpeechEventType.RECOGNITION_USAGE to RTZR streaming STT so streamed STTMetrics(audio_duration) is collected.
- For observability, now supports audio_duration in STTMetrics
Stabilize websocket shutdown flow (_send_lock, deterministic cleanup, safer recv cancellation).
Fix token refresh behavior for concurrent and near-expiry cases.
Align RTZR websocket query options with docs and update README.
Add sommers_en model to RTZR streaming STT.

Changes

livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/stt.py
- Add/validate STT options (sample_rate, encoding, domain, ITN/filter flags).
- Serialize send_bytes and EOS on one send path.
- Emit RECOGNITION_USAGE from sent audio duration.
livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/rtzrapi.py
- Fix token refresh logic:
  - add _token_lock to prevent concurrent refresh races
  - refresh proactively before expiry window
  - use canonical Bearer <token> auth header
- Add use_itn, use_disfluency_filter, use_profanity_filter, and language (whisper-only) to build_config().
livekit-plugins/livekit-plugins-rtzr/README.md
- Add streaming docs link and concise option examples.

Testing

Manual sanity:

Confirm transcript flow is normal.
Confirm STT metrics include non-zero streamed audio_duration.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b5b267425c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-25T04:33:24Z

livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/stt.py

+                    audio_bstream = utils.audio.AudioByteStream(
+                        sample_rate=sample_rate,
+                        num_channels=1,
+                        samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
+                    )


Preserve buffered samples when switching chunk sizes

Reinitializing audio_bstream here can drop unsent PCM buffered in the previous AudioByteStream instance, because that class keeps partial-frame bytes internally until flush() (see livekit-agents/livekit/agents/utils/audio.py). In common streaming inputs where frame duration is not an exact divisor of 50ms (e.g. 20ms frames), switching to _STREAMING_CHUNK_MS at this point discards the remainder (typically ~10ms), clipping audio at the start of each segment and degrading transcript accuracy/usage accounting.

Useful? React with 👍 / 👎.

devin-ai-integration

Devin Review found 1 potential issue.

View 8 additional findings in Devin Review.

devin-ai-integration · 2026-02-25T04:36:49Z

livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/stt.py


            if frames and not self._ws:
                await self._ensure_connected()
+                # After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming
+                if not switched_to_streaming:
+                    audio_bstream = utils.audio.AudioByteStream(
+                        sample_rate=sample_rate,
+                        num_channels=1,
+                        samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
+                    )
+                    switched_to_streaming = True


🟡 Audio data silently dropped when replacing AudioByteStream after initial connection

When the first audio frames are produced and the WebSocket connects, the AudioByteStream is replaced with a new one using _STREAMING_CHUNK_MS (200ms) chunk size. Any partial frame data buffered inside the old AudioByteStream (up to ~49ms worth of audio) is silently discarded because the old stream is never flushed before being replaced.

Root Cause and Impact

The AudioByteStream.write() method (livekit-agents/livekit/agents/utils/audio.py:82-117) buffers incoming audio data and only returns complete frames once enough data has accumulated. Any leftover bytes remain in the internal _buf.

At stt.py:421-431, after the first successful _ensure_connected(), the old stream is replaced:

if not switched_to_streaming: audio_bstream = utils.audio.AudioByteStream( sample_rate=sample_rate, num_channels=1, samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000, ) switched_to_streaming = True

The old audio_bstream's internal buffer (containing up to 49ms of audio at the beginning of the speech segment) is garbage-collected without being flushed. This means the first ~50ms of each new segment's audio can be clipped.

Impact: Up to 50ms of audio at the very beginning of each speech segment is lost, which could clip the onset of speech and slightly affect recognition accuracy.

Suggested change

if frames and not self._ws:

await self._ensure_connected()

# After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming

if not switched_to_streaming:

audio_bstream = utils.audio.AudioByteStream(

sample_rate=sample_rate,

num_channels=1,

samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,

)

switched_to_streaming = True

if not switched_to_streaming:

# Flush remaining buffered data from the initial-chunk stream

remaining_frames = audio_bstream.flush()

if remaining_frames:

frames.extend(remaining_frames)

audio_bstream = utils.audio.AudioByteStream(

sample_rate=sample_rate,

num_channels=1,

samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,

)

switched_to_streaming = True

Was this helpful? React with 👍 or 👎 to provide feedback.

… lifecycle.

chatgpt-codex-connector bot reviewed Feb 25, 2026

View reviewed changes

devin-ai-integration bot reviewed Feb 25, 2026

View reviewed changes

kimdwkimdw changed the title ~~Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh~~ feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh Feb 26, 2026

Support SpeechEventType.RECOGNITION_USAGE, Hardening websocket stream…

3a4f98d

… lifecycle.

kimdwkimdw force-pushed the feature/rtzr_patch_260223 branch from b5b2674 to 3a4f98d Compare February 26, 2026 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh#4940

feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh#4940
kimdwkimdw wants to merge 1 commit intolivekit:mainfrom
kimdwkimdw:feature/rtzr_patch_260223

kimdwkimdw commented Feb 25, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kimdwkimdw commented Feb 25, 2026

Summary

Changes

Testing

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant