feat(rtzr): Support SpeechEventType.RECOGNITION_USAGE, harden websocket stream lifecycle, and fix token refresh#4940
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b5b267425c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| audio_bstream = utils.audio.AudioByteStream( | ||
| sample_rate=sample_rate, | ||
| num_channels=1, | ||
| samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000, | ||
| ) |
There was a problem hiding this comment.
Preserve buffered samples when switching chunk sizes
Reinitializing audio_bstream here can drop unsent PCM buffered in the previous AudioByteStream instance, because that class keeps partial-frame bytes internally until flush() (see livekit-agents/livekit/agents/utils/audio.py). In common streaming inputs where frame duration is not an exact divisor of 50ms (e.g. 20ms frames), switching to _STREAMING_CHUNK_MS at this point discards the remainder (typically ~10ms), clipping audio at the start of each segment and degrading transcript accuracy/usage accounting.
Useful? React with 👍 / 👎.
|
|
||
| if frames and not self._ws: | ||
| await self._ensure_connected() | ||
| # After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming | ||
| if not switched_to_streaming: | ||
| audio_bstream = utils.audio.AudioByteStream( | ||
| sample_rate=sample_rate, | ||
| num_channels=1, | ||
| samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000, | ||
| ) | ||
| switched_to_streaming = True |
There was a problem hiding this comment.
🟡 Audio data silently dropped when replacing AudioByteStream after initial connection
When the first audio frames are produced and the WebSocket connects, the AudioByteStream is replaced with a new one using _STREAMING_CHUNK_MS (200ms) chunk size. Any partial frame data buffered inside the old AudioByteStream (up to ~49ms worth of audio) is silently discarded because the old stream is never flushed before being replaced.
Root Cause and Impact
The AudioByteStream.write() method (livekit-agents/livekit/agents/utils/audio.py:82-117) buffers incoming audio data and only returns complete frames once enough data has accumulated. Any leftover bytes remain in the internal _buf.
At stt.py:421-431, after the first successful _ensure_connected(), the old stream is replaced:
if not switched_to_streaming:
audio_bstream = utils.audio.AudioByteStream(
sample_rate=sample_rate,
num_channels=1,
samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000,
)
switched_to_streaming = TrueThe old audio_bstream's internal buffer (containing up to 49ms of audio at the beginning of the speech segment) is garbage-collected without being flushed. This means the first ~50ms of each new segment's audio can be clipped.
Impact: Up to 50ms of audio at the very beginning of each speech segment is lost, which could clip the onset of speech and slightly affect recognition accuracy.
| if frames and not self._ws: | |
| await self._ensure_connected() | |
| # After initial _INITIAL_CHUNK_MS chunk sent, switch to _STREAMING_CHUNK_MS for low-latency streaming | |
| if not switched_to_streaming: | |
| audio_bstream = utils.audio.AudioByteStream( | |
| sample_rate=sample_rate, | |
| num_channels=1, | |
| samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000, | |
| ) | |
| switched_to_streaming = True | |
| if not switched_to_streaming: | |
| # Flush remaining buffered data from the initial-chunk stream | |
| remaining_frames = audio_bstream.flush() | |
| if remaining_frames: | |
| frames.extend(remaining_frames) | |
| audio_bstream = utils.audio.AudioByteStream( | |
| sample_rate=sample_rate, | |
| num_channels=1, | |
| samples_per_channel=sample_rate * _STREAMING_CHUNK_MS // 1000, | |
| ) | |
| switched_to_streaming = True |
Was this helpful? React with 👍 or 👎 to provide feedback.
b5b2674 to
3a4f98d
Compare
Summary
SpeechEventType.RECOGNITION_USAGEto RTZR streaming STT so streamedSTTMetrics(audio_duration)is collected.audio_durationinSTTMetrics_send_lock, deterministic cleanup, safer recv cancellation).sommers_enmodel to RTZR streaming STT.Changes
livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/stt.pysample_rate,encoding,domain, ITN/filter flags).send_bytesandEOSon one send path.RECOGNITION_USAGEfrom sent audio duration.livekit-plugins/livekit-plugins-rtzr/livekit/plugins/rtzr/rtzrapi.py_token_lockto prevent concurrent refresh racesBearer <token>auth headeruse_itn,use_disfluency_filter,use_profanity_filter, andlanguage(whisper-only) tobuild_config().livekit-plugins/livekit-plugins-rtzr/README.mdTesting
Manual sanity:
audio_duration.