Skip to content

feat(assemblyai): add u3-rt-pro model plus mid-stream updates, SpeechStarted, and ForceEndpoint support#4965

Open
gsharp-aai wants to merge 9 commits intolivekit:mainfrom
gsharp-aai:assemblyai-u3-pro-streaming-new
Open

feat(assemblyai): add u3-rt-pro model plus mid-stream updates, SpeechStarted, and ForceEndpoint support#4965
gsharp-aai wants to merge 9 commits intolivekit:mainfrom
gsharp-aai:assemblyai-u3-pro-streaming-new

Conversation

@gsharp-aai
Copy link

@gsharp-aai gsharp-aai commented Feb 27, 2026

Summary

Adds Universal-3-Pro (u3-rt-pro) model support to the AssemblyAI streaming plugin with several improvements to the existing streaming implementation.

New model

  • Add u3-rt-pro to the supported model literals
  • Accept deprecated u3-pro name with a warning, remapped to u3-rt-pro
  • Add prompt parameter (u3-rt-pro only) for custom transcription instructions, validated at init
  • Default language_detection to True for u3-rt-pro model (previously only defaulted for multilingual models)
  • Default min_end_of_turn_silence_when_confident and max_turn_silence to 100ms for u3-rt-pro for optimal out-of-the-box performance/latency across most LiveKit configurations. This provides finals quickly for third-party turn detection models while still working well with built-in turn detection.
    • If a user sets min without setting max, max defaults to match min rather than its API default of 1000ms. Both parameters are fully overridable. For reference, the AssemblyAI API defaults are min=100ms and max=1000ms. We will clearly document the plugin's defaults and how to override them.

Speaker diarization support

  • Add speaker_labels parameter (bool) to enable speaker diarization on the streaming connection
  • Add max_speakers parameter (int, 1-10) to set the maximum number of speakers when diarization is enabled
  • Both are connection-level parameters sent as WebSocket query params at connect time (not updatable mid-stream via UpdateConfiguration)

Mid-stream configuration updates

  • Replace reconnect-based update_options() with in-place UpdateConfiguration websocket messages (previously, updating options would tear down and restart the entire websocket connection)
  • Queue-based approach (asyncio.Queue) for thread-safe sync-to-async communication, with a dedicated coroutine that sends config messages immediately and independently of audio flow
  • Supported fields: prompt, keyterms_prompt, max_turn_silence, min_end_of_turn_silence_when_confident, end_of_turn_confidence_threshold, vad_threshold
  • Add keyterms_prompt to update_options() (previously only available at connection time)

New websocket message support

  • SpeechStarted: Handle new server event, mapped to SpeechEventType.START_OF_SPEECH for barge-in detection
  • ForceEndpoint: Add force_endpoint() method to immediately finalize the current turn via {"type": "ForceEndpoint"}

Fixes

  • Set interim_results=True in STTCapabilities (was incorrectly False despite emitting INTERIM_TRANSCRIPT events)
  • Fix send_config_task shutdown hang by separating it from asyncio.gather so it is cancelled in finally instead of blocking graceful shutdown

…eEndpoint

- Rename model from u3-pro to u3-rt-pro
- Replace reconnect-based update_options with UpdateConfiguration websocket messages
- Add SpeechStarted event handler (maps to START_OF_SPEECH)
- Add force_endpoint() to immediately finalize turns
- Add keyterms_prompt to update_options/UpdateConfiguration
- Fix interim_results capability (True, not False)
@gsharp-aai gsharp-aai marked this pull request as draft February 27, 2026 01:20
devin-ai-integration[bot]

This comment was marked as resolved.

Move queue drain into a separate send_config_task coroutine so
ForceEndpoint and UpdateConfiguration messages are sent immediately,
even when no audio frames are flowing.
… u3-rt-pro

Separate send_config_task from gather so it is cancelled in finally
instead of blocking shutdown. Default language_detection to True for
u3-rt-pro model.
Accept 'u3-pro' as a deprecated model name that remaps to 'u3-rt-pro'
with a warning. For u3-rt-pro, default min_end_of_turn_silence_when_confident
and max_turn_silence to 100ms (max follows min if only min is set) to
minimize latency for external turn detectors.
@gsharp-aai gsharp-aai marked this pull request as ready for review February 27, 2026 20:29
Adds support for the new streaming diarization params (speaker_labels bool,
max_speakers 1-10) as connection-level query params on the WebSocket URL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant