feat(tts): add gradium tts support by BenWeekes · Pull Request #2193 · TEN-framework/ten-framework

BenWeekes · 2026-06-24T11:22:39Z

Summary

Adds the Gradium TTS extension (gradium_tts_python), a websocket streaming TTS built on AsyncTTS2BaseExtension, plus a voice_assistant_gradium example graph.

Vendor client (gradium_tts.py): websocket setup → ready → text/end_of_stream → audio/error, with auth-error classification (401/403/1008 → fatal).
Extension (extension.py): streams audio through a single finalize path (_finalize_request → finish_request); request-level TTFB metric (emitted once per request even across multiple vendor segments); audio framing prefers the server-ready sample rate, falling back to config.
Config (config.py): pydantic model with param normalization, output-format/sample-rate reconciliation, and api-key redaction in logs.
Example wiring: new voice_assistant_gradium predefined graph (deepgram ASR + openai LLM + gradium TTS), manifest dependency, regenerated lockfile.

Tests

Standalone suite mirrors the sibling TTS extensions (basic audio, flush, state machine, robustness, params, metrics, error handling), plus gradium-specific coverage for its text-batching and per-request socket model.

task test-extension EXTENSION=agents/ten_packages/extension/gradium_tts_python → 27 passed
task tts-guarder-test EXTENSION=gradium_tts_python CONFIG_DIR=tests/configs → 15 passed, 1 skipped

Follow-ups (non-blocking)

Add punctuation-only / list-transition robustness regressions to match the documented robustness matrix.
Add an explicit 48 kHz config/test if we advertise pcm_48000.
If a future refactor reuses one websocket across segments, revisit reconnect/backoff coverage.

…l to url

BenWeekes · 2026-06-25T14:24:48Z

TTS guarder + standalone test records (commit `931cc24`)

Addressed both review comments on config.py:

sample_rate is the single source of truth — Gradium only supports PCM, so the wire output_format is now derived as pcm_<sample_rate> and is no longer a user-settable knob (removed the two conflicting normalize/derive helpers). This matches the closest peer minimax_tts_websocket and the convention across other TTS extensions (sample_rate is the canonical rate knob).
base_url → url — the field holds the full websocket endpoint and is used as-is (no path appended in code), matching the url convention used by the modern WS extensions (minimax, vibevoice).

TTS guarder — `task tts-guarder-test EXTENSION=gradium_tts_python`

Run against the real Gradium API, from the branch worktree (container /app is a bind mount of this branch). 15 passed, 1 skipped in 363.94s.

Case	Result
test_append_input	✅ passed
test_append_input_stress	✅ passed
test_append_input_without_text_input_end	✅ passed
test_append_interrupt	✅ passed
test_basic_audio_setting	✅ passed
test_corner_input	✅ passed
test_dump	✅ passed
test_dump_each_request_id	✅ passed
test_empty_text_request	✅ passed
test_flush	✅ passed
test_interleaved_requests	✅ passed
test_invalid_required_params	✅ passed
test_invalid_text_handling	✅ passed
test_metrics	✅ passed
test_miss_required_params	✅ passed
test_subtitle_alignment	⏭️ skipped (disabled by default; optional — only for providers that emit word/segment timing, which Gradium does not)

Standalone suite — `task test-extension EXTENSION=...gradium_tts_python`

28 passed (includes a new test_output_format_derived_from_sample_rate_only that locks in: a user-supplied output_format is ignored, sample_rate drives the derived pcm_<rate>, and it is not leaked as a vendor passthrough param).

Both runs were executed on the branch code; black format-check is clean.

github-actions · 2026-06-25T14:36:06Z

Code Review: Gradium TTS extension

Thanks for the thorough work here — the per-request socket model, the single finalize path, and the test suite mirroring the sibling extensions are all solid. A few things worth a look before merge.

Potential bugs

1. on_data double-queues every tts_text_input message (extension.py).
This is the most important one. The new on_data override appends each tts_text_input to self.ingress_messages, then calls super().on_data(...), which dispatches to request_tts(t) with that same parsed payload. Inside request_tts you then call self._pop_next_ingress_message(t.request_id, t), popping from the queue and processing the popped message instead of the one the base class handed you.

For a strictly FIFO base class these line up, but the design is fragile:

It assumes the base class delivers messages to request_tts in exactly the same order and count as on_data sees them. If the base class ever filters, coalesces, or reorders (e.g. drops a message after text_input_end), the queue and the dispatch stream drift permanently out of sync — every subsequent request processes a stale t.
_pop_next_ingress_message is also called once per request_tts, but on_data enqueues once per data event. If the base class ever invokes request_tts more than once for a payload (or zero times), the counts diverge.

What's the queue actually buying you over just using the t the base class passes in? If it's only to recover metadata/ordering, consider dropping the parallel queue and working directly off t. If it's genuinely needed, please add a comment explaining the invariant ("base class calls request_tts exactly once per tts_text_input, in order") since the correctness hinges entirely on it.

2. cancel_tts finalizes but request_contexts may already be gone.
cancel_tts sets context.finished = True then calls _finalize_request, which does del self.request_contexts[request_id]. If cancel_tts fires twice, or after a natural REQUEST_END finalize, the second _finalize_request returns early (good — context is None guard), but cancel_tts still calls self.client.cancel() and re-issues a send_tts_audio_end path only if context exists. Worth a quick test for double-cancel / cancel-after-completion to confirm no duplicate tts_audio_end.

3. Clean-close detection relies on getattr(exc, "code", None) == 1000 (gradium_tts.py).
You depend on websockets 14.x for additional_headers (good — the old extra_headers was removed). But the ConnectionClosedOK/ConnectionClosed exceptions in 14.x expose the close code via exc.rcvd.code / exc.sent.code, not a top-level exc.code attribute. The tests pass because they construct a fake exception with a literal .code = 1000, so this path is only ever exercised against the mock, never the real library exception. Please verify against a real ConnectionClosedOK from websockets 14 — if .code isn't present, a normal end-of-stream close (1000) will be reported as EVENT_TTS_ERROR instead of EVENT_TTS_END. The sibling ASR extensions parse the code out of str(ConnectionClosed) precisely because of this.

4. _wait_for_ready maps every recv failure to 401/500.
A timeout waiting for ready (server slow, network hiccup) becomes status_code=500 → NON_FATAL, which is fine, but a genuinely transient close gets classified via _closed_during_setup which treats codes {1000, 1008, 1011} as auth (401 → FATAL). A 1011 (server internal error) or even a clean 1000 during setup isn't necessarily an auth problem, and marking it FATAL stops retries. Consider narrowing the FATAL set to actual auth signals (1008 + the auth token list) and letting the rest stay NON_FATAL.

Code quality

Dead/duplicated state. The extension now keeps both per-request RequestContext and a parallel set of instance mirrors (self.current_request_finished, self.total_audio_bytes, self._audio_start_sent, self.current_request_sample_rate, self.sent_ts) that are continually copied back and forth via _apply_request_context and manual assignments. This is a lot of surface area for the two to disagree. Since synthesize_audio_sample_rate() is the only base-class hook that reads instance state mid-stream, consider keeping only the context dict and a single current_request_id, and deriving everything else from self.request_contexts[current_request_id].
_calculate_audio_duration_ms hardcodes bytes_per_sample = 2 and channels = 1 instead of synthesize_audio_sample_width() / synthesize_audio_channels() like the prior version. That's correct for Gradium's 16-bit mono PCM today, but a magic constant that silently breaks if the format ever changes. A short comment ("Gradium is always 16-bit mono PCM") would help.
_clear_local_request_state swallows flush errors with a bare except Exception: pass. At least a debug log would help diagnose dump-file issues.
README/manifest drift: the README documents params.url and a cLONiZ4hQ8VpQ4Sz default, which matches the new config — good. But the PR summary still references output_format/pcm_48000 follow-ups; just confirm the README's removal of output_format as a user knob is intentional and complete (it is, per config.py, since it's now derived).

Performance

The per-request connect → ready → text → end → disconnect cycle pays a full websocket handshake + ready round-trip on every flush segment. With the new text-batching (MIN_STREAM_FLUSH_CHARS=80) this is mitigated, but a long streamed LLM response that flushes several times will reconnect several times within one request, each adding handshake latency to TTFB for that segment. The TTFB metric is only emitted once per request, so this added latency on later segments is invisible in metrics. The follow-up note about reusing one socket across segments is the right call — worth prioritizing.

Security

API-key redaction in to_str now covers both self.api_key and params["api_key"] — nice catch, that's a real leak the old code had.
_connect falls back to status_code=401 whenever "401" in message or "403" in message. A server error body that merely contains the substring "401" (e.g. a request ID) would be misclassified as fatal auth. Low risk, but string-sniffing status codes is brittle; prefer the structured InvalidStatus.response.status_code path you already have and only fall back for the truly opaque case.

Tests

Good coverage of batching, interleaving, metrics-once, and the state machine. Gaps:

No test exercises the double-queue invariant in on_data (bug Update README.md #1) — e.g. interleaved messages where the base dispatch order could differ.
The clean-close test (Fix speech sdk #3) uses a hand-rolled exception with .code, so it doesn't protect against the real websockets 14 exception shape. Consider importing ConnectionClosedOK from websockets.exceptions and asserting your handler maps a real instance to EVENT_TTS_END.
No cancel-after-finalize / double-cancel test (modify playground default settings，fix voice type error #2).

Overall this is close. The on_data double-queue (#1) and the clean-close .code assumption (#3) are the two I'd want resolved or explicitly justified before merge; the rest are polish.

github-actions · 2026-06-29T13:30:51Z

Review: feat(tts): add gradium tts support

Solid, well-structured extension that closely mirrors the sibling TTS extensions. The per-request socket model, request-scoped RequestContext, single-emit TTFB, and the websockets >= 14 close-code handling are all done carefully, and the test suite is genuinely comprehensive. A few things worth a look before merge.

Potential bugs

tts_flush clears all request state, not just the flushed request. In on_data, after delegating to the base class, _clear_local_request_state() wipes request_contexts, recorder_map, and ingress_messages wholesale. The whole PR is built around interleaved request_id handling (the on_data ordering override, per-request contexts). A flush that targets one turn discarding the buffered state of other in-flight requests seems to contradict that invariant. If flush is genuinely whole-pipeline this is fine, but worth confirming against base-class flush semantics and adding an interleaved-flush test.
TTFB handler overwrites context.sent_ts. In _process_tts_text, the EVENT_TTS_TTFB_METRIC branch does context.sent_ts = datetime.now(). Since request_event_interval_ms in _finalize_request is computed as now - context.sent_ts, the reported interval is measured from first-audio rather than from request send. The TTFB value itself is correct (the client computes it from its own _sent_ts), but the interval semantics may not be what you intend. Worth a comment explaining the reset, or dropping it.
current_request_sample_rate is instance-level under interleaving. synthesize_audio_sample_rate() returns the instance field, set per-chunk in _process_tts_text. Duration math correctly uses context.sample_rate, but anything the base class reads via synthesize_audio_sample_rate() can reflect a different request's rate when requests interleave. Probably benign given serialized request_tts, but flagging since the rest of the design is rigorously per-context.

Code quality

Auth classification by substring matching is brittle. _looks_like_auth_error / _is_auth_error scan for \"401\", \"403\", \"1008\", \"unauthorized\", etc. in free-text error bodies. A non-auth error whose message happens to contain 401 would be promoted to FATAL_ERROR (or vice-versa). The InvalidStatus.response.status_code path is the reliable signal; the string heuristic is a reasonable fallback but worth a comment noting its best-effort nature.
json_config type mismatch. The field is typed dict[str, Any] | str | None and forwarded raw in _send_setup, but manifest.json declares json_config as \"type\": \"string\". Either accept only the string form or document that a dict is also valid; right now schema and model disagree.
_disconnect swallows all errors silently (except Exception: pass). Fine for close, but a log_debug would help diagnose stuck-socket cases.
The _close_code helper and the on_data ordering override are both well-documented — nice.

Security

API-key redaction in to_str now covers both the top-level field and params[\"api_key\"] — good. Other params values are still logged verbatim; fine today, but if Gradium adds another secret-bearing param it won't be masked.
Dump files are never cleaned up. dump_path defaults to /tmp and _setup_recorder does os.makedirs(..., exist_ok=True); synthesized audio accumulates with no retention bound. dump defaults False, so low risk, but worth a README note that dumps are persistent and operator-managed.

Tests

Good coverage across basic audio, flush, state machine, params, metrics, and error handling, plus the gradium-specific interleaving and websockets-14 close-code regression. Gaps to consider:

An interleaved-flush test exercising concern Update README.md #1.
A test asserting request_event_interval_ms semantics (concern modify playground default settings，fix voice type error #2), so a future refactor doesn't silently change it.
Confirm the websockets >= 14 lower bound is pinned in the extension's Python requirements — the code now depends on additional_headers, websockets.asyncio.client.ClientConnection, and State.OPEN, none of which exist on older releases. A stale environment would fail at import/connect rather than gracefully.

The follow-ups you listed (punctuation/list robustness regressions, explicit 48 kHz coverage) are reasonable to defer. Nice work overall.

BenWeekes · 2026-06-29T13:56:01Z

Update — commits `3bb3e909` and `b2c1b819`

On the `on_data` override (re: "the base class already implements `on_data`, so it shouldn't be needed")

I looked into removing the override and it turns out it's load-bearing — removing it drops text on interleaved requests. Keeping it, with a comment now explaining why.

Mechanism. Gradium is the only TTS extension that batches multiple same-request_id segments into a single vendor request (it accumulates pending_text and flushes on a sentence boundary or text_input_end, then deletes the per-request context on finalize). The base class buffers messages for a not-yet-active request_id and, in finish_request, re-queues them to the tail of its input queue. For interleaved requests this can hand request_tts() a request's text_input_end segment before an earlier segment of the same request — which flushes and deletes the context early, dropping the earlier text.

ingress_messages is a per-request_id FIFO captured at on_data arrival time and replayed in order in request_tts(), which neutralises that reordering.

Evidence. I empirically removed the override and ran the standalone suite — test_interleaved_requests_keep_separate_buffers fails, dropping interleave_req_2's first segment:

At index 1: ('interleave_req_2', 'It also finishes in one vendor request.')
  != ('interleave_req_2', 'Request two starts now and stays separate. It also finishes...')

Why the other extensions don't need this: I checked minimax, cartesia, elevenlabs, and xai_tts — they all send each segment to the vendor immediately inside request_tts, with no per-request text buffer, so an out-of-order text_input_end can't drop earlier text. Gradium's per-request-socket model makes tiny per-fragment requests expensive, which is why it batches — and why it alone is exposed.

b2c1b819 adds a thorough comment on on_data/ingress_messages documenting this invariant and pointing at the covering test.

Note: the cleaner long-term fix is in the base class — if finish_request preserved per-request arrival order when re-releasing buffered messages (rather than re-queuing to the tail), gradium could drop ingress_messages entirely and every extension would be safe. Happy to file that as a separate issue against ten_ai_base.

websockets 14 close-code (`3bb3e909`)

Separately fixed a latent bug: websockets~=14.0 exposes the close code on exc.rcvd/exc.sent (each a Close frame with .code), not a top-level exc.code. The clean-close path in _iter_messages relied on getattr(exc, "code", ...), so a real 1000 end-of-stream close would be misclassified as EVENT_TTS_ERROR instead of EVENT_TTS_END (the existing test passed only because it hand-built an exception with a top-level .code). Added a _close_code helper that reads rcvd/sent first and falls back to .code, plus a regression test using the real websockets-14 exception shape.

Standalone suite: 29 passed (28 + the new regression test); black clean.

YiminW · 2026-06-30T02:53:52Z

I don't think Gradium should batch all text segments with the same request_id before sending them to the vendor.

A few reasons:

This is not how the other TTS extensions behave. Cartesia / ElevenLabs / xAI all forward each incoming TTSTextInput segment to the vendor/client layer as it arrives, and use text_input_end only to finalize the request. Gradium is currently different: it accumulates pending_text and only sends when the buffer reaches a sentence/size threshold or text_input_end.
I don't think the AsyncTTS2BaseExtension ordering issue described here is accurate. tts2.py buffers messages from other request_ids while one request is being processed, and finish_request() releases the next buffered request back to the input queue in FIFO order. It serializes interleaved requests, but it should not reorder messages within the same request_id such that a text_input_end=True segment is processed before earlier segments of that same request. If we have a concrete repro for that, it should probably be fixed/tested in AsyncTTS2BaseExtension, not worked around by overriding on_data() in the Gradium extension.
Gradium's official streaming guidance does not recommend waiting for the whole logical request before sending text. Their LLM-to-TTS recipe says to open one TTS WebSocket before the LLM starts, buffer tokens into word/phrase chunks, send each chunk as a text message, add <flush> at sentence/end boundaries, and read audio concurrently:
https://docs.gradium.ai/guides/recipes/llm-to-tts.md

The documented chunking rules are:
- send complete words, phrases, or sentences
- keep punctuation attached to the preceding word
- avoid one-token-per-message unless tokens are already clean word chunks
- add <flush> at natural boundaries
- avoid frequent flushes because they can make prosody choppy
In our pipeline, the LLM already guarantees that every emitted segment ends with punctuation, so the incoming segments should already satisfy Gradium's recommended chunking constraints.

github-actions · 2026-06-30T08:18:44Z

Review: feat(tts): add gradium tts support

Solid extension. It follows the sibling-TTS pattern, the per-request streaming session model is clean, and test coverage is good. A few issues worth addressing before merge.

Bugs / correctness

json_config schema/type mismatch (likely runtime failure). In config.py, json_config is typed dict[str, Any] | str | None, and test_params.py passes a dict ({"speed": 1.1}). But manifest.json declares the param as "type": "string". The TEN property-schema validator will reject a dict value for a string-typed property when the graph passes an object, so a real graph using an object json_config may fail validation even though the unit test (which builds the config directly, bypassing the manifest) passes. Either declare it as an object in the manifest or constrain the Python type to str.
Example graph still uses removed base_url key. property.json (the new voice_assistant_gradium graph) sets params.base_url: "wss://api.gradium.ai/...", but the config dropped base_url in favor of url. update_params() no longer extracts base_url, so it silently falls into vendor params and gets sent to the server in the setup payload as an unexpected base_url field, while the actual URL falls back to the default. The endpoint happens to match the default so it "works", but the stray key is sent to the vendor and a non-default region URL would be silently ignored. Rename to url.
request_tts resets reader state on every new request_id without ensuring the prior reader finished. When request_id changes, _reset_request_state() runs but the previous _reader_task is not cancelled/awaited here (only cancel_tts/on_stop cancel it). If a new request arrives before the prior stream's reader finishes, the orphaned task can still call send_tts_audio_*/_finalize_request. _finalize_request guards on request_id != self.current_request_id (good), but the orphaned reader can still write to a stale recorder and emit audio frames against the old request. Consider awaiting _cancel_reader_task() on request transition.

Robustness

audio_events() / _iter_messages has no idle timeout cap across a full request. WS_RECV_TIMEOUT=10s is per-recv. A vendor that keeps the socket open and dribbles a byte every 9s would never time out the request. The base class flush path covers interrupts, but a hung-but-alive socket has no overall ceiling. Minor, but worth noting given the follow-ups already mention reconnect/backoff.
_disconnect() swallows all close exceptions silently (except Exception: pass). Fine for cleanup, but a debug log would help diagnose stuck-close situations.

Style / minor

extension.py imports asyncio and uses a background reader task — good — but _cancel_reader_task swallows all exceptions with a bare except Exception: pass. At least log at debug so a crashing reader isn't invisible.
config._ensure_dict silently converts any non-dict params (including a populated list/str) to {}, discarding user input without warning. A log_warn on the discard path would save debugging time.
Auth-error detection is duplicated in two places: extension._is_auth_error (checks 401/403/unauthorized/forbidden) and gradium_tts._looks_like_auth_error (a superset). Consider consolidating to one source of truth to avoid divergence.

Test coverage

Coverage is strong — basic audio, dump, flush, segmented sessions, metrics (incl. once-per-request TTFB), params passthrough, sample-rate derivation, robustness (empty/whitespace/punctuation/long/special), and auth-error classification. Two gaps:

No test asserts the base_url→url migration, which is exactly why issue modify playground default settings，fix voice type error #2 slipped through. A test that loads the example graph property or asserts base_url is rejected/ignored would catch it.
The mock client (gradium_mocks.py) models the streaming API faithfully but never exercises the real _close_code / _closed_during_setup logic for websockets >= 14 close frames. Given the careful version-compat handling there, a small unit test against a synthetic exception carrying rcvd.code would lock that behavior in.

Overall this is close. Issues #1 and #2 are the ones I'd want resolved before merge since they affect real graph configs, not just internals.

BenWeekes · 2026-06-30T09:15:28Z

Update — stream segments immediately (commit `89da4f70`)

This reworks the extension to forward each tts_text_input segment to Gradium as it arrives, instead of accumulating segments in a local buffer. It resolves the earlier review feedback about overriding on_data, and fixes the dropped-audio behaviour seen in manual testing.

What changed

gradium_tts.py — the client is now a persistent per-request streaming session: start_session() (connect + setup + ready) → send_text(segment) for each segment → end_input() (end_of_stream) → audio read concurrently via audio_events(). (Previously it opened one websocket per get() call.)
extension.py — request_tts() forwards each segment immediately over a single session for the request; a background reader streams audio back and finalizes on stream end. Removed the pending_text batching, _should_flush_pending_text, the on_data override, ingress_messages, and the per-request RequestContext dict.
Retains the websockets-14 close-code fix (exc.rcvd/exc.sent vs top-level exc.code).

Why

Matches the other websocket TTS extensions. Cartesia, ElevenLabs, xAI and Minimax all forward each segment to the vendor immediately and use text_input_end only to finalize. Gradium was the only one batching, which is also what forced the non-standard on_data override.
Matches Gradium's official LLM-to-TTS guidance (open one socket, send each chunk as a text message, read audio concurrently). Our LLM already emits punctuation-terminated segments, so each incoming segment is a valid chunk to send as-is.
Fixes dropped audio. Holding text in a local buffer that a finalize could flush-and-discard was dropping sections of speech. With immediate forwarding there is no buffer to lose, and the base class owns queuing/ordering — so the on_data override is no longer needed.

Verification

Manual: voice_assistant_gradium graph — clean multi-turn conversation; logs show each request completing with reason=REQUEST_END, one TTFB metric per request, no errors/warnings, api_key redacted.
TTS guarder (real Gradium API): 15 passed, 1 skipped (incl. test_interleaved_requests, test_flush, test_append_interrupt, test_corner_input, test_metrics).
Standalone suite: 29 passed (rewritten to the streaming API; adds immediate-forwarding, single-session-per-request, and punctuation-only coverage).
Formatting: task check (black, line-length 80) clean.

Net change: roughly −260 lines; the extension is now simpler and consistent with the rest of the TTS extensions.

Gradium previously accumulated same-request_id segments into a local buffer and only flushed on a sentence/size threshold or text_input_end. Combined with the base class re-queuing interleaved messages, this could drop sections of TTS audio and required a non-standard on_data override. Forward each segment to the vendor as it arrives over one persistent per-request websocket session (start_session/send_text/end_input plus a concurrent audio reader), matching Gradium's LLM-to-TTS streaming guidance and the other websocket TTS extensions. The base class now owns queuing/ordering, so the on_data override and ingress_messages are removed. Also in this change: - parse json_config (a JSON string per the manifest schema) into an object before sending it on the wire, so manifest, config and README agree. - cancel the background reader task on a send/setup failure so it cannot outlive the request. - read the websockets-14 close code from exc.rcvd/exc.sent (it is no longer a top-level exc.code) so a clean 1000 close is treated as end-of-stream. Tests rewritten to the streaming API, with added coverage for immediate forwarding, single-session-per-request, json_config parsing, reader-task cancellation, and punctuation-only input.

github-actions · 2026-06-30T10:03:00Z

Review: `feat(tts): add gradium tts support` (#2193)

Solid extension. The refactor to a single-session-per-request streaming model with a background reader task, the once-per-request TTFB metric, and the websockets >= 14 close-code handling are all well done, and the test suite is genuinely thorough (state machine, metrics, params, robustness, error paths). A few things worth a look before merge.

Potential bugs / correctness

Background reader task is not cancelled when a new request_id arrives mid-stream. request_tts resets state via _reset_request_state() when t.request_id != self.current_request_id, but that method does not touch self._reader_task. The only places the reader is cancelled are cancel_tts and on_stop. If a new request begins before the prior request's vendor stream has closed (reader still blocked on recv()), start_session() will _disconnect()/_connect() and replace self.ws, then a second reader task is created and assigned to self._reader_task — the orphaned first task is still iterating audio_events() -> self.ws.recv(), so two coroutines read the same new socket and can steal each other's frames. Recommend await self._cancel_reader_task() at the top of the new-request branch (before _reset_request_state). Reachable for back-to-back requests that arrive without an intervening flush.
WS_RECV_TIMEOUT = 10s between chunks emits a hard error. In _iter_messages, an asyncio.TimeoutError yields EVENT_TTS_ERROR and finalizes the request. For long synthesis where the vendor pauses >10s between audio frames, this surfaces a spurious TTS error. If the vendor can legitimately go quiet that long, consider distinguishing "no data yet but socket open" from a true stall, or making the timeout configurable.
assert self.ws is not None in _iter_messages / _send_json. Asserts are stripped under python -O, and a teardown race that nulls self.ws would become an AttributeError rather than a clean handled path. The cancel ordering mitigates this today, but an explicit if self.ws is None: return is more robust than an assert.

Style / maintainability

Hardcoded bytes_per_sample = 2 / channels = 1 in _calculate_audio_duration_ms. Correct for Gradium's 16-bit mono PCM, but it diverges from the base helpers (synthesize_audio_sample_width() / synthesize_audio_channels()) the prior version used. The comment helps; consider deriving from the base config so duration math can't silently drift.
README/manifest sync. Config table (params.url, retry_for_s, pronunciation_id) matches the manifest schema today — just keep them in lockstep on future edits.

Security

API key handling is correct: sent only via the x-api-key header, popped out of params in update_params so it can't leak into the setup passthrough payload (verified by test_params_passthrough), and redacted in to_str for both the top-level field and params.
User-configurable websocket url is expected for a vendor endpoint; no concern.

Test coverage

Strong overall. One gap maps to bug #1: no test exercises a new request arriving while the previous reader is still streaming. Worth adding given the concurrency model. The punctuation-only/list-transition robustness and explicit 48 kHz config are already noted as non-blocking follow-ups — agreed.

Nice work. Nothing here is a hard blocker except possibly #1, depending on how the base class serializes requests; the rest is polish.

Capture the gates that were easy to miss this round: the commitlint body-max-line-length rule (there is no local commit-msg hook, so CI is the first thing to catch a long body line), and a single pre-push checklist covering black, lint, standalone tests, guarder tests and commit messages. Also note the stale .ten/ gotcha: a leftover standalone install makes task check report spurious reformatting and breaks the next install.

github-actions · 2026-06-30T11:40:02Z

Review: feat(tts): add gradium tts support

Solid, well-structured extension. It uses the correct websockets 14 API (additional_headers, ClientConnection.state == State.OPEN, close code read from exc.rcvd/exc.sent), redacts the API key in both the top-level field and params, emits TTFB once per request across multiple segments, and ships a genuinely good test suite — including explicit websockets-14 close-frame regression tests and connection-error classification. Nice work.

A few things worth considering before merge.

Potential issues

1. WS_RECV_TIMEOUT (10s) can conflict with retry_for_s.
_iter_messages aborts with an error if any single recv() exceeds 10s:

raw_msg = await asyncio.wait_for(self.ws.recv(), timeout=WS_RECV_TIMEOUT)

But the config exposes retry_for_s, which lets the server hold the connection while retrying. If a user sets retry_for_s above ~10s, the client-side recv timeout tears down a connection the server still considers healthy and surfaces a spurious EVENT_TTS_ERROR ("Timeout waiting for Gradium audio"). Consider deriving the recv timeout from retry_for_s (plus margin) when set, or documenting the ceiling.

2. Passthrough params can clobber reserved setup fields.
In _send_setup, the vendor-extras loop runs after the reserved keys are set:

if self.config.voice_id:
    payload["voice_id"] = self.config.voice_id
...
for key, value in self.config.params.items():
    payload[key] = value

A passthrough param named type, voice_id, voice, or json_config would silently overwrite the framework-set value (e.g. type: "setup"). Recommend skipping keys already present in payload, or keeping a reserved-key denylist, so a stray param can't break the setup envelope.

3. Possible orphaned reader task on request switch without text_input_end.
request_tts starts a reader task on first text, but on a new request_id it calls _reset_request_state() (not _cancel_reader_task()) before creating a new task:

self._reader_task = asyncio.create_task(self._read_audio(t.request_id))

In the normal flow the prior request ends via end_input and its reader has finished. But if a new request_id arrives while a prior reader is still live (no text_input_end seen), the old task reference is overwritten without being cancelled or awaited — an orphan. And since _finalize_request guards on request_id != self.current_request_id and returns early, that orphan's eventual finalize is dropped silently. cancel_tts/flush handle this correctly; the gap is the request-switch path. Cheap fix: await self._cancel_reader_task() at the top of the new-request branch.

Minor

_calculate_audio_duration_ms hardcodes bytes_per_sample = 2 / channels = 1 instead of using synthesize_audio_sample_width() / synthesize_audio_channels(). Correct for Gradium today (16-bit mono PCM) and the inline comment says so — just noting the divergence from the base-class helpers.
README advertises sample_rate ∈ {8000,16000,22050,24000,44100,48000} and the PR lists a follow-up to add an explicit 48 kHz config/test — worth doing since it's sent on the wire as pcm_48000.

Test coverage

Strong. Covers basic audio, flush, dump byte-for-byte comparison, segmented/single-TTFB metrics, params passthrough + json_config parsing, sample-rate variants, empty/whitespace/long/special/punctuation-only text, sequential requests, reconnect-after-error, clean-close → END (both legacy .code and websockets-14 .rcvd/.sent shapes), 1008 → fatal auth mapping, and reader-task cancellation on send failure. The two gaps mapping to the issues above would be good additions: a recv-timeout-vs-retry_for_s case, and a request-switch-without-text_input_end case asserting no orphaned task.

Overall this is in good shape — items 1–3 are the ones I'd want addressed or consciously waived before merge.

Add a pyproject.toml for the gradium_tts_python extension (project metadata plus the websockets runtime dependency, version kept in sync with the manifest) and list it in the manifest package include so it ships with the package. This is prep for switching to the full uv toolchain, which expects each extension to carry its own pyproject.toml.

github-actions · 2026-07-01T06:17:02Z

Review: `feat(tts): add gradium tts support`

Nice work — the refactor to a single-session-per-request model with a request-level TTFB metric is clean, the config normalization is well thought out, and the test suite is genuinely broad (state machine, metrics, params, robustness). A few things are worth addressing before merge.

Bug: `url` vs `base_url` key mismatch (endpoint override is silently ignored)

There's an inconsistency in how the endpoint parameter is named:

manifest.json schema declares the param as url
extension property.json default sets url
all tests/configs/*.json set url
config.py update_params() only pops base_url — url is never recognized
the example graph (voice-assistant/tenapp/property.json) uses base_url

Because leftover keys in params are forwarded verbatim into the vendor setup payload (the for key, value in self.config.params.items() loop in _send_setup), a user-supplied url does two wrong things:

It does not change the connection endpoint — websocket_url() falls back to GRADIUM_DEFAULT_WS_URL.
It gets leaked into the setup payload as payload[\"url\"] = ....

Today this is masked only because the default happens to equal the intended URL. A custom endpoint via url (exactly what the schema/defaults advertise) would be ignored.

test_params_passthrough doesn't catch this: it passes url and asserts websocket_url() equals the default string — which is true regardless of whether url was consumed — and never asserts url was removed from params.

Fix: pick one name. Either add url to the popped-keys tuple in update_params() (aliasing it to base_url), or standardize everything (manifest, property.json, test configs, example graph) on base_url. Then add an assertion that the endpoint key is consumed and not present in the emitted setup payload.

Concern: shared client + background reader race across overlapping requests

request_tts now starts a background _read_audio task and returns before audio finishes; the reader finalizes when the vendor closes the stream. On a normal end_of_stream this is fine. But if a new request_id arrives before the previous request's reader has drained, request_tts calls _reset_request_state() (which does not cancel the running reader) and then start_session() calls _disconnect() on the shared self.ws — pulling the socket out from under the still-running previous reader. The previous _reader_task reference is also overwritten without cancellation.

The sequential test passes because each request fully finalizes (the test waits for tts_audio_end) before the next is sent, so the race is never exercised. Worth confirming the base class strictly serializes request_tts calls; if not, consider cancelling/awaiting the prior reader in the new-request branch, the same way _cancel_reader_task() is used in the error paths.

Minor

Auth-error classification is substring-based. Both _is_auth_error (extension) and _looks_like_auth_error (client) match tokens like \"401\"/\"403\" anywhere in the message. A benign vendor message containing \"403 items\" would be misclassified as FATAL. Low probability, but prefer matching on the structured code / status_code where available rather than free text.
WS_RECV_TIMEOUT = 10s between chunks yields a hard error. For long syntheses with natural pauses this could truncate output. Confirm the vendor keeps the stream active (keepalive/heartbeat) within that window, or make it configurable.
_calculate_audio_duration_ms() uses synthesize_audio_sample_rate() (config rate) rather than the reconciled current_request_sample_rate from the server ready message. If the server ever returns a rate different from the requested one, the reported duration will be off. Since setup sends output_format=pcm_<rate> this should normally agree, so it's an accuracy nit.

Style / conventions

Commit author lines show Ubuntu <ubuntu@...compute.internal> — per AGENTS.md, please make sure the final commits use a proper author identity (no build-host identities).
pyproject.toml pins websockets~=14.0 (good — the code relies on the asyncio.client API and additional_headers). Double-check any lingering requirements.txt doesn't pin an incompatible older websockets.

Tests

Coverage is solid. Two gaps tied to the findings above:

No test asserts the endpoint override key is actually consumed / not leaked into the setup payload (see the url/base_url bug).
No test exercises an overlapping-request scenario (new request_id before the prior reader drains).

Overall this is close — the endpoint key mismatch is the one I'd consider blocking, since it makes a documented, defaulted parameter silently non-functional.

github-actions · 2026-07-01T06:30:23Z

Code Review: `feat(tts): add gradium tts support`

Thanks for this — the streaming rewrite is a clear improvement over the old batch-and-flush model, the auth-error classification is thorough, and the test suite is genuinely comprehensive (state machine, metrics-once-per-request, websockets-14 close-code handling, reader-task cancellation). A few things worth addressing, roughly in priority order.

Potential bugs

1. close_ws_on_eos=false likely deadlocks on the recv timeout (gradium_tts.py).
_send_setup forwards close_ws_on_eos to the vendor and the README documents it as configurable. But end_input sends end_of_stream and the reader then relies on either an explicit end_of_stream message or a clean 1000 close to emit EVENT_TTS_END. If a user sets close_ws_on_eos=false and the vendor signals completion only by leaving the socket open, _iter_messages sits until WS_RECV_TIMEOUT (10s) and surfaces a synthetic "Timeout waiting for Gradium audio" error for what was actually a successful request. If the extension can't truly tolerate close_ws_on_eos=false, consider forcing it to true (or validating) rather than advertising it as a knob. There's no test for this path.

2. Reader-task lifecycle on rapid consecutive requests (extension.py).
_cancel_reader_task() is called on the exception paths in request_tts and in cancel_tts, but I don't see it on the normal new-request path. When t.request_id != self.current_request_id, the code calls _reset_request_state() (which clears current_request_id and _session_started but does not cancel _reader_task). If a new request arrives before the previous reader finished, the prior detached task can still be running. _finalize_request guards with if request_id != self.current_request_id: return, so a stale finalize is dropped — but the task itself and its socket may linger. Worth confirming the base class serializes request_tts such that this can't happen, or add _cancel_reader_task() to the new-request branch. A test for "new request_id mid-stream without a flush" would cover it.

3. synthesize_audio_sample_rate() can disagree with the framed audio.
current_request_sample_rate comes from the server ready frame; before it's set it falls back to config.get_sample_rate(). If the server ever returns a rate different from the configured sample_rate, early framing (send_tts_audio_start) may use the config value while duration math later uses the server value. Fine if server/config rates are guaranteed equal for PCM — worth a comment either way.

Code quality

4. Substring-based auth detection is fragile (_is_auth_error / _looks_like_auth_error). Matching "401"/"403" anywhere in a message risks misclassifying an unrelated error that happens to contain those digits as fatal. Prefer keying off the structured code field where available (as _message_to_exception already does for 1008).

5. except (asyncio.CancelledError, Exception) in _cancel_reader_task swallows all reader exceptions silently. The intent (CancelledError is BaseException, not Exception) is correct and the comment explains it, but consider a debug log so a real reader bug isn't invisible.

6. _message_to_exception maps every non-auth error to status_code=500. Harmless today (only used for fatal/non-fatal split) but a bit of a fiction for vendor content errors. Minor.

Security

API-key redaction in to_str covers both the top-level field and params["api_key"] — good. Key is sent as the x-api-key header, never in the setup payload (asserted by test_params_passthrough).
_send_setup forwards arbitrary self.config.params extras onto the wire. Acceptable since they come from operator config; worth a one-line note that unknown params pass through verbatim.

Tests

Strong coverage. Gaps for follow-up (some already in your PR's own follow-ups):

No close_ws_on_eos=false test (see Update README.md #1) — most likely production breakage.
No assertion that the reader task is cancelled when a new request_id arrives mid-stream without a flush (see modify playground default settings，fix voice type error #2).
pcm_48000 is in SUPPORTED_SAMPLE_RATES but not config-tested (already noted in follow-ups).

Nits

The commit-message / pre-push docs added under docs/ai/L1/ are a nice touch and fit the progressive-disclosure convention.
voice_id = "cLONiZ4hQ8VpQ4Sz" is hardcoded as the default across model, manifest, property.json, README and all tests. Fine if it's a stable public default voice — just confirm it won't 404 for accounts that don't have it.

Nothing here is a hard blocker except clarifying #1 and confirming #2 (reader-task lifecycle). Nice work overall.

github-actions · 2026-07-01T06:55:34Z

Code Review — feat(tts): add gradium tts support

Solid, well-tested extension. The refactor to a per-request streaming session with a background reader task and a single finalize path is clean, the config normalization + api-key redaction is careful, and the websockets>=14 close-code handling (_close_code reading exc.rcvd/exc.sent) is a nice touch with a dedicated regression test. Test coverage is strong (state machine, metrics, robustness, params, errors).

A few things worth addressing before merge.

Bugs / correctness

1. Example graph uses base_url, but the config only reads url now.
In examples/voice-assistant/tenapp/property.json the new voice_assistant_gradium node has "params": { "base_url": "wss://api.gradium.ai/api/speech/tts", ... }. GradiumTTSConfig.update_params() was changed to read url (not base_url), so base_url is no longer recognized as an extension field. It falls through into self.params and gets forwarded verbatim into the vendor setup payload as an unknown base_url key, while websocket_url() silently uses the default. It happens to equal the default here so it still connects, but the key is misleading and an endpoint override placed there would be silently ignored. Rename it to url to match config.py, property.json, and the README.

2. cancel_tts sends audio_end but never calls finish_request.
The normal and error paths go through _finalize_request, which sends audio_end and then calls finish_request(...) to complete the base-class request state transition. cancel_tts sends audio_end with INTERRUPTED and resets local state, but skips finish_request. The sibling extensions (cartesia_tts, elevenlabs_tts2_python) both call finish_request on the cancel/interrupt path. If the base class relies on it to advance its request state machine, interrupted requests may be left half-finished. Please route the cancel path through the same finalize logic (or add the finish_request call).

3. Stale reader task on back-to-back requests (potential ordering hazard).
request_tts returns immediately after end_input(), leaving _reader_task running until the vendor sends end_of_stream. The reader is only cancelled on an exception path — on a new request_id, request_tts calls _reset_request_state() (which nulls current_request_id) and starts a fresh session, but does not cancel the previous reader. The old reader can then reach _finalize_request, hit the request_id != self.current_request_id guard, and bail — so the previous request may never emit its audio_end/finish_request. The sequential-requests test passes only because the tester waits for tts_audio_end before sending the next input. Consider awaiting or cancelling the prior reader when a new request arrives, and add a test that fires a second request before the first reader drains.

Minor

Aggressive recv timeout. WS_RECV_TIMEOUT = 10.0 turns any >10s gap between audio frames into an EVENT_TTS_ERROR. For long synthesis or a slow network this could cut off a legitimate stream. Confirm Gradium always streams within 10s, or make it configurable / more lenient.
Duplicated auth-error heuristic. extension._is_auth_error and gradium_tts._looks_like_auth_error implement the same substring match — consider consolidating. Substring matching on "401"/"403" can also false-positive on unrelated messages containing those digits; acceptable as a heuristic, just flagging.
_finalize_request early-return. The if request_id != self.current_request_id: return guard silently drops finalization for a mismatched id. That's the intended dedupe, but combined with Fix speech sdk #3 it's the mechanism by which an audio_end can get lost — a debug log there would aid diagnosis.
Docs additions (04_conventions.md, 05_workflows.md commitlint / pre-push notes) are useful and well-written, though somewhat unrelated to the TTS feature — fine to keep, just noting scope.

Security

Looks good — x-api-key header only, to_str() now redacts both api_key and params["api_key"], with a test asserting the raw key never appears in the stringified config. No secrets logged.

Nice work overall. Items 1–3 are the ones I'd want resolved (1 is a quick fix; 2 and 3 relate to lifecycle correctness under interrupt / rapid-turn scenarios).

Ubuntu added 2 commits June 24, 2026 09:03

feat(tts): add gradium tts support

26cca08

fix(tts): avoid duplicate gradium ttfb metrics

a05b995

BenWeekes requested review from halajohn and plutoless as code owners June 24, 2026 11:22

chore(tts): format gradium tts code with black

f7e8119

YiminW reviewed Jun 25, 2026

View reviewed changes

Comment thread ai_agents/agents/ten_packages/extension/gradium_tts_python/config.py Outdated

YiminW reviewed Jun 25, 2026

View reviewed changes

Comment thread ai_agents/agents/ten_packages/extension/gradium_tts_python/config.py Outdated

YiminW requested changes Jun 25, 2026

View reviewed changes

fix(tts): make gradium sample_rate source of truth and rename base_ur…

931cc24

…l to url

BenWeekes force-pushed the feat/gradium-tts branch from 89da4f7 to cb1a38c Compare June 30, 2026 09:53

Merge branch 'main' into feat/gradium-tts

91a6ec9

feat: update version for gradium tts

f21a0d9

wangyoucao577 approved these changes Jul 1, 2026

View reviewed changes

wangyoucao577 merged commit abea947 into main Jul 1, 2026
41 checks passed

wangyoucao577 deleted the feat/gradium-tts branch July 1, 2026 07:25

Uh oh!

Conversation

BenWeekes commented Jun 24, 2026

Summary

Tests

Follow-ups (non-blocking)

Uh oh!

Uh oh!

Uh oh!

BenWeekes commented Jun 25, 2026

TTS guarder + standalone test records (commit 931cc24)

TTS guarder — task tts-guarder-test EXTENSION=gradium_tts_python

Standalone suite — task test-extension EXTENSION=...gradium_tts_python

Uh oh!

github-actions Bot commented Jun 25, 2026

Code Review: Gradium TTS extension

Potential bugs

Code quality

Performance

Security

Tests

Uh oh!

github-actions Bot commented Jun 29, 2026

Review: feat(tts): add gradium tts support

Potential bugs

Code quality

Security

Tests

Uh oh!

BenWeekes commented Jun 29, 2026

Update — commits 3bb3e909 and b2c1b819

On the on_data override (re: "the base class already implements on_data, so it shouldn't be needed")

websockets 14 close-code (3bb3e909)

Uh oh!

YiminW commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

BenWeekes commented Jun 30, 2026

Update — stream segments immediately (commit 89da4f70)

What changed

Why

Verification

Uh oh!

github-actions Bot commented Jun 30, 2026

Review: feat(tts): add gradium tts support (#2193)

Potential bugs / correctness

Style / maintainability

Security

Test coverage

Uh oh!

github-actions Bot commented Jun 30, 2026

Review: feat(tts): add gradium tts support

Potential issues

Minor

Test coverage

Uh oh!

github-actions Bot commented Jul 1, 2026

Review: feat(tts): add gradium tts support

Bug: url vs base_url key mismatch (endpoint override is silently ignored)

Concern: shared client + background reader race across overlapping requests

Minor

Style / conventions

Tests

Uh oh!

github-actions Bot commented Jul 1, 2026

Code Review: feat(tts): add gradium tts support

Potential bugs

Code quality

Security

Tests

Nits

Uh oh!

github-actions Bot commented Jul 1, 2026

Code Review — feat(tts): add gradium tts support

Bugs / correctness

Minor

Security

Uh oh!

Uh oh!

TTS guarder + standalone test records (commit `931cc24`)

TTS guarder — `task tts-guarder-test EXTENSION=gradium_tts_python`

Standalone suite — `task test-extension EXTENSION=...gradium_tts_python`

Update — commits `3bb3e909` and `b2c1b819`

On the `on_data` override (re: "the base class already implements `on_data`, so it shouldn't be needed")

websockets 14 close-code (`3bb3e909`)

Update — stream segments immediately (commit `89da4f70`)

Review: `feat(tts): add gradium tts support` (#2193)

Review: `feat(tts): add gradium tts support`

Bug: `url` vs `base_url` key mismatch (endpoint override is silently ignored)

Code Review: `feat(tts): add gradium tts support`