Skip to content

Conversation

@jjmaldonis
Copy link
Contributor

Add TTS Latency Measurement Tools

This PR adds tools for measuring how quickly Deepgram's TTS service delivers audio—and whether it delivers audio fast enough for smooth, uninterrupted playback.

What's included

File Description
stream_tts.py Sends text to Deepgram and records when each piece of audio arrives
analyze_tts_latency.py Analyzes the timing data and produces a report
README.md Setup and usage instructions

Why two separate scripts?

Customers can send us their raw timing data (the JSON file) so we can analyze their exact results and compare against our own tests. This makes debugging latency issues much easier.


What the metrics mean

TTFB (Time To First Byte)

How long until audio starts playing?

This is the delay between sending text and receiving the first audio. Lower is better—users notice delays over ~200ms.

  • TTFB: Time from sending text to first audio (excludes connection setup)
  • TTFB including network: Time from the very start, including establishing the connection (matters for cold starts)

RTF (Real-Time Factor)

Is audio arriving fast enough?

RTF compares delivery speed to playback speed. If a 10-second audio clip arrives in 5 seconds, that's 2.0x RTF.

  • > 1.0x: Audio arrives faster than it plays—good!
  • = 1.0x: Audio arrives exactly as fast as it plays—just barely keeping up
  • < 1.0x: Audio arrives slower than it plays—playback will stutter

Min Cumulative RTF (Streaming Health)

Did the stream ever fall behind?

This is the most important metric for real-world playback. It tracks whether, at any moment during the stream, we had received enough audio to keep playing without interruption.

  • ≥ 1.0x: Stream always stayed ahead—smooth playback ✓
  • < 1.0x: Stream fell behind at some point—playback would have stuttered ✗

Jitter

How consistent is the delivery?

Even if audio arrives fast enough on average, inconsistent packet timing can cause problems. Jitter measures this variability—lower is better.


Example output

$ uv run analyze_tts_latency.py -i phrases_internet_troubleshooting.json
======================================================================
DEEPGRAM TTS LATENCY ANALYSIS REPORT
======================================================================
SESSION OVERVIEW
----------------------------------------
  Duration:        8965.78 ms
  Phrases:         5
  Total packets:   455
  Total audio:     18200.00 ms
  Total bytes:     873,600
LATENCY
----------------------------------------
  TTFB:            161.67 ms
  TTFB (incl net): 622.37 ms
  TTLB:            8502.82 ms
  Overall RTF:     2.18x
STREAMING HEALTH
----------------------------------------
  (Min cumulative RTF >= 1.0 means stream never fell behind real-time)
  Min cumulative RTF: 2.12x
  Status:          ✓ Stream kept ahead of real-time
JITTER (Inter-Arrival Time Variability)
----------------------------------------
  Mean IAT:        18.02 ms
  Jitter (σ):      2.21 ms

Example files


Technical details: How calculations are performed

Timestamps collected

Timestamp Description
session_start Before websocket connection attempt
connected_at After websocket connection established
text_sent_at Before sending each Speak message
flush_sent_at Before sending each Flush message
packets[].received_at When each audio packet arrives
flushed_received_at When the Flushed control message arrives
session_end After connection closes

All timestamps are UTC ISO 8601 format.

Audio duration calculation

audio_duration = (byte_size / bytes_per_sample) / sample_rate

Where bytes_per_sample is 2 for linear16, 1 for mulaw/alaw. All audio is mono.

Formulas

TTFB:

TTFB = first_audio_packet.received_at - first_phrase.text_sent_at
TTFB_incl_net = first_audio_packet.received_at - session_start

RTF:

overall_rtf = total_audio_duration / (last_packet.received_at - first_packet.received_at)

Cumulative RTF (calculated for each packet after the first):

cumulative_rtf = cumulative_audio_received / wall_clock_since_first_packet

The minimum value across all packets determines streaming health.

Jitter:

inter_arrival_time[i] = packet[i].received_at - packet[i-1].received_at
jitter = standard_deviation(all_inter_arrival_times)

Usage

# Collect timing data
export DEEPGRAM_API_KEY="your-api-key"
uv run stream_tts.py -i phrases.txt -o results.json -a output.wav

# Analyze results
uv run analyze_tts_latency.py -i results.json

# Export metrics as JSON
uv run analyze_tts_latency.py -i results.json -o metrics.json

@jjmaldonis jjmaldonis requested a review from a team as a code owner January 26, 2026 15:51
Copy link

@jkroll-deepgram jkroll-deepgram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! LGTM

@jeniya-DG jeniya-DG merged commit cba6262 into main Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants