Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 58 additions & 16 deletions stt_stream_file/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,27 @@ uv run stream_audio_file.py --ui --live \

### Save & Print Mode

**Stream to file (realtime mode - natural pace):**
**Stream and save JSON output:**
```bash
uv run stream_audio_file.py -o output.json -f audio.wav --realtime \
uv run stream_audio_file.py -f audio.wav \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"
```

Output is automatically saved to `audio.json` (derived from input filename).

**Specify a custom output file:**
```bash
uv run stream_audio_file.py -o output.json -f audio.wav \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"
```

**Live recording saves with timestamp:**
```bash
uv run stream_audio_file.py --live \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"
# Saves to recording_20250114_153022.json (or similar)
```

**Print basic transcript:**
```bash
uv run print_transcript.py -f output.json
Expand All @@ -56,7 +71,7 @@ uv run print_transcript.py -f output.json
**Print with all the details:**
```bash
uv run print_transcript.py -f output.json \
--print-speakers --print-channels --print-interim --print-delay --colorize
--print-speakers --print-channels --print-interim --print-latency --colorize
```
```
[18:30:24.066 (0.665s since EOS)] [00:00:00.00 - 00:00:03.48] [Speaker 0] [Channel 0] [IsFinal]: The missile knows where it is at all times.
Expand All @@ -74,24 +89,51 @@ It knows this because it knows where it isn't.
## Key Options

### stream_audio_file.py
- `--ui` - Interactive terminal UI with live updates
- `-f, --audio` - Audio file to stream
- `-l, --live` - Stream from microphone
- `-o, --output` - Save JSON messages to file
- `-v, -vv, -vvv` - Increase verbosity

| Option | Description |
|--------|-------------|
| `--url, -u` | Deepgram websocket URL (required) |
| `--ui` | Interactive terminal UI with live updates |
| `-f, --audio` | Audio file to stream |
| `-l, --live` | Stream from microphone |
| `-o, --output` | Save JSON messages to file (defaults to input filename or timestamped name) |
| `-v, -vv, -vvv` | Increase verbosity |

### print_transcript.py
- `--print-speakers` - Show speaker labels
- `--print-channels` - Show audio channels
- `--print-interim` - Include interim results
- `--print-delay` - Show latency (time since end of speech)
- `--colorize` - Color words by confidence
- `--only-transcript` - Just the text, no metadata

| Option | Description |
|--------|-------------|
| `--print-speakers` | Show speaker labels |
| `--print-channels` | Show audio channels |
| `--print-interim` | Include interim results |
| `--print-received` | Show received timestamp for streamed messages |
| `--print-latency` | Show latency metrics (TTFT, update frequency, message latency, EOT latency) |
| `--print-entities` | Show detected entities |
| `--colorize` | Color words by confidence |
| `--only-transcript` | Just the text, no metadata |

Run either script with `--help` for full options.

## What's Happening?
### Shell Completion

Generate shell completions for your preferred shell:

```bash
uv run stream_audio_file.py completion bash # or zsh, fish
```

## Metrics Calculated

When using `--print-latency`, the following metrics are computed:

The UI mode shows transcription speed in real-time - watch words appear as you speak and see exactly how fast Deepgram processes your audio. The `--print-delay` option reveals latency metrics, perfect for testing different models and configurations.
**Session-level:**
- **TTFT (Time To First Transcript)**: Wall-clock time from when audio streaming begins to when the first transcript message is received. Measures initial responsiveness.
- **Update Frequency**: Number of interim transcript updates per second of audio. Higher values mean a more fluid, responsive transcription experience.

**Per-message:**
- **Message Latency**: How far behind the transcription is from the audio being sent, calculated as `audio_cursor - transcript_cursor`. Measured on interim results only, per Deepgram's methodology.
- **EOT Latency (End-of-Turn Latency)**: Time between the last interim result and the finalizing event (e.g., `speech_final`, `UtteranceEnd`, `EndOfTurn`). Critical for voice agents—they can't respond until they know the user stopped speaking.

## What's Happening?

The UI mode shows transcription speed in real-time—watch words appear as you speak and see exactly how fast Deepgram processes your audio. The `--print-latency` option reveals latency metrics, perfect for testing different models and configurations.
Loading