Pure Rust implementation of NVIDIA's Parakeet-TDT-0.6B-v3 speech-to-text model, built on Candle with Metal GPU acceleration for Apple Silicon.
- 3.83% WER on LibriSpeech test-clean (NVIDIA published: 1.93%)
- 7.6x real-time on Apple Silicon with Metal GPU
- 25 languages — English, French, German, Spanish, and 21 more European languages
- No Python runtime — single static binary, no ONNX/CoreML/MLX dependency
- Automatic model download from HuggingFace Hub
- Output formats: plain text, JSON, SRT, VTT subtitles
Full implementation of the FastConformer-TDT pipeline:
| Component | Description |
|---|---|
| Preprocessor | 128-bin log-mel spectrogram (rustfft + GPU filterbank) |
| Subsampling | 8x depthwise-separable conv stack (3 stride-2 layers) |
| Encoder | 24-layer FastConformer (relative positional attention, Macaron FFN, GLU conv) |
| Decoder | Token-and-Duration Transducer (2-layer LSTM + joint network) |
| Tokenizer | 8192 BPE vocabulary via SentencePiece (multilingual) |
# Build (requires Rust 1.85+)
cargo build --release --features metal -p parakeet
# Transcribe audio (downloads model on first run, ~2.4GB)
./target/release/parakeet --input recording.wav
# With timestamps
./target/release/parakeet --input recording.wav --timestamps
# SRT subtitles
./target/release/parakeet --input recording.wav --format srt --output recording.srt
# JSON output with word-level timestamps
./target/release/parakeet --input recording.wav --format json# Convert NeMo checkpoint to SafeTensors (one-time)
python3 scripts/convert_nemo.py
# Use local model directory
./target/release/parakeet --model-dir converted_model --input recording.wavBulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian
- Rust 1.85+ (edition 2024)
- macOS with Apple Silicon (M1/M2/M3/M4) for Metal acceleration
- 16-bit mono WAV input at any sample rate (resampled internally to 16kHz)
Measured on Apple Silicon with --features metal:
| Metric | Value |
|---|---|
| WER (LibriSpeech test-clean, 500 samples) | 3.83% |
| Real-time factor | 0.131 |
| Speed | 7.6x real-time |
| Model size | 0.6B parameters (~2.4GB) |
Also supports v2 (English-only, 2.66% WER) — auto-detected from model directory.
parakeet/
src/
audio.rs # Log-mel spectrogram (rustfft + GPU matmul)
subsampling.rs # 8x depthwise-separable conv subsampling
encoder.rs # 24-layer FastConformer encoder
decoder.rs # TDT decoder (LSTM + joint network)
decoding.rs # Greedy TDT decoding with duration skipping
tokenizer.rs # BPE tokenizer
pipeline.rs # End-to-end transcription pipeline
config.rs # Model configuration (v2 + v3)
weights.rs # HF Hub download + SafeTensors loading
wer.rs # Word error rate computation
bin/parakeet.rs # CLI entry point
scripts/
convert_nemo.py # NeMo .nemo → SafeTensors converter
benchmark_wer.py # LibriSpeech WER benchmark
MIT
Model weights are subject to NVIDIA's CC-BY-4.0 license.