📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский
On-device speech SDK for Android, powered by ONNX Runtime and speech-core.
Speech recognition (114 languages), text-to-speech (8 languages), voice activity detection, and noise cancellation — all running locally. No cloud APIs, no data leaves the device.
Demo APK · Models · speech-swift (Apple counterpart) · speech-core (pipeline engine + Linux/embedded build)
This repo is the Android packaging: Kotlin SDK, JNI bridge, demo app. The C++ engine and ONNX model wrappers (Silero VAD, Parakeet STT, Kokoro TTS, DeepFilterNet3) live in speech-core and are pulled in via a git submodule. Linux / automotive (Yocto, Qualcomm SA8295P/SA8255P) lives at speech-core/examples/linux.
| Model | Task | INT8 Size | Languages |
|---|---|---|---|
| Parakeet TDT v3 | Speech recognition | 891 MB | 114 |
| Kokoro 82M | Text-to-speech | 330 MB | 8 (en, fr, es, it, pt, hi, ja, zh) |
| Supertonic-3 | Text-to-speech (LiteRT, flow-matching, G2P-free, 44.1 kHz) | ~380 MB | 31 |
| Silero VAD v5 | Voice activity detection | 2 MB | Any |
| DeepFilterNet3 | Noise cancellation | ~8 MB | Any |
| FunctionGemma 270M | On-device LLM — structured function / tool calls | 283 MB | EN-tuned |
Models are downloaded automatically on first launch via ModelManager.ensureModels().
Supertonic-3 is an opt-in higher-quality multilingual TTS — select it with
SpeechConfig(ttsModel = TtsModel.SUPERTONIC) (requires the LiteRT backend). The host runs its four
non-autoregressive flow-matching graphs on-device at 44.1 kHz; the front-end is G2P-free (NFKD +
Unicode index — no phonemizer), so all 31 languages go through one path.
FunctionGemma 270M is a Gemma 3 derivative trained for structured tool
calls. The Kotlin wrapper (audio.soniqo.speech.llm.FunctionGemma) is a
runtime-agnostic shell: bring your own LiteRT-LM runtime adapter (see the
Kotlin usage section) and the SDK handles prompt
formatting and call parsing. The model bundle ships as a single 283 MB
.litertlm file.
Download the signed APK and install on any arm64 Android device (8+). Models (~1.2 GB) download automatically on first launch.
dependencies {
implementation("audio.soniqo:speech:0.0.9")
}val modelDir = ModelManager.ensureModels(context)
val pipeline = SpeechPipeline(
SpeechConfig(modelDir = modelDir, useNnapi = true)
)
pipeline.events.collect { event ->
when (event) {
is SpeechEvent.TranscriptionCompleted -> println(event.text)
is SpeechEvent.ResponseDone -> pipeline.resumeListening()
else -> {}
}
}
pipeline.start()
// Feed 16kHz mono float32 PCM from microphone
pipeline.pushAudio(samples)The SDK ships the prompt formatter (FunctionGemmaPrompt), parser
(FunctionGemmaParser) and a small façade (FunctionGemma). You bring
the LiteRT-LM runtime — e.g. the com.google.ai.edge.litert:litert-lm-runtime
Maven artifact — and adapt it to the one-method FunctionGemma.Runtime
interface so the SDK stays free of that transitive dependency.
import audio.soniqo.speech.llm.*
val runtime = object : FunctionGemma.Runtime {
private val engine = /* load model.litertlm via your chosen runtime */
override fun generate(prompt: String, maxNewTokens: Int): String =
engine.generateResponse(prompt, maxNewTokens)
override fun cancel() { engine.cancel() }
}
val llm = FunctionGemma(runtime)
val tools = listOf(
FunctionDeclaration(
name = "get_weather",
description = "Get current weather",
parameters = mapOf(
"type" to "object",
"properties" to mapOf(
"location" to mapOf("type" to "string"),
),
),
),
)
val rawResponse = llm.generateToolCall("What's the weather in Tokyo?", tools)
val calls = llm.parseToolCalls(rawResponse)
// -> [FunctionCall(name="get_weather",
// arguments={"location": ArgumentValue.Str("Tokyo")})]The model bundle (model.litertlm, 283 MB) is published at
soniqo/FunctionGemma-270M-LiteRT-LM.
git clone --recursive https://github.com/soniqo/speech-android.git
cd speech-android
./setup.sh
./gradlew :app:assembleDebug
./gradlew :sdk:connectedAndroidTest # 34 e2e tests./setup.sh initializes the speech-core submodule and downloads ONNX Runtime
into ./ort/.
The app/ module is a minimal voice assistant demo with:
- Real-time VAD waveform visualization
- Echo mode: transcribes speech and synthesizes it back (no LLM)
- Dictation mode: streaming partial results
SpeechRecognizertest screen — exercises the system-wide voice input path- Chat bubble UI with STT/TTS latency display
./gradlew :app:installDebugThe SDK ships a ready-made audio.soniqo.speech.service.SpeechRecognitionService
that plugs into Android's framework SpeechRecognizer API — no code to write.
Once your app is selected as the default voice recognizer, any third-party app
calling SpeechRecognizer.createSpeechRecognizer(context) (with no
ComponentName) gets fully on-device STT through your pipeline.
1. Declare RECORD_AUDIO and the service in AndroidManifest.xml:
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<application>
<service
android:name="audio.soniqo.speech.service.SpeechRecognitionService"
android:exported="true"
android:permission="android.permission.RECORD_AUDIO">
<intent-filter>
<action android:name="android.speech.RecognitionService" />
</intent-filter>
<meta-data
android:name="android.speech"
android:resource="@xml/recognition_service" />
</service>
</application>2. Add app/src/main/res/xml/recognition_service.xml:
<?xml version="1.0" encoding="utf-8"?>
<recognition-service xmlns:android="http://schemas.android.com/apk/res/android" />(Optionally add android:settingsActivity="..." to expose a gear icon in the
system Voice-input picker.)
3. Set the service as the system default (Settings → System → Languages & input → Voice input picker on stock Android, or via adb):
adb shell settings put secure voice_recognition_service \
your.package/audio.soniqo.speech.service.SpeechRecognitionService4. Verify by running the demo app's Recognizer test screen, which calls
SpeechRecognizer.createSpeechRecognizer(ctx) (no component) and logs every
framework callback — useful for confirming the binder round-trip without
needing logcat.
The service implements onCheckRecognitionSupport (API 33+) returning the
27 BCP-47 languages Parakeet TDT v3 covers, marked
installedOnDeviceLanguage once models are present (or
pendingOnDeviceLanguage while they're downloading). Audio focus is
acquired with AUDIOFOCUS_GAIN_TRANSIENT for the duration of a session.
Caveat: Gboard, Samsung Keyboard, and Google Assistant bundle their own
recognizers and skip the system default. Apps that explicitly call the
framework SpeechRecognizer API (or build their own UI on top of it) are
the ones that flow through your service.
Measured on Android emulator (arm64-v8a, no NNAPI). Real hardware is significantly faster.
| Model | Task | Audio | Inference | RTF |
|---|---|---|---|---|
| Parakeet TDT v3 | STT | 1.5s | 175ms | 0.12 |
| Kokoro 82M | TTS | 1.9s output | 1,075ms | 0.58 |
| Silero VAD v5 | VAD | 32ms chunk | <1ms | <0.01 |
Idle → Listening → Transcribing → Speaking → Idle
↑ |
└─── resumeListening() ───┘
Barge-in supported: speaking during TTS playback interrupts and starts a new transcription.
┌──────────────────────────────────────────────┐
│ SpeechPipeline (Kotlin) │
│ │ │
│ ▼ │
│ jni_bridge.cpp (~250 lines) │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ speech_core_models (git submodule) │ │
│ │ SileroVad / ParakeetStt / │ │
│ │ KokoroTts / DeepFilterEnhancer │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ speech_core (orchestration: │ │
│ │ pipeline · turn · interruptions) │ │
│ └──────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ONNX Runtime (CPU / NNAPI) │
└──────────────────────────────────────────────┘
Each model class directly implements the corresponding speech-core interface
(VADInterface, STTInterface, TTSInterface, EnhancerInterface) — the
JNI bridge instantiates them and hands references to VoicePipeline. No
C-vtable adapter boilerplate.
| Chipset | Acceleration |
|---|---|
| Snapdragon 8 Gen 1+ | NNAPI → Hexagon NPU |
| Samsung Exynos 2200+ | NNAPI → Samsung NPU |
| Google Tensor G2+ | NNAPI → Google TPU |
| CPU fallback | XNNPACK |
For automotive Qualcomm SA8295P / SA8255P with QNN (Hexagon DSP), see speech-core/examples/linux.
| Repository | Scope |
|---|---|
| speech-swift | Apple (macOS, iOS) — MLX + CoreML |
| speech-core | Cross-platform C++ pipeline engine + ONNX model wrappers + Linux/embedded examples |
| speech-android | Android wrapper — Kotlin SDK + JNI bridge over speech-core |
Apache 2.0