Speech Android

📖 Read in: English · 中文 · 日本語 · 한국어 · Español · Deutsch · Français · हिन्दी · Português · Русский

On-device speech SDK for Android, powered by ONNX Runtime and speech-core.

Speech recognition (114 languages), text-to-speech (8 languages), voice activity detection, and noise cancellation — all running locally. No cloud APIs, no data leaves the device.

Demo APK · Models · speech-swift (Apple counterpart) · speech-core (pipeline engine + Linux/embedded build)

Scope

This repo is the Android packaging: Kotlin SDK, JNI bridge, demo app. The C++ engine and ONNX model wrappers (Silero VAD, Parakeet STT, Kokoro TTS, DeepFilterNet3) live in speech-core and are pulled in via a git submodule. Linux / automotive (Yocto, Qualcomm SA8295P/SA8255P) lives at speech-core/examples/linux.

Models

Model	Task	INT8 Size	Languages
Parakeet TDT v3	Speech recognition	891 MB	114
Kokoro 82M	Text-to-speech	330 MB	8 (en, fr, es, it, pt, hi, ja, zh)
Supertonic-3	Text-to-speech (LiteRT, flow-matching, G2P-free, 44.1 kHz)	~380 MB	31
Silero VAD v5	Voice activity detection	2 MB	Any
DeepFilterNet3	Noise cancellation	~8 MB	Any
FunctionGemma 270M	On-device LLM — structured function / tool calls	283 MB	EN-tuned

Models are downloaded automatically on first launch via ModelManager.ensureModels().

Supertonic-3 is an opt-in higher-quality multilingual TTS — select it with SpeechConfig(ttsModel = TtsModel.SUPERTONIC) (requires the LiteRT backend). The host runs its four non-autoregressive flow-matching graphs on-device at 44.1 kHz; the front-end is G2P-free (NFKD + Unicode index — no phonemizer), so all 31 languages go through one path.

FunctionGemma 270M is a Gemma 3 derivative trained for structured tool calls. The Kotlin wrapper (audio.soniqo.speech.llm.FunctionGemma) is a runtime-agnostic shell: bring your own LiteRT-LM runtime adapter (see the Kotlin usage section) and the SDK handles prompt formatting and call parsing. The model bundle ships as a single 283 MB .litertlm file.

Try the demo

Download the signed APK and install on any arm64 Android device (8+). Models (~1.2 GB) download automatically on first launch.

Add dependency

dependencies {
    implementation("audio.soniqo:speech:0.0.9")
}

Kotlin usage

val modelDir = ModelManager.ensureModels(context)

val pipeline = SpeechPipeline(
    SpeechConfig(modelDir = modelDir, useNnapi = true)
)

pipeline.events.collect { event ->
    when (event) {
        is SpeechEvent.TranscriptionCompleted -> println(event.text)
        is SpeechEvent.ResponseDone -> pipeline.resumeListening()
        else -> {}
    }
}

pipeline.start()

// Feed 16kHz mono float32 PCM from microphone
pipeline.pushAudio(samples)

FunctionGemma 270M (on-device tool-calling LLM)

The SDK ships the prompt formatter (FunctionGemmaPrompt), parser (FunctionGemmaParser) and a small façade (FunctionGemma). You bring the LiteRT-LM runtime — e.g. the com.google.ai.edge.litert:litert-lm-runtime Maven artifact — and adapt it to the one-method FunctionGemma.Runtime interface so the SDK stays free of that transitive dependency.

import audio.soniqo.speech.llm.*

val runtime = object : FunctionGemma.Runtime {
    private val engine = /* load model.litertlm via your chosen runtime */
    override fun generate(prompt: String, maxNewTokens: Int): String =
        engine.generateResponse(prompt, maxNewTokens)
    override fun cancel() { engine.cancel() }
}

val llm = FunctionGemma(runtime)

val tools = listOf(
    FunctionDeclaration(
        name = "get_weather",
        description = "Get current weather",
        parameters = mapOf(
            "type" to "object",
            "properties" to mapOf(
                "location" to mapOf("type" to "string"),
            ),
        ),
    ),
)

val rawResponse = llm.generateToolCall("What's the weather in Tokyo?", tools)
val calls = llm.parseToolCalls(rawResponse)
// -> [FunctionCall(name="get_weather",
//                  arguments={"location": ArgumentValue.Str("Tokyo")})]

The model bundle (model.litertlm, 283 MB) is published at soniqo/FunctionGemma-270M-LiteRT-LM.

Build from source

git clone --recursive https://github.com/soniqo/speech-android.git
cd speech-android
./setup.sh
./gradlew :app:assembleDebug
./gradlew :sdk:connectedAndroidTest   # 34 e2e tests

./setup.sh initializes the speech-core submodule and downloads ONNX Runtime into ./ort/.

Demo app

The app/ module is a minimal voice assistant demo with:

Real-time VAD waveform visualization
Echo mode: transcribes speech and synthesizes it back (no LLM)
Dictation mode: streaming partial results
SpeechRecognizer test screen — exercises the system-wide voice input path
Chat bubble UI with STT/TTS latency display

./gradlew :app:installDebug

System voice input (`RecognitionService`)

The SDK ships a ready-made audio.soniqo.speech.service.SpeechRecognitionService that plugs into Android's framework SpeechRecognizer API — no code to write. Once your app is selected as the default voice recognizer, any third-party app calling SpeechRecognizer.createSpeechRecognizer(context) (with no ComponentName) gets fully on-device STT through your pipeline.

1. Declare RECORD_AUDIO and the service in AndroidManifest.xml:

<uses-permission android:name="android.permission.RECORD_AUDIO" />

<application>
    <service
        android:name="audio.soniqo.speech.service.SpeechRecognitionService"
        android:exported="true"
        android:permission="android.permission.RECORD_AUDIO">
        <intent-filter>
            <action android:name="android.speech.RecognitionService" />
        </intent-filter>
        <meta-data
            android:name="android.speech"
            android:resource="@xml/recognition_service" />
    </service>
</application>

2. Add app/src/main/res/xml/recognition_service.xml:

<?xml version="1.0" encoding="utf-8"?>
<recognition-service xmlns:android="http://schemas.android.com/apk/res/android" />

(Optionally add android:settingsActivity="..." to expose a gear icon in the system Voice-input picker.)

3. Set the service as the system default (Settings → System → Languages & input → Voice input picker on stock Android, or via adb):

adb shell settings put secure voice_recognition_service \
  your.package/audio.soniqo.speech.service.SpeechRecognitionService

4. Verify by running the demo app's Recognizer test screen, which calls SpeechRecognizer.createSpeechRecognizer(ctx) (no component) and logs every framework callback — useful for confirming the binder round-trip without needing logcat.

The service implements onCheckRecognitionSupport (API 33+) returning the 27 BCP-47 languages Parakeet TDT v3 covers, marked installedOnDeviceLanguage once models are present (or pendingOnDeviceLanguage while they're downloading). Audio focus is acquired with AUDIOFOCUS_GAIN_TRANSIENT for the duration of a session.

Caveat: Gboard, Samsung Keyboard, and Google Assistant bundle their own recognizers and skip the system default. Apps that explicitly call the framework SpeechRecognizer API (or build their own UI on top of it) are the ones that flow through your service.

Performance

Measured on Android emulator (arm64-v8a, no NNAPI). Real hardware is significantly faster.

Model	Task	Audio	Inference	RTF
Parakeet TDT v3	STT	1.5s	175ms	0.12
Kokoro 82M	TTS	1.9s output	1,075ms	0.58
Silero VAD v5	VAD	32ms chunk	<1ms	<0.01

Pipeline

Idle → Listening → Transcribing → Speaking → Idle
              ↑                         |
              └─── resumeListening() ───┘

Barge-in supported: speaking during TTS playback interrupts and starts a new transcription.

Architecture

┌──────────────────────────────────────────────┐
│      SpeechPipeline (Kotlin)                 │
│            │                                 │
│            ▼                                 │
│      jni_bridge.cpp  (~250 lines)            │
│            │                                 │
│            ▼                                 │
│  ┌──────────────────────────────────────┐    │
│  │  speech_core_models (git submodule)  │    │
│  │   SileroVad / ParakeetStt /          │    │
│  │   KokoroTts / DeepFilterEnhancer     │    │
│  │            │                         │    │
│  │            ▼                         │    │
│  │  speech_core  (orchestration:        │    │
│  │   pipeline · turn · interruptions)   │    │
│  └──────────────────────────────────────┘    │
│            │                                 │
│            ▼                                 │
│      ONNX Runtime (CPU / NNAPI)              │
└──────────────────────────────────────────────┘

Each model class directly implements the corresponding speech-core interface (VADInterface, STTInterface, TTSInterface, EnhancerInterface) — the JNI bridge instantiates them and hands references to VoicePipeline. No C-vtable adapter boilerplate.

Hardware Acceleration

Chipset	Acceleration
Snapdragon 8 Gen 1+	NNAPI → Hexagon NPU
Samsung Exynos 2200+	NNAPI → Samsung NPU
Google Tensor G2+	NNAPI → Google TPU
CPU fallback	XNNPACK

For automotive Qualcomm SA8295P / SA8255P with QNN (Hexagon DSP), see speech-core/examples/linux.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.claude/skills		.claude/skills
.github		.github
.vscode		.vscode
app		app
gradle/wrapper		gradle/wrapper
sdk		sdk
speech-core @ b50e128		speech-core @ b50e128
.gitignore		.gitignore
.gitmodules		.gitmodules
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODEX.md		CODEX.md
LICENSE		LICENSE
README.md		README.md
README_de.md		README_de.md
README_es.md		README_es.md
README_fr.md		README_fr.md
README_hi.md		README_hi.md
README_ja.md		README_ja.md
README_ko.md		README_ko.md
README_pt.md		README_pt.md
README_ru.md		README_ru.md
README_zh.md		README_zh.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts
setup.sh		setup.sh

Repository	Scope
speech-swift	Apple (macOS, iOS) — MLX + CoreML
speech-core	Cross-platform C++ pipeline engine + ONNX model wrappers + Linux/embedded examples
speech-android	Android wrapper — Kotlin SDK + JNI bridge over speech-core

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speech Android

Scope

Models

Try the demo

Add dependency

Kotlin usage

FunctionGemma 270M (on-device tool-calling LLM)

Build from source

Demo app

System voice input (`RecognitionService`)

Performance

Pipeline

Architecture

Hardware Acceleration

Related

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Speech Android

Scope

Models

Try the demo

Add dependency

Kotlin usage

FunctionGemma 270M (on-device tool-calling LLM)

Build from source

Demo app

System voice input (RecognitionService)

Performance

Pipeline

Architecture

Hardware Acceleration

Related

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

System voice input (`RecognitionService`)

Packages