feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112
Open
localai-bot wants to merge 6 commits into
Open
feat(parakeet-cpp): dynamic batching for concurrent transcription requests#10112localai-bot wants to merge 6 commits into
localai-bot wants to merge 6 commits into
Conversation
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…ed JSON C-API Drop SingleThread; route unary transcription through the in-process batcher which coalesces concurrent requests into one batched engine call. Streaming stays mutually exclusive via engineMu. Adds batch_max_size / batch_max_wait_ms options (size=1 disables; recommended on CPU). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…eallocate; clarify stream lock Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… with per-request fallback The batched JSON C-API symbol exists only in newer libparakeet.so (ABI >= 2); probe it with Dlsym and register optionally so the backend still loads against an older library, falling back to per-request transcription. Rewrites the batcher unit tests as Ginkgo/Gomega specs (forbidigo bans t.Fatal in tests). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
a39a3b8 to
1c9bff2
Compare
Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Dynamic batching now defaults off (batch_max_size:1, one request at a time). Raise batch_max_size to opt in: it is a large throughput win on GPU under concurrent load, but on CPU and low-concurrency setups it only adds latency, so off is the safer default. The startup log now states whether batching is on or off, and the audio-to-text docs are updated to match. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
795d2ed to
27d7d0d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds dynamic batching to the
parakeet-cppbackend so concurrent/v1/audio/transcriptionsrequests are coalesced into one batched call throughparakeet.cpp's batched encoder. This is a GPU throughput feature: under
concurrent load the batched-GEMM path raises utilization. On CPU it does not
help (the GEMMs already saturate the threads and padding adds work), so it is
disableable.
What changed (all under
backend/go/parakeet-cpp/):batcher.go): handler goroutines submit requests; onedispatcher goroutine accumulates them until
batch_max_sizeorbatch_max_wait_ms, then makes a single batched engine call. The dispatcher isthe sole caller of the C engine, so engine access stays single-threaded.
base.SingleThread(which serialized every call) forbase.Base, so concurrentAudioTranscriptionhandlers actually run and reachthe batcher. An
engineMukeeps the streaming path and batched-unary mutuallyexclusive on the one shared engine context.
AudioTranscriptiondecodes the file, submits to the batcher, and shapes theper-item JSON exactly as before (text, word/segment timestamps, tokens).
batch_max_size(default 8) andbatch_max_wait_ms(default 15).
batch_max_size: 1disables batching (recommended on CPU).docs/content/features/audio-to-text.md.Dependency
Requires the parakeet.cpp side that adds the
parakeet_capi_transcribe_pcm_batch_jsonC-API (batched transcription withtimestamps) and bumps the ABI to 2. The backend binds that symbol via purego at
runtime, so this Go code builds without it, but the backend image must ship a
libparakeet.sobuilt from that parakeet.cpp branch for the feature to work.Test plan
-race:go test ./backend/go/parakeet-cpp/ -run TestBatcher -race(coalescing, size trigger, window trigger, size-1 bypass).go build/go vetclean (one pre-existing unrelated unsafe.Pointer warning).libparakeet.so, thenmake test-extra-backend-parakeet-cpp-transcription; fire N concurrent transcription requests and confirm correct per-request transcripts, and that throughput improves atbatch_max_size > 1on GPU.Assisted-by: Claude:claude-opus-4-8 [Claude Code]