Skip to content

[WIP] EasyMP vllm-omni model definition#15741

Open
vklimkov-nvidia wants to merge 45 commits into
NVIDIA-NeMo:easymp_voiceagentfrom
vklimkov-nvidia:easymp_vllm_omni
Open

[WIP] EasyMP vllm-omni model definition#15741
vklimkov-nvidia wants to merge 45 commits into
NVIDIA-NeMo:easymp_voiceagentfrom
vklimkov-nvidia:easymp_vllm_omni

Conversation

@vklimkov-nvidia

Copy link
Copy Markdown
Member

EasyMP model defnition, where backbone and LT are compiled into a single cuda graph for uniform batches.
Loads real weights, doesn't produce valid acoustic tokens at this point.

@vklimkov-nvidia vklimkov-nvidia requested a review from a team as a code owner June 1, 2026 16:59
@copy-pr-bot

copy-pr-bot Bot commented Jun 1, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the TTS label Jun 1, 2026
…cate speaker encoder application

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ition of Easy Magpie

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…embeddings and prepare prefill embeddings

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…eckpoint to vllm omni one

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… prediction processing

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ic token prediction

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…do ckpt conversion without precision loss

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ce of cudagraph-friendly LT re-implemantation

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…d scaling

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ampled tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… fix sending back chunks of audio

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… model definition

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…memory utilization

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…t other plugin

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… fix preprocessing start_idx usage

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…t: increased batched tokens in case a lot of simaltenious prefill

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…el weights

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… model

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… demo for streaming text tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ext tokens are all streamed

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…g of streaming mode, simplify

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…ts service

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…operly

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Lower the first service audio chunk to one frame based on local TTFA benchmarks, and record the measured codec/streaming investigation notes.
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…1d with Linear

The local transformer's feed-forward used kernel-1 Conv1d layers, forcing a
[b,t,c]<->[b,c,t] transpose on entry/exit that torch.compile could not fuse
away (showed up as transpose/convolution triton kernels in profiling). Switch
to bias-free nn.Linear operating directly on [b,t,c]; the conv submodule
attribute name is kept and load_weights squeezes the trailing singleton dim so
existing checkpoints still load 1:1. Also cache the positional arange index to
avoid re-running an embedding gather every autoregressive step.

Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 ITL 45.6->26.4ms,
req/s 10.54->11.82.
…in a single graph

The per-frame codebook loop replayed the compiled transformer N times with eager
projection/sampling in between. Move the whole loop (transformer stack +
per-codebook heads + sampling) into one @support_torch_compile module
(EasyMagpieCodeLoop) so vLLM captures a single CUDA graph replayed once per frame
instead of N times. Same FLOPs; removes per-step Python/launch overhead that
dominates throughput scaling.

Sampling is kept graph-safe: Gumbel noise is drawn eagerly outside the graph and
injected (so RNG isn't replayed), temperature is a runtime tensor (per-request
temperature still works), and top_k is a capture-time constant. The loop owns no
params \u2014 the heads/embeddings/mask stay on EasyMagpieCodePredictor so the
checkpoint still loads 1:1. Also squeeze the kernel-1 Conv1d->Linear weight in the
test's NeMo->vLLM copy (follow-up to the FFN dense change).

Benchmark (Nemotron-H, -n 32 -c 1 32 --max-new-tokens 64): c=32 req/s 11.82->17.55.
preprocess runs on the host, once per request, serially on the runner's
critical path; shipping a per-request (T_audio, embedding_dim) speaker
embedding (ZMQ serialize/deserialize + H2D) and reassembling the prefill
context there dominated TTFT under concurrency.

For the fixed speaker set we serve, bake the speaker embeddings into model
state: load each speaker_embeddings/<id>.pt once at construction into a
GPU-resident, model-dtype tensor, so a request carries only a short
speaker_id string. Custom / one-off voices may still pass a raw
speaker_embedding tensor. Loaded in __init__ (not load_weights) so it is
present under --load-format dummy too.

Also drop the silent zero/last-row padding of short prefill chunks in favor
of an assertion (the backbone was never trained on padded context).

Benchmark (dummy weights, RTX A6000, n=64, speaker_id path):
c=32 TTFT 188ms -> ~95ms, c=1 31ms -> 27ms.
…d, no-op for codec as debug

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
…st of text tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
… chunks of tokens

Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Signed-off-by: Viacheslav Klimkov <vklimkov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant