Fast CUDA backend: vendored mmq matmul, VMM weight layout, speculative-decoding plumbing by Entrpi · Pull Request #187 · antirez/ds4

Entrpi · 2026-05-18T07:55:58Z

This PR is a follow-up from the initial PR cluster I submitted a few days ago, with continuing work producing more significant performance gains.

What this branch does.

PRO 6000 Blackwell, V4 Flash, ctx=2048: upstream prefills at ~373 tok/s; this branch does ~2190, i.e. 5.9×. Generation: 38 → 44 tok/s (+16%). GB10 Spark: 401 → 461 prefill (+15%); integrated LPDDR5X is bandwidth-limited so most of the wins don't translate there. Reproduces with ds4-bench on both machines; CSVs and SVGs in speed-bench/.

The speed comes from two independent changes that compound:

1. Faster matmul. Upstream dispatches every Q8_0 dense matmul through a Q8→FP16 expansion cache plus cublasGemmEx. This branch vendors llama.cpp's mmq/mmvq family verbatim into cuda/mmq/ (pinned to 5c0e9468) with a ds4-side adapter, dispatcher, and CPU-reference parity harness on top. mmq handles prefill; mmvq handles single-token decode. cuBLAS stays as a fallback. Licensing is clean: both projects are MIT and LICENSE already credits both author lines.

2. Better weight layout. cudaMalloc packs all ~80 GiB of weights into one region; each tensor lands at an arbitrary internal offset. Switching to per-tensor allocation via the CUDA Driver VMM API (cuMemCreate + cuMemAddressReserve) gives each tensor its own 2 MiB-aligned virtual address... and roughly doubles discrete-GPU prefill on its own. The bisect that found this is informative: VMM with one big chunk performs identically to cudaMalloc. So the win is specifically per-tensor 2 MiB-aligned bases, not 2 MiB pages; same memory either way, but tile-load coalescing and L2 spatial locality care which addresses the tensors land at. A small, deterministic FP32 reduction-order drift comes with this (one of four logprob test vectors flips to a textually-equivalent alternative); documented in misc/cuda-env-vars.md, opt-out via DS4_CUDA_VMM_ARENA=0.

Stack: 2.93× from mmq, 2.04× from VMM layout, 5.88× combined.

Why we can land this confidently. tests/ds4_proof.py is a proof harness modeled as profile × suite × prompt × budget × contract: boots ds4 in any two configurations, feeds them identical prompts, verifies byte-identical generated tokens. Every default-on change here was gated on it.

Weight-server sidecar. For N concurrent ds4 processes against the same model, each paying 20 to 70 seconds of weight upload is painful. tools/ds4_weight_server.cu is a small CUDA process that owns the weights once and shares VMM ranges with workers over a Unix-domain socket (SCM_RIGHTS). Auto-disabled when not in use; single-process users never notice it.

CUDA speculative decoding (MTP): plumbed, but experimental. The CUDA backend had no exact speculative-decoding path before; this branch lands the full pipeline behind 14 purpose-built CUDA kernels. It proves byte-equivalent. But throughput is currently neutral-to-negative on the configurations measured (GB10 draft=2: 12.62 vs 12.68 gen tok/s); the setup tax doesn't amortize at gen=128. Ships behind opt-in env flags; main/default config stays best-nomtp. The exact verifier mostly earns its keep here as a correctness floor for future MTP work. Path to net-positive: deeper draft, lower verifier scheduling overhead, longer generation regimes.

Suggested split if you'd prefer. All logically independent and revertible:

cuda/mmq/ vendor + adapter + parity harness.
Dispatcher routes Q8_0 / Q4_K / IQ2_XXS / Q2_K through mmq.
In-process VMM arena.
ds4_weight_server + import API.
tests/ds4_proof.py + docs.
MTP exact verifier path (opt-in).
Option D: MTP-verifier kernel routing onto legacy.
Bench/server quality-of-life: steady-state gen_tps, token_ids in SSE, no-MTP baselines.

(1) and (2) stay together; mmq is dead code without the dispatcher. Everything else can be sequenced.

Full annotated state with charts and per-commit indexing: https://entrpi.github.io/misc/ds4/ds4_fork_state_2026-05-18.html

Testing per CONTRIBUTING.md.

Default build and correctness:

make clean && make: green on macOS (Metal) and Linux (CUDA).
make cpu: green; CPU path is reference-only.
make test (i.e. ./ds4_test --all): run on both CUDA targets (PRO 6000 Blackwell sm_120, GB10 Spark sm_121). All 8 --logprob-vectors failures categorized:
- 1 attributable to the VMM arena: short_code_completion step 1 flips to a textually-equivalent alternative on the c language tag after triple-backticks. Documented in misc/cuda-env-vars.md; opt-out via DS4_CUDA_VMM_ARENA=0 restores byte-equivalence at ~50% prefill cost.
- 7 long_memory_archive failures verified pre-existing by building at upstream merge-base c9dd9499 with none of this branch applied: same 7 failures, same pattern. Most likely q2-quant noise on a 16k retrieval task; not introduced here.

CUDA regression:

make cuda-regression: green on both targets. (Fixed a missing $(MMQ_OBJS) link in the tests/cuda_long_context_smoke target as part of this branch.)

Speed regressions:

ds4-bench frontier sweeps on both CUDA targets; CSVs and SVGs committed under speed-bench/: pro6000_blackwell.csv / _ts.svg, gb10_spark.csv / _ts.svg. Both prefill_tps and gen_tps reported per CONTRIBUTING.md. PRO 6000 ctx=2048: 1078.86 → 2193.29 prefill, 38.0 → 44.2 gen. GB10 is flat on VMM (integrated LPDDR5X, as expected); GB10 gen is also flat within run-to-run noise vs upstream, since LPDDR5X bandwidth caps decode regardless of dispatch path.

Quality scorer: not run; no quantization changes in this PR.

The VMM sharing path needs an explicit capability checkpoint before any ownership or import lifetime code depends on it. Add the build and dry-run plumbing needed to establish that CUDA Driver VMM is available on the target without changing the default IPC backend. Link CUDA builds with libcuda, add --backend ipc|vmm with ipc as the default, query VMM/POSIX-FD/UVA device attributes, and report VMM allocation granularity and logical-vs-rounded allocation size during dry-run planning. The non-dry-run VMM path is intentionally guarded in this checkpoint so a run cannot claim VMM semantics while still producing legacy IPC allocations. The existing IPC path remains the default behavior. Validated on Spark/GB10 by rebuilding ds4 and ds4_weight_server, running an MTP-only --backend vmm --dry-run with a 32 GiB reserve, observing vmm=1 posix_fd=1 uva=1 and 2 MiB granularity, confirming the 3.55 GiB MTP allocation plan, and verifying no lingering DS4/weight-server processes or swap use afterward.

Move the isolated weight owner from VMM capability probing to actual CUDA Driver VMM allocation ownership. This avoids spending effort scaling the legacy full-base raw IPC owner and instead exercises the allocation model intended for imported worker mappings. Owned ranges now carry backend-specific state. The VMM backend allocates with cuMemCreate, reserves and maps owner VA with cuMemAddressReserve and cuMemMap, grants owner access with cuMemSetAccess, and reuses the pinned staged GGUF upload path against the mapped VA. The server emits DS4_WEIGHT_SERVER_VMM_V1 alloc records containing logical bytes and granularity-rounded allocation bytes. Cleanup releases VMM mappings with cuMemUnmap, cuMemAddressFree, and cuMemRelease. The IPC backend remains the default and keeps its existing manifest format. Validated on Spark/GB10 by rebuilding ds4_weight_server, starting an MTP-only --backend vmm owner with a 32 GiB reserve, observing an 8-record DS4_WEIGHT_SERVER_VMM_V1 manifest, uploading 3.55 GiB across 8 MTP ranges, terminating with SIGTERM, seeing the shutdown path run, and confirming no lingering DS4/weight-server processes or swap use afterward.

VMM manifests can describe exported allocations but cannot directly contain live POSIX file descriptors. Add an owner-mediated Unix-domain socket broker so short-lived worker processes can obtain importable handles from the long-lived weight owner. VMM-owned ranges are exported with cuMemExportToShareableHandle. The VMM manifest now records a broker socket path, and the server listens on an AF_UNIX stream socket. Clients send GET <alloc-id>; the broker replies with status text and transfers the exported FD with SCM_RIGHTS. Shutdown closes the broker socket, unlinks the socket path, closes exported FDs, releases VMM allocations, and reports the number of broker requests served. Validated on Spark/GB10 by rebuilding ds4_weight_server, starting an MTP-only --backend vmm owner, confirming the VMM manifest advertised a broker path and 8 MTP alloc records, requesting GET 0 from a Python Unix-socket client, receiving one FD with OK 0 144703488 via SCM_RIGHTS, observing broker served alloc=0 requests=1 in the server log, and confirming socket cleanup, no lingering DS4/weight-server processes, and zero swap use after shutdown.

Imported VMM allocations need to look like ordinary cached model ranges to the existing CUDA kernels. Add the worker-side half of the VMM sharing path so process-isolated DS4 workers can consume allocations owned by ds4_weight_server without changing tensor call sites. The importer now recognizes DS4_WEIGHT_SERVER_VMM_V1 manifests, reads the broker socket and alloc records, requests each allocation by id, receives the exported POSIX FD with SCM_RIGHTS, imports it with cuMemImportFromShareableHandle, maps it into worker VA with cuMemAddressReserve and cuMemMap, grants device read access with cuMemSetAccess, and registers the logical span in g_model_ranges. Model range cleanup now has a VMM-owned branch that unmaps worker VA, frees the reserved address range, and releases the imported generic allocation handle. Existing IPC manifests remain supported by the same public import entry point. Validated on Spark/GB10 by rebuilding ds4 and ds4_weight_server, starting an MTP-only --backend vmm owner, running a direct CUDA worker with DS4_CUDA_WEIGHT_IPC_MANIFEST and DS4_CUDA_WEIGHT_IPC_SCOPE=mtp, observing 'CUDA imported shared VMM weight cache for mtp: 3.55 GiB across 8 ranges', seeing the broker serve all 8 allocation FDs, generating output successfully, and confirming no lingering DS4/weight-server processes and zero swap use after shutdown.

The proof harness needs to own VMM runs with the same rigor as the legacy IPC path before longer MTP comparisons can rely on it. Add first-class VMM backend selection and validation so a JSON report can distinguish a successful VMM import path from a server that merely started. The runner now accepts --weight-server-backend ipc|vmm and passes it through both dry-run preflight and persistent ds4_weight_server startup. VMM manifests are validated with DS4_WEIGHT_SERVER_VMM_V1, broker, and alloc records, while existing IPC manifests remain supported. Weight-server telemetry parsing now records VMM support flags, granularity, logical and allocated VMM plans, backend allocation totals, broker socket path, and broker request counts. The validation verdict checks backend matching, VMM preflight support, VMM backend telemetry, broker listening, broker requests, and per-model VMM plans. Validated on Spark/GB10 with an MTP-only proof run using --start-weight-server --weight-server-scope mtp --weight-server-backend vmm. The run completed with failures=0, weight_server_validation.passed=true, preflight_vmm_supported=true, vmm_backend_telemetry=true, vmm_broker_listening=true, vmm_broker_requests=true, vmm_plan_mtp=true, broker_requests=8, generated worker output, clean owner termination, no lingering DS4/weight-server processes, and zero swap use.

The VMM backend is now the primary Spark/GB10 path for shared MTP raw tensor spans, so the proof documentation needs to present that workflow directly instead of describing only the legacy IPC owner. Update the proof harness README to describe the two weight-server backends, VMM broker semantics, VMM validation checks, and manual VMM startup/import arguments. Update the CUDA MTP README so multi-profile proof examples use --weight-server-backend vmm with MTP scope and explain that validation now covers backend and broker activity. Validated by running local Python compile checks for the proof harness and by completing the Spark VMM MTP workflow proof before documenting the commands and observed behavior: failures=0, exact-byte profile comparisons passed for two prompts, weight_server_validation.passed=true, broker_requests=32, owner cleanup terminated, no lingering DS4/weight-server processes, and zero swap use.

Full-scope VMM ownership is only useful if the owner uses the same loading-speed assumptions as the CUDA runtime. The standalone owner was still reading model bytes through normal pread into pinned buffers, which made base+MTP startup substantially slower than the main CUDA loading path. Port the CUDA loader's direct-I/O staging approach into ds4_weight_server. The server now reopens model files through /proc/self/fd with O_DIRECT when available, tracks filesystem alignment, over-allocates and aligns pinned staging buffers, reads aligned windows for direct I/O, and falls back to normal pread if direct reads are unavailable. The existing four-stage pinned buffering, async H2D copies, and POSIX_FADV_DONTNEED page dropping remain in place. Validated on Spark/GB10 by rebuilding ds4_weight_server and starting a full-scope base+MTP VMM owner with a 16 GiB reserve. Direct I/O enabled for both GGUFs with 4096-byte alignment, base uploaded 80.76 GiB across 138 ranges, MTP uploaded 3.55 GiB across 8 ranges, owner reached broker-ready in 22.044s versus the previous 64-66s full-scope owner baseline, then shut down cleanly with no lingering DS4/weight-server processes and zero swap use.

The CUDA backend's fd-cache (cuda_model_range_ptr_from_fd) reads bytes from g_model_fd, which is set once via ds4_gpu_set_model_fd() for the main model. If the runtime registers a second model_map (e.g. via a separate ds4_gpu_set_model_map_range call for an auxiliary model whose weights live in a different file), subsequent weight lookups for that model would still hit the fd-cache and read bytes from the main model's fd at the wrong offsets, silently returning garbage that propagates as NaN through any kernel that consumes the result. Track which model_map owns the fd in g_model_fd_host_base, bound on the first set_model_map call after set_model_fd, and refuse the fd-cache path for any other model_map. Non-fd-owning models fall through to the cudaMemcpy path that uses (const char *)model_map + offset, the correct host pointer for any registered mmap. No behavior change for the main model.

accelerator_cache_model_tensors was gated on !e->mtp_ready, meaning neither the main model nor the MTP draft model got their tensor spans pre-staged into device memory when MTP was loaded. The first MTP draft invocation then paid a multi-second cudaMemcpy outlier copying the entire 3.6 GiB Q4_K MoE weight set synchronously, one tensor at a time. Call accelerator_cache_model_tensors for both the main and MTP models on startup when MTP is loaded. No effect when MTP is not loaded.

Remove the duplicate g_model_fd_host_base declaration left by merging upstream CUDA loader changes with the MTP prep branch's fd-cache owner guard. Keep the explanatory ownership comment on the single declaration used by the guard. Validation: Spark CUDA build failed on the duplicate declaration before this fix; the branch will be revalidated with make cuda-spark in the build matrix.

Today's vanilla-vs-best bench on PRO 6000 (sm_120, 1.6 TB/s GDDR7) and GB10 (sm_121, 273 GB/s LPDDR5X) confirmed that no single Q8_0 strategy is best across arches: arch mmq cublas warp8 PRO 6000: 1078 374 374 (mmq wins, cuBLAS dead code) GB10 Spark: 114 398 40 (cublas wins, mmq is a 3.5x regression) Picking mmq always (the prior default) regressed Spark prefill by 3.5x vs upstream antirez/main. Picking cuBLAS always sacrifices ~2.88x on PRO 6000. Picking warp8 always sacrifices both. Introduce `ds4_cuda_q8_strategy()`: queries `cudaGetDeviceProperties` once, computes memory bandwidth from `2 * memoryClockRate * memoryBusWidth / 8`, picks one of {mmq, cublas, warp8} from a tier table, and pins it for the process lifetime. Logs the choice once. `ds4_cuda_use_mmq()` becomes "strategy == MMQ" gated on lazy `ds4_mmq_init()` success (auto-downgrades to cublas on init failure). `ds4_cuda_use_cublas_q8()` is a new helper used by the cuBLAS branch of `cuda_matmul_q8_0_tensor_labeled` so warp8 is reachable when strategy says so. Env-var surface: - `DS4_CUDA_PREFILL_PATH=mmq|cublas|warp8|auto` (new explicit override) - `DS4_CUDA_USE_MMQ=0` (legacy): equivalent to PREFILL_PATH=cublas - `DS4_CUDA_MTP_VERIFIER_USE_MMQ`: unchanged Option D semantics Auto tiers (calibrated against the only two arches we've benched): - > 800 GB/s -> mmq - 200..800 -> cublas - <= 200 -> warp8

cudaDeviceProp::memoryClockRate was removed in CUDA 13; build fails with "class \"cudaDeviceProp\" has no member \"memoryClockRate\"". Query the rate via cudaDeviceGetAttribute(&v, cudaDevAttrMemoryClockRate, 0). memoryBusWidth is still in cudaDeviceProp on CUDA 13, but query it the same way for forward-compat. Add a guard: if either attribute query returns 0 (no driver value), fall back to mmq instead of mis-tiering into warp8.

Investigation across pod (sm_120, 1.8 TB/s GDDR7) and GB10 Spark (sm_121, 546 GB/s LPDDR5X) showed mmq is the empirically-fastest path on every CUDA arch we've validated, including the low-bandwidth one that the prior bandwidth-tier policy steered to cublas: arch mmq cublas warp8 sm_120: 1092 373 373 (mmq 2.9x) sm_121: 458 401 56 (mmq +14% over cublas) The cuBLAS path remains resident and is initialized at startup regardless of strategy. An undiagnosed CUDA driver-state side effect of cublasCreate gives mmq a ~4x speedup on sm_121 vs a binary that omits cublas init - confirmed via A/B (old binary at 8df4b2a: mmq ~114; new binary at 38572f9 with cublas restored: mmq ~458, both stable across 5-iter repro probes). We don't delete the cublas path partly for this reason. New policy: - default: mmq (with auto-downgrade chain mmq -> cublas -> warp8 on init failures) - DS4_CUDA_PREFILL_PATH=cublas: explicit cuBLAS path - DS4_CUDA_PREFILL_PATH=warp8: explicit native warp kernel - DS4_CUDA_USE_MMQ=0 (legacy): equivalent to PREFILL_PATH=cublas Removed the device-bandwidth tier table; we now log the bandwidth at startup for diagnostics but don't dispatch on it. Avoids miscalling a future low-bandwidth-but-mmq-friendly arch into cublas. Validated on pod (auto picks mmq -> 1092 t/s) and GB10 (auto picks mmq -> 458 t/s). All four explicit overrides honored on both.

Adds two CSV columns: - gen_tps_ss = (gen_tokens - first_call_tokens) / (t_end - t_after_first) - first_token_sec = t_after_first - gen_t0 The existing gen_tps stays as total_gen_tokens / total_decode_wall, which includes the first-token post-prefill setup (~1.0-1.3s). The new steady-state metric matches llama-benchy semantics ((N-1)/(t_last-t_first) for vanilla decode; subtracts the whole first speculative batch for MTP). Backward-compatible for csv.DictReader-based parsers.

OpenAI-compatible streaming clients that re-count tokens (e.g., llama-benchy) need the model's actual vocab token IDs to avoid re-tokenizing each SSE chunk through an unrelated HF BPE. With chunks that span sub-word fragments, per-chunk re-tokenization over-counts - especially in thinking mode where the model picks coarser vocab tokens like 'need to ' that don't exist as a single HF merge. This adds a return_token_ids boolean to /v1/chat/completions parsing (matching the field llama-benchy already sends in its payload) and plumbs model token IDs through the openai_stream emitter so each delta carries a token_ids array of exactly the IDs whose piece text falls in that delta's byte range. Hold-back semantics preserved: tokens whose byte_end exceeds the emission limit (e.g., text held back behind the </think> close-tag lookahead) stay queued in openai_stream until the next emit moves emit_pos past them. Verified locally with cc -c -Wall -Wextra; deferring full build and SSE-round-trip verification to the GB10 deployment.

llama-benchy reads token_ids from chunk['choices'][0]['token_ids'], not from inside delta (see client.py:237). vLLM emits it at the choice level; matching that convention.

Without token-boundary alignment, the </think> hold-back can land mid-token: ds4-server would emit a partial-token text fragment whose covering model token hasn't been drained from the queue yet. The delta chunk would go out without token_ids; llama-benchy then logs 'No token_ids in response, using local tokenization' and falls back to HF re-tokenization for the rest of the stream - reintroducing the exact thinking-mode over-count this commit set out to fix. openai_stream_align_limit snaps the candidate emission limit down to the largest pending byte_end that fits under the held-back limit, so every emitted chunk's text spans exactly the tokens it carries.

cuMemCreate with CU_MEM_HANDLE_TYPE_NONE works end-to-end on both PRO 6000 Blackwell (sm_120, discrete) and GB10 Spark (sm_121, integrated), per tools/ds4_vmm_probe.cu. Recommended granularity is 2 MiB on both. This is the layout that gave the ds4_weight_server its 1.7-2.0x prefill win on PRO 6000 versus the in-process cudaMalloc arena. Step 1 of 5: introduce probe state and the supported() helper. No call site yet (will land in a follow-up). Hard-gated when DS4_CUDA_WEIGHT_IPC_MANIFEST is set (sidecar would double-book the model). Soft env opt-out via DS4_CUDA_VMM_ARENA=0. Falls back to the existing cudaMalloc arena when the probe fails, so no behaviour change on hardware where VMM is missing. Compiles cleanly on sm_120; the only nvcc warning is the unused-function warning for the new helper, which goes away in step 3.

Bump-pointer allocator over CUmem ranges obtained via cuMemCreate + cuMemAddressReserve + cuMemMap, 1 GiB default chunks rounded up to the VMM granularity reported by the device (2 MiB on PRO 6000 Blackwell and GB10 Spark). Mirrors ds4_weight_server's owner-side allocation, minus the FD export -- HANDLE_TYPE_NONE because nothing leaves this process. Knob: DS4_CUDA_VMM_ARENA_CHUNK_MB (default 1024, clamped [64, 4096]). Teardown hooks into the existing cuda_model_range_release_all so model rebinds free the VMM chunks in the right order (unmap, address-free, release). Per-range device_ptr aliases are left alone -- they'll be marked arena_allocated=1 by the wiring commit so the existing release loop already skips cudaFree on them. Step 2 of 5: allocator + release. Still no call site, so still a no-op at runtime; the only nvcc warning is the unused-function warning for cuda_vmm_arena_alloc, which goes away in step 3.

Single call-site change in cuda_model_range_ptr_from_fd: prefer the new VMM arena, fall back to the existing cudaMalloc arena on any driver error. The push_back into g_model_ranges already sets arena_allocated=1, which means the existing release_all loop skips per-range cudaFree on both arena flavours -- the lifetime is owned by g_vmm_arenas / g_model_arenas, not by the individual range pointers. Effect: when DS4_CUDA_WEIGHT_IPC_MANIFEST is unset and the device supports VMM (verified on PRO 6000 Blackwell + GB10 Spark), weight spans land in 2 MiB device pages, matching the layout that gives ds4_weight_server its 1.7-2.0x prefill speedup. With the manifest set, or on a device without VMM, the cudaMalloc arena keeps running exactly as before. Step 3 of 5: wiring. ds4 + ds4-bench link clean with no warnings.

Default chunk size in Commit 2 was 1024 MiB minimum. With weight spans averaging ~530-670 MiB and bump-arena residuals too small to host a second span, this allocated one fresh 1024 MiB chunk per span and wasted ~30-50% of every chunk. For a 80 GiB model this would have pushed VMM usage to ~138 GiB, OOM-ing the 96 GiB Blackwell. Fix: default to 0 MiB minimum, so the chunk is just the request size rounded up to granularity (2 MiB on supported hardware). DS4_CUDA_VMM_ARENA_CHUNK_MB remains as an explicit minimum override for users who want to coalesce small allocations into fewer mappings. Verified on PRO 6000: 138 chunks, 80.77 GiB allocated for 80.76 GiB cached -- byte-for-byte parity with the ds4_weight_server's plan.

Update AGENT.md to reflect the new loader story: - Discrete GPUs (PRO 6000 Blackwell): in-process VMM arena now ships the same 2 MiB-page layout the weight server gave us as a sidecar. Bench A/B (16-row sweep, ctx 2048-32768): arena 1076 -> vmm 2196 at ctx=2048 (2.04x), arena 968 -> vmm 1810 at ctx=32768 (1.87x), within 0.5% of the WS ceiling on every row. - Integrated GPUs (GB10 Spark): in-process VMM is no-op-ish (-0.6% across 8 frontiers from 2048-16384), neither helping nor regressing meaningfully. Kept enabled for code-path uniformity; users can flip DS4_CUDA_VMM_ARENA=0 if profiling shows a real regression in their workload. - Weight server stays the right tool when N>=2 processes share the same model on the same card (proof harness, multi-profile sweeps, MTP correctness). Documents the new env-var surface (DS4_CUDA_VMM_ARENA, DS4_CUDA_VMM_ARENA_CHUNK_MB) and the hard gate against double-booking when DS4_CUDA_WEIGHT_IPC_MANIFEST is set.

…dispatch Brings in 14 commits since the 2026-05-15 merge-base d0357ec: CUDA correctness (high priority): - c9dd949 cuda: fix compressed prefill RoPE positions - 5bc1e6d Apply Flash graph correctness fixes Shared experts use the same swiglu_limit clamp as routed experts. Ratio-4 indexer Q / indexer compressor KV run Hadamard rotation + FP4 act-sim round trip before top-k. Adds CUDA + Metal kernels for the QAT step. Breaking signature: ds4_gpu_shared_gate_up_swiglu_q8_0_tensor gains a float clamp parameter (all callers updated atomically). Server features and protocol: - 037ee39 Ignore tool calls emitted inside thinking - 312935e Add opt-in CORS support (--cors) - f074c7b Anchor cold KV checkpoints at chat task boundary - ef0a490 Add ds4-server --chdir option - 613e9b2 Default sampling to min-p filtering (DS4_DEFAULT_MIN_P=0.05) Eval and dataset (orthogonal): - 1f6a8fe Broaden imatrix calibration prompts - e258d51 Clean up ds4-eval benchmark items - b166a73 Audit ds4-eval benchmark prompts - 2ae607f Improve ds4-eval controls and reporting - 4441e56 Auto-size ds4-eval context - 48c4d4d Add COMPSEC ds4-eval cases - 011aa67 Refine COMPSEC eval localization cases Analysis: local/docs/ds4_upstream_merge_2026-05-18.html Auto-merged cleanly across README.md, ds4.c, ds4.h, ds4_cli.c, ds4_cuda.cu, ds4_gpu.h, ds4_metal.m, ds4_server.c. Our VMM arena (ds4_cuda.cu lines 90-300, 1075-1340) is disjoint from upstream CUDA hunks (RoPE ~2344, FP8/indexer ~2416-2550, compressor ~6275-6700, dense Q8_0 ~7489+). Our SSE token_ids work (ds4_server.c) coexists with upstream's thinking/CORS/KV-anchor edits in different functions of the same file.

Upstream 5bc1e6d added a 'float clamp' parameter to ds4_gpu_shared_gate_up_swiglu_q8_0_tensor and updated three callers upstream had visibility into. Our tree carries two additional callers in metal_graph_encode_decode_ffn_half_exact and metal_graph_encode_decode_ffn_shared_post_exact (added during local Metal graph capture work) that need the same DS4_SWIGLU_CLAMP_EXP argument the upstream callers use. Same value upstream chose for the non-Metal-graph paths.

Diagnosis of the GB10 corruption recorded in commit b66b5d6 (R1). The captured routed-MoE and dense Q8_0 vec graphs run on g_moe_stream while the rest of the layer (HC, RoPE, head_rms_norm, KV store, attention, router select, shared SwiGLU, layer-end add) runs on stream=0. Two data-dependency races existed across that boundary: 1. POST-launch: captured kernels write down/mid/out buffers on g_moe_stream; the next layer's stream=0 kernels read those buffers without an explicit wait. 2. PRE-launch: captured kernels read x and selected, both produced by stream=0 kernels (router_select, previous layer's add). cudaGraphLaunch is asynchronous so g_moe_stream may start executing before stream=0 has finished writing those inputs. Empirically only fixing (1) leaves the corruption (verified by an on-pod 32-token greedy smoke with this commit's first attempt: still garbled output). Both legs are needed. Fix: a pair of reusable cuda events and two helper functions that bracket every cudaGraphLaunch: ds4_cuda_moe_stream_sync_pre(s) - record on stream=0, wait on s. Closes the input-read race. ds4_cuda_moe_stream_sync_post(s) - record on s, wait on stream=0. Closes the output-read race. Four bracketed call sites: - routed_moe replay-hit - routed_moe first-launch (after capture+instantiate) - dense Q8_0 vec replay-hit - dense Q8_0 vec first-launch (after capture+instantiate) Default remains OFF. After validation on GB10 + PRO 6000 (32-token smoke parity vs graphs-OFF + MTP acceptance recovery + bench parity), flip the static enabled=0 to 1 in a separate commit.

The cross-stream race that motivated commit b66b5d6's revert is closed by the previous commit's bidirectional ds4_cuda_moe_stream_sync_pre/_post() brackets around every cudaGraphLaunch. Validated on both target arches: PRO 6000 (sm_120): - 32-token greedy smoke: graphs ON output bit-identical to OFF - ds4-bench ctx=2048: gen 43.31 -> 44.20 t/s (+2.06%) - ds4-bench ctx=4096: gen 42.47 -> 43.45 t/s (+2.31%) GB10 (sm_121): - 32-token greedy smoke: graphs ON output bit-identical to OFF - MTP-active 64-token: clean coherent output (was garbled pre-fix) - ds4-bench ctx=2048: gen 14.10 -> 14.84 t/s (+5.2%) - ds4-bench ctx=4096: gen 13.95 -> 14.73 t/s (+5.6%) Changes: - ds4_cuda_moe_graphs_enabled() default: enabled = 0 -> enabled = 1. - Env-var semantics inverted back to opt-out: DS4_CUDA_MOE_GRAPHS=0 (or off/no/false) now disables. Anyone who was setting =1 to opt-in will silently no-op (graphs are already on by default). The perf gain is smaller than the pre-revert benchmark (which showed +9.36%) because the new cudaEventRecord+cudaStreamWaitEvent brackets serialize stream=0 with g_moe_stream, trading some parallelism for the correctness fix. Still strictly positive on both targets.

Upstream's convention is to keep extra committed docs under misc/ (.gitignored by default; force-add specific files like ANTHROPIC_LIVE_CONTINUATION.md and RESPONSE_API.md when they're worth keeping in-tree). We had been stashing ours under a new docs/ directory that upstream doesn't have. Move: docs/cuda-mtp/README.md -> misc/cuda-mtp/README.md docs/proof-harness/README.md -> misc/proof-harness/README.md Update the one in-tree reference (README.md) and the new misc/ proof-harness entry it now picks up. AGENT.md slim-down to follow.

AGENT.md had grown to 286 lines (~5.4x upstream's 53) by accreting operator-manual content as the CUDA backend matured. Upstream's AGENT.md is terse policy-and-pointers prose: no tables, no code blocks, no benchmark numbers, no commit hashes. Bring our version back to that tone: AGENT.md 286 -> 68 lines (15 net over upstream, all in Layout + one short Testing paragraph for the weight-server pattern) The dropped material consolidates into a new force-added reference file: misc/cuda-env-vars.md (new, 93 lines) - Q8_0 dispatcher table (mmq / cublas / warp8) - full env-var inventory with intent behind each default - in-process VMM weight arena env vars - one pointer each to misc/proof-harness/README.md and misc/cuda-mtp/README.md for content that lives there Stale content dropped during the move: - 'DS4_CUDA_MOE_GRAPHS default OFF' rationale paragraph (R1, since resolved by commits 687c783 + 7967154) - companion-repo cross-refs to local/docs/*.html files (private to the auto-round repo; not reachable from a ds4 checkout) - per-arch prefill benchmark tables (point-in-time snapshots that age out of correctness; the live numbers belong in the optimization report HTMLs, not in AGENT.md) - '~60 s base upload' figure (the storage-dependent range is 20-70 s; full discussion in misc/proof-harness/README.md) No code paths change; AGENT.md and misc/ are docs-only.

Three lines still framed ds4 as Metal-only when both backends are now real parallel implementations: the production-path goal, the smoke-test safety guidance, and the testing prerequisite. Update each to name both: Goals: 'whole-model Metal graph inference' -> 'whole-model GPU graph inference (Metal on macOS, CUDA on Linux)' Safety: 'Prefer short Metal smoke tests for build verification' -> 'Prefer short GPU smoke tests for build verification (Metal on macOS, CUDA on Linux)' Testing: 'when a model and Metal are available' -> 'when a model and a GPU backend are available' The Objective-C-only-where-Metal-requires-it line in the intro is preserved verbatim: that's a code-style policy, not a backend-coverage claim.

Reason antirez#6 in the Motivations section said the KV cache is compressed enough for long-context inference + on-disk persistence. True but understated: the same compression is what keeps the model's speed from collapsing at deep contexts the way similarly-sized recent peers (MiMo-V2.5 and friends) do beyond ~100k tokens. Add one sentence after the existing bullet making that speed-durability claim explicit, in the section's first-person-plural voice. Qualitative claim, no specific benchmark cited.

CONTRIBUTING.md asks for a clean make cpu build. Four static MTP helpers in ds4.c were defined unconditionally but called only from inside #ifndef DS4_NO_GPU regions (the CPU build stubs out ds4_session_eval_speculative_argmax to an empty body, so none of the callers compile in CPU mode). Result: four -Wunused-function warnings on every make cpu invocation. Wrap the four helper definitions (ds4_mtp_accept_trace, ds4_mtp_accept_gate_should_skip, ds4_mtp_exact_policy_use_seq, ds4_mtp_accept_gate_record) in the same #ifndef DS4_NO_GPU as their callers. make cpu and make (Metal) both build warning-free; CUDA builds compile the helpers as before because DS4_NO_GPU isn't defined.

… README The in-process VMM weight arena introduces a small, deterministic FP32 reduction-order drift relative to the cudaMalloc-backed arena when matmul kernels read identical weight bytes through 2 MiB-page- aligned per-tensor VA ranges instead of packed sub-granularity allocations. Investigation chain (all ruled out and documented in the commit notes): - Uploaded bytes are identical (checksum) - Kernels do not read OOB (poison-fill test) - Not cross-stream coherency (cudaDeviceSynchronize doesn't help) - Not the warp8 load_i8x4_i32_aligned helper alone (mmq also drifts) The most plausible mechanism is L2-cache-hit / tile-arrival-order sensitivity in the kernels' parallel reduction across CTAs (FP32 non-associativity). Magnitude ~0.08 logprob units; bounded to tight-margin token choices. One of four ./ds4_test --logprob-vectors official-vector cases flips a textually-equivalent alternative. Document this in misc/cuda-env-vars.md alongside the existing mmq-vs-warp8 drift (DS4_CUDA_MTP_VERIFIER_USE_MMQ / Option D). Same shape of trade; same workaround pattern: DS4_CUDA_VMM_ARENA=0 for users who need exact official-vector byte equivalence, default ON otherwise for the ~2x prefill win. Also add misc/cuda-env-vars.md to README's More Documentation list (was missing; was discoverable only via in-tree cross-reference).

The chunk-size bisect during the drift investigation updated our understanding of why the VMM arena is fast. The 'VMM gives 2 MiB pages' framing is incomplete: VMM with one large 1792 MiB chunk performs identically to cudaMalloc (~1080 t/s prefill on PRO 6000), even though the memory is still 2 MiB-paged. The actual differentiator is per-tensor 2 MiB-aligned base addresses: when each weight tensor sits at its own fresh cuMemAddressReserve-handed VA, the matmul kernels' tile-load coalescing and L2 spatial-locality improve enough to roughly double prefill on discrete GPUs. Pack the same memory into one big chunk and the bases land at sub-granularity offsets - perf advantage disappears. Same root cause as the reduction-order drift: the cache/tile- arrival-order change that gives the perf also changes the order partial sums reach the FP32 accumulator. Add a 'Why per-tensor chunks specifically' subsection explaining this, then link it explicitly to the drift section. Drop the redundant 'plausible mechanism' paragraph from the drift discussion.

Frontier sweeps run with ds4-bench: ./ds4-bench --cuda \ -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \ --prompt-file speed-bench/promessi_sposi.txt \ --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 \ --csv speed-bench/<arch>.csv Both runs use the current branch's post-merge defaults: mmq dispatch, mmvq decode, CUDA graphs with bidirectional stream sync, in-process VMM weight arena. q2-imatrix variant on both targets. CSVs use the extended schema ds4_bench.c emits today (8 columns): ctx_tokens, prefill_tokens, prefill_tps, gen_tokens, gen_tps, gen_tps_ss, first_token_sec, kvcache_bytes. plot_speed.py is already schema-tolerant (DictReader with subset semantics on its required columns) so it reads either the upstream 6-column files or these 8-column files without modification. PRO 6000 long-prompt headline: 2192 -> 1508 t/s prefill across ctx 2k -> 65k; gen 43.1 -> 35.3 t/s. GB10 long-prompt headline: 465 -> 308 t/s prefill across the same sweep; gen 14.2 -> 11.7 t/s.

make cuda-regression was broken by our work because the test target links only ds4_cuda.o and not the cuda/mmq/*.o adapter objects. ds4_cuda.o calls into ds4_mmq_init / ds4_mmq_q8_0_dense / ds4_mmq_q4_K_moe* / etc, which now live in libraries we vendored under cuda/mmq/. Without the link, the smoke test fails with 'undefined reference to ds4_mmq_init' and the cuda-regression CONTRIBUTING.md check can't run. Add $(MMQ_OBJS) to the cuda_long_context_smoke link line. Verified on pod: make cuda-regression now passes 'cuda long-context regression: OK'

…_ids Three updates that round out the README to match what the engine now exposes: - Speed table: broaden the 'Metal CLI numbers' footnote to 'CLI numbers' since both Mac and NVIDIA targets are now represented; add RTX PRO 6000 Blackwell short + long-prompt rows and refresh the DGX Spark GB10 row to a long-prompt run matching the new PRO 6000 entry. Numbers from the in-tree speed-bench sweeps, captured with the q2-imatrix variant on both CUDA targets for apples-to-apples. - Benchmarking: describe the steady-state gen_tps_ss column we added to ds4-bench (excludes first-token amortisation, useful for apples-to-apples decode comparisons across short and long generations). Point at the newly committed speed-bench/pro6000_blackwell_ts.svg and gb10_spark_ts.svg. - Server SSE: document return_token_ids on chat-completion SSE (per-token IDs at the choice level, matching vLLM/llama-benchy wire shape; emission snaps to token boundaries to keep IDs in sync with text).

Until antirez/ds4#187 lands upstream, the one-shot installer pulls the CUDA performance work (mmq dispatch + in-process VMM weight arena + stream-sync'd CUDA graphs) from our PR branch. Revert the two DS4_REPO / DS4_REF defaults to antirez/ds4 + main once merged. Fast-forward path now repoints `origin` in place when the existing clone tracks a different URL, so users who already ran the installer against antirez/ds4 main get a clean repoint instead of a fetch error.

Entrpi added 30 commits May 14, 2026 14:48

chore: ignore local workspace notes

95ff49d

proof: generalize engine proof runner

08d8fa1

proof: add CUDA weight server lifecycle

f2f424e

proof: support scoped weight server imports

76cbc9a

proof: record weight server telemetry

5a6d3ed

proof: guard weight server parent lifetime

0b1370d

proof: lock CUDA weight server ownership

7268f68

proof: reject stale weight manifests

cda4b4d

proof: report weight server validation verdict

7536533

proof: add weight server harness smoke test

0acb50e

proof: default shared weights to base without mtp

dbeeec9

cuda: add MTP top2 verifier paths

ec2dd7c

mtp: reduce exact verifier output scheduling

2141a37

cuda: fuse MTP verifier MoE down sum for two tokens

6b9fe82

cuda: batch Q8 pair projections for MTP verifier

3cad23c

cuda: pair decode Q and KV projections

c5a0c36

docs: add CUDA MTP enablement guide

c39ce1e

docs: add CUDA MTP profiling breakdown

d1d5a8b

cuda: write attention output A directly to low layout

143623f

cuda: add opt-in N2 Q8 attention output B path

de5bde3

Entrpi added 26 commits May 16, 2026 16:43

server: place token_ids at choice level to match vLLM/llama-benchy

4698319

llama-benchy reads token_ids from chunk['choices'][0]['token_ids'], not from inside delta (see client.py:237). vLLM emits it at the choice level; matching that convention.

Entrpi marked this pull request as ready for review May 18, 2026 08:23

Entrpi mentioned this pull request May 19, 2026

Model request: DeepSeek V4 Series ggml-org/llama.cpp#22319

Open

tnhnyc mentioned this pull request May 19, 2026

bench: add MTP options to ds4-bench #205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast CUDA backend: vendored mmq matmul, VMM weight layout, speculative-decoding plumbing#187

Fast CUDA backend: vendored mmq matmul, VMM weight layout, speculative-decoding plumbing#187
Entrpi wants to merge 132 commits into
antirez:mainfrom
Entrpi:pr-prep-2026-05-18

Entrpi commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Entrpi commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Entrpi commented May 18, 2026 •

edited

Loading