Skip to content

Fast CUDA backend: vendored mmq matmul, VMM weight layout, speculative-decoding plumbing#187

Open
Entrpi wants to merge 132 commits into
antirez:mainfrom
Entrpi:pr-prep-2026-05-18
Open

Fast CUDA backend: vendored mmq matmul, VMM weight layout, speculative-decoding plumbing#187
Entrpi wants to merge 132 commits into
antirez:mainfrom
Entrpi:pr-prep-2026-05-18

Conversation

@Entrpi
Copy link
Copy Markdown

@Entrpi Entrpi commented May 18, 2026

This PR is a follow-up from the initial PR cluster I submitted a few days ago, with continuing work producing more significant performance gains.

What this branch does.
Screen Shot 2026-05-18 at 18 15 12 PM

PRO 6000 Blackwell, V4 Flash, ctx=2048: upstream prefills at ~373 tok/s; this branch does ~2190, i.e. 5.9×. Generation: 38 → 44 tok/s (+16%). GB10 Spark: 401 → 461 prefill (+15%); integrated LPDDR5X is bandwidth-limited so most of the wins don't translate there. Reproduces with ds4-bench on both machines; CSVs and SVGs in speed-bench/.

The speed comes from two independent changes that compound:

1. Faster matmul. Upstream dispatches every Q8_0 dense matmul through a Q8→FP16 expansion cache plus cublasGemmEx. This branch vendors llama.cpp's mmq/mmvq family verbatim into cuda/mmq/ (pinned to 5c0e9468) with a ds4-side adapter, dispatcher, and CPU-reference parity harness on top. mmq handles prefill; mmvq handles single-token decode. cuBLAS stays as a fallback. Licensing is clean: both projects are MIT and LICENSE already credits both author lines.

2. Better weight layout. cudaMalloc packs all ~80 GiB of weights into one region; each tensor lands at an arbitrary internal offset. Switching to per-tensor allocation via the CUDA Driver VMM API (cuMemCreate + cuMemAddressReserve) gives each tensor its own 2 MiB-aligned virtual address... and roughly doubles discrete-GPU prefill on its own. The bisect that found this is informative: VMM with one big chunk performs identically to cudaMalloc. So the win is specifically per-tensor 2 MiB-aligned bases, not 2 MiB pages; same memory either way, but tile-load coalescing and L2 spatial locality care which addresses the tensors land at. A small, deterministic FP32 reduction-order drift comes with this (one of four logprob test vectors flips to a textually-equivalent alternative); documented in misc/cuda-env-vars.md, opt-out via DS4_CUDA_VMM_ARENA=0.

Stack: 2.93× from mmq, 2.04× from VMM layout, 5.88× combined.

Why we can land this confidently. tests/ds4_proof.py is a proof harness modeled as profile × suite × prompt × budget × contract: boots ds4 in any two configurations, feeds them identical prompts, verifies byte-identical generated tokens. Every default-on change here was gated on it.

Weight-server sidecar. For N concurrent ds4 processes against the same model, each paying 20 to 70 seconds of weight upload is painful. tools/ds4_weight_server.cu is a small CUDA process that owns the weights once and shares VMM ranges with workers over a Unix-domain socket (SCM_RIGHTS). Auto-disabled when not in use; single-process users never notice it.

CUDA speculative decoding (MTP): plumbed, but experimental. The CUDA backend had no exact speculative-decoding path before; this branch lands the full pipeline behind 14 purpose-built CUDA kernels. It proves byte-equivalent. But throughput is currently neutral-to-negative on the configurations measured (GB10 draft=2: 12.62 vs 12.68 gen tok/s); the setup tax doesn't amortize at gen=128. Ships behind opt-in env flags; main/default config stays best-nomtp. The exact verifier mostly earns its keep here as a correctness floor for future MTP work. Path to net-positive: deeper draft, lower verifier scheduling overhead, longer generation regimes.

Suggested split if you'd prefer. All logically independent and revertible:

  1. cuda/mmq/ vendor + adapter + parity harness.
  2. Dispatcher routes Q8_0 / Q4_K / IQ2_XXS / Q2_K through mmq.
  3. In-process VMM arena.
  4. ds4_weight_server + import API.
  5. tests/ds4_proof.py + docs.
  6. MTP exact verifier path (opt-in).
  7. Option D: MTP-verifier kernel routing onto legacy.
  8. Bench/server quality-of-life: steady-state gen_tps, token_ids in SSE, no-MTP baselines.

(1) and (2) stay together; mmq is dead code without the dispatcher. Everything else can be sequenced.

Full annotated state with charts and per-commit indexing: https://entrpi.github.io/misc/ds4/ds4_fork_state_2026-05-18.html


Testing per CONTRIBUTING.md.

Default build and correctness:

  • make clean && make: green on macOS (Metal) and Linux (CUDA).
  • make cpu: green; CPU path is reference-only.
  • make test (i.e. ./ds4_test --all): run on both CUDA targets (PRO 6000 Blackwell sm_120, GB10 Spark sm_121). All 8 --logprob-vectors failures categorized:
    • 1 attributable to the VMM arena: short_code_completion step 1 flips to a textually-equivalent alternative on the c language tag after triple-backticks. Documented in misc/cuda-env-vars.md; opt-out via DS4_CUDA_VMM_ARENA=0 restores byte-equivalence at ~50% prefill cost.
    • 7 long_memory_archive failures verified pre-existing by building at upstream merge-base c9dd9499 with none of this branch applied: same 7 failures, same pattern. Most likely q2-quant noise on a 16k retrieval task; not introduced here.

CUDA regression:

  • make cuda-regression: green on both targets. (Fixed a missing $(MMQ_OBJS) link in the tests/cuda_long_context_smoke target as part of this branch.)

Speed regressions:

  • ds4-bench frontier sweeps on both CUDA targets; CSVs and SVGs committed under speed-bench/: pro6000_blackwell.csv / _ts.svg, gb10_spark.csv / _ts.svg. Both prefill_tps and gen_tps reported per CONTRIBUTING.md. PRO 6000 ctx=2048: 1078.86 → 2193.29 prefill, 38.0 → 44.2 gen. GB10 is flat on VMM (integrated LPDDR5X, as expected); GB10 gen is also flat within run-to-run noise vs upstream, since LPDDR5X bandwidth caps decode regardless of dispatch path.

Quality scorer: not run; no quantization changes in this PR.

Entrpi added 30 commits May 14, 2026 14:48
The VMM sharing path needs an explicit capability checkpoint before any ownership or import lifetime code depends on it. Add the build and dry-run plumbing needed to establish that CUDA Driver VMM is available on the target without changing the default IPC backend.

Link CUDA builds with libcuda, add --backend ipc|vmm with ipc as the default, query VMM/POSIX-FD/UVA device attributes, and report VMM allocation granularity and logical-vs-rounded allocation size during dry-run planning.

The non-dry-run VMM path is intentionally guarded in this checkpoint so a run cannot claim VMM semantics while still producing legacy IPC allocations. The existing IPC path remains the default behavior.

Validated on Spark/GB10 by rebuilding ds4 and ds4_weight_server, running an MTP-only --backend vmm --dry-run with a 32 GiB reserve, observing vmm=1 posix_fd=1 uva=1 and 2 MiB granularity, confirming the 3.55 GiB MTP allocation plan, and verifying no lingering DS4/weight-server processes or swap use afterward.
Move the isolated weight owner from VMM capability probing to actual CUDA Driver VMM allocation ownership. This avoids spending effort scaling the legacy full-base raw IPC owner and instead exercises the allocation model intended for imported worker mappings.

Owned ranges now carry backend-specific state. The VMM backend allocates with cuMemCreate, reserves and maps owner VA with cuMemAddressReserve and cuMemMap, grants owner access with cuMemSetAccess, and reuses the pinned staged GGUF upload path against the mapped VA.

The server emits DS4_WEIGHT_SERVER_VMM_V1 alloc records containing logical bytes and granularity-rounded allocation bytes. Cleanup releases VMM mappings with cuMemUnmap, cuMemAddressFree, and cuMemRelease. The IPC backend remains the default and keeps its existing manifest format.

Validated on Spark/GB10 by rebuilding ds4_weight_server, starting an MTP-only --backend vmm owner with a 32 GiB reserve, observing an 8-record DS4_WEIGHT_SERVER_VMM_V1 manifest, uploading 3.55 GiB across 8 MTP ranges, terminating with SIGTERM, seeing the shutdown path run, and confirming no lingering DS4/weight-server processes or swap use afterward.
VMM manifests can describe exported allocations but cannot directly contain live POSIX file descriptors. Add an owner-mediated Unix-domain socket broker so short-lived worker processes can obtain importable handles from the long-lived weight owner.

VMM-owned ranges are exported with cuMemExportToShareableHandle. The VMM manifest now records a broker socket path, and the server listens on an AF_UNIX stream socket. Clients send GET <alloc-id>; the broker replies with status text and transfers the exported FD with SCM_RIGHTS.

Shutdown closes the broker socket, unlinks the socket path, closes exported FDs, releases VMM allocations, and reports the number of broker requests served.

Validated on Spark/GB10 by rebuilding ds4_weight_server, starting an MTP-only --backend vmm owner, confirming the VMM manifest advertised a broker path and 8 MTP alloc records, requesting GET 0 from a Python Unix-socket client, receiving one FD with OK 0 144703488 via SCM_RIGHTS, observing broker served alloc=0 requests=1 in the server log, and confirming socket cleanup, no lingering DS4/weight-server processes, and zero swap use after shutdown.
Imported VMM allocations need to look like ordinary cached model ranges to the existing CUDA kernels. Add the worker-side half of the VMM sharing path so process-isolated DS4 workers can consume allocations owned by ds4_weight_server without changing tensor call sites.

The importer now recognizes DS4_WEIGHT_SERVER_VMM_V1 manifests, reads the broker socket and alloc records, requests each allocation by id, receives the exported POSIX FD with SCM_RIGHTS, imports it with cuMemImportFromShareableHandle, maps it into worker VA with cuMemAddressReserve and cuMemMap, grants device read access with cuMemSetAccess, and registers the logical span in g_model_ranges.

Model range cleanup now has a VMM-owned branch that unmaps worker VA, frees the reserved address range, and releases the imported generic allocation handle. Existing IPC manifests remain supported by the same public import entry point.

Validated on Spark/GB10 by rebuilding ds4 and ds4_weight_server, starting an MTP-only --backend vmm owner, running a direct CUDA worker with DS4_CUDA_WEIGHT_IPC_MANIFEST and DS4_CUDA_WEIGHT_IPC_SCOPE=mtp, observing 'CUDA imported shared VMM weight cache for mtp: 3.55 GiB across 8 ranges', seeing the broker serve all 8 allocation FDs, generating output successfully, and confirming no lingering DS4/weight-server processes and zero swap use after shutdown.
The proof harness needs to own VMM runs with the same rigor as the legacy IPC path before longer MTP comparisons can rely on it. Add first-class VMM backend selection and validation so a JSON report can distinguish a successful VMM import path from a server that merely started.

The runner now accepts --weight-server-backend ipc|vmm and passes it through both dry-run preflight and persistent ds4_weight_server startup. VMM manifests are validated with DS4_WEIGHT_SERVER_VMM_V1, broker, and alloc records, while existing IPC manifests remain supported.

Weight-server telemetry parsing now records VMM support flags, granularity, logical and allocated VMM plans, backend allocation totals, broker socket path, and broker request counts. The validation verdict checks backend matching, VMM preflight support, VMM backend telemetry, broker listening, broker requests, and per-model VMM plans.

Validated on Spark/GB10 with an MTP-only proof run using --start-weight-server --weight-server-scope mtp --weight-server-backend vmm. The run completed with failures=0, weight_server_validation.passed=true, preflight_vmm_supported=true, vmm_backend_telemetry=true, vmm_broker_listening=true, vmm_broker_requests=true, vmm_plan_mtp=true, broker_requests=8, generated worker output, clean owner termination, no lingering DS4/weight-server processes, and zero swap use.
The VMM backend is now the primary Spark/GB10 path for shared MTP raw tensor spans, so the proof documentation needs to present that workflow directly instead of describing only the legacy IPC owner.

Update the proof harness README to describe the two weight-server backends, VMM broker semantics, VMM validation checks, and manual VMM startup/import arguments. Update the CUDA MTP README so multi-profile proof examples use --weight-server-backend vmm with MTP scope and explain that validation now covers backend and broker activity.

Validated by running local Python compile checks for the proof harness and by completing the Spark VMM MTP workflow proof before documenting the commands and observed behavior: failures=0, exact-byte profile comparisons passed for two prompts, weight_server_validation.passed=true, broker_requests=32, owner cleanup terminated, no lingering DS4/weight-server processes, and zero swap use.
Full-scope VMM ownership is only useful if the owner uses the same loading-speed assumptions as the CUDA runtime. The standalone owner was still reading model bytes through normal pread into pinned buffers, which made base+MTP startup substantially slower than the main CUDA loading path.

Port the CUDA loader's direct-I/O staging approach into ds4_weight_server. The server now reopens model files through /proc/self/fd with O_DIRECT when available, tracks filesystem alignment, over-allocates and aligns pinned staging buffers, reads aligned windows for direct I/O, and falls back to normal pread if direct reads are unavailable. The existing four-stage pinned buffering, async H2D copies, and POSIX_FADV_DONTNEED page dropping remain in place.

Validated on Spark/GB10 by rebuilding ds4_weight_server and starting a full-scope base+MTP VMM owner with a 16 GiB reserve. Direct I/O enabled for both GGUFs with 4096-byte alignment, base uploaded 80.76 GiB across 138 ranges, MTP uploaded 3.55 GiB across 8 ranges, owner reached broker-ready in 22.044s versus the previous 64-66s full-scope owner baseline, then shut down cleanly with no lingering DS4/weight-server processes and zero swap use.
The CUDA backend's fd-cache (cuda_model_range_ptr_from_fd) reads bytes
from g_model_fd, which is set once via ds4_gpu_set_model_fd() for the
main model.  If the runtime registers a second model_map (e.g. via a
separate ds4_gpu_set_model_map_range call for an auxiliary model whose
weights live in a different file), subsequent weight lookups for that
model would still hit the fd-cache and read bytes from the main model's
fd at the wrong offsets, silently returning garbage that propagates as
NaN through any kernel that consumes the result.

Track which model_map owns the fd in g_model_fd_host_base, bound on the
first set_model_map call after set_model_fd, and refuse the fd-cache
path for any other model_map.  Non-fd-owning models fall through to the
cudaMemcpy path that uses (const char *)model_map + offset, the correct
host pointer for any registered mmap.

No behavior change for the main model.
accelerator_cache_model_tensors was gated on !e->mtp_ready, meaning
neither the main model nor the MTP draft model got their tensor spans
pre-staged into device memory when MTP was loaded.  The first MTP draft
invocation then paid a multi-second cudaMemcpy outlier copying the
entire 3.6 GiB Q4_K MoE weight set synchronously, one tensor at a time.

Call accelerator_cache_model_tensors for both the main and MTP models
on startup when MTP is loaded.  No effect when MTP is not loaded.
Remove the duplicate g_model_fd_host_base declaration left by merging upstream CUDA loader changes with the MTP prep branch's fd-cache owner guard. Keep the explanatory ownership comment on the single declaration used by the guard.

Validation: Spark CUDA build failed on the duplicate declaration before this fix; the branch will be revalidated with make cuda-spark in the build matrix.
Entrpi added 26 commits May 16, 2026 16:43
Today's vanilla-vs-best bench on PRO 6000 (sm_120, 1.6 TB/s GDDR7) and
GB10 (sm_121, 273 GB/s LPDDR5X) confirmed that no single Q8_0 strategy
is best across arches:

  arch           mmq   cublas  warp8
  PRO 6000:    1078     374    374   (mmq wins, cuBLAS dead code)
  GB10 Spark:   114     398     40   (cublas wins, mmq is a 3.5x regression)

Picking mmq always (the prior default) regressed Spark prefill by 3.5x
vs upstream antirez/main.  Picking cuBLAS always sacrifices ~2.88x on
PRO 6000.  Picking warp8 always sacrifices both.

Introduce `ds4_cuda_q8_strategy()`: queries
`cudaGetDeviceProperties` once, computes memory bandwidth from
`2 * memoryClockRate * memoryBusWidth / 8`, picks one of {mmq, cublas,
warp8} from a tier table, and pins it for the process lifetime.  Logs
the choice once.

`ds4_cuda_use_mmq()` becomes "strategy == MMQ" gated on lazy
`ds4_mmq_init()` success (auto-downgrades to cublas on init failure).
`ds4_cuda_use_cublas_q8()` is a new helper used by the cuBLAS branch
of `cuda_matmul_q8_0_tensor_labeled` so warp8 is reachable when
strategy says so.

Env-var surface:
- `DS4_CUDA_PREFILL_PATH=mmq|cublas|warp8|auto` (new explicit override)
- `DS4_CUDA_USE_MMQ=0` (legacy): equivalent to PREFILL_PATH=cublas
- `DS4_CUDA_MTP_VERIFIER_USE_MMQ`: unchanged Option D semantics

Auto tiers (calibrated against the only two arches we've benched):
- > 800 GB/s -> mmq
- 200..800   -> cublas
- <= 200     -> warp8
cudaDeviceProp::memoryClockRate was removed in CUDA 13; build fails
with "class \"cudaDeviceProp\" has no member \"memoryClockRate\"".

Query the rate via cudaDeviceGetAttribute(&v, cudaDevAttrMemoryClockRate, 0).
memoryBusWidth is still in cudaDeviceProp on CUDA 13, but query it the
same way for forward-compat.  Add a guard: if either attribute query
returns 0 (no driver value), fall back to mmq instead of mis-tiering
into warp8.
Investigation across pod (sm_120, 1.8 TB/s GDDR7) and GB10 Spark
(sm_121, 546 GB/s LPDDR5X) showed mmq is the empirically-fastest path
on every CUDA arch we've validated, including the low-bandwidth one
that the prior bandwidth-tier policy steered to cublas:

  arch       mmq      cublas   warp8
  sm_120:    1092     373      373    (mmq 2.9x)
  sm_121:    458      401      56     (mmq +14% over cublas)

The cuBLAS path remains resident and is initialized at startup
regardless of strategy.  An undiagnosed CUDA driver-state side effect
of cublasCreate gives mmq a ~4x speedup on sm_121 vs a binary that
omits cublas init - confirmed via A/B (old binary at 8df4b2a: mmq
~114; new binary at 38572f9 with cublas restored: mmq ~458, both
stable across 5-iter repro probes).  We don't delete the cublas path
partly for this reason.

New policy:
- default: mmq (with auto-downgrade chain mmq -> cublas -> warp8 on
  init failures)
- DS4_CUDA_PREFILL_PATH=cublas: explicit cuBLAS path
- DS4_CUDA_PREFILL_PATH=warp8:  explicit native warp kernel
- DS4_CUDA_USE_MMQ=0 (legacy):  equivalent to PREFILL_PATH=cublas

Removed the device-bandwidth tier table; we now log the bandwidth at
startup for diagnostics but don't dispatch on it.  Avoids miscalling
a future low-bandwidth-but-mmq-friendly arch into cublas.

Validated on pod (auto picks mmq -> 1092 t/s) and GB10 (auto picks
mmq -> 458 t/s).  All four explicit overrides honored on both.
Adds two CSV columns:
- gen_tps_ss = (gen_tokens - first_call_tokens) / (t_end - t_after_first)
- first_token_sec = t_after_first - gen_t0

The existing gen_tps stays as total_gen_tokens / total_decode_wall, which
includes the first-token post-prefill setup (~1.0-1.3s). The new
steady-state metric matches llama-benchy semantics ((N-1)/(t_last-t_first)
for vanilla decode; subtracts the whole first speculative batch for MTP).

Backward-compatible for csv.DictReader-based parsers.
OpenAI-compatible streaming clients that re-count tokens (e.g.,
llama-benchy) need the model's actual vocab token IDs to avoid
re-tokenizing each SSE chunk through an unrelated HF BPE. With chunks
that span sub-word fragments, per-chunk re-tokenization over-counts -
especially in thinking mode where the model picks coarser vocab tokens
like 'need to ' that don't exist as a single HF merge.

This adds a return_token_ids boolean to /v1/chat/completions parsing
(matching the field llama-benchy already sends in its payload) and
plumbs model token IDs through the openai_stream emitter so each
delta carries a token_ids array of exactly the IDs whose piece text
falls in that delta's byte range.

Hold-back semantics preserved: tokens whose byte_end exceeds the
emission limit (e.g., text held back behind the </think> close-tag
lookahead) stay queued in openai_stream until the next emit moves
emit_pos past them.

Verified locally with cc -c -Wall -Wextra; deferring full build and
SSE-round-trip verification to the GB10 deployment.
llama-benchy reads token_ids from chunk['choices'][0]['token_ids'],
not from inside delta (see client.py:237). vLLM emits it at the
choice level; matching that convention.
Without token-boundary alignment, the </think> hold-back can land
mid-token: ds4-server would emit a partial-token text fragment whose
covering model token hasn't been drained from the queue yet. The
delta chunk would go out without token_ids; llama-benchy then logs
'No token_ids in response, using local tokenization' and falls back
to HF re-tokenization for the rest of the stream - reintroducing the
exact thinking-mode over-count this commit set out to fix.

openai_stream_align_limit snaps the candidate emission limit down to
the largest pending byte_end that fits under the held-back limit, so
every emitted chunk's text spans exactly the tokens it carries.
cuMemCreate with CU_MEM_HANDLE_TYPE_NONE works end-to-end on both PRO 6000
Blackwell (sm_120, discrete) and GB10 Spark (sm_121, integrated), per
tools/ds4_vmm_probe.cu. Recommended granularity is 2 MiB on both. This is
the layout that gave the ds4_weight_server its 1.7-2.0x prefill win on
PRO 6000 versus the in-process cudaMalloc arena.

Step 1 of 5: introduce probe state and the supported() helper. No call
site yet (will land in a follow-up). Hard-gated when DS4_CUDA_WEIGHT_IPC_MANIFEST
is set (sidecar would double-book the model). Soft env opt-out via
DS4_CUDA_VMM_ARENA=0. Falls back to the existing cudaMalloc arena when
the probe fails, so no behaviour change on hardware where VMM is missing.

Compiles cleanly on sm_120; the only nvcc warning is the unused-function
warning for the new helper, which goes away in step 3.
Bump-pointer allocator over CUmem ranges obtained via cuMemCreate +
cuMemAddressReserve + cuMemMap, 1 GiB default chunks rounded up to the
VMM granularity reported by the device (2 MiB on PRO 6000 Blackwell and
GB10 Spark). Mirrors ds4_weight_server's owner-side allocation, minus
the FD export -- HANDLE_TYPE_NONE because nothing leaves this process.

Knob: DS4_CUDA_VMM_ARENA_CHUNK_MB (default 1024, clamped [64, 4096]).

Teardown hooks into the existing cuda_model_range_release_all so model
rebinds free the VMM chunks in the right order (unmap, address-free,
release). Per-range device_ptr aliases are left alone -- they'll be
marked arena_allocated=1 by the wiring commit so the existing release
loop already skips cudaFree on them.

Step 2 of 5: allocator + release. Still no call site, so still a no-op
at runtime; the only nvcc warning is the unused-function warning for
cuda_vmm_arena_alloc, which goes away in step 3.
Single call-site change in cuda_model_range_ptr_from_fd: prefer the new
VMM arena, fall back to the existing cudaMalloc arena on any driver
error. The push_back into g_model_ranges already sets arena_allocated=1,
which means the existing release_all loop skips per-range cudaFree on
both arena flavours -- the lifetime is owned by g_vmm_arenas /
g_model_arenas, not by the individual range pointers.

Effect: when DS4_CUDA_WEIGHT_IPC_MANIFEST is unset and the device
supports VMM (verified on PRO 6000 Blackwell + GB10 Spark), weight
spans land in 2 MiB device pages, matching the layout that gives
ds4_weight_server its 1.7-2.0x prefill speedup. With the manifest set,
or on a device without VMM, the cudaMalloc arena keeps running exactly
as before.

Step 3 of 5: wiring. ds4 + ds4-bench link clean with no warnings.
Default chunk size in Commit 2 was 1024 MiB minimum. With weight spans
averaging ~530-670 MiB and bump-arena residuals too small to host a
second span, this allocated one fresh 1024 MiB chunk per span and wasted
~30-50% of every chunk. For a 80 GiB model this would have pushed VMM
usage to ~138 GiB, OOM-ing the 96 GiB Blackwell.

Fix: default to 0 MiB minimum, so the chunk is just the request size
rounded up to granularity (2 MiB on supported hardware). DS4_CUDA_VMM_ARENA_CHUNK_MB
remains as an explicit minimum override for users who want to coalesce
small allocations into fewer mappings.

Verified on PRO 6000: 138 chunks, 80.77 GiB allocated for 80.76 GiB
cached -- byte-for-byte parity with the ds4_weight_server's plan.
Update AGENT.md to reflect the new loader story:

- Discrete GPUs (PRO 6000 Blackwell): in-process VMM arena now ships
  the same 2 MiB-page layout the weight server gave us as a sidecar.
  Bench A/B (16-row sweep, ctx 2048-32768): arena 1076 -> vmm 2196 at
  ctx=2048 (2.04x), arena 968 -> vmm 1810 at ctx=32768 (1.87x), within
  0.5% of the WS ceiling on every row.

- Integrated GPUs (GB10 Spark): in-process VMM is no-op-ish (-0.6%
  across 8 frontiers from 2048-16384), neither helping nor regressing
  meaningfully. Kept enabled for code-path uniformity; users can flip
  DS4_CUDA_VMM_ARENA=0 if profiling shows a real regression in their
  workload.

- Weight server stays the right tool when N>=2 processes share the
  same model on the same card (proof harness, multi-profile sweeps,
  MTP correctness).

Documents the new env-var surface (DS4_CUDA_VMM_ARENA,
DS4_CUDA_VMM_ARENA_CHUNK_MB) and the hard gate against double-booking
when DS4_CUDA_WEIGHT_IPC_MANIFEST is set.
…dispatch

Brings in 14 commits since the 2026-05-15 merge-base d0357ec:

CUDA correctness (high priority):
- c9dd949 cuda: fix compressed prefill RoPE positions
- 5bc1e6d Apply Flash graph correctness fixes
    Shared experts use the same swiglu_limit clamp as routed experts.
    Ratio-4 indexer Q / indexer compressor KV run Hadamard rotation + FP4
    act-sim round trip before top-k. Adds CUDA + Metal kernels for the QAT
    step. Breaking signature: ds4_gpu_shared_gate_up_swiglu_q8_0_tensor
    gains a float clamp parameter (all callers updated atomically).

Server features and protocol:
- 037ee39 Ignore tool calls emitted inside thinking
- 312935e Add opt-in CORS support (--cors)
- f074c7b Anchor cold KV checkpoints at chat task boundary
- ef0a490 Add ds4-server --chdir option
- 613e9b2 Default sampling to min-p filtering (DS4_DEFAULT_MIN_P=0.05)

Eval and dataset (orthogonal):
- 1f6a8fe Broaden imatrix calibration prompts
- e258d51 Clean up ds4-eval benchmark items
- b166a73 Audit ds4-eval benchmark prompts
- 2ae607f Improve ds4-eval controls and reporting
- 4441e56 Auto-size ds4-eval context
- 48c4d4d Add COMPSEC ds4-eval cases
- 011aa67 Refine COMPSEC eval localization cases

Analysis: local/docs/ds4_upstream_merge_2026-05-18.html

Auto-merged cleanly across README.md, ds4.c, ds4.h, ds4_cli.c,
ds4_cuda.cu, ds4_gpu.h, ds4_metal.m, ds4_server.c. Our VMM arena
(ds4_cuda.cu lines 90-300, 1075-1340) is disjoint from upstream CUDA
hunks (RoPE ~2344, FP8/indexer ~2416-2550, compressor ~6275-6700,
dense Q8_0 ~7489+). Our SSE token_ids work (ds4_server.c) coexists
with upstream's thinking/CORS/KV-anchor edits in different functions
of the same file.
Upstream 5bc1e6d added a 'float clamp' parameter to
ds4_gpu_shared_gate_up_swiglu_q8_0_tensor and updated three callers
upstream had visibility into. Our tree carries two additional callers
in metal_graph_encode_decode_ffn_half_exact and
metal_graph_encode_decode_ffn_shared_post_exact (added during local
Metal graph capture work) that need the same DS4_SWIGLU_CLAMP_EXP
argument the upstream callers use. Same value upstream chose for the
non-Metal-graph paths.
Diagnosis of the GB10 corruption recorded in commit b66b5d6 (R1).
The captured routed-MoE and dense Q8_0 vec graphs run on g_moe_stream
while the rest of the layer (HC, RoPE, head_rms_norm, KV store,
attention, router select, shared SwiGLU, layer-end add) runs on
stream=0.  Two data-dependency races existed across that boundary:

  1. POST-launch: captured kernels write down/mid/out buffers on
     g_moe_stream; the next layer's stream=0 kernels read those
     buffers without an explicit wait.

  2. PRE-launch:  captured kernels read x and selected, both produced
     by stream=0 kernels (router_select, previous layer's add).
     cudaGraphLaunch is asynchronous so g_moe_stream may start
     executing before stream=0 has finished writing those inputs.

Empirically only fixing (1) leaves the corruption (verified by an
on-pod 32-token greedy smoke with this commit's first attempt: still
garbled output).  Both legs are needed.

Fix: a pair of reusable cuda events and two helper functions that
bracket every cudaGraphLaunch:

  ds4_cuda_moe_stream_sync_pre(s)  - record on stream=0, wait on s.
                                     Closes the input-read race.
  ds4_cuda_moe_stream_sync_post(s) - record on s, wait on stream=0.
                                     Closes the output-read race.

Four bracketed call sites:
  - routed_moe replay-hit
  - routed_moe first-launch (after capture+instantiate)
  - dense Q8_0 vec replay-hit
  - dense Q8_0 vec first-launch (after capture+instantiate)

Default remains OFF.  After validation on GB10 + PRO 6000 (32-token
smoke parity vs graphs-OFF + MTP acceptance recovery + bench parity),
flip the static enabled=0 to 1 in a separate commit.
The cross-stream race that motivated commit b66b5d6's revert is closed
by the previous commit's bidirectional ds4_cuda_moe_stream_sync_pre/_post()
brackets around every cudaGraphLaunch.  Validated on both target arches:

  PRO 6000 (sm_120):
    - 32-token greedy smoke: graphs ON output bit-identical to OFF
    - ds4-bench ctx=2048: gen 43.31 -> 44.20 t/s (+2.06%)
    - ds4-bench ctx=4096: gen 42.47 -> 43.45 t/s (+2.31%)

  GB10 (sm_121):
    - 32-token greedy smoke: graphs ON output bit-identical to OFF
    - MTP-active 64-token: clean coherent output (was garbled pre-fix)
    - ds4-bench ctx=2048: gen 14.10 -> 14.84 t/s (+5.2%)
    - ds4-bench ctx=4096: gen 13.95 -> 14.73 t/s (+5.6%)

Changes:
  - ds4_cuda_moe_graphs_enabled() default: enabled = 0 -> enabled = 1.
  - Env-var semantics inverted back to opt-out: DS4_CUDA_MOE_GRAPHS=0
    (or off/no/false) now disables.  Anyone who was setting =1 to
    opt-in will silently no-op (graphs are already on by default).

The perf gain is smaller than the pre-revert benchmark (which showed
+9.36%) because the new cudaEventRecord+cudaStreamWaitEvent brackets
serialize stream=0 with g_moe_stream, trading some parallelism for
the correctness fix.  Still strictly positive on both targets.
Upstream's convention is to keep extra committed docs under misc/
(.gitignored by default; force-add specific files like ANTHROPIC_LIVE_CONTINUATION.md
and RESPONSE_API.md when they're worth keeping in-tree).  We had been
stashing ours under a new docs/ directory that upstream doesn't have.

Move:
  docs/cuda-mtp/README.md      -> misc/cuda-mtp/README.md
  docs/proof-harness/README.md -> misc/proof-harness/README.md

Update the one in-tree reference (README.md) and the new misc/
proof-harness entry it now picks up.  AGENT.md slim-down to follow.
AGENT.md had grown to 286 lines (~5.4x upstream's 53) by accreting
operator-manual content as the CUDA backend matured.  Upstream's
AGENT.md is terse policy-and-pointers prose: no tables, no code
blocks, no benchmark numbers, no commit hashes.  Bring our version
back to that tone:

  AGENT.md  286 -> 68 lines  (15 net over upstream, all in Layout +
                              one short Testing paragraph for the
                              weight-server pattern)

The dropped material consolidates into a new force-added reference
file:

  misc/cuda-env-vars.md  (new, 93 lines)
    - Q8_0 dispatcher table (mmq / cublas / warp8)
    - full env-var inventory with intent behind each default
    - in-process VMM weight arena env vars
    - one pointer each to misc/proof-harness/README.md and
      misc/cuda-mtp/README.md for content that lives there

Stale content dropped during the move:
  - 'DS4_CUDA_MOE_GRAPHS default OFF' rationale paragraph (R1, since
    resolved by commits 687c783 + 7967154)
  - companion-repo cross-refs to local/docs/*.html files (private to
    the auto-round repo; not reachable from a ds4 checkout)
  - per-arch prefill benchmark tables (point-in-time snapshots that
    age out of correctness; the live numbers belong in the
    optimization report HTMLs, not in AGENT.md)
  - '~60 s base upload' figure (the storage-dependent range is
    20-70 s; full discussion in misc/proof-harness/README.md)

No code paths change; AGENT.md and misc/ are docs-only.
Three lines still framed ds4 as Metal-only when both backends are now
real parallel implementations: the production-path goal, the smoke-test
safety guidance, and the testing prerequisite.  Update each to name
both:

  Goals:   'whole-model Metal graph inference'
        -> 'whole-model GPU graph inference (Metal on macOS, CUDA on Linux)'

  Safety:  'Prefer short Metal smoke tests for build verification'
        -> 'Prefer short GPU smoke tests for build verification
            (Metal on macOS, CUDA on Linux)'

  Testing: 'when a model and Metal are available'
        -> 'when a model and a GPU backend are available'

The Objective-C-only-where-Metal-requires-it line in the intro is
preserved verbatim: that's a code-style policy, not a backend-coverage
claim.
Reason antirez#6 in the Motivations section said the KV cache is compressed
enough for long-context inference + on-disk persistence.  True but
understated: the same compression is what keeps the model's speed
from collapsing at deep contexts the way similarly-sized recent peers
(MiMo-V2.5 and friends) do beyond ~100k tokens.  Add one sentence
after the existing bullet making that speed-durability claim
explicit, in the section's first-person-plural voice.

Qualitative claim, no specific benchmark cited.
CONTRIBUTING.md asks for a clean make cpu build.  Four static MTP
helpers in ds4.c were defined unconditionally but called only from
inside #ifndef DS4_NO_GPU regions (the CPU build stubs out
ds4_session_eval_speculative_argmax to an empty body, so none of the
callers compile in CPU mode).  Result: four -Wunused-function
warnings on every make cpu invocation.

Wrap the four helper definitions (ds4_mtp_accept_trace,
ds4_mtp_accept_gate_should_skip, ds4_mtp_exact_policy_use_seq,
ds4_mtp_accept_gate_record) in the same #ifndef DS4_NO_GPU as their
callers.  make cpu and make (Metal) both build warning-free; CUDA
builds compile the helpers as before because DS4_NO_GPU isn't
defined.
… README

The in-process VMM weight arena introduces a small, deterministic
FP32 reduction-order drift relative to the cudaMalloc-backed arena
when matmul kernels read identical weight bytes through 2 MiB-page-
aligned per-tensor VA ranges instead of packed sub-granularity
allocations.  Investigation chain (all ruled out and documented in
the commit notes):
  - Uploaded bytes are identical (checksum)
  - Kernels do not read OOB (poison-fill test)
  - Not cross-stream coherency (cudaDeviceSynchronize doesn't help)
  - Not the warp8 load_i8x4_i32_aligned helper alone (mmq also drifts)
The most plausible mechanism is L2-cache-hit / tile-arrival-order
sensitivity in the kernels' parallel reduction across CTAs (FP32
non-associativity).  Magnitude ~0.08 logprob units; bounded to
tight-margin token choices.  One of four ./ds4_test --logprob-vectors
official-vector cases flips a textually-equivalent alternative.

Document this in misc/cuda-env-vars.md alongside the existing
mmq-vs-warp8 drift (DS4_CUDA_MTP_VERIFIER_USE_MMQ / Option D).  Same
shape of trade; same workaround pattern: DS4_CUDA_VMM_ARENA=0 for
users who need exact official-vector byte equivalence, default ON
otherwise for the ~2x prefill win.

Also add misc/cuda-env-vars.md to README's More Documentation list
(was missing; was discoverable only via in-tree cross-reference).
The chunk-size bisect during the drift investigation updated our
understanding of why the VMM arena is fast.  The 'VMM gives 2 MiB
pages' framing is incomplete: VMM with one large 1792 MiB chunk
performs identically to cudaMalloc (~1080 t/s prefill on PRO 6000),
even though the memory is still 2 MiB-paged.

The actual differentiator is per-tensor 2 MiB-aligned base
addresses: when each weight tensor sits at its own fresh
cuMemAddressReserve-handed VA, the matmul kernels' tile-load
coalescing and L2 spatial-locality improve enough to roughly double
prefill on discrete GPUs.  Pack the same memory into one big chunk
and the bases land at sub-granularity offsets - perf advantage
disappears.

Same root cause as the reduction-order drift: the cache/tile-
arrival-order change that gives the perf also changes the order
partial sums reach the FP32 accumulator.  Add a 'Why per-tensor
chunks specifically' subsection explaining this, then link it
explicitly to the drift section.  Drop the redundant 'plausible
mechanism' paragraph from the drift discussion.
Frontier sweeps run with ds4-bench:

  ./ds4-bench --cuda \
    -m gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf \
    --prompt-file speed-bench/promessi_sposi.txt \
    --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 \
    --csv speed-bench/<arch>.csv

Both runs use the current branch's post-merge defaults: mmq dispatch,
mmvq decode, CUDA graphs with bidirectional stream sync, in-process
VMM weight arena.  q2-imatrix variant on both targets.

CSVs use the extended schema ds4_bench.c emits today (8 columns):
ctx_tokens, prefill_tokens, prefill_tps, gen_tokens, gen_tps,
gen_tps_ss, first_token_sec, kvcache_bytes.  plot_speed.py is
already schema-tolerant (DictReader with subset semantics on its
required columns) so it reads either the upstream 6-column files or
these 8-column files without modification.

PRO 6000 long-prompt headline: 2192 -> 1508 t/s prefill across
ctx 2k -> 65k; gen 43.1 -> 35.3 t/s.

GB10 long-prompt headline: 465 -> 308 t/s prefill across the same
sweep; gen 14.2 -> 11.7 t/s.
make cuda-regression was broken by our work because the test target
links only ds4_cuda.o and not the cuda/mmq/*.o adapter objects.
ds4_cuda.o calls into ds4_mmq_init / ds4_mmq_q8_0_dense /
ds4_mmq_q4_K_moe* / etc, which now live in libraries we vendored
under cuda/mmq/.  Without the link, the smoke test fails with
'undefined reference to ds4_mmq_init' and the cuda-regression
CONTRIBUTING.md check can't run.

Add $(MMQ_OBJS) to the cuda_long_context_smoke link line.

Verified on pod: make cuda-regression now passes
  'cuda long-context regression: OK'
…_ids

Three updates that round out the README to match what the engine
now exposes:

  - Speed table: broaden the 'Metal CLI numbers' footnote to 'CLI
    numbers' since both Mac and NVIDIA targets are now represented;
    add RTX PRO 6000 Blackwell short + long-prompt rows and refresh
    the DGX Spark GB10 row to a long-prompt run matching the new
    PRO 6000 entry.  Numbers from the in-tree speed-bench sweeps,
    captured with the q2-imatrix variant on both CUDA targets for
    apples-to-apples.

  - Benchmarking: describe the steady-state gen_tps_ss column we
    added to ds4-bench (excludes first-token amortisation, useful
    for apples-to-apples decode comparisons across short and long
    generations).  Point at the newly committed
    speed-bench/pro6000_blackwell_ts.svg and gb10_spark_ts.svg.

  - Server SSE: document return_token_ids on chat-completion SSE
    (per-token IDs at the choice level, matching vLLM/llama-benchy
    wire shape; emission snaps to token boundaries to keep IDs in
    sync with text).
@Entrpi Entrpi marked this pull request as ready for review May 18, 2026 08:23
Entrpi added a commit to Entrpi/ds4-on-spark that referenced this pull request May 18, 2026
Until antirez/ds4#187 lands upstream, the one-shot installer pulls
the CUDA performance work (mmq dispatch + in-process VMM weight
arena + stream-sync'd CUDA graphs) from our PR branch. Revert the
two DS4_REPO / DS4_REF defaults to antirez/ds4 + main once merged.

Fast-forward path now repoints `origin` in place when the existing
clone tracks a different URL, so users who already ran the
installer against antirez/ds4 main get a clean repoint instead of
a fetch error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant