Skip to content

rocm: add wmma indexer support#180

Open
alantsev wants to merge 61 commits into
antirez:rocmfrom
alantsev:rocm
Open

rocm: add wmma indexer support#180
alantsev wants to merge 61 commits into
antirez:rocmfrom
alantsev:rocm

Conversation

@alantsev
Copy link
Copy Markdown

most of the changes are from the upstream main branch.

the rocm related changes are about introducing the wmma indexer, and minimizing the pure cuda path changes

the long context test is still failing -

$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
ds4-test: long-context wrong assignment for Alice: got 50 expected 52
tests/ds4_test.c:354: assertion failed: test_output_has_fact(text, &test_long_facts[i])
long-context: ERR
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-evict-test.ZkrbI4/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-current-store-evict-test.SPxImn/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-oversize-store-evict-test.WQKFyJ/2222222222222222222222222222222222222222.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=2048 hits=15 size=0.00 MiB file=/tmp/ds4-kv-stale-hit-evict-test.h3s2dj/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-live-prefix-test.J3pmsF/1111111111111111111111111111111111111111.kv
server: OK
ds4 tests: 1 failure(s)

while the cpu call produces proper logprobs

$ cat /tmp/long_context_story.ds4.json | grep "\"step\":6,"
    {"step":6,"selected":{"id":1328,"text":"50","bytes":[53,48]},"top_logprobs":[{"token":{"id":1328,"text":"50","bytes":[53,48]},"logit":39.0537186,"logprob":-0.441582799},{"token":{"id":4157,"text":"52","byte
s":[53,50]},"logit":38.4644012,"logprob":-1.03090012},{"token":{"id":926,"text":"16","bytes":[49,54]},"logit":29.8901711,"logprob":-9.6051302},{"token":{"id":4287,"text":"51","bytes":[53,49]},"logit":29.4500008
,"logprob":-10.0453005},{"token":{"id":5863,"text":"71","bytes":[55,49]},"logit":29.2099457,"logprob":-10.2853556},{"token":{"id":2181,"text":"31","bytes":[51,49]},"logit":28.3799171,"logprob":-11.1153841},{"to
ken":{"id":3286,"text":"41","bytes":[52,49]},"logit":28.0461922,"logprob":-11.4491091},{"token":{"id":4414,"text":"53","bytes":[53,51]},"logit":27.9420624,"logprob":-11.5532389},{"token":{"id":2315,"text":"55",
"bytes":[53,53]},"logit":27.8421669,"logprob":-11.6531343},{"token":{"id":5817,"text":"73","bytes":[55,51]},"logit":27.7103462,"logprob":-11.784955},{"token":{"id":1671,"text":"33","bytes":[51,51]},"logit":27.5
818291,"logprob":-11.9134722},{"token":{"id":1059,"text":"30","bytes":[51,48]},"logit":27.4494438,"logprob":-12.0458574},{"token":{"id":4364,"text":"54","bytes":[53,52]},"logit":27.2013245,"logprob":-12.2939768
},{"token":{"id":3175,"text":"58","bytes":[53,56]},"logit":26.9953213,"logprob":-12.49998},{"token":{"id":3351,"text":"57","bytes":[53,55]},"logit":26.9138756,"logprob":-12.5814257},{"token":{"id":4610,"text":"
72","bytes":[55,50]},"logit":26.860775,"logprob":-12.6345263},{"token":{"id":3661,"text":"56","bytes":[53,54]},"logit":26.7848511,"logprob":-12.7104502},{"token":{"id":2111,"text":"32","bytes":[51,50]},"logit":
26.6977062,"logprob":-12.797595},{"token":{"id":2491,"text":"47","bytes":[52,55]},"logit":26.6845646,"logprob":-12.8107367},{"token":{"id":3180,"text":"42","bytes":[52,50]},"logit":26.6252499,"logprob":-12.8700
514}]},

vs rocm one

$ cat /tmp/long_context_story_cpu.ds4.json | grep "\"step\":6,"
    {"step":6,"selected":{"id":4157,"text":"52","bytes":[53,50]},"top_logprobs":[{"token":{"id":4157,"text":"52","bytes":[53,50]},"logit":44.3264122,"logprob":-9.7539878e-06},{"token":{"id":926,"text":"16","byt
es":[49,54]},"logit":32.3199005,"logprob":-12.0065212},{"token":{"id":2012,"text":"34","bytes":[51,52]},"logit":30.0317593,"logprob":-14.2946625},{"token":{"id":1328,"text":"50","bytes":[53,48]},"logit":29.7194
824,"logprob":-14.6069393},{"token":{"id":4287,"text":"51","bytes":[53,49]},"logit":29.5467281,"logprob":-14.7796936},{"token":{"id":4414,"text":"53","bytes":[53,51]},"logit":29.1445732,"logprob":-15.1818485},{
"token":{"id":3180,"text":"42","bytes":[52,50]},"logit":28.6198616,"logprob":-15.7065601},{"token":{"id":4364,"text":"54","bytes":[53,52]},"logit":28.4588375,"logprob":-15.8675842},{"token":{"id":1671,"text":"3
3","bytes":[51,51]},"logit":28.2291546,"logprob":-16.0972672},{"token":{"id":5863,"text":"71","bytes":[55,49]},"logit":28.0653591,"logprob":-16.2610626},{"token":{"id":3286,"text":"41","bytes":[52,49]},"logit":
27.9963379,"logprob":-16.3300838},{"token":{"id":2315,"text":"55","bytes":[53,53]},"logit":27.9557056,"logprob":-16.3707161},{"token":{"id":2372,"text":"46","bytes":[52,54]},"logit":27.9028664,"logprob":-16.423
5554},{"token":{"id":2597,"text":"78","bytes":[55,56]},"logit":27.7379379,"logprob":-16.5884838},{"token":{"id":1173,"text":"24","bytes":[50,52]},"logit":27.5626965,"logprob":-16.7637253},{"token":{"id":2491,"t
ext":"47","bytes":[52,55]},"logit":27.5278931,"logprob":-16.7985287},{"token":{"id":929,"text":"14","bytes":[49,52]},"logit":27.4220829,"logprob":-16.9043388},{"token":{"id":907,"text":"13","bytes":[49,51]},"lo
git":27.4163208,"logprob":-16.9101009},{"token":{"id":5817,"text":"73","bytes":[55,51]},"logit":27.3305702,"logprob":-16.9958515},{"token":{"id":3661,"text":"56","bytes":[53,54]},"logit":27.2389526,"logprob":-1
7.0874691}]},

so far I was not able to identify the reason for the discrepancy.

mitsuhiko and others added 30 commits May 11, 2026 12:30
Implements the Responses API endpoint that Codex CLI (and other modern
OpenAI tooling) speaks instead of /v1/chat/completions. The wire format
is documented in OpenAI's Responses API; this implementation has been
iterated against the Codex CLI binary's SSE parser shape until no
remaining schema gaps were found.

Request parsing (parse_responses_request, parse_responses_input):
- Accepts the typed input array (message, function_call,
  function_call_output, reasoning, custom_tool_call(_output),
  local_shell_call(_output), web_search_call(_output),
  tool_search_call(_output), image_generation_call(_output),
  compaction, context_compaction).
- Maps hosted-tool history to function_call/function_call_output so
  prior actions survive across turns; rejects unknown item types and
  non-completed status with 400 to avoid silent context loss.
- Strict content-array parsing: only string|null|array of recognized
  text blocks (input_text/output_text/text/summary_text/
  reasoning_text); rejects non-text modalities (input_image/file/
  audio) instead of accepting an empty prompt.
- Merges adjacent function_call items into the preceding assistant
  message so text + tool-call turns render as a single assistant
  block.
- Honors reasoning.effort (incl. "minimal"/"none") and gates
  reasoning summary surface on reasoning.summary opt-in.
- Rejects previous_response_id, conversation, and forced tool_choice
  explicitly (constrained decoding / persisted state not supported).

Output (responses_sse_*, responses_final_response):
- Emits the full streaming lifecycle: response.created,
  output_item.added/.done, reasoning_summary_part.added/.done,
  reasoning_summary_text.delta/.done, content_part.added/.done,
  output_text.delta/.done, function_call_arguments.delta/.done,
  response.completed.
- Branches the terminal event by finish reason: response.failed for
  errors and response.incomplete with reason "max_tokens" for length.
- Every event carries sequence_number; every output_text part carries
  annotations:[]; function_call output_item.added ships with an empty
  arguments string (full args arrive via function_call_arguments.done
  and output_item.done), and item ids are stable across added/done.
- Tracks whether </think> was actually observed so a truncated stream
  marks the reasoning item incomplete instead of "completed".
- Recovers gracefully when the DSML tool parse fails after the model
  was suppressed at the tool marker: the suppressed tail is flushed
  as additional output_text deltas so the streamed message matches
  output_item.done.

Tested by 25 rounds of /codex:adversarial-review against the same
client this is meant to feed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks.

Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3.

There was no ack from the user. Don't want to take a fix
that is astronautically produced from an unclear error
trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth.

Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation.

The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance.

Tested on 0.180 with:
- make cpu
- make -B cuda-spark
- make cuda-regression
- ./ds4_test --server --metal-kernels
- ./ds4_test --logprob-vectors --tool-call-quality
- ds4-bench ctx-alloc 32768, 250000, and 1000000
- ds4-server --ctx 1000000 startup smoke

(cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first.

Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations.

Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.

Error response format:
  {
    "error": {
      "message": "Prompt tokens (N) exceeds context size (M)",
      "type": "context_exceeded",
      "n_prompt_tokens": N,
      "n_ctx": M
    }
  }
dwarfstar is typoed to drawfstar
antirez and others added 26 commits May 15, 2026 19:58
Render tool result bodies as raw text so file contents containing angle brackets reach the model unchanged. Escape only the tool_result closing sentinel, and keep the imatrix prompt renderer mirror in sync.

Fixes antirez#152
ds4-server auto-injects "## Tools" and the per-request tool-schema
listing into the system content of every chat-style request that
declares tools. Place that block at the head of the system region,
before any client-supplied system messages, so the dynamic tail of
the client's system prompt (commonly a per-request timestamp or
agent-id emitted by an SDK) sits at the very end of the system
region, immediately before <|User|>.

That position is what --kv-cache-boundary-trim-tokens subtracts from
when computing the cached cut. With tool schemas at the head, trim
can chop only the variable bytes and keep the much larger tool-schema
region inside the cached prefix, so cross-session lookups hit instead
of always missing on the dynamic tail.

The model still sees the same content in the same role; only the
order inside the system block changes. Includes a unit test asserting
that the tool block is rendered before the client's system content
and that the client content stays immediately before the user marker.
The disk cache eviction score is (hits + 1) * tokens / file_size.
hits is monotonic: it only ever goes up on a successful disk reuse.
That is fine while the workload stays the same, but it locks
once-popular files in place after a prompt schema change (system
prompt rewritten, tools removed, model swapped, etc.).  The stale
file can no longer match anything, so it never accumulates new hits,
but its existing bonus keeps it well above every freshly-stored entry
indefinitely.

Add an optional time-based decay so the bonus erodes when an entry
goes untouched.  When --kv-cache-hit-half-life-days N is set, the
score uses

    effective_hits = hits * 2 ** -((now - last_used) / half_life)

so a matching prompt that refreshes last_used keeps the entry hot,
and an entry that stops getting hits gradually falls toward the
(0 + 1) baseline that protects fresh stores.  Default is 0, which
disables decay and preserves the current behavior exactly.
PR antirez#29 added useful cache usage reporting, but its OpenAI-compatible usage shape reported cached_tokens as cache reads plus newly-prefilled cache writes. OpenAI defines cached_tokens as prompt tokens retrieved from cache, so including writes makes a future reusable suffix look like an actual cache hit. Keep cache_write_tokens as a DS4 extension and report cached_tokens as read hits only for Chat/Completions and Responses usage. Anthropic usage already has separate standard read/create fields and is left unchanged.
PR antirez#162 introduced the right score shape by decaying only the hit bonus while preserving the token/byte baseline. Make that behavior unconditional so stale once-hot checkpoints cannot remain pinned unless the operator opts in. Remove the extra CLI knob, keep the half-life visible in startup logs, and add an eviction test for same-density fresh checkpoints versus stale high-hit files.\n\nFixes antirez#161.
Add language/prose calibration tasks, include ds4-eval benchmark prompts without answer keys, and expand long-context prompts with code synthesis, agent transcript replay, trace diagnosis, prose fact recovery, document comparison, delayed constraints, and a small needle task. Regenerate the rendered prompt files and manifest so the corpus stays reproducible.
Cold disk KV checkpoints should maximize reuse across independent agent sessions, not preserve the variable user task. The reusable boundary is the last rendered <User> special token before the first <Assistant> special token: this keeps stable user-role scaffolding such as Codex environment_context, while avoiding later multi-turn user markers that belong to conversation history and are better handled by live/continued checkpoints.

Implement the anchor at token level, using the exact chat special token IDs, so the saved prefix is the same boundary the model sees. If no valid chat anchor is present, keep the existing trim/alignment heuristic for non-chat or too-short prompts.

Checked with ./ds4_test --server and real fresh-session pairs for pi, Claude Code, opencode, and Codex. The second run reused disk prefixes of 1188/1215 tokens for pi, 23164/23973 for Claude Code, 520/551 plus 10725/10756 for opencode's two stages, and 11554/11584 for Codex Responses, confirming that Codex's stable environment_context is now included while task text is left out.
Add a --cors server flag for browser JavaScript clients that call ds4-server directly from another origin. When enabled, normal JSON responses, errors, and SSE streams include broad Access-Control-Allow-* headers, and OPTIONS preflight requests return 204 No Content.

The flag only changes HTTP headers. It deliberately does not change the bind address, so LAN exposure remains explicit via --host 0.0.0.0. This keeps the useful part of PR antirez#94 while avoiding the surprising localhost-to-all-interfaces behavior.

Checked with ./ds4_test --server and real curl requests for OPTIONS, GET /v1/models, and streaming /v1/chat/completions. Also verified that responses without --cors omit the CORS headers and that default listening remains 127.0.0.1.
Remove SuperGPQA rows with broken upstream keys from ds4-eval instead of locally re-keying them. Replace them with cleaner self-contained SuperGPQA items, document the audited SuperGPQA slice, and regenerate the imatrix calibration prompts so benchmark-derived calibration no longer includes the broken items.
Add a q key path that exits the TUI cleanly and prints the normal plain-text report, show the pause/quit hints in the TUI title, and keep the final report compact by omitting question titles and ANSI escapes. Also clamp per-case generation to the remaining context budget so small-ctx runs stop or skip individual cases instead of aborting the whole benchmark on KV capacity errors.
Refs antirez#167.

DSML tool calls are executable only after the assistant closes the thinking stanza. The live decoder now skips tool-marker detection while thinking is open, and the final parser searches for executable DSML only after the last </think> when thinking mode is active. This prevents a DSML block inside reasoning from being duplicated as both thinking text and a structured tool call.

We were not able to prove from traces that the model actually made this exact mistake in the reported case, because the trace gives text rather than the original sampled-token/KV state. It is nevertheless possible, and it matches the reported shape of the bug. We did see evidence that the model can discuss tool usage and DSML syntax inside the thinking stanza, so this is a conservative protocol fix: it does not force the end of thinking, and it does not execute tool calls while still inside thinking.

It remains unclear what the model would do if it truly mis-emits a tool call before </think>: it might recover by closing thinking later, or it might not. A follow-up commit should address the likely root source by changing sampler behavior itself instead of relying only on protocol filtering.
Use a shared DS4_DEFAULT_MIN_P of 0.05 for sampled generation. The default keeps top_p at 1.0 and uses min_p as the active filter, preventing very low relative-probability tokens from entering the sampling set while preserving plausible alternatives.

Apply the same default to the CLI, server, and ds4-eval, expose --min-p in CLI/eval, and make the server thinking-mode fixed sampler use min_p=0.05 instead of disabling min-p filtering. Explicit non-thinking API min_p values continue to be honored.

Builds checked with make -B ds4 ds4-server ds4-eval. Server unit tests pass; full make test could not run because another DS4 process was already holding /tmp/ds4.lock.
Add --chdir so service managers and external launchers can run ds4-server from any current directory while still resolving relative runtime files, including metal/*.metal, from the project or install tree they select.

This intentionally keeps path handling explicit instead of adding per-loader executable-relative probing: after chdir succeeds, model paths, trace paths, KV cache directories, and Metal sources all follow the same normal relative-path rule.

Fixes antirez#49
changed:
  - used intrinsics from llama-cpp repo
  - no fast math
  - WMMA kernels for rocm
  - minimal changes in the vanilla cuda path

Test results

still have issues with the long context test


```
$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
ds4-test: long-context wrong assignment for Alice: got 50 expected 52
tests/ds4_test.c:354: assertion failed: test_output_has_fact(text, &test_long_facts[i])
long-context: ERR
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-evict-test.ZkrbI4/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-current-store-evict-test.SPxImn/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-oversize-store-evict-test.WQKFyJ/2222222222222222222222222222222222222222.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=2048 hits=15 size=0.00 MiB file=/tmp/ds4-kv-stale-hit-evict-test.h3s2dj/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-live-prefix-test.J3pmsF/1111111111111111111111111111111111111111.kv
server: OK
ds4 tests: 1 failure(s)

$
```

Strangelly enough the cpu version produces similar results for the logprobs
@alantsev alantsev changed the title Rocm rocm: add wmma indexer support May 17, 2026
@antirez
Copy link
Copy Markdown
Owner

antirez commented May 17, 2026

@alantsev please check the CUDA fix for the long context, try to ask your agent to fix the ROCm path in the same way if possible.

@gundemirbas
Copy link
Copy Markdown

gundemirbas commented May 19, 2026

Any progress ?

@alantsev
Copy link
Copy Markdown
Author

huge progress in understanding :) - will fix it this weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants