rocm: add wmma indexer support by alantsev · Pull Request #180 · antirez/ds4

alantsev · 2026-05-17T07:06:25Z

most of the changes are from the upstream main branch.

the rocm related changes are about introducing the wmma indexer, and minimizing the pure cuda path changes

the long context test is still failing -

$ ./ds4_test
long-context:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: long-context prefill 0/30474
ds4-test: long-context prefill 8192/30474
ds4-test: long-context prefill 16384/30474
ds4-test: long-context prefill 24576/30474
ds4-test: long-context prefill 30474/30474
ds4-test: long-context wrong assignment for Alice: got 50 expected 52
tests/ds4_test.c:354: assertion failed: test_output_has_fact(text, &test_long_facts[i])
long-context: ERR
tool-call-quality:
ds4-test: tool-call quality fast path
ds4-test: tool-call quality exact path
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
tool-call-quality: OK
logprob-vectors:
ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
ds4-test: vector short_italian_fact
ds4-test: vector short_code_completion
ds4-test: vector short_reasoning_plain
ds4-test: vector long_memory_archive
ds4-test: vector long_code_audit
logprob-vectors: OK
metal-kernels:
ds4: CUDA registered 0.00 GiB model mapping for device access
metal-kernels: OK
server:
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-evict-test.ZkrbI4/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-current-store-evict-test.SPxImn/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-oversize-store-evict-test.WQKFyJ/2222222222222222222222222222222222222222.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=2048 hits=15 size=0.00 MiB file=/tmp/ds4-kv-stale-hit-evict-test.h3s2dj/1111111111111111111111111111111111111111.kv
0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-live-prefix-test.J3pmsF/1111111111111111111111111111111111111111.kv
server: OK
ds4 tests: 1 failure(s)

while the cpu call produces proper logprobs

$ cat /tmp/long_context_story.ds4.json | grep "\"step\":6,"
    {"step":6,"selected":{"id":1328,"text":"50","bytes":[53,48]},"top_logprobs":[{"token":{"id":1328,"text":"50","bytes":[53,48]},"logit":39.0537186,"logprob":-0.441582799},{"token":{"id":4157,"text":"52","byte
s":[53,50]},"logit":38.4644012,"logprob":-1.03090012},{"token":{"id":926,"text":"16","bytes":[49,54]},"logit":29.8901711,"logprob":-9.6051302},{"token":{"id":4287,"text":"51","bytes":[53,49]},"logit":29.4500008
,"logprob":-10.0453005},{"token":{"id":5863,"text":"71","bytes":[55,49]},"logit":29.2099457,"logprob":-10.2853556},{"token":{"id":2181,"text":"31","bytes":[51,49]},"logit":28.3799171,"logprob":-11.1153841},{"to
ken":{"id":3286,"text":"41","bytes":[52,49]},"logit":28.0461922,"logprob":-11.4491091},{"token":{"id":4414,"text":"53","bytes":[53,51]},"logit":27.9420624,"logprob":-11.5532389},{"token":{"id":2315,"text":"55",
"bytes":[53,53]},"logit":27.8421669,"logprob":-11.6531343},{"token":{"id":5817,"text":"73","bytes":[55,51]},"logit":27.7103462,"logprob":-11.784955},{"token":{"id":1671,"text":"33","bytes":[51,51]},"logit":27.5
818291,"logprob":-11.9134722},{"token":{"id":1059,"text":"30","bytes":[51,48]},"logit":27.4494438,"logprob":-12.0458574},{"token":{"id":4364,"text":"54","bytes":[53,52]},"logit":27.2013245,"logprob":-12.2939768
},{"token":{"id":3175,"text":"58","bytes":[53,56]},"logit":26.9953213,"logprob":-12.49998},{"token":{"id":3351,"text":"57","bytes":[53,55]},"logit":26.9138756,"logprob":-12.5814257},{"token":{"id":4610,"text":"
72","bytes":[55,50]},"logit":26.860775,"logprob":-12.6345263},{"token":{"id":3661,"text":"56","bytes":[53,54]},"logit":26.7848511,"logprob":-12.7104502},{"token":{"id":2111,"text":"32","bytes":[51,50]},"logit":
26.6977062,"logprob":-12.797595},{"token":{"id":2491,"text":"47","bytes":[52,55]},"logit":26.6845646,"logprob":-12.8107367},{"token":{"id":3180,"text":"42","bytes":[52,50]},"logit":26.6252499,"logprob":-12.8700
514}]},

vs rocm one

$ cat /tmp/long_context_story_cpu.ds4.json | grep "\"step\":6,"
    {"step":6,"selected":{"id":4157,"text":"52","bytes":[53,50]},"top_logprobs":[{"token":{"id":4157,"text":"52","bytes":[53,50]},"logit":44.3264122,"logprob":-9.7539878e-06},{"token":{"id":926,"text":"16","byt
es":[49,54]},"logit":32.3199005,"logprob":-12.0065212},{"token":{"id":2012,"text":"34","bytes":[51,52]},"logit":30.0317593,"logprob":-14.2946625},{"token":{"id":1328,"text":"50","bytes":[53,48]},"logit":29.7194
824,"logprob":-14.6069393},{"token":{"id":4287,"text":"51","bytes":[53,49]},"logit":29.5467281,"logprob":-14.7796936},{"token":{"id":4414,"text":"53","bytes":[53,51]},"logit":29.1445732,"logprob":-15.1818485},{
"token":{"id":3180,"text":"42","bytes":[52,50]},"logit":28.6198616,"logprob":-15.7065601},{"token":{"id":4364,"text":"54","bytes":[53,52]},"logit":28.4588375,"logprob":-15.8675842},{"token":{"id":1671,"text":"3
3","bytes":[51,51]},"logit":28.2291546,"logprob":-16.0972672},{"token":{"id":5863,"text":"71","bytes":[55,49]},"logit":28.0653591,"logprob":-16.2610626},{"token":{"id":3286,"text":"41","bytes":[52,49]},"logit":
27.9963379,"logprob":-16.3300838},{"token":{"id":2315,"text":"55","bytes":[53,53]},"logit":27.9557056,"logprob":-16.3707161},{"token":{"id":2372,"text":"46","bytes":[52,54]},"logit":27.9028664,"logprob":-16.423
5554},{"token":{"id":2597,"text":"78","bytes":[55,56]},"logit":27.7379379,"logprob":-16.5884838},{"token":{"id":1173,"text":"24","bytes":[50,52]},"logit":27.5626965,"logprob":-16.7637253},{"token":{"id":2491,"t
ext":"47","bytes":[52,55]},"logit":27.5278931,"logprob":-16.7985287},{"token":{"id":929,"text":"14","bytes":[49,52]},"logit":27.4220829,"logprob":-16.9043388},{"token":{"id":907,"text":"13","bytes":[49,51]},"lo
git":27.4163208,"logprob":-16.9101009},{"token":{"id":5817,"text":"73","bytes":[55,51]},"logit":27.3305702,"logprob":-16.9958515},{"token":{"id":3661,"text":"56","bytes":[53,54]},"logit":27.2389526,"logprob":-1
7.0874691}]},

so far I was not able to identify the reason for the discrepancy.

Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.

Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.

Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.

Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.

Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)

Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.

Return a 400 error with error type "context_exceeded" when prompt tokens exceed context size. The response includes both n_prompt_tokens and n_ctx fields so clients can determine exactly why the request failed and how far over the limit they went. Error response format: { "error": { "message": "Prompt tokens (N) exceeds context size (M)", "type": "context_exceeded", "n_prompt_tokens": N, "n_ctx": M } }

dwarfstar is typoed to drawfstar

fix typo in readme

Render tool result bodies as raw text so file contents containing angle brackets reach the model unchanged. Escape only the tool_result closing sentinel, and keep the imatrix prompt renderer mirror in sync. Fixes antirez#152

ds4-server auto-injects "## Tools" and the per-request tool-schema listing into the system content of every chat-style request that declares tools. Place that block at the head of the system region, before any client-supplied system messages, so the dynamic tail of the client's system prompt (commonly a per-request timestamp or agent-id emitted by an SDK) sits at the very end of the system region, immediately before <｜User｜>. That position is what --kv-cache-boundary-trim-tokens subtracts from when computing the cached cut. With tool schemas at the head, trim can chop only the variable bytes and keep the much larger tool-schema region inside the cached prefix, so cross-session lookups hit instead of always missing on the dynamic tail. The model still sees the same content in the same role; only the order inside the system block changes. Includes a unit test asserting that the tool block is rendered before the client's system content and that the client content stays immediately before the user marker.

The disk cache eviction score is (hits + 1) * tokens / file_size. hits is monotonic: it only ever goes up on a successful disk reuse. That is fine while the workload stays the same, but it locks once-popular files in place after a prompt schema change (system prompt rewritten, tools removed, model swapped, etc.). The stale file can no longer match anything, so it never accumulates new hits, but its existing bonus keeps it well above every freshly-stored entry indefinitely. Add an optional time-based decay so the bonus erodes when an entry goes untouched. When --kv-cache-hit-half-life-days N is set, the score uses effective_hits = hits * 2 ** -((now - last_used) / half_life) so a matching prompt that refreshes last_used keeps the entry hot, and an entry that stops getting hits gradually falls toward the (0 + 1) baseline that protects fresh stores. Default is 0, which disables decay and preserves the current behavior exactly.

…nd-system

PR antirez#29 added useful cache usage reporting, but its OpenAI-compatible usage shape reported cached_tokens as cache reads plus newly-prefilled cache writes. OpenAI defines cached_tokens as prompt tokens retrieved from cache, so including writes makes a future reusable suffix look like an actual cache hit. Keep cache_write_tokens as a DS4 extension and report cached_tokens as read hits only for Chat/Completions and Responses usage. Anthropic usage already has separate standard read/create fields and is left unchanged.

…it-decay

PR antirez#162 introduced the right score shape by decaying only the hit bonus while preserving the token/byte baseline. Make that behavior unconditional so stale once-hot checkpoints cannot remain pinned unless the operator opts in. Remove the extra CLI knob, keep the half-life visible in startup logs, and add an eviction test for same-density fresh checkpoints versus stale high-hit files.\n\nFixes antirez#161.

Add language/prose calibration tasks, include ds4-eval benchmark prompts without answer keys, and expand long-context prompts with code synthesis, agent transcript replay, trace diagnosis, prose fact recovery, document comparison, delayed constraints, and a small needle task. Regenerate the rendered prompt files and manifest so the corpus stays reproducible.

Cold disk KV checkpoints should maximize reuse across independent agent sessions, not preserve the variable user task. The reusable boundary is the last rendered <User> special token before the first <Assistant> special token: this keeps stable user-role scaffolding such as Codex environment_context, while avoiding later multi-turn user markers that belong to conversation history and are better handled by live/continued checkpoints. Implement the anchor at token level, using the exact chat special token IDs, so the saved prefix is the same boundary the model sees. If no valid chat anchor is present, keep the existing trim/alignment heuristic for non-chat or too-short prompts. Checked with ./ds4_test --server and real fresh-session pairs for pi, Claude Code, opencode, and Codex. The second run reused disk prefixes of 1188/1215 tokens for pi, 23164/23973 for Claude Code, 520/551 plus 10725/10756 for opencode's two stages, and 11554/11584 for Codex Responses, confirming that Codex's stable environment_context is now included while task text is left out.

Add a --cors server flag for browser JavaScript clients that call ds4-server directly from another origin. When enabled, normal JSON responses, errors, and SSE streams include broad Access-Control-Allow-* headers, and OPTIONS preflight requests return 204 No Content. The flag only changes HTTP headers. It deliberately does not change the bind address, so LAN exposure remains explicit via --host 0.0.0.0. This keeps the useful part of PR antirez#94 while avoiding the surprising localhost-to-all-interfaces behavior. Checked with ./ds4_test --server and real curl requests for OPTIONS, GET /v1/models, and streaming /v1/chat/completions. Also verified that responses without --cors omit the CORS headers and that default listening remains 127.0.0.1.

Remove SuperGPQA rows with broken upstream keys from ds4-eval instead of locally re-keying them. Replace them with cleaner self-contained SuperGPQA items, document the audited SuperGPQA slice, and regenerate the imatrix calibration prompts so benchmark-derived calibration no longer includes the broken items.

Add a q key path that exits the TUI cleanly and prints the normal plain-text report, show the pause/quit hints in the TUI title, and keep the final report compact by omitting question titles and ANSI escapes. Also clamp per-case generation to the remaining context budget so small-ctx runs stop or skip individual cases instead of aborting the whole benchmark on KV capacity errors.

Refs antirez#167. DSML tool calls are executable only after the assistant closes the thinking stanza. The live decoder now skips tool-marker detection while thinking is open, and the final parser searches for executable DSML only after the last </think> when thinking mode is active. This prevents a DSML block inside reasoning from being duplicated as both thinking text and a structured tool call. We were not able to prove from traces that the model actually made this exact mistake in the reported case, because the trace gives text rather than the original sampled-token/KV state. It is nevertheless possible, and it matches the reported shape of the bug. We did see evidence that the model can discuss tool usage and DSML syntax inside the thinking stanza, so this is a conservative protocol fix: it does not force the end of thinking, and it does not execute tool calls while still inside thinking. It remains unclear what the model would do if it truly mis-emits a tool call before </think>: it might recover by closing thinking later, or it might not. A follow-up commit should address the likely root source by changing sampler behavior itself instead of relying only on protocol filtering.

Use a shared DS4_DEFAULT_MIN_P of 0.05 for sampled generation. The default keeps top_p at 1.0 and uses min_p as the active filter, preventing very low relative-probability tokens from entering the sampling set while preserving plausible alternatives. Apply the same default to the CLI, server, and ds4-eval, expose --min-p in CLI/eval, and make the server thinking-mode fixed sampler use min_p=0.05 instead of disabling min-p filtering. Explicit non-thinking API min_p values continue to be honored. Builds checked with make -B ds4 ds4-server ds4-eval. Server unit tests pass; full make test could not run because another DS4 process was already holding /tmp/ds4.lock.

Add --chdir so service managers and external launchers can run ds4-server from any current directory while still resolving relative runtime files, including metal/*.metal, from the project or install tree they select. This intentionally keeps path handling explicit instead of adding per-loader executable-relative probing: after chdir succeeds, model paths, trace paths, KV cache directories, and Metal sources all follow the same normal relative-path rule. Fixes antirez#49

changed: - used intrinsics from llama-cpp repo - no fast math - WMMA kernels for rocm - minimal changes in the vanilla cuda path Test results still have issues with the long context test ``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 ds4-test: long-context wrong assignment for Alice: got 50 expected 52 tests/ds4_test.c:354: assertion failed: test_output_has_fact(text, &test_long_facts[i]) long-context: ERR tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive ds4-test: vector long_code_audit logprob-vectors: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK server: 0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-evict-test.ZkrbI4/1111111111111111111111111111111111111111.kv 0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-current-store-evict-test.SPxImn/1111111111111111111111111111111111111111.kv 0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=4096 hits=0 size=0.00 MiB file=/tmp/ds4-kv-oversize-store-evict-test.WQKFyJ/2222222222222222222222222222222222222222.kv 0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=2048 hits=15 size=0.00 MiB file=/tmp/ds4-kv-stale-hit-evict-test.h3s2dj/1111111111111111111111111111111111111111.kv 0517 13:00:04 ds4-server: kv cache evicted reason=disk-cache-full tokens=512 hits=0 size=0.00 MiB file=/tmp/ds4-kv-live-prefix-test.J3pmsF/1111111111111111111111111111111111111111.kv server: OK ds4 tests: 1 failure(s) $ ``` Strangelly enough the cpu version produces similar results for the logprobs

antirez · 2026-05-17T16:42:03Z

@alantsev please check the CUDA fix for the long context, try to ask your agent to fix the ROCm path in the same way if possible.

gundemirbas · 2026-05-19T20:52:21Z

Any progress ?

alantsev · 2026-05-19T23:36:45Z

huge progress in understanding :) - will fix it this weekend

mitsuhiko and others added 30 commits May 11, 2026 12:30

feat(server): report KV cache usage

0ca2e28

feat(server): report Anthropic cache usage

38800bf

README: separate motivations.

c5ef7ac

Merge branch 'pr-91-responses' into responses-api

2174611

Tighten Responses tool_search replay

6396966

Fix Responses tool checkpoint cache reuse

a01bf1d

Fix Responses API live continuation

acb40bf

metal: cover q4 expert tensors in model views

2a7a5f3

Skip tool checkpoint canonicalization for exact DSML replay

b4c5f7c

Merge responses-api

e88a71e

Use visible live checkpoints for toolless thinking

5453ad0

Clarify server progress logs

646798f

Add Anthropic live tool continuation

43535e1

Revert "metal: cover q4 expert tensors in model views"

67e6146

This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.

Tag Responses API server logs

0083475

Recover Responses replays without hidden reasoning

0610591

Stream Anthropic tool calls live

94c1f38

Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.

fix typo in readme

741d0cc

dwarfstar is typoed to drawfstar

Merge pull request antirez#155 from kernelzeroday/main

98593ec

fix typo in readme

Fix typos in README.md

f6fa52b

Merge branch 'pr-150-context-error' into merge-pr-150-standard-context

157873b

antirez and others added 26 commits May 15, 2026 19:58

Polish ds4-eval TUI and report

336fbd6

Preserve literal tool result text

950e8e6

Render tool result bodies as raw text so file contents containing angle brackets reach the model unchanged. Escape only the tool_result closing sentinel, and keep the imatrix prompt renderer mirror in sync. Fixes antirez#152

Merge branch 'main' into cache-report

822ab36

Merge pull request antirez#68 from unsaltedbutter-ai/feat/tools-prepe…

03a43b8

…nd-system

Merge pull request antirez#29 from mitsuhiko/cache-report

e5b9463

Merge pull request antirez#162 from unsaltedbutter-ai/feat/kv-cache-h…

277c198

…it-decay

merge from main@upstream

88906ca

merge from rocm@upstream

1e41fb3

Clean up ds4-eval benchmark items

e258d51

Auto-size ds4-eval context

4441e56

Add COMPSEC ds4-eval cases

48c4d4d

Refine COMPSEC eval localization cases

011aa67

merge from main@upstream

33a9f32

alantsev changed the title ~~Rocm~~ rocm: add wmma indexer support May 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rocm: add wmma indexer support#180

rocm: add wmma indexer support#180
alantsev wants to merge 61 commits into
antirez:rocmfrom
alantsev:rocm

alantsev commented May 17, 2026

Uh oh!

antirez commented May 17, 2026

Uh oh!

gundemirbas commented May 19, 2026 •

edited

Loading

Uh oh!

alantsev commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

alantsev commented May 17, 2026

Uh oh!

antirez commented May 17, 2026

Uh oh!

gundemirbas commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alantsev commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

gundemirbas commented May 19, 2026 •

edited

Loading