Compressed KV disk cache via streaming LZ4 by jasontitus · Pull Request #186 · antirez/ds4

jasontitus · 2026-05-18T06:14:30Z

Summary

Add optional LZ4 compression to the KV disk cache. ~1.7× more cache entries in the same disk budget on realistic prompt sizes (ratio 1.58–1.75× on the measured A/B), with no user-visible cache-hit cost and no measurable inference regression.

Compression is streamed through a cookie wrapper around ds4_session_save_payload / load_payload in ds4_kvstore.c, so peak extra RAM during a save is 2 * threads * chunk_size (~256 MiB at the default threads=8, chunk=16 MiB) instead of the full uncompressed payload (1–16 GiB on long contexts).

What's in the change

ds4_kvstore.c — kv_lz4_writer_open / kv_lz4_reader_open cookie wrappers around the payload region; chunked, parallel LZ4_compress_default. Shared kvstore, so both ds4-server and ds4-agent get the codec (the agent's sysprompt.kv and /save checkpoints are now compressed too — see numbers below).
Format — KV_CACHE_VERSION 1 → 2. Codec byte at header offset 7, chunk size at offset 20 (both were reserved in v1). Loader accepts v1 and v2.
CLI (ds4-server) — --kv-cache-compression-threads N. N=0 writes uncompressed v2 files. Default min(8, online cpus). Env DS4_KV_CACHE_COMPRESSION_THREADS.
lz4.c / lz4.h — vendored from upstream lz4 v1.10.0 (BSD-2-Clause), linked into ds4-server, ds4-agent, and the _cpu variants.
Makefile — adds lz4.o to the link lines that already include ds4_kvstore.o, and lz4.h to the header dependency lists.
misc/COMPRESSED_KV_CACHE.md — design note.
ds4.c, ds4.h, ds4_cli.c, ds4_bench.c, ds4_eval.c, ds4_cuda.cu, ds4_metal.m and Metal kernels are not modified.

Test env + reproduce

M1 Ultra, 128 GB, macOS 25.4, Metal backend, model DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.

git fetch && git checkout kv-cache-lz4
make clean && make && make ds4_test

# Correctness
./ds4_test --server                              # server: OK   ds4 tests: ok   (~0.5 s)

# Speed
./ds4-bench -m ds4flash.gguf \
    --prompt-file speed-bench/promessi_sposi.txt \
    --ctx-start 2048 --ctx-max 16384 --step-incr 2048 --gen-tokens 128 \
    --csv /tmp/after.csv

# Compressed cache, end-to-end
./ds4-server --ctx 400000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 102400 \
    --kv-cache-cold-max-tokens 300000 --kv-cache-continued-interval-tokens 0
# Send any long-ish prompt twice; second send hits the cache.
# Startup: codec=lz4 threads=8 chunk=16384K.
# Stores log: codec=lz4 size=… comp=…x save=… ms.   Hits log: load=… ms.

Speed — no inference regression

A/B with ds4-bench at the merge base vs HEAD, same sweep as above:

ctx	prefill before	prefill after	gen before	gen after
2048	237.38	237.40	20.23	20.15
8192	243.34	243.62	19.81	20.07
16384	228.45	227.85	19.70	19.62

Prefill within ±0.26 %, gen within run-to-run noise. ds4-bench / ds4_bench.o are bit-identical at HEAD vs base — the change adds no inference-path code.

Cache feature — A/B vs `codec=none`

Same ds4-server instance, three prompt sizes, miss + hit pairs:

prompt	on-disk none	on-disk lz4	ratio	save lz4	load lz4
3.5 K	69 MiB	44 MiB	1.56×	67 ms	22.1 ms
17 K	246 MiB	144 MiB	1.71×	135 ms	66.1 ms
47 K	640 MiB	367 MiB	1.74×	316 ms	163 ms

End-to-end TTLB, miss → hit:

prompt	none miss → hit	lz4 miss → hit
3.5 K	18.31 s → 0.67 s	18.22 s → 0.66 s
17 K	74.47 s → 0.79 s	73.82 s → 0.81 s
47 K	227.7 s → 1.10 s	225.7 s → 1.15 s

Save is post-generation, so the +20–45 ms never hits user-visible TTLB. Cache-hit TTLB is identical to uncompressed within noise.

Agent — same codec, same story

ds4-agent shares ds4_kvstore, so the system-prompt cache (~/.ds4/kvcache/sysprompt.kv) and /save session files also go through the codec. Agent startup with a 1274-token system prompt on the same machine, four-cell A/B (codec via DS4_KV_CACHE_COMPRESSION_THREADS):

	wall	cache file	codec
cold (no cache, generate + save)	7.01 s	—	—
warm + uncompressed cache	1.32 s	40 MiB	NONE
warm + compressed cache	1.29 s	26 MiB	LZ4

Same conclusion as the server A/B: compression saves disk (40 → 26 MiB, 1.50× ratio) with no measurable cost on the warm load. The 5.4× faster startup vs cold is the cache being a cache, not the codec.

Future: HC via out-of-band recompressor

The codec byte reserves room for DS4_KVSTORE_CODEC_LZ4HC. HC reaches ~1.8× ratio (vs ~1.6× for -1) and decompresses faster per byte, so HC-encoded hits would shave another ~25–30 % off the load times above. Measured with lz4 -b (lz4 1.10):

	M1 Ultra	M4 Pro
-1 decompress, 33 MiB	4980 MB/s	6519 MB/s
HC decompress, 33 MiB	6501 MB/s	8716 MB/s
HC compress -T1, 2.1 GiB	9 MB/s (~245 s)	11 MB/s (~205 s)
HC compress -T8, 2.1 GiB	68 MB/s (~32 s)	78 MB/s (~28 s)

Too slow for inline save; fine for a sidecar that re-encodes settled files during idle. The chunked v2 format would also permit multi-threaded decompress — implementation-ready, not measured here. No further format change needed when this lands.

jasontitus · 2026-05-20T06:34:12Z

Rebased onto the new agent changes.

The disk KV cache payload is the only large section of a .kv file. Wrap the engine's ds4_session_save_payload(FILE*) / load_payload(FILE*) with a funopen/fopencookie cookie that chunks bytes into fixed 16 MiB raw blocks, compresses each block with LZ4_compress_default on a per-batch fork/join pool, and writes (u64 uncompressed_total, u32 chunk_count, [chunk records]) into the existing payload region. Reader is symmetric. Format: KV_CACHE_VERSION bumps from 1 to 2. The codec byte goes in the header at offset 7 (was reserved) and the chunk size at offset 20 (was reserved). Both bytes existed in v1; the on-disk header shape is unchanged. Loader accepts v1 and v2; writer always writes v2. CLI: --kv-cache-compression-threads N. N=0 disables (writes uncompressed v2 files); default min(8, online cpus), the same heuristic ds4_threads_init uses in ds4.c. Env override DS4_KV_CACHE_COMPRESSION_THREADS for parity with DS4_THREADS. Compression lives in ds4_kvstore.c so both ds4-server and ds4-agent share the codec. ds4_agent.c writes uncompressed for now (its payloads are typically small). Streaming via cookie wrappers, not via the snapshot APIs in ds4.h, so peak extra RAM during a save is 2*N*chunk_size (~256 MiB at N=8 / 16 MiB) instead of the full uncompressed payload (1-16 GiB on long contexts). Engine APIs in ds4.h are not touched. The load path uses an eager-framing-peek variant of kv_lz4_reader_open so the call site learns the uncompressed payload size before the engine starts reading, and passes that as the read budget to ds4_session_load_payload. The engine treats its remaining argument as a hard upper bound and rejects both shortfall ("truncated session payload") and trailing bytes, so neither hdr.payload_bytes (compressed) nor UINT64_MAX is correct. ds4_agent.c, ds4_server.c, and tests/ds4_test.c are touched only to pass the new codec and chunk_size arguments to ds4_kvstore_fill_header. ds4.c, ds4.h, ds4_cli.c, ds4_bench.c, ds4_eval.c, ds4_cuda.cu, ds4_metal.m and Metal kernels are not modified. lz4.c / lz4.h vendored from upstream lz4 v1.10.0 (BSD-2-Clause), same treatment as rax.c and linenoise.c. Linked into ds4-server, ds4-server_cpu, ds4-agent, and ds4-agent_cpu.

jasontitus force-pushed the kv-cache-lz4 branch 2 times, most recently from 734fe79 to 65581ca Compare May 20, 2026 06:34

jasontitus force-pushed the kv-cache-lz4 branch from 65581ca to 33f309e Compare May 20, 2026 15:16

jasontitus force-pushed the kv-cache-lz4 branch from 33f309e to cdace08 Compare May 20, 2026 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed KV disk cache via streaming LZ4#186

Compressed KV disk cache via streaming LZ4#186
jasontitus wants to merge 1 commit into
antirez:mainfrom
jasontitus:kv-cache-lz4

jasontitus commented May 18, 2026 •

edited

Loading

Uh oh!

jasontitus commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasontitus commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the change

Test env + reproduce

Speed — no inference regression

Cache feature — A/B vs codec=none

Agent — same codec, same story

Future: HC via out-of-band recompressor

Uh oh!

jasontitus commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasontitus commented May 18, 2026 •

edited

Loading

Cache feature — A/B vs `codec=none`