Skip to content

Compressed KV disk cache via streaming LZ4#186

Open
jasontitus wants to merge 1 commit into
antirez:mainfrom
jasontitus:kv-cache-lz4
Open

Compressed KV disk cache via streaming LZ4#186
jasontitus wants to merge 1 commit into
antirez:mainfrom
jasontitus:kv-cache-lz4

Conversation

@jasontitus
Copy link
Copy Markdown

@jasontitus jasontitus commented May 18, 2026

Summary

Add optional LZ4 compression to the KV disk cache. ~1.7× more cache entries in the same disk budget on realistic prompt sizes (ratio 1.58–1.75× on the measured A/B), with no user-visible cache-hit cost and no measurable inference regression.

Compression is streamed through a cookie wrapper around ds4_session_save_payload / load_payload in ds4_kvstore.c, so peak extra RAM during a save is 2 * threads * chunk_size (~256 MiB at the default threads=8, chunk=16 MiB) instead of the full uncompressed payload (1–16 GiB on long contexts).

What's in the change

  • ds4_kvstore.ckv_lz4_writer_open / kv_lz4_reader_open cookie wrappers around the payload region; chunked, parallel LZ4_compress_default. Shared kvstore, so both ds4-server and ds4-agent get the codec (the agent's sysprompt.kv and /save checkpoints are now compressed too — see numbers below).
  • Format — KV_CACHE_VERSION 1 → 2. Codec byte at header offset 7, chunk size at offset 20 (both were reserved in v1). Loader accepts v1 and v2.
  • CLI (ds4-server) — --kv-cache-compression-threads N. N=0 writes uncompressed v2 files. Default min(8, online cpus). Env DS4_KV_CACHE_COMPRESSION_THREADS.
  • lz4.c / lz4.h — vendored from upstream lz4 v1.10.0 (BSD-2-Clause), linked into ds4-server, ds4-agent, and the _cpu variants.
  • Makefile — adds lz4.o to the link lines that already include ds4_kvstore.o, and lz4.h to the header dependency lists.
  • misc/COMPRESSED_KV_CACHE.md — design note.
  • ds4.c, ds4.h, ds4_cli.c, ds4_bench.c, ds4_eval.c, ds4_cuda.cu, ds4_metal.m and Metal kernels are not modified.

Test env + reproduce

M1 Ultra, 128 GB, macOS 25.4, Metal backend, model DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf.

git fetch && git checkout kv-cache-lz4
make clean && make && make ds4_test

# Correctness
./ds4_test --server                              # server: OK   ds4 tests: ok   (~0.5 s)

# Speed
./ds4-bench -m ds4flash.gguf \
    --prompt-file speed-bench/promessi_sposi.txt \
    --ctx-start 2048 --ctx-max 16384 --step-incr 2048 --gen-tokens 128 \
    --csv /tmp/after.csv

# Compressed cache, end-to-end
./ds4-server --ctx 400000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 102400 \
    --kv-cache-cold-max-tokens 300000 --kv-cache-continued-interval-tokens 0
# Send any long-ish prompt twice; second send hits the cache.
# Startup: codec=lz4 threads=8 chunk=16384K.
# Stores log: codec=lz4 size=… comp=…x save=… ms.   Hits log: load=… ms.

Speed — no inference regression

A/B with ds4-bench at the merge base vs HEAD, same sweep as above:

ctx prefill before prefill after gen before gen after
2048 237.38 237.40 20.23 20.15
8192 243.34 243.62 19.81 20.07
16384 228.45 227.85 19.70 19.62

Prefill within ±0.26 %, gen within run-to-run noise. ds4-bench / ds4_bench.o are bit-identical at HEAD vs base — the change adds no inference-path code.

Cache feature — A/B vs codec=none

Same ds4-server instance, three prompt sizes, miss + hit pairs:

prompt on-disk none on-disk lz4 ratio save lz4 load lz4
3.5 K 69 MiB 44 MiB 1.56× 67 ms 22.1 ms
17 K 246 MiB 144 MiB 1.71× 135 ms 66.1 ms
47 K 640 MiB 367 MiB 1.74× 316 ms 163 ms

End-to-end TTLB, miss → hit:

prompt none miss → hit lz4 miss → hit
3.5 K 18.31 s → 0.67 s 18.22 s → 0.66 s
17 K 74.47 s → 0.79 s 73.82 s → 0.81 s
47 K 227.7 s → 1.10 s 225.7 s → 1.15 s

Save is post-generation, so the +20–45 ms never hits user-visible TTLB. Cache-hit TTLB is identical to uncompressed within noise.

Agent — same codec, same story

ds4-agent shares ds4_kvstore, so the system-prompt cache (~/.ds4/kvcache/sysprompt.kv) and /save session files also go through the codec. Agent startup with a 1274-token system prompt on the same machine, four-cell A/B (codec via DS4_KV_CACHE_COMPRESSION_THREADS):

wall cache file codec
cold (no cache, generate + save) 7.01 s
warm + uncompressed cache 1.32 s 40 MiB NONE
warm + compressed cache 1.29 s 26 MiB LZ4

Same conclusion as the server A/B: compression saves disk (40 → 26 MiB, 1.50× ratio) with no measurable cost on the warm load. The 5.4× faster startup vs cold is the cache being a cache, not the codec.

Future: HC via out-of-band recompressor

The codec byte reserves room for DS4_KVSTORE_CODEC_LZ4HC. HC reaches ~1.8× ratio (vs ~1.6× for -1) and decompresses faster per byte, so HC-encoded hits would shave another ~25–30 % off the load times above. Measured with lz4 -b (lz4 1.10):

M1 Ultra M4 Pro
-1 decompress, 33 MiB 4980 MB/s 6519 MB/s
HC decompress, 33 MiB 6501 MB/s 8716 MB/s
HC compress -T1, 2.1 GiB 9 MB/s (~245 s) 11 MB/s (~205 s)
HC compress -T8, 2.1 GiB 68 MB/s (~32 s) 78 MB/s (~28 s)

Too slow for inline save; fine for a sidecar that re-encodes settled files during idle. The chunked v2 format would also permit multi-threaded decompress — implementation-ready, not measured here. No further format change needed when this lands.

@jasontitus jasontitus force-pushed the kv-cache-lz4 branch 2 times, most recently from 734fe79 to 65581ca Compare May 20, 2026 06:34
@jasontitus
Copy link
Copy Markdown
Author

Rebased onto the new agent changes.

The disk KV cache payload is the only large section of a .kv file.  Wrap
the engine's ds4_session_save_payload(FILE*) / load_payload(FILE*) with a
funopen/fopencookie cookie that chunks bytes into fixed 16 MiB raw blocks,
compresses each block with LZ4_compress_default on a per-batch fork/join
pool, and writes (u64 uncompressed_total, u32 chunk_count, [chunk records])
into the existing payload region.  Reader is symmetric.

Format: KV_CACHE_VERSION bumps from 1 to 2.  The codec byte goes in the
header at offset 7 (was reserved) and the chunk size at offset 20 (was
reserved).  Both bytes existed in v1; the on-disk header shape is
unchanged.  Loader accepts v1 and v2; writer always writes v2.

CLI: --kv-cache-compression-threads N.  N=0 disables (writes uncompressed
v2 files); default min(8, online cpus), the same heuristic
ds4_threads_init uses in ds4.c.  Env override
DS4_KV_CACHE_COMPRESSION_THREADS for parity with DS4_THREADS.

Compression lives in ds4_kvstore.c so both ds4-server and ds4-agent share
the codec.  ds4_agent.c writes uncompressed for now (its payloads are
typically small).

Streaming via cookie wrappers, not via the snapshot APIs in ds4.h, so
peak extra RAM during a save is 2*N*chunk_size (~256 MiB at N=8 / 16
MiB) instead of the full uncompressed payload (1-16 GiB on long
contexts).  Engine APIs in ds4.h are not touched.

The load path uses an eager-framing-peek variant of kv_lz4_reader_open
so the call site learns the uncompressed payload size before the engine
starts reading, and passes that as the read budget to
ds4_session_load_payload.  The engine treats its remaining argument as a
hard upper bound and rejects both shortfall ("truncated session
payload") and trailing bytes, so neither hdr.payload_bytes (compressed)
nor UINT64_MAX is correct.

ds4_agent.c, ds4_server.c, and tests/ds4_test.c are touched only to pass
the new codec and chunk_size arguments to ds4_kvstore_fill_header.
ds4.c, ds4.h, ds4_cli.c, ds4_bench.c, ds4_eval.c, ds4_cuda.cu,
ds4_metal.m and Metal kernels are not modified.

lz4.c / lz4.h vendored from upstream lz4 v1.10.0 (BSD-2-Clause), same
treatment as rax.c and linenoise.c.  Linked into ds4-server,
ds4-server_cpu, ds4-agent, and ds4-agent_cpu.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant