Skip to content

src: embed zstd dictionary for further compile cache size wins#16

Merged
anonrig merged 1 commit into
anonrig:compile-cache-perffrom
lemire:compile-cache-zstd-dict-pr
Jun 12, 2026
Merged

src: embed zstd dictionary for further compile cache size wins#16
anonrig merged 1 commit into
anonrig:compile-cache-perffrom
lemire:compile-cache-zstd-dict-pr

Conversation

@lemire

@lemire lemire commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Experimental, for discussion

Gist: When you have lots of small files, using a pregenerated dictionary is a win compression wise, at the cost of a few KB of extra payload.

Summary

This builds on the zstd compression added in nodejs#63861 by embedding a small (16
KiB) zstd dictionary trained on a diverse corpus of real modules. For each
compile-cache entry we compress with and without the dictionary and keep
whichever frame is smaller, so the dictionary only ever helps.

The dictionary mainly benefits the common "many small/medium modules" case,
where individual code caches are too short for plain zstd to find much
redundancy on its own. Large single blobs are left untouched (plain zstd
already wins there, and the dictionary path is skipped for them entirely).

Size benefits

The compile cache is already compressed by nodejs#63861; the question this PR answers
is how much more the dictionary saves on top of that plain zstd. All
numbers below use the shipped policy (per-entry min(plain, dict), level 1) and
compare against the no-dictionary baseline. The ratios are on-disk size vs. the
raw V8 code cache, so higher is smaller-on-disk.

corpus (held out from training) plain zstd (nodejs#63861) + dictionary extra saving
diverse modules (npm/lib/deps, 226 unseen files) 1.87× 2.44× −24%
test/parallel (4,119 files, not in training) 1.74× 2.22× −22%
npm --version end-to-end (~70 real modules) 138 KB 117 KB −15%

Reading this:

  • The first two rows are measured offline on code caches the dictionary was
    never trained on (a held-out file split, and an entire corpus —
    test/parallel — absent from training), so they reflect generalization, not
    memorization.
  • The third row is a real end-to-end run through the actual module loader and
    persist path: the on-disk cache for npm's module graph shrinks from 138 KB to
    117 KB.
  • Put plainly: src: improve compile cache performance and size nodejs/node#63861 already made the cache ~1.8–1.9× smaller than raw; this
    dictionary takes it to ~2.2–2.4×, recovering roughly another 15–24% of the
    on-disk footprint, concentrated in the small/medium modules that dominate real
    workloads.
  • Large single blobs (e.g. the typescript.js fixture, > 256 KiB) stay on the
    plain path and are byte-for-byte unchanged.

The cost is +16 KiB in the node binary (the embedded dictionary). A 32 KiB
dictionary was measured to add only ~1 percentage point and 48 KiB nothing
beyond that, so 16 KiB is the size/benefit knee.

Timing (does the dictionary make things slower?)

A/B against the no-dictionary baseline (this commit's parent), same tree, only
compile_cache.cc/.h differ. Trimmed median wall time per process (Apple
Silicon, both binaries measured back-to-back). The nocache row uses no compile
cache at all, so its delta is the run-to-run noise floor — read the other rows
against it.

Big single blob — typescript.js (~1.8 MB cache, 1 entry; above the 256 KiB
threshold, so the dictionary is skipped on write):

regime baseline +dict Δ
nocache (noise) 86.2 ms 84.4 ms −1.8 ms
cold (write/persist) 104.4 ms 101.9 ms equal (gated)
warm (read/decomp) 53.6 ms 53.9 ms +0.3 ms (≈ noise)
cache size 752,364 B 751,529 B equal

Many small modules (120 entries; all below the threshold, dictionary applied):

regime baseline +dict Δ
nocache (noise) 31.8 ms 32.5 ms +0.7 ms
cold (write/persist) 53.2 ms 53.9 ms +0.7 ms (≈ noise)
warm (read/decomp) 30.6 ms 30.5 ms −0.1 ms
cache size 116,852 B 86,633 B −25.9%

Takeaways: the read path — paid on every warm-cache startup — is within
noise; decompress_usingDDict is not measurably slower than plain decompress,
and the one-time per-process DDict digest of a 16 KiB dictionary is
negligible. Write overhead (only at persist, on shutdown) is sub-millisecond for
many modules and zero for the big blob (the size gate skips it). On-disk size
never regresses.

Why embed the dictionary

The compile cache must be usable early during startup, portably, and without
relying on any additional filesystem state, so the dictionary is compiled into
the binary rather than loaded at runtime. Only the small binary .dict is
checked in; the C array is generated at build time.

How the dictionary is trained (reproducible)

The dictionary is trained on V8 code caches harvested via vm.compileFunction
(the same shape the CJS loader produces) from a diverse in-tree corpus: bundled
npm packages (deps/npm/node_modules), lib/, tools/, and a few deps/
libraries — ~1,200 modules. Those samples are fed to zstd --train --maxdict=16384. The measurement corpora above are disjoint from this training
set. The harvest+train script can be committed under tools/ so the .dict is
regenerable from the tree rather than an opaque drop-in.

On-disk compatibility

No format change. The dictionary is a trained zstd dictionary, so dict-assisted
frames carry its dictID and plain frames carry none; the reader decompresses
both correctly with a single DDict. Compile-cache directories are already
keyed by Node version, arch, and a cache-data version tag, so a future change to
the embedded dictionary is naturally isolated to a fresh cache directory.

Builds on the zstd compression in nodejs#63861 by embedding a small zstd
dictionary trained on a diverse corpus of real modules, so each
small/medium compile-cache entry compresses better. Per entry we keep
the smaller of the plain and dictionary-assisted frame, so the
dictionary only ever helps.

- Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code
  caches harvested (via vm.compileFunction, the same shape the CJS
  loader produces) from a diverse corpus: bundled npm packages, lib/,
  tools/ and a few deps.
- Add tools/generate_compile_cache_dict.py and a node.gyp action that
  generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at
  build time; no generated header is checked in. libnode include_dirs
  updated to pick it up.
- Prepare the CDict/DDict once per process (shared across all handlers
  and Workers, matching the lazy-context approach from nodejs#63861) and use
  them in Persist() and ReadCacheFile(). Persist() compresses the plain
  and dict frames into separate buffers and selects the smaller, so the
  written bytes and recorded size always agree. The dictionary is only
  tried for entries up to 256 KiB; larger blobs never benefit, so the
  second compression is skipped to avoid wasted work. Falls back to
  plain zstd if dictionary preparation fails.
- The dictionary is embedded in the binary because the compile cache
  must be usable early, portably, and without extra filesystem state.
- No on-disk format change: dict-assisted frames carry the dictID, plain
  frames carry none, and a single DDict decompresses both.
- Size, measured on data held out from training (per-entry min policy):
  diverse modules go from ~1.87x (plain zstd) to ~2.44x with the
  dictionary (~24% smaller on disk); on test/parallel, which is not in
  the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real
  end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read
  time is unchanged and the extra write-time work is negligible.
- Add a multi-module write/read roundtrip test and a startup benchmark
  (standard createBenchmark harness).
@anonrig anonrig merged commit a6273b1 into anonrig:compile-cache-perf Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants