src: improve compile cache performance and size#63861
Conversation
|
Review requested:
|
|
Verification results (macOS arm64, release build, vs. baseline at the merge-base): Tests: all 23 On-disk size
Warm-start, self-controlled (same binary, cache on vs. off, interleaved 20 runs, trimmed means — avoids binary-layout skew between builds):
Warm-start with a hot page cache is neutral within noise (~1 ms on the pathological single-10 MB-file case, which is dominated by one large zstd decompress; with cold page cache or slower storage the 2.4x smaller read wins). Cold-start adds one-time compression at persist (~35 ms for the 1.8 MB blob at level 1, proportionally less for typical files). The second commit reuses zstd contexts (one |
Improve the compile cache by: - Reading cache files with a single exactly-sized read using the file size from fstat instead of reading into an exponentially growing buffer, which previously cost O(log N) syscalls and allocations and about 2N bytes of copying per file. - Compressing the cache content on disk with zstd at level 1, falling back to raw storage when the data is not compressible. This shrinks cache directories by about 2-4x. The magic number is bumped so that files in the old format are discarded as cache misses and then overwritten in place. - Handing the cache to V8 through a non-owning CachedData wrapper instead of copying the whole buffer on every cache hit. Corrupted cache files keep degrading to silent cache misses and are regenerated, now covered by a regression test. Co-authored-by: Grok <grok@x.ai> Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
816f4ef to
0cf8818
Compare
Creating and freeing a zstd context for every cache file costs more than the (de)compression itself for small caches. Lazily create one decompression context on the handler and reuse it across reads, and share one compression context across all entries in Persist(). Co-authored-by: Grok <grok@x.ai> Signed-off-by: Yagiz Nizipli <yagiz@nizipli.com>
0cf8818 to
f427fb3
Compare
… on nodejs#63861) - Add src/compile_cache_zstd.dict (48 KiB trained on 5.7k objective V8 code cache samples harvested via vm.Script from test/parallel + fixtures + benchmarks; the 300 small benchmark measurement set and the big TS fixture raw are held completely out of training). - Generate and include src/compile_cache_zstd_dict.h (the embeddable form). - Add tools/generate-compile-cache-dict.js (run after updating the .dict). - Wire always-on use of the prepared CDict/DDict in Persist() (pick best of plain vs dict-assisted per entry) and ReadCacheFile() (decompress_usingDDict). - Reuses the ctxs and "only if smaller than raw" policy from Yagiz's PR. - Yes, the dictionary is joined/embedded in the node binary (tiny, must be available with no extra FS for portable/early/restricted cache use). - Measurements on the benchmark scenarios (big TS fixture + 300 held-out small/medium objective samples) with the representative dict: small/medium: raw -> plain-zstd-l1 2.13x -> +dict 3.06x (1.44x further win over the zstd already in nodejs#63861). big: ~2.69x plain (dict neutral/slightly worse, still >> raw; we take the min so big stays optimal). - See the investigation notes in the branch for full details and reproduction. Co-authored-by: Grok (investigation + prototype) PR on top of nodejs#63861
|
It seems the performance gains are within noise and it's mostly only the compression changes size? Can you split them into different PRs and measure them individually?
|
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness) plus the many-modules fixture list.
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness).
Builds on the zstd compression in nodejs#63861 by embedding a small zstd dictionary trained on a diverse corpus of real modules, so each small/medium compile-cache entry compresses better. Per entry we keep the smaller of the plain and dictionary-assisted frame, so the dictionary only ever helps. - Add src/compile_cache_zstd.dict (16 KiB). It is trained on V8 code caches harvested (via vm.compileFunction, the same shape the CJS loader produces) from a diverse corpus: bundled npm packages, lib/, tools/ and a few deps. - Add tools/generate_compile_cache_dict.py and a node.gyp action that generates compile_cache_zstd_dict.h into SHARED_INTERMEDIATE_DIR at build time; no generated header is checked in. libnode include_dirs updated to pick it up. - Prepare the CDict/DDict once per process (shared across all handlers and Workers, matching the lazy-context approach from nodejs#63861) and use them in Persist() and ReadCacheFile(). Persist() compresses the plain and dict frames into separate buffers and selects the smaller, so the written bytes and recorded size always agree. The dictionary is only tried for entries up to 256 KiB; larger blobs never benefit, so the second compression is skipped to avoid wasted work. Falls back to plain zstd if dictionary preparation fails. - The dictionary is embedded in the binary because the compile cache must be usable early, portably, and without extra filesystem state. - No on-disk format change: dict-assisted frames carry the dictID, plain frames carry none, and a single DDict decompresses both. - Size, measured on data held out from training (per-entry min policy): diverse modules go from ~1.87x (plain zstd) to ~2.44x with the dictionary (~24% smaller on disk); on test/parallel, which is not in the training corpus at all, ~1.74x -> ~2.22x (~22% smaller). A real end-to-end run (npm --version, ~70 modules) is ~15% smaller. Read time is unchanged and the extra write-time work is negligible. - Add a multi-module write/read roundtrip test and a startup benchmark (standard createBenchmark harness).
There was a problem hiding this comment.
If we do include dictionary files in our source tree, I'd say we should either include instructions for how to (at least approximately) re-create this file, or even generate it at build time entirely. strings src/compile_cache_zstd.dict results in a surprising amount of Node.js-core specificity, which seems surprising given that the commit that adds it claims it was trained on public npm packages.
(strings src/compile_cache_zstd.dict output in the fold)
wUUu2
_HOFJW
[Z)S
nding
options
Paz
types
PaN
process
versions
openssl
internal/process/pre_execution
PdR`
prepareMainThreadExecution
Pc6Wl
markBootstrapComplete
internal/options
"<";
__createBinding
__setModuleDefault
PaZY4J
birname
note
signatures
ObjectSetPrototypeOf
PromisePrototypeThen
PromiseWithResolvers
RegExpPrototypeExec
RegExpPrototypeSymbolReplace
StringPrototypeSplit
StringPrototypeToLowerCase
_createBlob
_createBlobFromFilePath
getDataObject
kMaxLength
TextDecoder
markTransferMode
Pbvyo
isAnyArrayBuffer
PcBS
isArrayBufferView
PaZ
require
PaN"
util
InvalidArgumentError
ConnectTimeoutError
SessionCache
maybeNormalizeConnectError
PromiseWithResolvers
nonOpStart
Pf&
writableStreamMarkFirstWriteRequestInFlight
createPromiseCallback
Pb:6k
nonOpWrite
$Pg>
writableStreamDefaultWriterEnsureReadyPromiseRejected
writableStreamDefaultControllerClose
DOMException
Pc^v
extractHighWaterMark
extractSizeAlgorithm
Pb~:
kEmptyObject
wgIgAygCACIAIAQgAWtqIQUgASAAa0ECaiEGAkADQCABLQAAIABButUAai0AAEcNAyAAQQJGDQEgAEEBaiEAIAQgAUEBaiIBRw0ACyADIAU2AgAMwwILIAMoAgQhACADQgA3AwAgAyAAIAZBAWoiARAuIgBFBEBB4wEhAgyqAgsgA0H1ATYCHCADIAE2AhQgAyAANgIMQQAhAgzCAgtB9AEhAiABIARGDcECIAMoAgAiACAEIAFraiEFIAEgAGtBAWohBgJAA0AgAS0AACAAQbjVAGotAABHDQIgAEEBRg0BIABBAWohACAEIAFBAWoiAUcNAAsgAyAFNgIADMICCyADQYEEOwEoIAMoAgQhACADQgA3AwAgAyAAIAZBAWoiARAuIgANAwwCCyADQQA2AgALQQAhAiADQQA2AhwgAyABNgIUIANB5R82AhAgA0EINgIMDL8CC0HVASECDKUCCyADQfMBNgIcIAMgATYCFCADIAA2AgxBACECDL0CC0EAIQACQCADKAI4IgJFDQAgAigCQ
memoMethod
perf
calculatedSizcludes
kTypes
uvErrmapGet
ArrayPrototypeIndexOf
classRegExp
MathMax
ArrayIsArray
isWindows
ObjectIsExtensible
isPermissionModelError
getExpectedArgumentLength
ArrayPrototypeJoin
kIsNodeError
get source
get ports
CloseEvent
get listening
header.js.map
@[@b
node:path
./large-numbers.js
"<";
escapeHTML
PcB_C+
convertChangesToXML
#noProxy
#ProxyAgent
#getProxy
#timeoutConnection
Pc~}
#drainPendingRequests
connect
Pbv2&_
addRequest
createSocket
.get
.get
getMilestoneTimestamp
getTimeOriginTimestamp
PbjK
internalBinding
performance
constants
ERR_INVALID_ARG_TYPE
ERR_INVALID_ARG_VALUE
Pb:e
ERR_OUT_OF_RANGE
isErrorStackTraceLimitWritable
Buffer
Pa&
inspect
validateBoolean
validateFunction
validateNumber
validateString
validateOneOf
PbbH
validateObject
validateInteger
isReadableStream
isWritableStream
isNodeStream
utilColors
PbJ%
lazyUtilColors
PbNn6
__importStar
DQCABLQAAQSBHDUggBCABQQFqIgFHDQALQSUhAwxpC0ElIQMMaAsgAi0ALUEBcQRAQcMBIQMMTwsgAigCBCEAQQAhAyACQQA2AgQgAiAAIAEQKSIABEAgAkEmNgIcIAIgADYCDCACIAFBAWo2AhQMaAsgAUEBaiEBDFwLIAFBAWohASACLwEwIgBBgAFxBEBBACEAAkAgAigCOCIDRQ0AIAMoAlQiA0UNACACIAMRAAAhAAsgAEUNBiAAQRVHDR8gAkEFNgIcIAIgATYCFCACQfkXNgIQIAJBFTYCDEEAIQMMZwsCQCAAQaAEcUGgBEcNACACLQAtQQJxDQBBACEDIAJBADYCHCACIAE2AhQgAkGWEzYCECACQQQ2AgwMZwsgAgJ/IAIvATBBFHFBFEYEQEEBIAItAChBAUYNARogAi8BMkHlAEYMAQsgAi0AKUEFRgs6AC5BACEAAkAgAigCOCIDRQ0AIAMoAiQiA0UNACACIAMRAAAhAAsCQAJAAkACQAJAIAAOFgIBAAQEBAQEBAQE
get ok
get statusText
get headers
get body
get bodyUsed
clone
parsePullArgs
PcNs
validateBackpressure
primordials
PcJ=
internal/encoding
TextEncoder
internal/errors
Pav
codes
internal/util
internal/util/types
internal/validators
onComplete
E`w
PaN
onError
node:assert
../core/symbols
../core/errors
../core/util
Pab
beep
ArrayFrom
ArrayPrototypeFilter
ArrayPrototypeIncludes
ArrayPrototypeMap
ArrayPrototypePush
PcB,t
ArrayPrototypePushApply
ArrayPrototypeSlice
ObjectDefineProperty
Pb>c
ObjectKeys
PdRc
ObjectPrototypeHasOwnProperty
Pbf)
ReflectGet
SafeMap
SafeSet
StringPrototypeSlice
Error
PbFOT
_flushFlag
__esModule
.desc.get
stop
destroyer
primordials
internal/errors
Pav
codes
internal/streams/utils
__esModule
fs/promises
PaF-M
path
PaZ
require
Pa"/
module
__filename
__dirname
There was a problem hiding this comment.
@anonrig I'm going to mark this as un-resolved again, since there was no response here as far as I can tell
There was a problem hiding this comment.
Sorry, I didn't see this up until now and didn't pressed the "resolve" button at all. Will address your concerns.
Improves the on-disk compile cache (
NODE_COMPILE_CACHE/module.enableCompileCache()):fstat) instead of an exponentially growing buffer, which previously costO(log N)syscalls/allocations and ~2N bytes of copying per file.CachedDatawrapper (BufferNotOwned) instead of copying the entire buffer on every cache hit. The underlying buffer is owned by the cache entry, which outlives the synchronous compilation (same pattern as thevmcached-data path innode_contextify.cc).Corrupted cache files keep degrading to silent cache misses and are regenerated; a corrupted size header can no longer cause an oversized allocation since the zstd frame content size is cross-checked first. Added
test/parallel/test-compile-cache-corrupted.jscovering bad magic, truncation, content bit-flips, and header corruption.No public API or documented behavior changes; the file format is private to
src/compile_cache.cc. Benchmark numbers (cache size and warm-startup timings) to follow in a comment.This change was developed with AI assistance (see
Co-authored-bytrailer).