Skip to content

Guest page-table pool exhausts under multiple V8 isolates, Node worker_threads/cluster hard-abort #74

@Max042004

Description

@Max042004

Summary

The guest page-table pool is a fixed 960 KiB arena carved out of the 4 MiB
infrastructure reserve (src/core/guest.h:53-54). It is filled by a bump
allocator that is never reclaimed
within a process — g->pt_pool_next only
advances and is reset solely at guest init / reset / fork
(src/core/guest.c:358,470,1590, src/runtime/forkipc.c). Because each 2 MiB
guest block that needs 4 KiB granularity costs one 4 KiB L3 page from this pool
(guest_split_blockguest_alloc_pt_page, src/core/guest.c:166-194), the
pool budgets roughly 480 MiB of distinct mapped address space over the whole
process lifetime
(240 pages × 2 MiB), regardless of how much is later
munmap'd.

A single-isolate Node process stays under that ceiling and runs fine. But every
additional V8 isolate (worker_threads, cluster) reserves and commits its own
multi-hundred-MiB regions, and the cumulative, non-reclaimed consumption crosses
the 240-page budget. When guest_alloc_pt_page returns 0 the guest mmap fails,
V8's allocator then trips on a follow-up munmap that returns non-zero, and the
process hard-aborts:

src/core/guest.c:178: guest: page table pool exhausted (used 983040 / 983040 bytes)

# Fatal error in , line 0
# Check failed: 0 == munmap(address, size).

(983040 == 0xF0000 == 960 KiB, i.e. the pool is 100% consumed.)

This surfaced while testing advanced Node workloads on oci run node:alpine
(node v26.3.0) — oci run is the OCI image support from PR #34 — on top of the
oci-image-rebase branch with PR #73 applied. Single-threaded, event-loop web
servers (Express, raw http) are unaffected; the failure is specific to
spinning up multiple live V8 isolates.

Reproduction

node:alpine already pulled. --entrypoint node is required because the image
entrypoint is docker-entrypoint.sh (a shell script, not an ELF).

Crashes — concurrent worker isolates after normal allocator traffic

The advanced built-in battery below (crypto / zlib / fs / streams / net, then
two sequential workers, then four concurrent workers) deterministically
exhausts the pool at the concurrent-worker stage:

./build/elfuse oci run --entrypoint node node:alpine -e '
const {Worker}=require("worker_threads");
const mk=n=>new Promise((res,rej)=>{
  const w=new Worker(`require("worker_threads").parentPort.postMessage(${n}*${n})`,{eval:true});
  w.on("message",m=>{w.terminate();res(m);}); w.on("error",rej);
});
Promise.all([mk(2),mk(3),mk(4),mk(5)]).then(r=>console.log(r)).catch(e=>console.error(e));
'
# -> guest: page table pool exhausted (used 983040 / 983040 bytes)
# -> # Check failed: 0 == munmap(address, size).

Works — single isolate / single worker

# plain node, no worker .................. OK (well under 80% warn threshold)
./build/elfuse oci run --entrypoint node node:alpine --version          # v26.3.0

# one worker (2 isolates) ................ OK, prints 500500
./build/elfuse oci run --entrypoint node node:alpine -e '
const {Worker}=require("worker_threads");
new Promise((res,rej)=>{const w=new Worker("let s=0;for(let i=0;i<=1000;i++)s+=i;require(\"worker_threads\").parentPort.postMessage(s)",{eval:true});
w.on("message",m=>{w.terminate();res(m)});w.on("error",rej);}).then(s=>console.log(s));'

Threshold

Measured with minimal workers held live until all are up: 1 and 2 concurrent
worker isolates pass cleanly; at 3 the run no longer completes
(consistent with
the abort hanging guest teardown), and the 4-concurrent-worker battery above is
the deterministic, cleanly-observed crash. The boundary drops further once the
main isolate has already done real work (an HTTP server plus the
crypto/zlib/stream/net battery), because the pool is cumulative and never
reclaimed. A worker_threads pool of size os.cpus().length (the common
default for CPU-bound work) or any cluster-based server will not start.

For contrast, everything that does not add isolates passes on the same
binary: HTTP server reachable from the host, crypto (sha256/aes-256-cbc/
randomBytes), zlib gzip, fs, stream pipeline, timers, raw net TCP, and a
full Express 5.2.1 REST app serving 100 concurrent requests.

Root cause

  1. Fixed, boxed-in pool. INFRA_PT_POOL_OFF 0x10000 .. INFRA_PT_POOL_END_OFF 0x100000 = 960 KiB, sitting inside the 4 MiB INFRA_RESERVE. The shim
    code slot starts immediately above at +0x100000

    (INFRA_SHIM_OFF, src/core/guest.h:55), so the pool cannot grow in place
    without relocating the shim/shim-data slots and enlarging the reserve.

  2. Bump allocator, no reclaim. guest_alloc_pt_page (guest.c:166-194)
    only ever advances pt_pool_next; there is no free path. sys_munmap tears
    down regions and invalidates PTEs but does not return L3 pages to the
    pool. So a worker_threads pool that spawns and joins workers leaks PT
    pages on every churn, even at steady-state isolate count.

  3. Per-isolate footprint. Each V8 isolate maps its own committed heap /
    code / cage regions; one L3 page per touched 2 MiB block. Two isolates'
    footprints already approach the 240-page budget; three crosses it.

  4. Failure mode is an abort, not a clean OOM. On exhaustion the guest mmap
    fails, but V8's subsequent munmap returns non-zero and trips a CHECK,
    killing the process instead of surfacing a JS RangeError/OOM. The
    post-exhaustion mmap-bookkeeping is left inconsistent enough that the paired
    munmap cannot succeed.

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions