Guest page-table pool exhausts under multiple V8 isolates, Node `worker_threads`/`cluster` hard-abort

## Summary

The guest page-table pool is a fixed **960 KiB** arena carved out of the 4 MiB
infrastructure reserve (`src/core/guest.h:53-54`). It is filled by a **bump
allocator that is never reclaimed** within a process — `g->pt_pool_next` only
advances and is reset solely at guest init / reset / fork
(`src/core/guest.c:358,470,1590`, `src/runtime/forkipc.c`). Because each 2 MiB
guest block that needs 4 KiB granularity costs one 4 KiB L3 page from this pool
(`guest_split_block` → `guest_alloc_pt_page`, `src/core/guest.c:166-194`), the
pool budgets roughly **480 MiB of distinct mapped address space over the whole
process lifetime** (240 pages × 2 MiB), regardless of how much is later
`munmap`'d.

A single-isolate Node process stays under that ceiling and runs fine. But every
additional V8 isolate (`worker_threads`, `cluster`) reserves and commits its own
multi-hundred-MiB regions, and the cumulative, non-reclaimed consumption crosses
the 240-page budget. When `guest_alloc_pt_page` returns 0 the guest mmap fails,
V8's allocator then trips on a follow-up `munmap` that returns non-zero, and the
process **hard-aborts**:

```
src/core/guest.c:178: guest: page table pool exhausted (used 983040 / 983040 bytes)

# Fatal error in , line 0
# Check failed: 0 == munmap(address, size).
```

(`983040 == 0xF0000 == 960 KiB`, i.e. the pool is 100% consumed.)

This surfaced while testing advanced Node workloads on `oci run node:alpine`
(node v26.3.0) — `oci run` is the OCI image support from PR #34 — on top of the
`oci-image-rebase` branch with PR #73 applied. Single-threaded, event-loop web
servers (Express, raw `http`) are unaffected; the failure is specific to
spinning up **multiple live V8 isolates**.

## Reproduction

`node:alpine` already pulled. `--entrypoint node` is required because the image
entrypoint is `docker-entrypoint.sh` (a shell script, not an ELF).

### Crashes — concurrent worker isolates after normal allocator traffic

The advanced built-in battery below (crypto / zlib / fs / streams / net, then
two sequential workers, then **four concurrent workers**) deterministically
exhausts the pool at the concurrent-worker stage:

```sh
./build/elfuse oci run --entrypoint node node:alpine -e '
const {Worker}=require("worker_threads");
const mk=n=>new Promise((res,rej)=>{
  const w=new Worker(`require("worker_threads").parentPort.postMessage(${n}*${n})`,{eval:true});
  w.on("message",m=>{w.terminate();res(m);}); w.on("error",rej);
});
Promise.all([mk(2),mk(3),mk(4),mk(5)]).then(r=>console.log(r)).catch(e=>console.error(e));
'
# -> guest: page table pool exhausted (used 983040 / 983040 bytes)
# -> # Check failed: 0 == munmap(address, size).
```

### Works — single isolate / single worker

```sh
# plain node, no worker .................. OK (well under 80% warn threshold)
./build/elfuse oci run --entrypoint node node:alpine --version          # v26.3.0

# one worker (2 isolates) ................ OK, prints 500500
./build/elfuse oci run --entrypoint node node:alpine -e '
const {Worker}=require("worker_threads");
new Promise((res,rej)=>{const w=new Worker("let s=0;for(let i=0;i<=1000;i++)s+=i;require(\"worker_threads\").parentPort.postMessage(s)",{eval:true});
w.on("message",m=>{w.terminate();res(m)});w.on("error",rej);}).then(s=>console.log(s));'
```

### Threshold

Measured with minimal workers held live until all are up: **1 and 2 concurrent
worker isolates pass cleanly; at 3 the run no longer completes** (consistent with
the abort hanging guest teardown), and the 4-concurrent-worker battery above is
the deterministic, cleanly-observed crash. The boundary drops further once the
main isolate has already done real work (an HTTP server plus the
crypto/zlib/stream/net battery), because the pool is cumulative and never
reclaimed. A `worker_threads` pool of size `os.cpus().length` (the common
default for CPU-bound work) or any `cluster`-based server will not start.

For contrast, everything that does **not** add isolates passes on the same
binary: HTTP server reachable from the host, `crypto` (sha256/aes-256-cbc/
randomBytes), `zlib` gzip, `fs`, `stream` pipeline, timers, raw `net` TCP, and a
full Express 5.2.1 REST app serving 100 concurrent requests.

## Root cause

1. **Fixed, boxed-in pool.** `INFRA_PT_POOL_OFF 0x10000 .. INFRA_PT_POOL_END_OFF
   0x100000` = 960 KiB, sitting inside the 4 MiB `INFRA_RESERVE`. The **shim
   code slot starts immediately above at `+0x100000`**
   (`INFRA_SHIM_OFF`, `src/core/guest.h:55`), so the pool cannot grow in place
   without relocating the shim/shim-data slots and enlarging the reserve.

2. **Bump allocator, no reclaim.** `guest_alloc_pt_page` (`guest.c:166-194`)
   only ever advances `pt_pool_next`; there is no free path. `sys_munmap` tears
   down regions and invalidates PTEs but does **not** return L3 pages to the
   pool. So a `worker_threads` pool that spawns and joins workers *leaks* PT
   pages on every churn, even at steady-state isolate count.

3. **Per-isolate footprint.** Each V8 isolate maps its own committed heap /
   code / cage regions; one L3 page per touched 2 MiB block. Two isolates'
   footprints already approach the 240-page budget; three crosses it.

4. **Failure mode is an abort, not a clean OOM.** On exhaustion the guest mmap
   fails, but V8's subsequent `munmap` returns non-zero and trips a `CHECK`,
   killing the process instead of surfacing a JS `RangeError`/OOM. The
   post-exhaustion mmap-bookkeeping is left inconsistent enough that the paired
   `munmap` cannot succeed.

## Notes

- The pool already warns at 80% (`guest.c:185-194`,
  `guest: page table pool at N%`), so the runway is observable before the abort.
- Reachable through the OCI `oci run` path added by PR #34 (`oci run node:alpine`),
  but the page-table pool is core guest infrastructure (`src/core/guest.*`) and
  the limit is not specific to OCI — any multi-isolate guest hits it.
- Independent of PR #73 (epoll-mt). #73 is what lets multi-threaded Node get far
  enough to *hit* this; the pool limit is a separate, pre-existing constraint.
- `io_uring_setup` (syscall 425) is unimplemented and warns, but Node falls back
  to the libuv threadpool cleanly — unrelated to this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guest page-table pool exhausts under multiple V8 isolates, Node `worker_threads`/`cluster` hard-abort #74

Summary

Reproduction

Crashes — concurrent worker isolates after normal allocator traffic

Works — single isolate / single worker

Threshold

Root cause

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Guest page-table pool exhausts under multiple V8 isolates, Node worker_threads/cluster hard-abort #74

Description

Summary

Reproduction

Crashes — concurrent worker isolates after normal allocator traffic

Works — single isolate / single worker

Threshold

Root cause

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Guest page-table pool exhausts under multiple V8 isolates, Node `worker_threads`/`cluster` hard-abort #74