Skip to content

Grow the page-table pool for multi-isolate guests#75

Merged
jserv merged 1 commit into
sysprog21:mainfrom
Max042004:pt-pool-grow
Jun 6, 2026
Merged

Grow the page-table pool for multi-isolate guests#75
jserv merged 1 commit into
sysprog21:mainfrom
Max042004:pt-pool-grow

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented Jun 5, 2026

Fixes #74.

The guest page-table pool was a fixed 960 KiB arena (240 x 4 KiB L3 pages) at the bottom of the 4 MiB infra reserve. It is a bump allocator that never reclaims on munmap, and each 2 MiB block that needs mixed permissions (e.g. V8 JIT W^X) draws one L3 page, so the pool budgeted only ~480 MiB of split address space over a process lifetime. A single guest stays well under that, but every extra V8 isolate a Node worker_threads pool or cluster spins up reserves its own committed regions; the third isolate exhausted the pool, after which even munmap (which must split a block to invalidate a sub-range) failed and V8 hard-aborted on CHECK(0 == munmap) instead of getting a clean ENOMEM.

Grow INFRA_RESERVE 4 MiB -> 16 MiB and tighten the shim code slot from a round 1 MiB to 40 KiB (INFRA_SHIM_SLOT, ~6x the ~7 KiB shim blob) so the freed space falls through to the pool, which becomes ~13.9 MiB (3558 pages, ~7 GiB of split address space) -- enough for an os.cpus()-sized worker pool on a 24-core Ultra. The reserve is demand-paged and sits in the ~4 GiB dead zone below interp_base, so the unused pool costs no host RAM and the larger virtual reserve is free.

The layout keeps every invariant: shim data still occupies the top 2 MiB block (shim_data_base + 2 MiB == interp_base), shim code still shares the PT pool's last 2 MiB block, and the pool still ends where the shim slot begins. Only the constants in core/guest.h change; consumers derive their addresses from g->shim_base / pt_pool_base, and fork copies only the used pool, so nothing else moves.

Because the shim slot is now tight, a _Static_assert in main.c and a runtime re-check in bootstrap.c fail loudly if the shim ever outgrows INFRA_SHIM_SLOT instead of letting it silently overlap the shim-data block.

Surfaced while bringing up node:alpine through the OCI image work, where worker_threads pools of 8 and 16 isolates now run to completion; the fix is in the core guest runtime and is independent of that. The full guest test suite passes, including test-mremap-infra (infra-reserve boundary) and test-mprotect-mt (W^X L3 splitting).


Summary by cubic

Expand the infra reserve to 16 MiB and tighten the shim code slot to 40 KiB to grow the guest page‑table pool to ~13.9 MiB and avoid PT pool exhaustion in multi‑isolate V8 workloads. Adds build‑time invariants and runtime checks for the infra layout and shim size.

  • Bug Fixes
    • Increased INFRA_RESERVE 4 MiB → 16 MiB; PT pool now ~13.9 MiB (3558 pages, ~7 GiB split space).
    • Tightened shim code slot to 40 KiB (INFRA_SHIM_SLOT); shim data stays in the top 2 MiB; layout invariants preserved.
    • Added build‑time _Static_asserts for layout invariants (pool end == shim base, shim_data in top 2 MiB, 2 MiB/page alignment) and a runtime size check with slot size in logs.
    • Reserve is demand‑paged; unused pool costs no RAM. Tests pass; Node pools with 8–16 isolates run to completion.

Written for commit 51c9b0a. Summary will update on new commits.

Review in cubic

cubic-dev-ai[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout math, fork-IPC compatibility, finalize_block_perms behavior, and the icache-invalidate cache-line edge all check out. Two small follow-ups before this lands, both P3.

One: the layout invariants documented in src/core/guest.h:62-66 (INFRA_PT_POOL_END_OFF == INFRA_SHIM_OFF, INFRA_SHIM_DATA_OFF + BLOCK_2MIB == INFRA_RESERVE, alignments) aren't compile-time enforced -- see inline comment on src/main.c:131.

Two: src/runtime/forkipc.c:132 still says "plus the 4 MiB infra reserve below it." Functional check (8 GiB lower bound) still passes, but the comment will mislead anyone debugging guest-size limits. One-line update.

Separately (not this PR's job): docs/internals.md:95-113 carries a pre-existing memory-map table showing a legacy layout with the infra reserve at LOW addresses; the actual code places it at the top of IPA. Worth a separate doc-cleanup PR rather than folding into this change.

Comment thread src/main.c
* pool. If the shim ever outgrows the slot it would overlap the shim-data
* block; fail the build loudly rather than corrupt memory at boot. Enlarge
* INFRA_SHIM_SLOT (and shrink the pool to match) if this fires. */
_Static_assert(sizeof(shim_bin) <= INFRA_SHIM_SLOT,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shim-blob size is asserted, but the layout invariants documented in guest.h:62-66 are only guarded by comment. A future edit that grows the pool by shifting INFRA_PT_POOL_END_OFF without touching INFRA_SHIM_OFF would silently overlap the pool and the shim slot. Consider adding next to this one:

_Static_assert(INFRA_PT_POOL_END_OFF == INFRA_SHIM_OFF,
               "PT pool must end exactly where the shim slot begins");
_Static_assert(INFRA_SHIM_DATA_OFF + BLOCK_2MIB == INFRA_RESERVE,
               "shim_data must occupy the top 2 MiB block of the reserve");
_Static_assert((INFRA_SHIM_DATA_OFF & (BLOCK_2MIB - 1)) == 0,
               "shim_data must be 2 MiB-aligned");
_Static_assert((INFRA_PT_POOL_OFF & 0xFFF) == 0 &&
               (INFRA_PT_POOL_END_OFF & 0xFFF) == 0,
               "PT pool offsets must be page-aligned");

The guest page-table pool was a fixed 960 KiB arena (240 x 4 KiB L3
pages) at the bottom of the 4 MiB infra reserve. It is a bump allocator
that never reclaims on munmap, and each 2 MiB block that needs mixed
permissions (e.g. V8 JIT W^X) draws one L3 page, so the pool budgeted
only ~480 MiB of split address space over a process lifetime. A single
guest stays well under that, but every extra V8 isolate a Node
worker_threads pool or cluster spins up reserves its own committed
regions; the third isolate exhausted the pool, after which even munmap
(which must split a block to invalidate a sub-range) failed and V8
hard-aborted on CHECK(0 == munmap) instead of getting a clean ENOMEM.

Grow INFRA_RESERVE 4 MiB -> 16 MiB and tighten the shim code slot from a
round 1 MiB to 40 KiB (INFRA_SHIM_SLOT, ~6x the ~7 KiB shim blob) so the
freed space falls through to the pool, which becomes ~13.9 MiB (3558
pages, ~7 GiB of split address space) -- enough for an os.cpus()-sized
worker pool on a 24-core Ultra. The reserve is demand-paged and sits in
the ~4 GiB dead zone below interp_base, so the unused pool costs no host
RAM and the larger virtual reserve is free.

The layout keeps every invariant: shim data still occupies the top 2 MiB
block (shim_data_base + 2 MiB == interp_base), shim code still shares the
PT pool's last 2 MiB block, and the pool still ends where the shim slot
begins. Only the constants in core/guest.h change; consumers derive
their addresses from g->shim_base / pt_pool_base, and fork copies only
the used pool, so nothing else moves.

Because the shim slot is now tight, a _Static_assert in main.c and a
runtime re-check in bootstrap.c fail loudly if the shim ever outgrows
INFRA_SHIM_SLOT instead of letting it silently overlap the shim-data
block.

Surfaced while bringing up node:alpine through the OCI image work, where
worker_threads pools of 8 and 16 isolates now run to completion; the fix
is in the core guest runtime and is independent of that. The full guest
test suite passes, including test-mremap-infra (infra-reserve boundary)
and test-mprotect-mt (W^X L3 splitting).
Max042004 added a commit to Max042004/elfuse that referenced this pull request Jun 6, 2026
The Memory Layout table in internals.md showed the page-table pool, shim
code, and shim data at low addresses (0x10000-0x3FFFFF), a legacy layout.
The code (compute_infra_layout in src/core/guest.c) anchors the infra
reserve at [interp_base - INFRA_RESERVE, interp_base), in the dead zone
above mmap_limit, so EL0 binaries are free to load at low addresses without
colliding with the runtime.

Rewrite the table to split low fixed addresses from the interp_base-anchored
high reserve, and describe the reserve's internal layout (null guard, page
table pool, shim code sharing the pool's tail 2 MiB block, shim data) to
match src/core/guest.h. Numbers reflect the grown 16 MiB reserve / ~13.9 MiB
pool from sysprog21#75; this doc PR should land together with or after that change.
@jserv jserv merged commit 7e0fe90 into sysprog21:main Jun 6, 2026
4 checks passed
@jserv
Copy link
Copy Markdown
Contributor

jserv commented Jun 6, 2026

Thank @Max042004 for contributing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Guest page-table pool exhausts under multiple V8 isolates, Node worker_threads/cluster hard-abort

2 participants