Grow the page-table pool for multi-isolate guests#75
Conversation
jserv
left a comment
There was a problem hiding this comment.
Layout math, fork-IPC compatibility, finalize_block_perms behavior, and the icache-invalidate cache-line edge all check out. Two small follow-ups before this lands, both P3.
One: the layout invariants documented in src/core/guest.h:62-66 (INFRA_PT_POOL_END_OFF == INFRA_SHIM_OFF, INFRA_SHIM_DATA_OFF + BLOCK_2MIB == INFRA_RESERVE, alignments) aren't compile-time enforced -- see inline comment on src/main.c:131.
Two: src/runtime/forkipc.c:132 still says "plus the 4 MiB infra reserve below it." Functional check (8 GiB lower bound) still passes, but the comment will mislead anyone debugging guest-size limits. One-line update.
Separately (not this PR's job): docs/internals.md:95-113 carries a pre-existing memory-map table showing a legacy layout with the infra reserve at LOW addresses; the actual code places it at the top of IPA. Worth a separate doc-cleanup PR rather than folding into this change.
| * pool. If the shim ever outgrows the slot it would overlap the shim-data | ||
| * block; fail the build loudly rather than corrupt memory at boot. Enlarge | ||
| * INFRA_SHIM_SLOT (and shrink the pool to match) if this fires. */ | ||
| _Static_assert(sizeof(shim_bin) <= INFRA_SHIM_SLOT, |
There was a problem hiding this comment.
The shim-blob size is asserted, but the layout invariants documented in guest.h:62-66 are only guarded by comment. A future edit that grows the pool by shifting INFRA_PT_POOL_END_OFF without touching INFRA_SHIM_OFF would silently overlap the pool and the shim slot. Consider adding next to this one:
_Static_assert(INFRA_PT_POOL_END_OFF == INFRA_SHIM_OFF,
"PT pool must end exactly where the shim slot begins");
_Static_assert(INFRA_SHIM_DATA_OFF + BLOCK_2MIB == INFRA_RESERVE,
"shim_data must occupy the top 2 MiB block of the reserve");
_Static_assert((INFRA_SHIM_DATA_OFF & (BLOCK_2MIB - 1)) == 0,
"shim_data must be 2 MiB-aligned");
_Static_assert((INFRA_PT_POOL_OFF & 0xFFF) == 0 &&
(INFRA_PT_POOL_END_OFF & 0xFFF) == 0,
"PT pool offsets must be page-aligned");The guest page-table pool was a fixed 960 KiB arena (240 x 4 KiB L3 pages) at the bottom of the 4 MiB infra reserve. It is a bump allocator that never reclaims on munmap, and each 2 MiB block that needs mixed permissions (e.g. V8 JIT W^X) draws one L3 page, so the pool budgeted only ~480 MiB of split address space over a process lifetime. A single guest stays well under that, but every extra V8 isolate a Node worker_threads pool or cluster spins up reserves its own committed regions; the third isolate exhausted the pool, after which even munmap (which must split a block to invalidate a sub-range) failed and V8 hard-aborted on CHECK(0 == munmap) instead of getting a clean ENOMEM. Grow INFRA_RESERVE 4 MiB -> 16 MiB and tighten the shim code slot from a round 1 MiB to 40 KiB (INFRA_SHIM_SLOT, ~6x the ~7 KiB shim blob) so the freed space falls through to the pool, which becomes ~13.9 MiB (3558 pages, ~7 GiB of split address space) -- enough for an os.cpus()-sized worker pool on a 24-core Ultra. The reserve is demand-paged and sits in the ~4 GiB dead zone below interp_base, so the unused pool costs no host RAM and the larger virtual reserve is free. The layout keeps every invariant: shim data still occupies the top 2 MiB block (shim_data_base + 2 MiB == interp_base), shim code still shares the PT pool's last 2 MiB block, and the pool still ends where the shim slot begins. Only the constants in core/guest.h change; consumers derive their addresses from g->shim_base / pt_pool_base, and fork copies only the used pool, so nothing else moves. Because the shim slot is now tight, a _Static_assert in main.c and a runtime re-check in bootstrap.c fail loudly if the shim ever outgrows INFRA_SHIM_SLOT instead of letting it silently overlap the shim-data block. Surfaced while bringing up node:alpine through the OCI image work, where worker_threads pools of 8 and 16 isolates now run to completion; the fix is in the core guest runtime and is independent of that. The full guest test suite passes, including test-mremap-infra (infra-reserve boundary) and test-mprotect-mt (W^X L3 splitting).
The Memory Layout table in internals.md showed the page-table pool, shim code, and shim data at low addresses (0x10000-0x3FFFFF), a legacy layout. The code (compute_infra_layout in src/core/guest.c) anchors the infra reserve at [interp_base - INFRA_RESERVE, interp_base), in the dead zone above mmap_limit, so EL0 binaries are free to load at low addresses without colliding with the runtime. Rewrite the table to split low fixed addresses from the interp_base-anchored high reserve, and describe the reserve's internal layout (null guard, page table pool, shim code sharing the pool's tail 2 MiB block, shim data) to match src/core/guest.h. Numbers reflect the grown 16 MiB reserve / ~13.9 MiB pool from sysprog21#75; this doc PR should land together with or after that change.
|
Thank @Max042004 for contributing! |
Fixes #74.
The guest page-table pool was a fixed 960 KiB arena (240 x 4 KiB L3 pages) at the bottom of the 4 MiB infra reserve. It is a bump allocator that never reclaims on munmap, and each 2 MiB block that needs mixed permissions (e.g. V8 JIT W^X) draws one L3 page, so the pool budgeted only ~480 MiB of split address space over a process lifetime. A single guest stays well under that, but every extra V8 isolate a Node worker_threads pool or cluster spins up reserves its own committed regions; the third isolate exhausted the pool, after which even munmap (which must split a block to invalidate a sub-range) failed and V8 hard-aborted on CHECK(0 == munmap) instead of getting a clean ENOMEM.
Grow INFRA_RESERVE 4 MiB -> 16 MiB and tighten the shim code slot from a round 1 MiB to 40 KiB (INFRA_SHIM_SLOT, ~6x the ~7 KiB shim blob) so the freed space falls through to the pool, which becomes ~13.9 MiB (3558 pages, ~7 GiB of split address space) -- enough for an os.cpus()-sized worker pool on a 24-core Ultra. The reserve is demand-paged and sits in the ~4 GiB dead zone below interp_base, so the unused pool costs no host RAM and the larger virtual reserve is free.
The layout keeps every invariant: shim data still occupies the top 2 MiB block (shim_data_base + 2 MiB == interp_base), shim code still shares the PT pool's last 2 MiB block, and the pool still ends where the shim slot begins. Only the constants in core/guest.h change; consumers derive their addresses from g->shim_base / pt_pool_base, and fork copies only the used pool, so nothing else moves.
Because the shim slot is now tight, a _Static_assert in main.c and a runtime re-check in bootstrap.c fail loudly if the shim ever outgrows INFRA_SHIM_SLOT instead of letting it silently overlap the shim-data block.
Surfaced while bringing up node:alpine through the OCI image work, where worker_threads pools of 8 and 16 isolates now run to completion; the fix is in the core guest runtime and is independent of that. The full guest test suite passes, including test-mremap-infra (infra-reserve boundary) and test-mprotect-mt (W^X L3 splitting).
Summary by cubic
Expand the infra reserve to 16 MiB and tighten the shim code slot to 40 KiB to grow the guest page‑table pool to ~13.9 MiB and avoid PT pool exhaustion in multi‑isolate V8 workloads. Adds build‑time invariants and runtime checks for the infra layout and shim size.
INFRA_RESERVE4 MiB → 16 MiB; PT pool now ~13.9 MiB (3558 pages, ~7 GiB split space).INFRA_SHIM_SLOT); shim data stays in the top 2 MiB; layout invariants preserved._Static_asserts for layout invariants (pool end == shim base, shim_data in top 2 MiB, 2 MiB/page alignment) and a runtime size check with slot size in logs.Written for commit 51c9b0a. Summary will update on new commits.