Skip to content

Back read-only MAP_SHARED file mappings with MAP_PRIVATE#84

Open
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:fix-mmap-shared-ro
Open

Back read-only MAP_SHARED file mappings with MAP_PRIVATE#84
Max042004 wants to merge 1 commit into
sysprog21:mainfrom
Max042004:fix-mmap-shared-ro

Conversation

@Max042004
Copy link
Copy Markdown
Collaborator

@Max042004 Max042004 commented Jun 6, 2026

A MAP_SHARED, PROT_READ mapping of a file opened O_RDONLY could never be
installed. hvf_apply_file_overlay_quiesced() always mmap'd the host page
PROT_READ|PROT_WRITE and mapped the HVF segment RWX. On a read-only fd the
host mmap fails with EACCES (writable mapping of an O_RDONLY fd); forcing
PROT_READ then trips hv_vm_map(), because a MAP_SHARED mapping of an
O_RDONLY fd has macOS max_protection=READ and HVF cannot grant stage-2
rights (RWX) beyond the host region's max_protection (HV_ERROR).

This blocked every workload that maps a read-only file MAP_SHARED -- most
visibly the JVM, which maps its ~135 MiB lib/modules image exactly this
way and crashed on startup.

Choose the host backing from what the fd and the guest actually need:

  • guest wants PROT_WRITE: MAP_SHARED PROT_READ|PROT_WRITE (writes reach
    the file; an O_RDONLY fd still yields EACCES, matching Linux).
  • guest read-only on a writable fd: MAP_SHARED PROT_READ (max_protection
    is RWX, so the segment maps and cross-mapping coherence is preserved).
  • guest read-only on an O_RDONLY fd: MAP_PRIVATE PROT_READ. Its
    max_protection is RWX so the segment maps; the pages still show file
    content, and the guest's stage-1 tables keep the region read-only so
    the private copy is never dirtied -- no observable MAP_SHARED
    divergence for a read-only mapping.

The guest-requested prot is threaded through hvf_apply_file_overlay(),
hvf_apply_file_overlay_quiesced(), and restore_file_overlay_range() so
every overlay install/restore site picks the correct backing.

Add test-mmap-shared-ro covering the O_RDONLY read path, a second
concurrent read-only mapping, EACCES on a writable request, and the
read-only-mapping-on-O_RDWR-fd branch.

(cherry picked from commit 337d39a4313109884112a86a0c4147bddfe18fa1)


Summary by cubic

Fixes read-only MAP_SHARED mappings of O_RDONLY files by backing them with MAP_PRIVATE when needed. This unblocks common workloads (e.g., JVM lib/modules) and restores Linux-compatible behavior.

  • Bug Fixes
    • Choose host backing based on guest prot and fd mode:
      • If guest needs write: MAP_SHARED | PROT_READ|PROT_WRITE (returns EACCES on O_RDONLY, matching Linux).
      • If guest is read-only on writable fd: MAP_SHARED | PROT_READ.
      • If guest is read-only on O_RDONLY fd: MAP_PRIVATE | PROT_READ (segment maps; no divergence since guest pages stay read-only).
    • Thread prot through overlay paths (apply/restore, sys_mmap, mremap, and fork install/restore) so each site picks the correct backing.
    • Preserve expected error semantics: writable shared mapping on O_RDONLY fd yields EACCES.
    • Add test-mmap-shared-ro and manifest entry covering:
      • Read-only MAP_SHARED on O_RDONLY.
      • A second concurrent read-only mapping.
      • Rejection of writable MAP_SHARED on O_RDONLY.
      • Read-only mapping on an O_RDWR fd.

Written for commit ace1dd6. Summary will update on new commits.

Review in cubic

A MAP_SHARED, PROT_READ mapping of a file opened O_RDONLY could never be
installed. hvf_apply_file_overlay_quiesced() always mmap'd the host page
PROT_READ|PROT_WRITE and mapped the HVF segment RWX. On a read-only fd the
host mmap fails with EACCES (writable mapping of an O_RDONLY fd); forcing
PROT_READ then trips hv_vm_map(), because a MAP_SHARED mapping of an
O_RDONLY fd has macOS max_protection=READ and HVF cannot grant stage-2
rights (RWX) beyond the host region's max_protection (HV_ERROR).

This blocked every workload that maps a read-only file MAP_SHARED -- most
visibly the JVM, which maps its ~135 MiB lib/modules image exactly this
way and crashed on startup.

Choose the host backing from what the fd and the guest actually need:
  - guest wants PROT_WRITE: MAP_SHARED PROT_READ|PROT_WRITE (writes reach
    the file; an O_RDONLY fd still yields EACCES, matching Linux).
  - guest read-only on a writable fd: MAP_SHARED PROT_READ (max_protection
    is RWX, so the segment maps and cross-mapping coherence is preserved).
  - guest read-only on an O_RDONLY fd: MAP_PRIVATE PROT_READ. Its
    max_protection is RWX so the segment maps; the pages still show file
    content, and the guest's stage-1 tables keep the region read-only so
    the private copy is never dirtied -- no observable MAP_SHARED
    divergence for a read-only mapping.

The guest-requested prot is threaded through hvf_apply_file_overlay(),
hvf_apply_file_overlay_quiesced(), and restore_file_overlay_range() so
every overlay install/restore site picks the correct backing.

Add test-mmap-shared-ro covering the O_RDONLY read path, a second
concurrent read-only mapping, EACCES on a writable request, and the
read-only-mapping-on-O_RDWR-fd branch.

(cherry picked from commit 337d39a4313109884112a86a0c4147bddfe18fa1)
cubic-dev-ai[bot]

This comment was marked as resolved.

Comment thread src/syscall/mem.c
bool fd_writable = acc >= 0 && ((acc & O_ACCMODE) == O_RDWR ||
(acc & O_ACCMODE) == O_WRONLY);
int host_prot = want_write ? (PROT_READ | PROT_WRITE) : PROT_READ;
int share = (want_write || fd_writable) ? MAP_SHARED : MAP_PRIVATE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The MAP_PRIVATE substitution is sound while the mapping stays read-only, but sys_mprotect at mem.c:3275 doesn't know about the backing decision -- it just calls guest_region_set_prot + guest_update_perms(prot_to_perms(prot)). A guest that does:

int fd = open(path, O_RDONLY);
char *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
mprotect(p, len, PROT_READ | PROT_WRITE);   // Linux: EACCES; here: succeeds
*p = 0xff;                                   // Writes to COW copy, not the file

will silently upgrade stage-1 to RW and write into the COW copy. Linux returns EACCES because the mapping remembers max_prot=READ from the O_RDONLY fd. Before this PR the upgrade was unreachable (the initial mmap failed); the MAP_PRIVATE fallback exposes it.

The cleanest fix is to track a host_backing_kind (or max_prot) on guest_region_t, set it when this branch is taken, carry it through snapshots/splits/merges/mremap, and have sys_mprotect return -LINUX_EACCES when PROT_WRITE would exceed it. That also closes the downstream gap where sys_msync at mem.c:3560 skips its pwrite-refresh path for overlay_active=true regions on the assumption "the page cache already keeps them coherent with the file" -- false for a MAP_PRIVATE backing.

Comment thread src/syscall/mem.c
* divergence for a read-only mapping).
*/
bool want_write = (prot & LINUX_PROT_WRITE) != 0;
int acc = fcntl(fd, F_GETFL);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating fcntl failure as "not writable" silently picks MAP_PRIVATE for a valid writable fd whose F_GETFL transiently failed. Vanishingly rare on a host fd elfuse already holds via host_fd_ref_open, but the failure mode (losing MAP_SHARED coherence) is silent rather than surfaced.

Two options: return -linux_errno() on acc < 0, or hoist fd-writability detection up to where host_backing_fd is resolved and thread a plain bool fd_writable through. The latter also eliminates the per-install fcntl on the hot mmap path.


/* Several guest pages so the overlay spans more than one host page and the
* containing 2 MiB segment is split and remapped over a realistic range. */
#define NPAGES 64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 x 4 KiB = 256 KiB stays entirely within one 2 MiB segment, so hvf_segment_split's multi-block path isn't exercised. JVM lib/modules is ~135 MiB and crosses many. Bump to at least NPAGES 768 (3 MiB, two segments) so the segment-split + per-page-marker check catches a misaligned split.

A further test would lock in the corner this PR introduces:

// Linux returns EACCES; with the MAP_PRIVATE fallback in place but no
// backing-kind tracking, elfuse currently lets this through silently.
static void test_rdonly_mprotect_write_rejected(const char *path) {
    int fd = open(path, O_RDONLY);
    char *p = mmap(NULL, FILE_LEN, PROT_READ, MAP_SHARED, fd, 0);
    EXPECT_EQ(mprotect(p, FILE_LEN, PROT_READ | PROT_WRITE), -1, "must reject");
    EXPECT_EQ(errno, EACCES, "errno must be EACCES");
    munmap(p, FILE_LEN); close(fd);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants