This document describes the security properties, trust boundaries, and hardening recommendations for go-microvm.
- Trust Boundaries
- Guest-VMM Trust Boundary
- Networking Trust Boundary
- Guest Escape Blast Radius
- Hardening Recommendations
- Guest Hardening
- Egress Policy Security Model
- Tar Extraction Defenses
- Process Identity Verification
- File Permissions
- SSH Client Security
go-microvm has two trust boundaries:
Trust boundary 1: KVM/HVF
(hardware isolation)
|
+-------------------------+|+------------------------------------------+
| Guest VM ||| go-microvm-runner |
| (untrusted code) ||| +------------------+ +------------------+|
| ||| | libkrun VMM | | VirtualNetwork ||
| Runs user workload ||| | (virtio devices, | | (gvisor-tap-vsock||
| inside hardware- ||| | VM supervisor) | | DHCP, DNS, NAT, ||
| isolated VM ||| | | | port forwarding)||
| ||| +------------------+ +------------------+|
+-------------------------+|+------------------------------------------+
| |
| Trust boundary 2: process isolation
| (OS-level, user privileges)
| |
+------+-----------+------+
| Host OS |
| (caller's user context)|
+-------------------------+
libkrun runs the guest and VMM in the same process. The microVM provides hardware-level isolation via KVM (Linux) or Hypervisor.framework (macOS), but the VMM itself is not sandboxed from the host process.
This means:
- The VM provides stronger isolation than containers (hardware MMU separation)
- The VMM process has the same privileges as the user running it
- A guest escape would land in the VMM process context
- This is the same model used by krunvm and crun+libkrun
The location of the VirtualNetwork depends on the networking mode:
- Default (runner-side): The runner process hosts both the libkrun VMM and the VirtualNetwork in the same OS process. The network stack's lifecycle is tied to the VM's lifetime.
- Hosted (caller-side): The VirtualNetwork runs in the caller's process
via
net/hosted.Provider. The runner connects over a Unix socket.
| Component | Role |
|---|---|
| libkrun VMM | VM supervisor, virtio device emulation |
| VirtualNetwork | Userspace TCP/IP stack (gVisor netstack) |
| DHCP server | Assigns guest IP (192.168.127.2) |
| DNS server | Resolves names for the guest |
| Port forward listeners | TCP listeners on 127.0.0.1 for host-to-guest forwarding |
| Component | Role |
|---|---|
| libkrun VMM | VM supervisor, virtio device emulation |
In hosted mode, the VirtualNetwork, DHCP, DNS, port forwards, and any HTTP services run in the caller's process instead. A guest escape still lands in the runner process, but the network stack is in a separate process, providing better isolation.
-
Single-VM model. There is exactly one guest per runner process. The network stack serves only that guest. There is no cross-tenant risk because there is no second tenant.
-
Port forwards are localhost-only. All port forward listeners bind to
127.0.0.1, hardcoded in the runner. There is no configuration path to bind to0.0.0.0or an external interface. -
The VMM was already unsandboxed. A guest escape already gives the attacker full user-level host access (the runner's UID). Adding the network stack to the same process does not meaningfully widen this -- the attacker could already open sockets and access the network.
-
No secrets in the network stack. The VirtualNetwork does not hold TLS keys, credentials, or tokens. Port forwards are simple TCP proxies. The DHCP/DNS servers have no auth material.
If an attacker exploits a KVM/HVF vulnerability or a libkrun bug to escape the guest, they land in the runner process with:
| Capability | Default mode | Hosted mode |
|---|---|---|
| Runner's UID privileges | Yes | Yes |
| Rootfs directory access | Yes | Yes |
| Console/VM log files | Yes | Yes |
| VirtualNetwork goroutines | Yes | No (separate process) |
| Port forward listeners on 127.0.0.1 | Yes | No (separate process) |
| DHCP/DNS servers | Yes | No (separate process) |
| Socketpair fd to krun | Yes (inject/modify frames) | Unix socket (inject/modify frames) |
What they do NOT get:
- Root privileges (unless the runner was run as root)
- Access to other users' processes or files
- Kernel-level access (the escape lands in userspace)
- Network listeners on external interfaces
In hosted mode, the network stack is in the caller's process, so a guest escape does not directly compromise it. This provides better isolation for security-sensitive deployments.
The collapsed trust boundary (default mode) would matter more if:
- Multiple VMs shared a VirtualNetwork (multi-tenant): one VM could sniff or poison another's traffic. go-microvm does not support this.
- The network stack held secrets (TLS termination, auth tokens): a guest escape would expose them. go-microvm port forwards are plain TCP.
- The network stack ran with higher privileges than the VMM: not the case here, both share the same process and UID.
For deployments where these concerns apply, use the hosted provider
(net/hosted.Provider) which moves the network stack to the caller's
process, providing natural process-level separation.
For security-critical deployments, layer additional isolation around the runner process:
The guest/harden package provides a built-in seccomp BPF blocklist
filter for the guest VM. Enable it with boot.WithSeccomp(true) or
call harden.ApplySeccomp() directly. See Seccomp BPF filter
for the full blocklist.
Restrict the runner's syscalls to only what libkrun and gvisor-tap-vsock need. This is the highest-impact host-side hardening measure. After a guest escape, the attacker lands in a seccomp jail that limits what syscalls they can make.
Runner process
└── seccomp filter: allow only needed syscalls
├── ioctl (KVM/HVF)
├── read/write/close (fd operations)
├── mmap/mprotect (memory management)
├── socket/bind/listen/accept (networking)
├── epoll/poll (event loop)
└── deny everything else
Run the runner in restricted namespaces:
- User namespace: map the runner's UID to an unprivileged range
- Mount namespace: limit filesystem visibility to rootfs + data dir
- Network namespace: not applicable (the network stack needs host access for port forwarding), but PID namespace limits process visibility
Confine the runner to a policy that restricts:
- File access to the data directory and rootfs only
- Network access to localhost port forwards only
- No access to other users' home directories or system paths
Use the hosted networking provider (net/hosted.Provider) to move the
network stack out of the runner process. This achieves process-level
separation without a custom helper binary:
go-microvm-runner (VMM only)
└── Unix socket → hosted provider in caller's process
caller's process (network stack)
└── hosted.Provider → VirtualNetwork + HTTP services
This is not the default because it requires the caller's process to remain alive for networking to function. For the simplest deployments, the default runner-side networking ties the network stack to the VM's lifetime with no extra coordination.
The guest/harden package provides reusable kernel, capability, and
syscall hardening for microVM init processes. It is guest-side code
(//go:build linux) with no CGO or krun dependencies.
Call the hardening functions in your guest init boot sequence:
- Mount
/procand/sysfirst (sysctls need procfs). - Call
harden.KernelDefaults(logger)to apply sysctls. - Perform all privileged operations (mounts, network config, chown).
- Call
harden.DropBoundingCaps(keep...)as the last privileged step. - Call
harden.ApplySeccomp()as the very last hardening step.
Alternatively, use boot.WithSeccomp(true) to have the boot sequence
apply the seccomp filter automatically after SSH starts.
KernelDefaults applies the following sysctls. Each is set
independently; individual failures are logged as warnings rather than
aborting boot, because not all kernels support every sysctl.
| Sysctl | Value | Purpose |
|---|---|---|
kernel.kptr_restrict |
2 |
Hide kernel pointers from all users. Prevents information leaks that aid exploit development. |
kernel.dmesg_restrict |
1 |
Restrict dmesg to privileged users. Prevents unprivileged processes from reading kernel log messages that may contain sensitive addresses or operations. |
kernel.unprivileged_bpf_disabled |
1 |
Disable unprivileged BPF. Prevents unprivileged users from loading BPF programs, which have historically been a source of kernel privilege escalation vulnerabilities. |
kernel.perf_event_paranoid |
3 |
Disallow all perf events for unprivileged users. Prevents unprivileged access to performance counters, which can be used for side-channel attacks. |
kernel.yama.ptrace_scope |
2 |
Restrict ptrace to CAP_SYS_PTRACE holders. Prevents unprivileged processes from attaching to other processes to inspect memory or inject code. |
net.core.bpf_jit_harden |
2 |
Harden BPF JIT against spraying attacks. Forces constant blinding and disables JIT kallsyms exposure. |
kernel.sysrq |
0 |
Disable magic SysRq key. Prevents unprivileged users from triggering kernel debugging and recovery commands. |
DropBoundingCaps(keep...) drops all Linux capabilities from the
bounding set except those explicitly listed. This limits what
capabilities child processes can acquire, even through setuid binaries
or file capabilities.
For a typical SSH-based guest, the minimal keep set is:
| Capability | Number | Reason |
|---|---|---|
CAP_SETUID |
7 | sshd credential switching to sandbox user |
CAP_SETGID |
6 | sshd group switching |
CAP_NET_BIND_SERVICE |
10 | Binding port 22 (privileged port) |
SetNoNewPrivs() sets the PR_SET_NO_NEW_PRIVS bit on the calling
process. Once set, the process and all descendants (via fork/exec)
cannot gain new privileges through execve — setuid binaries run
without elevation and file capabilities are ignored.
This is intended to be called after all privileged operations are
complete (mounts, network config, credential setup, capability
dropping). Consumers that spawn child processes via os/exec inherit
the bit automatically; consumers that need to set it on the calling
process itself (e.g., an init that doesn't use os/exec) can call
SetNoNewPrivs() directly.
Note: no_new_privs does not affect setresuid/setresgid syscalls
used by Go's SysProcAttr.Credential — credential switching for SSH
sessions continues to work after the bit is set.
ApplySeccomp() installs a seccomp BPF blocklist filter on all OS
threads. The filter uses SECCOMP_FILTER_FLAG_TSYNC to synchronize
across threads and sets no_new_privs. Once applied, it is inherited
by all child processes and cannot be removed.
The blocklist is split into two tiers:
Exploitation indicators (SECCOMP_RET_KILL_PROCESS): syscalls that
legitimate workloads never call — any attempt strongly indicates
active exploitation.
| Syscall | Reason |
|---|---|
io_uring_setup/enter/register |
Prolific source of kernel CVEs |
ptrace, process_vm_readv/writev |
Process debugging / cross-process memory access |
kexec_load/file_load |
Kernel replacement |
init_module/finit_module/delete_module |
Kernel module loading |
bpf |
eBPF — kernel attack surface |
seccomp |
Prevent installing additional filters |
Operational blocks (SECCOMP_RET_ERRNO/EPERM): syscalls that
runtimes or build tools might innocuously probe — fail gracefully.
| Category | Syscalls |
|---|---|
| Filesystem manipulation | mount, umount2, pivot_root, chroot, fsopen, fsconfig, fspick, move_mount, open_tree |
| Namespace creation | unshare, setns, clone (with CLONE_NEW* flags) |
| Side-channel attacks | perf_event_open, userfaultfd |
| Kernel keyring | add_key, request_key, keyctl |
| Landlock | landlock_create_ruleset, landlock_add_rule, landlock_restrict_self |
| Dangerous socket families | AF_NETLINK, AF_PACKET, AF_KEY, AF_ALG, AF_VSOCK |
AF_INET, AF_INET6, and AF_UNIX remain allowed for normal
workload operation.
Note: clone3 is NOT blocked because glibc 2.34+ uses it for
fork(). Its namespace flags live inside a struct pointer (arg0)
which seccomp BPF cannot inspect.
Consumers should lock down /root/ (mode 0700) after completing
initial setup so the sandbox user cannot read root's home directory
contents (bootstrap config, debug logs, credentials). This is not
performed by the harden package itself but is a recommended consumer
practice — see apiary's lockdownRoot() for an example.
These hardening measures are defense-in-depth for the guest environment. An attacker who has compromised the guest workload would need a hypervisor escape to reach the host; however, guest hardening:
- Raises the bar for local privilege escalation within the guest
- Reduces information available for exploit development (kernel pointers, dmesg)
- Limits the attack surface of dangerous subsystems (BPF)
- Constrains what a compromised process can do even with root inside the guest
The DNS-based egress policy (WithEgressPolicy()) restricts VM outbound
connections to a set of allowed hostnames. It operates at the relay level,
intercepting DNS traffic between the VM and the VirtualNetwork.
The egress policy prevents a compromised or untrusted VM from:
- Exfiltrating data to arbitrary external hosts
- Communicating with command-and-control servers
- Scanning or connecting to internal network resources
- DNS queries for non-allowed hostnames receive NXDOMAIN responses directly from the relay — the query never reaches the DNS server.
- DNS responses for allowed hostnames are snooped to extract A-record IPs. These IPs become temporary firewall rules with TTLs matching the DNS record TTLs (minimum 60 seconds).
- All other egress traffic is denied by the default-deny firewall policy.
- Implicit rules allow DNS to the gateway, DHCP, and ingress on port-forwarded ports.
| Vector | Mitigation |
|---|---|
| Hardcoded IPs | Default-deny blocks connections to IPs not learned from DNS. The VM cannot reach arbitrary IPs without first resolving them through the interceptor. |
| DNS-over-HTTPS (DoH) | HTTPS to DoH providers (e.g., 1.1.1.1, 8.8.8.8) is blocked by default-deny unless the DoH server hostname is in the allowlist. |
| DNS-over-TLS (DoT) | TCP port 853 is blocked by default-deny. |
| IP-in-hostname | The policy matches hostname strings, not resolved IPs. An attacker-controlled DNS server could return arbitrary IPs for an allowed hostname, but the attacker would need to control the DNS server the gateway uses. |
| Tunneling over DNS | DNS queries to allowed domains pass through, so DNS tunneling to an allowed domain is theoretically possible. This is a limitation of DNS-level filtering. |
| TTL racing | After a dynamic rule expires, the connection may survive via conntrack (if already established). This is by design — conntrack ensures established connections are not disrupted by TTL expiry. |
- Established connection survival: Once a connection is tracked by conntrack, it persists until the conntrack entry expires (5 minutes for TCP, 30 seconds for UDP), even if the dynamic rule has expired.
- Same-subnet traffic: Traffic within the virtual network subnet (192.168.127.0/24) is subject to the same firewall rules but does not typically involve DNS resolution.
- IPv6: AAAA records are not processed. IPv6 traffic is non-IPv4 and passes through the firewall unfiltered (same as ARP).
When extracting OCI image layers, go-microvm applies multiple layers of defense against malicious tar archives:
Path traversal prevention. Every tar entry name is cleaned via
filepath.Clean(), absolute paths are rejected, and the resolved path
is verified to stay under the destination directory.
Symlink traversal prevention. Directories are created one component
at a time. Each existing component is checked with os.Lstat() (not
os.Stat(), which would follow symlinks). If any component is a symlink,
extraction is refused.
Symlink leaf validation. Before writing a regular file, the target
path is checked with os.Lstat(). If the target is a symlink or a
directory, writing is refused.
Symlink target validation. Absolute symlink targets are resolved relative to the rootfs and checked to stay within bounds. Relative symlink targets are resolved from the parent directory.
Hardlink boundary enforcement. Hard links are validated to ensure both source and target remain within the rootfs directory.
Decompression bomb limit. The tar reader is wrapped in
io.LimitedReader with a 30 GiB cap.
Entry type filtering. Character devices, block devices, and FIFOs are silently skipped.
When managing VM lifecycle, go-microvm verifies process identity before sending signals:
IsAlive()sends signal 0 to the PID (no-op that verifies the process exists and the caller has permission to signal it)Stop()checksIsAlive()before SIGTERM, and checks again during the poll loop before SIGKILL- The
isNoSuchProcess()helper handlesESRCHgracefully
This prevents sending signals to unrelated processes if the PID has been reused.
When running as a non-root user on Linux, go-microvm requires CAP_CHOWN on the
process to set correct file ownership during OCI image extraction. Without it,
all extracted rootfs files are owned by the host user, and the guest sees
incorrect ownership — which can cause permission errors for processes running
as different UIDs inside the VM.
A non-required preflight check (cap-chown) warns at startup if CAP_CHOWN
is missing. To grant it:
# File capability on the binary (persistent across invocations):
sudo setcap cap_chown+ep /path/to/your-binary
# Or grant via ambient capabilities for a specific invocation:
capsh --addamb=cap_chown -- -c '/path/to/your-binary'go-microvm also sets the user.containers.override_stat extended attribute on
extracted files so that libkrun's virtiofs server reports correct ownership to
the guest. This is the same mechanism that podman uses on macOS.
On Linux, the xattr is set on regular files and directories. The kernel
restricts user.* xattrs on symlinks and special files, so those are silently
skipped. Once libkrun's Linux virtiofs passthrough adds support for reading
these xattrs (the same support already exists on macOS), file ownership in the
guest will be correct without requiring CAP_CHOWN.
See the internal/xattr package for details.
| Resource | Mode | Description |
|---|---|---|
| State directory | 0700 | Owner only |
| State lock file | flock | Exclusive access |
| SSH private keys | 0600 | Owner read/write |
| SSH public keys | 0644 | World readable |
| VM log files | 0600 | Owner only |
| Cache directories | 0700 | Owner only |
| Unix socket (vnet.sock) | 0700 dir | Protected by parent directory permissions |
The SSH client uses InsecureIgnoreHostKey() for host key verification.
This is acceptable because the client only connects to VMs that go-microvm
just created -- the guest was booted from an image we pulled and configured,
and the connection is over a localhost port forward that is not exposed to
the network.
If go-microvm is used in a scenario where the SSH connection traverses an untrusted network, host key verification should be implemented by the caller using a custom SSH client rather than the built-in one.