Skip to content

Latest commit

 

History

History
562 lines (450 loc) · 21.7 KB

File metadata and controls

562 lines (450 loc) · 21.7 KB

Networking and Firewall

This document provides a deep dive into the go-microvm networking subsystem, including the in-process userspace network stack, wire protocol, firewall architecture, and extension points.

Table of Contents

Overview

go-microvm uses a userspace network stack powered by gvisor-tap-vsock. All VM traffic flows as Ethernet frames with no kernel networking between host and guest, and no separate gvproxy binary is needed.

The gvisor-tap-vsock library (used by podman machine, lima, and libkrun) is imported directly as a Go dependency. It provides a complete virtual network stack including DHCP, DNS, and TCP port forwarding.

There are two networking modes:

  • Runner-side (default): When no WithNetProvider() is set, the runner process creates a VirtualNetwork connected to libkrun via a socketpair. Port forwards are passed in the runner config JSON. This is the simplest path -- no Unix socket, no external process coordination.
  • Hosted (caller-side): When WithNetProvider() is set (e.g., with net/hosted.Provider), the VirtualNetwork runs in the caller's process and exposes a Unix socket that the runner connects to. This allows the caller to access the VirtualNetwork directly for gonet listeners and HTTP services on the gateway IP.

Key properties:

  • Userspace only: All packet processing happens in Go. No iptables, no eBPF, no network namespaces.
  • Frame-level access: Every Ethernet frame passes through Go code, enabling the optional firewall to inspect and filter traffic.
  • Shared topology: Network constants (subnet, gateway, IPs, MTU) are centralized in the net/topology package.

Architecture

The networking subsystem has two modes depending on how the caller configures it, and within the hosted mode, an optional firewall relay.

Runner-Side Networking (Default)

When no WithNetProvider() is set, the runner creates a VirtualNetwork in-process and connects it to libkrun via a socketpair:

+----------------------------------go-microvm-runner process------+
|                                                                  |
| +----------+   socketpair      +-------------------+  Go net     |
| | libkrun  |   (fd pair)       | VirtualNetwork    |----------+ |
| | virtio-  |<===============>  | (gVisor netstack) |          | |
| | net      |   Ethernet frames | DHCP, DNS,        |          | |
| +----------+                   | port forwarding   |          | |
|      |                         +-------------------+          | |
|  +---v------+                       |                         | |
|  | Guest VM |                  127.0.0.1:<port>               | |
|  | eth0:    |                  port forward listeners         | |
|  | 192.168. |                       |                    +---------+
|  | 127.2    |                       +------>             |  Host   |
|  +----------+                                            | Network |
+----------------------------------------------------------+---------+

Port forwards are configured via the runner config JSON. The VirtualNetwork runs as goroutines in the runner process, tied to the VM's lifetime.

Hosted Networking (Custom Provider)

When WithNetProvider() is set (e.g., net/hosted.Provider), the VirtualNetwork runs in the caller's process and exposes a Unix socket:

+----------+                    +-------------------+          +---------+
| Guest VM |                    | VirtualNetwork    |          |  Host   |
|          |   Unix socket      | (gVisor netstack) |  Go net  | Network |
| virtio-  |===(QEMU wire)===>  |                   |--------->|         |
| net      |   SOCK_STREAM      | DHCP, DNS,        |          |         |
|          |   4B BE + frame    | port forwarding   |          |         |
+----------+                    +-------------------+          +---------+
     (in runner process)         (in caller's process)

Hosted Networking with Firewall

When firewall rules are configured via WithFirewallRules() with a hosted provider, a relay is inserted between the VM socket and the VirtualNetwork. The relay intercepts every Ethernet frame, parses headers, and applies allow/deny rules with stateful connection tracking:

+----------+                  +-----------------+               +-------------------+
| Guest VM |                  | Relay + Filter  |               | VirtualNetwork    |
|          |   Unix socket    |                 |   net.Pipe    | (gVisor netstack) |
| virtio-  |===(QEMU wire)===>| egress gor. --->|===(in-mem)===>|                   |
| net      |   SOCK_STREAM    | ingress gor.<---|<==(in-mem)====|   DHCP, DNS,      |
|          |   4B BE + frame  |                 |               |   port forwarding |
+----------+                  | - parse ETH/IP  |               +-------------------+
                              | - conntrack     |                        |
                              | - rule matching |                   +---------+
                              | - metrics       |                   |  Host   |
                              +-----------------+                   | Network |
                                                                    +---------+

The relay creates a net.Pipe() -- one end is passed to VirtualNetwork.AcceptQemu(), the other is used by the relay. Two goroutines handle egress (VM to network) and ingress (network to VM) independently.

QEMU Wire Protocol

The QEMU transport is a stream protocol over a Unix domain socket (SOCK_STREAM). Every Ethernet frame is prefixed with a 4-byte big-endian length header.

Frame Format

+---+---+---+---+---+---+---+---+---+---+---+...
| Length (4B BE) |         Ethernet Frame          |
+---+---+---+---+---+---+---+---+---+---+---+...
|<--- 4 bytes -->|<-------- N bytes ------------->|
  • Length field: uint32, big-endian. Value is the number of bytes in the Ethernet frame that follows (does NOT include the 4-byte header itself).
  • Ethernet frame: Raw L2 frame starting with destination MAC address.
  • No handshake: Data flows immediately after socket connection.
  • Max frame size: Practically limited by MTU (default 1500 bytes).

Why QEMU Mode

libkrun's krun_add_net_unixstream speaks the QEMU wire protocol: a SOCK_STREAM Unix socket with 4-byte big-endian length-prefixed Ethernet frames. The gvisor-tap-vsock library's AcceptQemu() method uses the same framing, making them directly compatible.

Protocol Comparison

Protocol Header Byte Order Socket Type Use Case
QEMU 4 bytes Big-endian SOCK_STREAM libkrun, QEMU
VfKit None N/A SOCK_DGRAM macOS Virt.framework
BESS None N/A SOCK_SEQPACKET User Mode Linux

Network Topology

+---------------------------------------------------+
|                   Host Machine                     |
|                                                    |
|  +---------------+    Unix socket   +-----------+  |
|  | VirtualNetwork|---(SOCK_STREAM)->|  libkrun  |  |
|  | (in-process)  |  4-byte BE len  |  virtio-  |  |
|  |               |  prefix frames  |  net      |  |
|  | Gateway:      |                 |           |  |
|  | 192.168.127.1 |                 +-----------+  |
|  |               |                      |         |
|  | DHCP server   |                 +----v-----+   |
|  | DNS server    |                 | Guest VM |   |
|  | Port forwards |                 |          |   |
|  +---------------+                 | eth0:    |   |
|        |                           | 192.168. |   |
|        |  Port forwards:           | 127.2    |   |
|        |  localhost:8080           |          |   |
|        +-----> guest:80            +----------+   |
|        |  localhost:2222                          |
|        +-----> guest:22                           |
+---------------------------------------------------+
Property Value
Gateway 192.168.127.1 (VirtualNetwork, in-process)
Guest IP 192.168.127.2 (DHCP assigned)
Subnet 192.168.127.0/24
Socket type Unix domain, SOCK_STREAM
Wire format 4-byte big-endian length prefix + Ethernet frame
DHCP Built into VirtualNetwork
DNS Built into VirtualNetwork
Port forwarding TCP, host-to-guest only

VirtualNetwork Lifecycle

Runner-Side (Default Path)

When no custom provider is set, the runner's setupInProcessNetworking() creates the VirtualNetwork during VM startup:

  1. Builds the port forward map from runner.Config.PortForwards.
  2. Creates a virtualnetwork.New() instance using constants from net/topology (subnet, gateway IP/MAC, MTU).
  3. Creates a socketpair(AF_UNIX, SOCK_STREAM).
  4. Wraps one end as a net.Conn and passes it to AcceptQemu() in a background goroutine.
  5. Returns the other fd to be passed to krun_add_net_unixstream().

The VirtualNetwork goroutines run alongside krun_start_enter() and are torn down when the runner process exits (i.e., when the guest shuts down).

Hosted Provider Path

When using net/hosted.Provider, the lifecycle is managed in the caller's process:

Start

Provider.Start() performs the following:

  1. Creates a virtualnetwork.New() instance with the network configuration (subnet, gateway, port forwards, DHCP, DNS) using net/topology constants.
  2. Starts any registered HTTP services on the VirtualNetwork via VirtualNetwork.Listen().
  3. Creates a Unix listener at the socket path (hosted-net.sock in the data directory).
  4. If firewall rules are configured, creates a firewall.Filter and firewall.Relay, and starts the conntrack expiry goroutine.
  5. Starts an accept loop goroutine. For each runner connection:
    • With firewall: creates a net.Pipe(), runs the relay between the runner connection and the pipe, passes the pipe to AcceptQemu().
    • Without firewall: passes the runner connection directly to AcceptQemu().
  6. Returns once the listener is ready.

Stop

Provider.Stop() tears down everything:

  1. Gracefully shuts down HTTP services (5-second timeout per service).
  2. Cancels the context, which signals all goroutines to exit.
  3. Closes the Unix listener.
  4. Waits for the accept loop and all connection handlers to finish.
  5. Removes the socket file.

All goroutines are tracked via sync.WaitGroup and context cancellation.

Firewall Architecture

The firewall provides frame-level packet filtering with stateful connection tracking. It operates entirely in userspace by intercepting Ethernet frames as they pass between the VM socket and the VirtualNetwork.

Frame-Level Interception

The firewall inserts a relay between the VM's Unix socket and the VirtualNetwork. The relay reads each frame, parses the Ethernet and IP headers, applies firewall rules, and either forwards or drops the frame.

Packet Parsing

Each Ethernet frame is parsed at fixed offsets with zero allocations:

  1. Ethernet header (14 bytes): Destination MAC (6B), Source MAC (6B), EtherType (2B). EtherType 0x0800 = IPv4, 0x0806 = ARP, 0x86DD = IPv6.

  2. IPv4 header (20+ bytes, starts at offset 14): Protocol field at byte 23 (6=TCP, 17=UDP, 1=ICMP). Source IP at bytes 26-29, destination IP at bytes 30-33. IHL field gives header length.

  3. TCP/UDP header (starts at offset 14 + IHL*4): Source port (2B), destination port (2B).

Non-IPv4 frames (ARP, IPv6, LLDP) are always passed through without filtering. They are essential for the network stack to function (ARP resolution, etc.).

Rule Model

Each firewall rule specifies:

Field Type Description
Direction Ingress/Egress Ingress = outside to VM, Egress = VM to outside
Action Allow/Deny What to do when matched
Protocol uint8 6=TCP, 17=UDP, 1=ICMP; 0=any
SrcCIDR net.IPNet Source IP range
DstCIDR net.IPNet Destination IP range
SrcPort uint16 Source port; 0=any
DstPort uint16 Destination port; 0=any

Rules are evaluated in order. First match wins (same as iptables). If no rule matches, the default action applies (configurable via WithFirewallDefaultAction(); defaults to Allow when no rules are set).

Stateful Connection Tracking

The firewall tracks active connections using a 5-tuple key:

connKey = { protocol, srcIP, dstIP, srcPort, dstPort }

When a rule allows a packet, the connection tracker records the flow. Return traffic (with source and destination swapped) is automatically allowed via a reverse-lookup in the connection table. This means you do not need explicit ingress rules for return traffic from allowed egress connections.

TTLs:

  • TCP connections: 5 minutes idle timeout
  • UDP flows: 30 seconds idle timeout

An expiry goroutine periodically sweeps the connection table to remove stale entries.

Memory: Each conntrack entry is approximately 100 bytes. A typical VM workload of 200-500 concurrent flows uses around 50 KB.

Filter Verdict Flow

For each frame, the filter follows this path:

  1. Conntrack fast path: Check if the packet belongs to an already-allowed flow via reverse-lookup. If yes, allow immediately (most common case for established connections).
  2. Rule walk: Iterate through rules in order. First match wins. If the matching rule allows the packet, record it in the connection tracker.
  3. Default action: If no rule matches, apply the configured default action (Allow or Deny).

Relay Hot Path

The relay runs two goroutines:

  • Egress goroutine: Reads frames from the VM socket, applies filter, writes to the VirtualNetwork pipe.
  • Ingress goroutine: Reads frames from the VirtualNetwork pipe, applies filter, writes to the VM socket.

Each goroutine uses a buffered reader (64 KB) and a reusable frame buffer. Frames that the filter denies are silently dropped (not forwarded). Atomic counters track forwarded frames, dropped frames, and bytes forwarded.

Performance

The firewall adds minimal overhead per Ethernet frame:

Operation Cost
Read frame (4-byte prefix + payload) Required regardless -- no added cost
Parse Ethernet + IPv4 headers ~10ns -- fixed-offset reads, no allocations
Connection tracker lookup ~20ns -- map lookup under RLock
Rule matching (per rule, on miss) ~5ns -- simple comparisons
Write frame (forward) Required regardless -- no added cost

Total added latency: ~50-100ns per frame. At 1 Gbps with 1500-byte frames (~83,000 frames/sec), the firewall adds roughly 4ms of CPU time per second. This is negligible at typical VM throughput.

Memory overhead:

  • Connection tracker: ~100 bytes per entry, typically 200-500 entries = ~50 KB
  • Frame buffer: ~2 KB per direction, reused
  • Rule slice: typically <20 rules = negligible

Usage Examples

Default-Deny with DNS and HTTPS Egress

Allow the VM to make DNS queries and HTTPS connections, but deny all other outbound traffic. Inbound traffic is denied except on explicitly allowed ports. Return traffic for allowed connections is automatically permitted via connection tracking.

import "github.com/stacklok/go-microvm/net/firewall"

vm, err := microvm.Run(ctx, "my-app:latest",
    microvm.WithPorts(
        microvm.PortForward{Host: 8080, Guest: 80},
        microvm.PortForward{Host: 2222, Guest: 22},
    ),
    microvm.WithFirewallDefaultAction(firewall.Deny),
    microvm.WithFirewallRules(
        // Egress: allow DNS and HTTPS
        firewall.Rule{
            Direction: firewall.Egress,
            Action:    firewall.Allow,
            Protocol:  17, // UDP
            DstPort:   53, // DNS
        },
        firewall.Rule{
            Direction: firewall.Egress,
            Action:    firewall.Allow,
            Protocol:  6, // TCP
            DstPort:   443, // HTTPS
        },
        // Ingress: allow SSH and HTTP
        firewall.Rule{
            Direction: firewall.Ingress,
            Action:    firewall.Allow,
            Protocol:  6,
            DstPort:   22,
        },
        firewall.Rule{
            Direction: firewall.Ingress,
            Action:    firewall.Allow,
            Protocol:  6,
            DstPort:   80,
        },
    ),
)

Allow Specific Ingress Ports Only

vm, err := microvm.Run(ctx, "my-server:latest",
    microvm.WithPorts(
        microvm.PortForward{Host: 8443, Guest: 443},
        microvm.PortForward{Host: 6443, Guest: 6443},
    ),
    microvm.WithFirewallDefaultAction(firewall.Deny),
    microvm.WithFirewallRules(
        // Allow all egress (VM can reach the internet)
        firewall.Rule{
            Direction: firewall.Egress,
            Action:    firewall.Allow,
        },
        // Allow specific ingress ports
        firewall.Rule{
            Direction: firewall.Ingress,
            Action:    firewall.Allow,
            Protocol:  6,
            DstPort:   443,
        },
        firewall.Rule{
            Direction: firewall.Ingress,
            Action:    firewall.Allow,
            Protocol:  6,
            DstPort:   6443,
        },
        firewall.Rule{
            Direction: firewall.Ingress,
            Action:    firewall.Allow,
            Protocol:  6,
            DstPort:   22,
        },
    ),
)

No Firewall (Default)

When no firewall rules are configured, all traffic passes through unrestricted. This is the default behavior:

vm, err := microvm.Run(ctx, "alpine:latest",
    microvm.WithPorts(microvm.PortForward{Host: 8080, Guest: 80}),
)

DNS-Based Egress Policy

WithEgressPolicy() restricts VM outbound traffic to a set of allowed DNS hostnames. Instead of writing firewall rules for specific IPs (which change often), you specify hostnames and let go-microvm handle the rest.

vm, err := microvm.Run(ctx, "my-app:latest",
    microvm.WithPorts(microvm.PortForward{Host: 8080, Guest: 80}),
    microvm.WithEgressPolicy(microvm.EgressPolicy{
        AllowedHosts: []microvm.EgressHost{
            {Name: "api.github.com", Ports: []uint16{443}},
            {Name: "*.docker.io"},
            {Name: "ntp.ubuntu.com", Ports: []uint16{123}, Protocol: 17},
        },
    }),
)

How it works:

  1. The firewall default action is forced to Deny. A hosted network provider is auto-created if none was configured.
  2. Implicit firewall rules are added for DNS (to gateway), DHCP, and port-forwarded ingress ports.
  3. A DNSInterceptor is wired into the relay between the VM and the VirtualNetwork.
  4. Egress DNS queries: The interceptor parses each outbound DNS query. If the queried hostname is not in the allowlist, it returns an NXDOMAIN response directly to the VM. Allowed queries pass through normally.
  5. Ingress DNS responses: For allowed hostnames, the interceptor parses A records from responses and creates temporary firewall rules for those IPs. The rule TTL matches the DNS record TTL (minimum 60 seconds).
  6. The VM can only connect to IPs that were resolved from allowed hostnames. All other egress traffic is denied by the default-deny policy.

Interaction with static firewall rules:

Static rules added via WithFirewallRules() are evaluated before dynamic rules. You can use static rules alongside an egress policy to allow additional traffic (e.g., specific IP ranges) that doesn't go through DNS. Implicit rules (DNS, DHCP, port forwards) are prepended before user rules.

Limitations:

  • Hardcoded IPs bypass DNS: If the VM connects to an IP directly (without DNS resolution), the egress policy cannot block it unless the default-deny catches it. This is mitigated by the default-deny policy — only IPs learned from allowed DNS responses get dynamic allow rules.
  • DNS-over-HTTPS (DoH): Blocked by the default-deny policy since HTTPS to DoH servers would need to be in the allowlist. Standard DNS over UDP port 53 is the only supported resolution path.
  • IPv6: Only IPv4 A records create dynamic rules. AAAA records are ignored.

Provider Interface

The networking layer is abstracted behind the net.Provider interface:

type Provider interface {
    // Start launches the network provider. Must block until ready.
    Start(ctx context.Context, cfg Config) error

    // SocketPath returns the Unix socket path for virtio-net.
    SocketPath() string

    // Stop terminates the provider and cleans up.
    Stop()
}

Config contains:

  • LogDir -- directory for log files
  • Forwards -- slice of PortForward{Host, Guest} for TCP forwarding
  • FirewallRules -- optional packet filtering rules for frame-level filtering
  • FirewallDefaultAction -- default action when no rule matches (Allow or Deny)

By default (no WithNetProvider()), networking runs inside the runner process. The net/hosted package provides a ready-made hosted provider that runs the VirtualNetwork in the caller's process with support for HTTP services on the gateway IP.

Implementing a Custom Provider

To replace the default runner-side networking with an alternative backend (e.g., passt, slirp4netns, or a custom bridge):

  1. Implement the net.Provider interface.
  2. Start() must block until the Unix socket is ready for connections.
  3. The socket must use SOCK_STREAM with 4-byte big-endian length-prefixed Ethernet frames (the QEMU transport protocol).
  4. Pass your provider via microvm.WithNetProvider(myProvider).

The SocketPath() return value is passed to the runner as the Unix socket path for krun_add_net_unixstream. See net/hosted/provider.go for the reference implementation.