Skip to content

Linux guest unreachable after vm clone --on-demand from hibernate snapshot (no DHCP lease for hot-swapped NIC MAC) #47

@tonicmuroq

Description

@tonicmuroq

Summary

After cocoon vm clone --on-demand <hibernate-import-snapshot> resumes an Ubuntu 24.04 guest, the guest is reachable on neither its pre-hibernate IP nor any new IP. cocoon vm ls shows state=running, ip=- indefinitely. Symptom is identical to #28 (Windows, fixed via the per-image CocoonNicAutoHeal scheduled task + in-guest PnP rebind), but the in-guest recovery path that's image-resident for Windows has no Linux equivalent baked into our base image, and our base image lacks anything cocoon-specific for this.

Environment

  • Cocoon cluster: cocoonset-gke, vk-cocoon on cocoonset-node-2
  • vm-service env: testing
  • Hot snapshot: epoch.simular.cloud/simular/ubuntu-hot-testing:v1 (fresh bake at 12:00 UTC today)
  • Per-VM hibernate snapshot: vk-default-vm-c547fa0a-0 (saved at 2026-05-14 12:21:48 UTC)

Reproduce

vm-service-driven, but the underlying vk-cocoon CLI sequence is:

sudo cocoon vm rm --force <pre-hibernate-vm-id>
sudo cocoon snapshot inspect vk-default-vm-c547fa0a-0
sudo cocoon vm clone --output json --name vk-default-vm-c547fa0a-0 \
     --network cocoon-dhcp --on-demand vk-default-vm-c547fa0a-0

(This is exactly what vk-cocoon logs during a spec.suspend=false reconcile after hibernate.)

Observations

Pre-hibernate VM:

  • guest MAC: (whatever was leased originally)
  • DHCP-assigned IP: 172.20.1.58
  • working agent → vm-service token-exchange, etc.

Post-wake (after the clone above):

  • cocoon vm ls:
    ID                          NAME                      STATE    CPU  MEMORY  STORAGE  IP  ...
    E5LFZLS2QQXYPBRQEQ5OYQISOQ  vk-default-vm-c547fa0a-0  running  4    8GiB    20GiB    -   ...
    
  • Host-side veth/netns: present, MAC 2a:98:96:a6:fc:65 on veth8e430d83, peer in cocoon-E5LFZLS2QQXYPBRQEQ5OYQISOQ.
  • /var/lib/cocoon/net/leases.json: no entry for 2a:98:96:a6:fc:65 (the new MAC). The old MAC's lease (for 172.20.1.58) is also gone. So cocoon-dhcp IPAM lost the binding too.
  • ping 172.20.1.58 from cni0 / from a sibling cocoon pod: "Destination Host Unreachable", ip neigh shows the entry as FAILED.
  • kubectl exec and cocoon vm exec both hang (no vsock progress) — guest is alive but doesn't progress past the wake point because its NIC stack is hot-swapped to a new MAC and there's no in-guest path to renegotiate DHCP.

This is the same shape as #28 — virtio-net hot-swap leaves the guest with a fresh MAC the guest hasn't bound to. The Linux symptom is that systemd-networkd / NetworkManager (or whatever's managing eth0) doesn't notice the new device, so no DHCPDISCOVER goes out on the new interface, so no lease, so no IPAM entry, so cocoon-dhcp doesn't even know the VM exists.

Why this matters

vm clone --on-demand <hibernate-snapshot> is the wake path that vk-cocoon uses for spec.suspend=false on a CocoonSet. For us, that's every hibernate-cycle on the Linux cocoon path. As shipped today, it's a one-way road: hibernate works, but the woken guest is never reachable again.

#28's resolution baked CocoonNicAutoHeal into the Windows base image. The Linux analog would have to be image-resident as well (we can't run anything via cocoon vm exec from the host until the guest comes back), but there's no cocoonstack/ubuntu analog of cocoonstack/windows shipping such a recovery hook in the base. Two paths I can see:

  1. Image-side fix: ship a small systemd unit in the cocoon Ubuntu base that watches for link-up on a freshly-attached virtio-net interface and triggers networkctl renew / dhclient -r && dhclient on it. Belt-and-suspenders, but it's a property of the image not of cocoon, and we'd have to add it to every Ubuntu base downstream wants to wake from hibernate.

  2. Host-side fix in cocoon: at clone-from-hibernate time, re-use the saved MAC instead of regenerating a new one. The saved snapshot already encodes the guest's view of its NIC (driver state, IP, etc.); regenerating the MAC is what breaks the guest. If the post-wake MAC matches the pre-hibernate MAC, the guest never knew anything changed and DHCP/leases just keep working. That's an in-cocoon change to vm clone when the snapshot is a hibernate-import.

(2) is the cleaner fix — it makes hibernate→wake actually transparent to the guest regardless of OS, and cocoon-dhcp's existing lease for the old MAC stays valid for the lease duration. (1) is the workaround if (2) isn't desirable for some reason (e.g. MAC collisions across cross-node clones).

Repro artifacts

  • vk-cocoon journal on cocoonset-node-2 around 2026-05-14T12:24:07Z to 12:24:19Z — full sequence.
  • cocoonset name: default/vm-c547fa0a, vm-id E5LFZLS2QQXYPBRQEQ5OYQISOQ, still in this state at time of filing.

If you want hands-on access let me know and I'll keep the VM around; otherwise vm-service will tear it down after the e2e times out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions