Skip to content

vm_clone produces VMs that are silently fragile to multi-disk attach + reboot (no bootDevices, vda-anchored grub) #370

@ddemlow

Description

@ddemlow

Summary

VMs cloned via scale_computing.hypercore.vm_clone from a standard Canonical Ubuntu cloud image inherit two independent boot-fragility issues that surface the first time a second VIRTIO_DISK is attached AND the VM is rebooted:

  1. bootDevices is empty on the cloned VirDomain. HC's BIOS has an implicit auto-fallback to "the only virtio disk" but that fallback gives up once a second virtio is attached → Boot failed: not a bootable disk / No bootable device.
  2. The cloud image's /etc/default/grub ships with GRUB_DISABLE_LINUX_UUID=true, so the generated /boot/grub/grub.cfg uses root=/dev/vda1 (device-path) instead of root=UUID=… (UUID-based). When a second virtio disk reorders PCI enumeration, the OS disk becomes vdb, the kernel can't find /dev/vda1, and initramfs hangs at Btrfs loaded ….

Combined effect: a cloud-image VM cloned via this module boots correctly for years while it has exactly one virtio disk, then bricks on the first reboot after any tool (CSI driver, manual disk-add, Terraform provider, or even another playbook in the same suite) attaches a second virtio disk.

This is the underlying root cause for a pair of related issues filed in ScaleComputing/k3s-on-hypercore (#7 and #8), but the same exposure exists in every downstream consumer of vm_clone: k3s-ansible-hypercore, ansible_edge_playbooks, and customer playbooks that clone from cloud-image templates.

Reproducer

# Clone an Ubuntu cloud-image VM via this collection (any of the standard
# patterns, e.g. ansible_edge_playbooks/simple_vm_deploy.yml):
ansible-playbook simple_vm_deploy.yml

# Inspect the resulting VM:
curl -sk -u admin:admin "https://<hc-host>/rest/v1/VirDomain/<new-vm-uuid>" \
  | jq '.[0] | {name, bootDevices}'
# → bootDevices: []   ← issue #1

# SSH into the new VM:
grep GRUB_DISABLE_LINUX_UUID /etc/default/grub
# → GRUB_DISABLE_LINUX_UUID=true     ← issue #2
cat /proc/cmdline
# → root=/dev/vda1    ← issue #2 consequence

# Attach a second VIRTIO_DISK, then reboot:
#   - If issue #1 unaddressed: BIOS "No bootable device" — VM unbootable.
#   - If issue #1 fixed but #2 unaddressed: kernel loads but initramfs hangs.

Suggested fixes in this collection

Layer 1: vm_clone should set bootDevices after clone

The most natural fix is to add a parameter (or change defaults) so the cloned VM's bootDevices is populated with the primary disk's UUID immediately after creation. Something like:

- name: Clone the VM
  scale_computing.hypercore.vm_clone:
    vm_name: "{{ inventory_hostname }}"
    source_vm_name: "{{ source_template }}"
    set_boot_devices: yes   # new default-yes parameter

Or simpler: always populate bootDevices with the cloned VM's largest VIRTIO_DISK by default. Users who want to manage boot order manually can override.

Layer 2: optional cloud_init.runcmd injection

If the module accepts cloud_init.user_data, the generated cloud-init should include a runcmd to fix the grub UUID issue at first boot:

runcmd:
  - sed -i 's/^GRUB_DISABLE_LINUX_UUID=true/#GRUB_DISABLE_LINUX_UUID=true/' /etc/default/grub
  - update-grub

Either bake this into the module's default user_data, or document it prominently in the role's README so every downstream playbook can add it.

Workaround for the field today

I've documented both layers + per-VM fix recipes in an internal hypercore-api-notes.md reference. The TL;DR for any playbook today:

  1. After vm_clone, PATCH /VirDomain/{uuid} with bootDevices: [<primary-disk-uuid>].
  2. Include the grub runcmd snippet in the cloud-init user data.

I've also done both layers manually on an 8-VM reference cluster — works cleanly.

Why this matters

Without these fixes, every k8s-on-HyperCore deployment is a ticking time bomb. The VMs come up fine, work fine for months/years, then the first time someone adds storage (CSI driver, data disk, NFS export disk, etc.) and the VM reboots, it bricks. That's exactly the wrong failure pattern: silent and recovery-blocking.

Related

  • HC platform team issue (BIOS auto-fallback should be smarter or refuse empty bootDevices): github.lab.local/dev/dev#394
  • k3s-on-hypercore#7 (bootDevices)
  • k3s-on-hypercore#8 (grub UUID)

Suggested labels

bug, priority:high — silent failure mode that affects every multi-disk workload on HC.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions