Skip to content

feat(vm): add conntrack synchronization for live migration#1939

Open
loktev-d wants to merge 14 commits intomainfrom
feat/vm/ct-sync-live-migration
Open

feat(vm): add conntrack synchronization for live migration#1939
loktev-d wants to merge 14 commits intomainfrom
feat/vm/ct-sync-live-migration

Conversation

@loktev-d
Copy link
Contributor

@loktev-d loktev-d commented Jan 30, 2026

Description

What this PR does

This PR implements Conntrack (CT) synchronization for VM live migration. During live migration, TCP connection state is preserved by exporting conntrack entries from Cilium on the source node and importing them on the target node before the VM becomes active.

Key Changes

New conntrack package (pkg/virt-handler/conntrack/):

  • cilium.go - HTTP client for Cilium's conntrack export/import API via Unix socket
  • source.go - Source-side handler that exports CT entries and sends them to target
  • target.go - Target-side handler that receives CT entries and imports them to Cilium
  • protocol.go - Wire protocol for CT data transfer (version + length-prefixed binary)

cmd/virt-launcher-hook/main.go — libvirt QEMU hook binary installed at /etc/libvirt/hooks/qemu:

  • Fires on started begin (after memory transfer, before VM resume on target)

  • Connects to virt-handler's hook socket, sends wait, blocks until CT import completes or times out

  • Provides the synchronization gate that prevents VM resume until conntrack is injected

  • 200ms hook timeout — starts when started begin hook fires and prevents indefinite VM pause if CT data doesn't arrive in time; VM resumes without CT sync on timeout (same behavior as without this feature)

CT Sync Flow

SOURCE NODE                                          TARGET NODE
============                                         ============

1. Migration starts
        |
        v
2. virt-handler detects migration
        |                                            3. Target virt-handler starts:
        |                                               - Migration proxy (TCP -> Unix socket)
        |                                               - CT sync listener (Unix socket)
        |                                               - Hook listener (Unix socket)
        |                                                        |
        v                                                        v
4. Source gets target ports from VMI status          5. Target publishes ports in VMI status
        |                                               (49154 -> TCP ports)
        v
6. Source virt-handler:
   - Calls Cilium API: GET /v1/conntrack/export?ip4=<VM_IP>
   - Gets binary CT entries + version
        |
        v
7. Source sends CT data via migration proxy:
   Unix socket -> TCP (port 49154) -> Target
        |                                                        |
        |                                                        v
        |                                            8. Target receives CT data:
        |                                               - Decodes protocol (version + data)
        |                                               - Calls Cilium API: POST /v1/conntrack/import
        |                                               - Marks injection as done
        |                                                        |
        v                                                        v
9. libvirt migrates VM memory/state               10. libvirt hook (started begin):
        |                                               - Connects to hook socket
        |                                               - Sends "wait"
        |                                               - Waits for CT import completion
        |                                               - Receives "ok"
        |                                                        |
        v                                                        v
11. Migration completes                           12. VM activates on target with preserved connections

Why do we need it, and what problem does it solve?

What is the expected result?

Checklist

  • The code is covered by unit tests.
  • e2e tests passed.
  • Documentation updated according to the changes.
  • Changes were tested in the Kubernetes cluster manually.

Changelog entries

section: vm
type: feature
summary: add conntrack synchronization for live migration

Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
@loktev-d loktev-d marked this pull request as draft February 3, 2026 08:22
Signed-off-by: Daniil Loktev <70405899+loktev-d@users.noreply.github.com>
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 12, 2026
Signed-off-by: Daniil Loktev <70405899+loktev-d@users.noreply.github.com>
@loktev-d loktev-d added e2e/run Run e2e test on cluster of PR author and removed e2e/run Run e2e test on cluster of PR author labels Feb 12, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 12, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 12, 2026
loktev-d and others added 6 commits February 13, 2026 10:23
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 17, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 17, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 17, 2026
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 18, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 18, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 18, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@loktev-d loktev-d added the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
@deckhouse-BOaTswain
Copy link
Contributor

deckhouse-BOaTswain commented Feb 18, 2026

Workflow has started.
Follow the progress here: Workflow Run

The target step completed with status: failure.

@deckhouse-BOaTswain deckhouse-BOaTswain removed the e2e/run Run e2e test on cluster of PR author label Feb 18, 2026
loktev-d and others added 3 commits March 5, 2026 11:37
Signed-off-by: Daniil Loktev <70405899+loktev-d@users.noreply.github.com>
Signed-off-by: Daniil Loktev <lokt.daniil@gmail.com>
Signed-off-by: Daniil Loktev <70405899+loktev-d@users.noreply.github.com>
@loktev-d loktev-d marked this pull request as ready for review March 5, 2026 14:59
@loktev-d loktev-d added this to the v1.6.1 milestone Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants