Skip to content

fix: enable LibreMesh mesh networking in QEMU testbed#2

Open
luandro wants to merge 9 commits into
mainfrom
fix/mesh-networking-qemu-testbed
Open

fix: enable LibreMesh mesh networking in QEMU testbed#2
luandro wants to merge 9 commits into
mainfrom
fix/mesh-networking-qemu-testbed

Conversation

@luandro

@luandro luandro commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Fix LibreMesh mesh networking (babeld) to work between QEMU VMs in the testbed, while preserving DHCP IP assignment, SSH access, and test infrastructure. The fix is fully automated with no manual serial console steps.

Root cause

LibreMesh's lime-config runs at first boot and creates VLAN-tagged batman-adv interfaces (eth0_29, bat0) that don't exist in the QEMU testbed's plain-Ethernet bridge. The previous rc.local workaround (ip link set eth0 up + udhcpc) bypassed lime-config's network setup entirely, so babeld was either not running or running on non-existent VLAN interfaces.

Strategy

Let lime-config run (preserving LibreMesh services like thisnode.info and shared-state), then override the network layer in rc.local to use plain DHCP on br-lan and restart babeld on br-lan via its init script. This is the same wired-bridge approach the bare-OpenWrt path uses, so both image types converge to the same mesh topology.

Changes

  • scripts/qemu/configure-source-image.sh: New rc.local that atomically replaces /etc/config/network and /etc/config/babeld (UCI/netifd, not raw ip commands), waits for br-lan state UP, starts babeld via init script (with direct-invocation fallback), and only sets the first-boot marker when both br-lan is UP and the mesh daemon is verified listening. Pre-creates /etc/config/babeld in the rootfs as defense-in-depth.
  • scripts/qemu/start-mesh.sh: Disable multicast snooping on mesha-br0 (babeld uses multicast for neighbor discovery; snooping can suppress hellos).
  • scripts/qemu/configure-vms.sh: Move mesh_proto detection to top of configure_vm; add post-lime-config verification that re-applies UCI network config and restarts babeld if lime-config clobbered them; unify start_mesh_daemon_on_vm to run for BOTH image paths; expand Phase 3 with cross-node ping and 120s timeout (up from 90s).
  • scripts/qemu/configure-vms.sh: New collect_diagnostics that dumps per-node ip/ip link/ip route/UCI/daemon/logread/bridge state to run/logs/convergence-diagnostics.log on convergence failure or --debug; new phase_start/phase_end timing helpers; --debug/--help flags.
  • tests/qemu/common.sh: count_mesh_neighbors now probes three signals (control socket dump, kernel routes proto babel, PID + UDP listener) and returns the max. New wait_for_mesh_ping for cross-node L3 reachability checks.
  • scripts/qemu/libremesh-testbed.defconfig: Add lime-proto-babeld (lime-config silently falls back to bmx7 without it).

Verification

All Codex gpt-5-codex reviews reached 5/5 confidence (4-5 review rounds per task). Fast test suite passes (5 suites, 28 tests, 0 failures). Lab/adapter suites require an actual running QEMU lab and were not run in this CI environment.

Test plan

  1. bin/libremesh-lab build-image (or bash scripts/qemu/convert-prebuilt.sh for prebuilt)
  2. sudo bin/libremesh-lab start
  3. bin/libremesh-lab configure
  4. bin/libremesh-lab test (fast suite, 5/5 passes locally)
  5. bin/libremesh-lab test --suite lab (mesh convergence + multi-hop)
  6. MESHA_ROOT=/path/to/mesha bin/libremesh-lab test --suite adapter (mesha adapters)

Plan reference

See PLAN.md at the repo root for the full plan with root-cause analysis, implementation tasks, risk mitigations, and verification criteria.

luandro added 7 commits June 5, 2026 21:26
Replace the raw ip/udhcpc workaround with a proper UCI-based post-boot
network reconfiguration that lets lime-config run, then rewrites
/etc/config/network to DHCP-on-br-lan via atomic file replace and
restarts netifd. Babeld is configured to run on br-lan via the init
script, with a direct invocation fallback when the init script is
missing. The first-boot marker is only set when both br-lan is up and
the mesh daemon is verified listening, so failed boots retry.

Codex gpt-5-codex review: 5/5 confidence.
Linux bridges default to multicast_snooping=1 since ~3.x, which can
suppress babeld's multicast hellos to TAP ports that haven't joined
the relevant IGMP/MLD group. On a wired shared-L2 topology this breaks
babeld neighbor discovery. Use the iproute2 spelling 'mcast_snooping'
with an explicit if/else so older iproute2 versions don't break the
lab (they print a WARN and continue with snooping enabled).

Codex gpt-5-codex review: 5/5 confidence.
Move mesh_proto detection to the top of configure_vm so it's available
to both image paths. After the LibreMesh lime-config sequence, verify
br-lan has the expected IP and re-apply the testbed UCI network config
if not (rc.local may not have run, or lime-config may have clobbered
it). Also verify babeld/bmx7 is listening on br-lan and restart via
the init script if not. Unify the mesh daemon re-verify: start_mesh_daemon_on_vm
now runs once for BOTH LibreMesh and bare OpenWrt paths, ensuring
consistent dual-interface mode (wlan0+br-lan when available, br-lan
otherwise) regardless of image type.

Codex gpt-5-codex review: 5/5 confidence.
…ility

count_mesh_neighbors in common.sh now probes three independent signals
for babeld and returns the maximum: (1) control socket dump at
/var/run/babeld.sock (authoritative neighbour count), (2) kernel routes
'ip route show proto babel', (3) baseline PID + UDP listener. The
socket parser only counts 'add/change neighbour' records to avoid
false positives from route/interface lines.

New wait_for_mesh_ping helper for cross-node L3 reachability checks.

configure-vms.sh Phase 3: babeld detection inlined (common.sh is not
sourced here) with identical logic, timeout increased 90s → 120s, and
a post-loop cross-node ping check added as a definitive L3
reachability signal.

Codex gpt-5-codex review: 5/5 confidence.
Add collect_diagnostics: when Phase 3 convergence fails, SSH to each
node and collect ip addr/link/route, /etc/config/{network,babeld},
mesh daemon process state, UDP listeners, kernel babel routes,
logread (babeld/netifd/lime-config), and /etc/rc.local. Plus host
bridge state and dnsmasq leases. Writes to
run/logs/convergence-diagnostics.log.

Add phase_start/phase_end for per-phase timing and --debug / DEBUG=1
flag: surfaces SSH stderr, wraps each phase with timing, and dumps
a snapshot on early non-zero exit (success path snapshot at end of
main). --help shows usage.

Codex gpt-5-codex review: 5/5 confidence.
…ation

lime-config silently falls back to bmx7 when lime-proto-babeld is not
installed, even if the babeld package itself is present. Add the
lime-proto-babeld package so lime-config wires babeld as the primary
mesh routing protocol (matching the babeld-first detection in
configure-vms.sh and tests/qemu/common.sh).

Codex gpt-5-codex review: 5/5 confidence.
@greptile-apps

greptile-apps Bot commented Jun 6, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes LibreMesh mesh networking (babeld) in the QEMU testbed by replacing the old raw-ip bypass in rc.local with a proper UCI-based network reconfiguration that lets lime-config run, then atomically rewrites /etc/config/network and /etc/config/babeld to DHCP-on-br-lan and starts babeld directly. Supporting changes disable multicast snooping on the host bridge, strengthen the babeld convergence signal with three independent probes, add a cross-node ping check, and introduce --debug/collect_diagnostics tooling.

  • configure-source-image.sh: New procd-compatible S99testbed init script calls a rewritten rc.local that kills conflicting daemons, atomically rewrites network/babeld UCI configs, starts netifd + babeld + dropbear directly (bypassing missing rc.common), and only sets a marker file when both br-lan and babeld are healthy — retrying on the next boot otherwise.
  • configure-vms.sh: Moves mesh_proto detection to the top of configure_vm, adds a post-lime-config SSH verification pass that re-applies a static network config and restarts babeld if needed, unifies start_mesh_daemon_on_vm for both image paths, and expands Phase 3 with a 120 s timeout and a definitive cross-node ping.
  • tests/qemu/common.sh + start-mesh.sh + libremesh-testbed.defconfig: Three-signal babeld convergence probe, bridge multicast snooping disabled, and lime-proto-babeld added to the defconfig so lime-config selects babeld instead of silently falling back to bmx7.

Confidence Score: 5/5

Safe to merge; the changes are self-contained testbed infrastructure and the logic is well-defended with || true guards, idempotent marker files, and retry-on-next-boot semantics.

The core rc.local rewrite, post-lime-config SSH repair, and multicast-snooping fix are all sound. Failures are handled gracefully throughout (|| true, marker-based retry, collect_diagnostics on failure). The one finding — collect_diagnostics overwriting failure-time data in debug runs — is a minor observability gap that does not affect correctness or testbed operation.

No files require special attention. The diagnostic overwrite in configure-vms.sh is the only actionable item and it does not affect production behavior.

Important Files Changed

Filename Overview
scripts/qemu/configure-source-image.sh Replaces the old ip link up + udhcpc rc.local bypass with a proper UCI-based post-boot network reconfiguration: atomically rewrites /etc/config/network to DHCP-on-br-lan, starts netifd/babeld/dropbear directly (sidestepping missing rc.common), and sets a marker file only when both br-lan and babeld are healthy. Adds a procd-compatible S99testbed init script and pre-creates the babeld UCI config as defense-in-depth.
scripts/qemu/configure-vms.sh Extensive rework: moves mesh_proto detection to top of configure_vm, adds post-lime-config verification that re-applies UCI network config and restarts babeld if lime-config clobbered them, unifies start_mesh_daemon_on_vm for both paths, expands Phase 3 with a cross-node ping check and 120s timeout, and adds collect_diagnostics/phase timing/--debug/--help. The overwrite mode in collect_diagnostics can silently clobber failure-time data on debug runs.
tests/qemu/common.sh count_mesh_neighbors now probes three babeld signals (control socket, kernel routes, PID+UDP) and returns the max; adds wait_for_mesh_ping for L3 reachability checks. Baseline signal (daemon alive+listening) still returns 1 even for an isolated daemon, which is intentional and documented.
scripts/qemu/start-mesh.sh Adds explicit multicast snooping disable on mesha-br0 so babeld's multicast neighbor-discovery hellos are flooded to all TAP ports rather than suppressed; gracefully falls back if older iproute2 lacks the option.
scripts/qemu/libremesh-testbed.defconfig Adds CONFIG_PACKAGE_lime-proto-babeld=y so lime-config wires babeld as the mesh routing protocol instead of silently falling back to bmx7.
PLAN.md New implementation plan document explaining root cause, strategy, tasks, risks, and alternatives for the LibreMesh QEMU mesh networking fix.

Sequence Diagram

sequenceDiagram
    participant img as configure-source-image.sh
    participant vm as QEMU VM (first boot)
    participant procd as procd (PID 1)
    participant cfg as configure-vms.sh

    img->>vm: Inject S99testbed + rc.local + babeld UCI + SSH keys
    img->>vm: Re-enable lime-config uci-defaults

    Note over vm,procd: VM boots
    procd->>procd: Run uci-defaults (91_lime-config) — creates VLAN batman-adv interfaces
    procd->>procd: "Run S99testbed (last S*)"
    procd->>vm: Execute rc.local

    Note over vm: rc.local
    vm->>vm: killall babeld / netifd
    vm->>vm: Rewrite /etc/config/network to DHCP on br-lan
    vm->>vm: Rewrite /etc/config/babeld to br-lan
    vm->>vm: Start netifd (direct)
    vm->>vm: Wait 30s for br-lan UP
    vm->>vm: Start babeld -D -I /var/run/babeld.pid br-lan
    vm->>vm: Start dropbear -R -B
    vm->>vm: Set marker (only if br-lan UP + babeld listening)

    cfg->>vm: Phase -1: detect/repair VM IPs
    cfg->>vm: Phase 0: wait for SSH reachable
    cfg->>vm: Phase 1: configure_vm
    Note over cfg,vm: Post-lime verify and repair
    cfg->>vm: Check br-lan IP correct?
    vm-->>cfg: brlan_ip
    cfg->>vm: if wrong — re-apply UCI static config + restart network
    cfg->>vm: Check babeld listening?
    cfg->>vm: if not — restart babeld via init.d or direct
    cfg->>vm: start_mesh_daemon_on_vm (safety re-bind)

    cfg->>vm: Phase 2: inject SSH keys
    cfg->>vm: "Phase 3: convergence (120s, expected_peers=1 for babeld)"
    cfg->>vm: Cross-node ping node[0] to node[1]
    cfg->>cfg: collect_diagnostics on failure
Loading

Reviews (2): Last reviewed commit: "test: temporary commit for greptile CLI ..." | Re-trigger Greptile

Comment thread scripts/qemu/configure-vms.sh
Comment thread scripts/qemu/configure-vms.sh
…ip --debug from main args

P1: _ssh_vm_try used 2>&1 in DEBUG mode, merging SSH stderr into stdout.
This polluted command substitution captures (brlan_ip, babeld_ok, etc.)
with diagnostic noise, breaking IP parsing and status detection. Now
stderr flows to fd 2 (terminal) so callers get clean stdout regardless
of DEBUG mode.

P2: The --debug flag was parsed but never stripped from $@, so
main "$@" received it as a positional argument. Currently harmless
since main() ignores positional args, but this prevents future
regressions. A flag-stripping loop now rebuilds $@ without recognized
flags before calling main.
@luandro

luandro commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Addressed Greptile Review Feedback

Both review comments have been fixed in commit 04b5a22:

P1: DEBUG mode stderr polluting captured output

Fixed: _ssh_vm_try no longer uses 2>&1 in DEBUG mode. Stderr now flows to fd 2 (terminal) directly, so callers capturing stdout via command substitution (brlan_ip, babeld_ok, mesh_proto, peer_count) get clean output.

P2: --debug forwarded to main() as positional arg

Fixed: Added a flag-stripping loop after arg parsing that rebuilds `` without recognized flags before calling main "".

Both changes have been validated with bash -n syntax check and reviewed for regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant