fix: enable LibreMesh mesh networking in QEMU testbed#2
Conversation
Replace the raw ip/udhcpc workaround with a proper UCI-based post-boot network reconfiguration that lets lime-config run, then rewrites /etc/config/network to DHCP-on-br-lan via atomic file replace and restarts netifd. Babeld is configured to run on br-lan via the init script, with a direct invocation fallback when the init script is missing. The first-boot marker is only set when both br-lan is up and the mesh daemon is verified listening, so failed boots retry. Codex gpt-5-codex review: 5/5 confidence.
Linux bridges default to multicast_snooping=1 since ~3.x, which can suppress babeld's multicast hellos to TAP ports that haven't joined the relevant IGMP/MLD group. On a wired shared-L2 topology this breaks babeld neighbor discovery. Use the iproute2 spelling 'mcast_snooping' with an explicit if/else so older iproute2 versions don't break the lab (they print a WARN and continue with snooping enabled). Codex gpt-5-codex review: 5/5 confidence.
Move mesh_proto detection to the top of configure_vm so it's available to both image paths. After the LibreMesh lime-config sequence, verify br-lan has the expected IP and re-apply the testbed UCI network config if not (rc.local may not have run, or lime-config may have clobbered it). Also verify babeld/bmx7 is listening on br-lan and restart via the init script if not. Unify the mesh daemon re-verify: start_mesh_daemon_on_vm now runs once for BOTH LibreMesh and bare OpenWrt paths, ensuring consistent dual-interface mode (wlan0+br-lan when available, br-lan otherwise) regardless of image type. Codex gpt-5-codex review: 5/5 confidence.
…ility count_mesh_neighbors in common.sh now probes three independent signals for babeld and returns the maximum: (1) control socket dump at /var/run/babeld.sock (authoritative neighbour count), (2) kernel routes 'ip route show proto babel', (3) baseline PID + UDP listener. The socket parser only counts 'add/change neighbour' records to avoid false positives from route/interface lines. New wait_for_mesh_ping helper for cross-node L3 reachability checks. configure-vms.sh Phase 3: babeld detection inlined (common.sh is not sourced here) with identical logic, timeout increased 90s → 120s, and a post-loop cross-node ping check added as a definitive L3 reachability signal. Codex gpt-5-codex review: 5/5 confidence.
Add collect_diagnostics: when Phase 3 convergence fails, SSH to each
node and collect ip addr/link/route, /etc/config/{network,babeld},
mesh daemon process state, UDP listeners, kernel babel routes,
logread (babeld/netifd/lime-config), and /etc/rc.local. Plus host
bridge state and dnsmasq leases. Writes to
run/logs/convergence-diagnostics.log.
Add phase_start/phase_end for per-phase timing and --debug / DEBUG=1
flag: surfaces SSH stderr, wraps each phase with timing, and dumps
a snapshot on early non-zero exit (success path snapshot at end of
main). --help shows usage.
Codex gpt-5-codex review: 5/5 confidence.
…ation lime-config silently falls back to bmx7 when lime-proto-babeld is not installed, even if the babeld package itself is present. Add the lime-proto-babeld package so lime-config wires babeld as the primary mesh routing protocol (matching the babeld-first detection in configure-vms.sh and tests/qemu/common.sh). Codex gpt-5-codex review: 5/5 confidence.
Greptile SummaryThis PR fixes LibreMesh mesh networking (babeld) in the QEMU testbed by replacing the old raw-
Confidence Score: 5/5Safe to merge; the changes are self-contained testbed infrastructure and the logic is well-defended with || true guards, idempotent marker files, and retry-on-next-boot semantics. The core rc.local rewrite, post-lime-config SSH repair, and multicast-snooping fix are all sound. Failures are handled gracefully throughout (|| true, marker-based retry, collect_diagnostics on failure). The one finding — collect_diagnostics overwriting failure-time data in debug runs — is a minor observability gap that does not affect correctness or testbed operation. No files require special attention. The diagnostic overwrite in configure-vms.sh is the only actionable item and it does not affect production behavior. Important Files Changed
Sequence DiagramsequenceDiagram
participant img as configure-source-image.sh
participant vm as QEMU VM (first boot)
participant procd as procd (PID 1)
participant cfg as configure-vms.sh
img->>vm: Inject S99testbed + rc.local + babeld UCI + SSH keys
img->>vm: Re-enable lime-config uci-defaults
Note over vm,procd: VM boots
procd->>procd: Run uci-defaults (91_lime-config) — creates VLAN batman-adv interfaces
procd->>procd: "Run S99testbed (last S*)"
procd->>vm: Execute rc.local
Note over vm: rc.local
vm->>vm: killall babeld / netifd
vm->>vm: Rewrite /etc/config/network to DHCP on br-lan
vm->>vm: Rewrite /etc/config/babeld to br-lan
vm->>vm: Start netifd (direct)
vm->>vm: Wait 30s for br-lan UP
vm->>vm: Start babeld -D -I /var/run/babeld.pid br-lan
vm->>vm: Start dropbear -R -B
vm->>vm: Set marker (only if br-lan UP + babeld listening)
cfg->>vm: Phase -1: detect/repair VM IPs
cfg->>vm: Phase 0: wait for SSH reachable
cfg->>vm: Phase 1: configure_vm
Note over cfg,vm: Post-lime verify and repair
cfg->>vm: Check br-lan IP correct?
vm-->>cfg: brlan_ip
cfg->>vm: if wrong — re-apply UCI static config + restart network
cfg->>vm: Check babeld listening?
cfg->>vm: if not — restart babeld via init.d or direct
cfg->>vm: start_mesh_daemon_on_vm (safety re-bind)
cfg->>vm: Phase 2: inject SSH keys
cfg->>vm: "Phase 3: convergence (120s, expected_peers=1 for babeld)"
cfg->>vm: Cross-node ping node[0] to node[1]
cfg->>cfg: collect_diagnostics on failure
Reviews (2): Last reviewed commit: "test: temporary commit for greptile CLI ..." | Re-trigger Greptile |
…ip --debug from main args P1: _ssh_vm_try used 2>&1 in DEBUG mode, merging SSH stderr into stdout. This polluted command substitution captures (brlan_ip, babeld_ok, etc.) with diagnostic noise, breaking IP parsing and status detection. Now stderr flows to fd 2 (terminal) so callers get clean stdout regardless of DEBUG mode. P2: The --debug flag was parsed but never stripped from $@, so main "$@" received it as a positional argument. Currently harmless since main() ignores positional args, but this prevents future regressions. A flag-stripping loop now rebuilds $@ without recognized flags before calling main.
Addressed Greptile Review FeedbackBoth review comments have been fixed in commit 04b5a22: P1: DEBUG mode stderr polluting captured outputFixed: P2:
|
Summary
Fix LibreMesh mesh networking (babeld) to work between QEMU VMs in the testbed, while preserving DHCP IP assignment, SSH access, and test infrastructure. The fix is fully automated with no manual serial console steps.
Root cause
LibreMesh's
lime-configruns at first boot and creates VLAN-tagged batman-adv interfaces (eth0_29,bat0) that don't exist in the QEMU testbed's plain-Ethernet bridge. The previousrc.localworkaround (ip link set eth0 up+udhcpc) bypassed lime-config's network setup entirely, sobabeldwas either not running or running on non-existent VLAN interfaces.Strategy
Let
lime-configrun (preserving LibreMesh services likethisnode.infoand shared-state), then override the network layer inrc.localto use plain DHCP onbr-lanand restartbabeldonbr-lanvia its init script. This is the same wired-bridge approach the bare-OpenWrt path uses, so both image types converge to the same mesh topology.Changes
scripts/qemu/configure-source-image.sh: Newrc.localthat atomically replaces/etc/config/networkand/etc/config/babeld(UCI/netifd, not rawipcommands), waits forbr-lanstate UP, startsbabeldvia init script (with direct-invocation fallback), and only sets the first-boot marker when bothbr-lanis UP and the mesh daemon is verified listening. Pre-creates/etc/config/babeldin the rootfs as defense-in-depth.scripts/qemu/start-mesh.sh: Disable multicast snooping onmesha-br0(babeld uses multicast for neighbor discovery; snooping can suppress hellos).scripts/qemu/configure-vms.sh: Movemesh_protodetection to top ofconfigure_vm; add post-lime-configverification that re-applies UCI network config and restartsbabeldiflime-configclobbered them; unifystart_mesh_daemon_on_vmto run for BOTH image paths; expand Phase 3 with cross-node ping and 120s timeout (up from 90s).scripts/qemu/configure-vms.sh: Newcollect_diagnosticsthat dumps per-nodeip/ip link/ip route/UCI/daemon/logread/bridge state torun/logs/convergence-diagnostics.logon convergence failure or--debug; newphase_start/phase_endtiming helpers;--debug/--helpflags.tests/qemu/common.sh:count_mesh_neighborsnow probes three signals (control socketdump, kernel routesproto babel, PID + UDP listener) and returns the max. Newwait_for_mesh_pingfor cross-node L3 reachability checks.scripts/qemu/libremesh-testbed.defconfig: Addlime-proto-babeld(lime-config silently falls back to bmx7 without it).Verification
All Codex gpt-5-codex reviews reached 5/5 confidence (4-5 review rounds per task). Fast test suite passes (5 suites, 28 tests, 0 failures). Lab/adapter suites require an actual running QEMU lab and were not run in this CI environment.
Test plan
bin/libremesh-lab build-image(orbash scripts/qemu/convert-prebuilt.shfor prebuilt)sudo bin/libremesh-lab startbin/libremesh-lab configurebin/libremesh-lab test(fast suite, 5/5 passes locally)bin/libremesh-lab test --suite lab(mesh convergence + multi-hop)MESHA_ROOT=/path/to/mesha bin/libremesh-lab test --suite adapter(mesha adapters)Plan reference
See
PLAN.mdat the repo root for the full plan with root-cause analysis, implementation tasks, risk mitigations, and verification criteria.