Skip to content

refactor(connd): refactor of tasks; modem self-healing; new OES events#1109

Merged
vmenge merged 7 commits intomainfrom
vm/connd-actors
Mar 25, 2026
Merged

refactor(connd): refactor of tasks; modem self-healing; new OES events#1109
vmenge merged 7 commits intomainfrom
vm/connd-actors

Conversation

@vmenge
Copy link
Copy Markdown
Collaborator

@vmenge vmenge commented Mar 24, 2026

this PR refactors task usage in connd to rely on speare for easier task management, free restart / backoff logic and a free broker for named channels between tasks.

it also publishes new data on the OES, and introduces modem self-healing

new

  • modem self-healing (powercycle whenever it is blacklisted by modem manager)
  • fw_revision field on CellularStatus
  • CellularStatus now published on OES
  • ConndReport and ActiveConnections both simplified to publish based on net-state event published internally on speare broker
  • NetStats now on OES
  • Datadog reporter now reporting usage for eth / wwan / wlan instead of only wlan
  • logging of number of wifi profiles on startup to help debug potential issues

tested on an orb

yes

do not merge

until this PR is merged (soon, i just need to unclankerfy it)

@vmenge vmenge changed the title feat(connd): modem-self-healing; + refactor of tasks refactor(connd): refactor of tasks; modem self-healing; new OES events Mar 24, 2026
@vmenge vmenge marked this pull request as ready for review March 24, 2026 18:26
@vmenge vmenge requested a review from a team as a code owner March 24, 2026 18:26
@AlexKaravaev
Copy link
Copy Markdown
Contributor

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 20874ec307

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +23 to +24
while let Ok(net_state) = net_state_rx.recv_async().await {
let _ = build_and_send_report(&ctx.nm, &ctx.resolved, net_state, &ctx.zsender)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish active-connections once at reporter startup

This reporter only reacts to net-state broker events and never emits an initial report on its own. In reporters::spawn, net_state::report is started before this task, so the initial state event can be published before this subscription is ready; if the connection state then stays stable, oes/active_connections is never sent for that boot. This creates a startup race where downstream consumers can remain stale indefinitely.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

@vmenge vmenge Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all the noise, clanker

when the apocalypse comes please remember i was the one who hated you the most

you can quote that sentence back to me as well

image

Comment on lines +81 to +85
if let Err(e) = refresh_snapshot().await {
error!("failed to refresh modem snapshot with err: {e}");
error!("powercycling modem");

let _ = powercycle_modem(ctx.mcu_util.as_ref(), &ctx.systemd)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid powercycling on every snapshot refresh failure

Any error from refresh_snapshot() immediately triggers a modem power cycle here, but take_snapshot() fails for generic transient ModemManager/DBus issues (for example list_modems, modem_info, or sim_info errors), not just blacklist conditions. That means short-lived control-plane glitches now cause hard modem resets and connection drops, which is much more disruptive than retrying and can create repeated outage loops.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is by design

Comment on lines +147 to +149
mm.set_current_bands(id, &ALLOWED_BANDS)
.await
.map_err(|e| eyre!("could not set modem bands: {e}"))?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore preferred modem mode configuration during setup

The new setup path applies signal refresh and allowed bands, but it no longer applies allowed/preferred modes (["3g", "4g"] with preferred "4g") that the previous startup logic configured. After modem replacement/power-cycle, connd now leaves mode selection at modem defaults, which can regress RAT behavior and throughput in the field.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go away
image

@vmenge vmenge enabled auto-merge (squash) March 25, 2026 11:37
@vmenge vmenge merged commit 9ff918e into main Mar 25, 2026
23 checks passed
@vmenge vmenge deleted the vm/connd-actors branch March 25, 2026 11:39
pophilpo pushed a commit that referenced this pull request Apr 2, 2026
#1109)

this PR refactors task usage in `connd` to rely on `speare` for easier
task management, free restart / backoff logic and a free broker for
named channels between tasks.

it also publishes new data on the OES, and introduces modem self-healing

## new
- modem self-healing (powercycle whenever it is blacklisted by modem
manager)
- `fw_revision` field on `CellularStatus`
- `CellularStatus` now published on OES
- `ConndReport` and `ActiveConnections` both simplified to publish based
on `net-state` event published internally on `speare` broker
- `NetStats` now on OES
- Datadog reporter now reporting usage for eth / wwan / wlan instead of
only wlan
- logging of number of wifi profiles on startup to help debug potential
issues

## tested on an orb
yes

## do not merge
[until this PR is
merged](#1094) (soon, i
just need to unclankerfy it)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants