Skip to content

fix(bootstrap): load kernel modules on install and fix Podman socket detection#24

Merged
maxamillion merged 4 commits intomidstreamfrom
fix/rpm-modules-and-podman-socket
Apr 7, 2026
Merged

fix(bootstrap): load kernel modules on install and fix Podman socket detection#24
maxamillion merged 4 commits intomidstreamfrom
fix/rpm-modules-and-podman-socket

Conversation

@maxamillion
Copy link
Copy Markdown

Summary

Fix two issues that cause gateway startup failures on Podman-only Fedora systems.

Issue 2: Kernel Module Loading in %post

The RPM spec ships a modules-load.d/openshell.conf file, but systemd-modules-load.service runs at boot — long before package installation. Modules are never loaded until reboot, causing gateway start to fail on fresh installs.

Additionally, br_netfilter (required by K3s for net.bridge.bridge-nf-call-iptables) was missing entirely.

Changes:

  • Add br_netfilter to modules-load.d/openshell.conf
  • Ship sysctl.d/99-openshell.conf with bridge netfilter settings
  • Add %post scriptlet that runs modprobe -a immediately + %sysctl_apply
  • Add sysctl config file to %files

Issue 3: Podman Socket Detection

Multiple code paths call Docker::connect_with_local_defaults() which hardcodes /var/run/docker.sock. On Podman-only systems without podman-docker, this fails with Socket not found.

The crate already has a runtime-aware docker::connect_local(runtime) function, but 7 call sites bypassed it.

Changes:

  • Add Recommends: podman-docker to RPM spec (belt-and-suspenders)
  • Add connect_local_auto() helper that auto-detects runtime and connects
  • Replace all 7 connect_with_local_defaults() calls with runtime-aware alternatives:
    • Sites with runtime in scope → docker::connect_local(runtime)
    • Sites with gateway name → metadata lookup for stored runtime, fallback to auto-detect
    • Sites with no context → docker::connect_local_auto()
    • Best-effort diagnostic → connect_local_auto() with graceful fallback
  • Remove unused bollard::Docker import from build.rs

Related Issue

See plan: architecture/plans/fix-rpm-modules-and-podman-socket.md

Changes

File Change
openshell.spec Add br_netfilter, sysctl config, %post scriptlet, Recommends: podman-docker
crates/openshell-bootstrap/src/docker.rs Add connect_local_auto() helper
crates/openshell-bootstrap/src/lib.rs Replace 5 connect_with_local_defaults() calls
crates/openshell-bootstrap/src/build.rs Replace 2 connect_with_local_defaults() calls, remove unused import

Testing

  • cargo check — full workspace, zero errors
  • cargo test -p openshell-bootstrap — 125/125 tests pass
  • cargo fmt --check — clean
  • Only remaining connect_with_local_defaults() is the legitimate one inside connect_local() for the Docker runtime path

Checklist

  • Follows Conventional Commits format
  • Code compiles without errors
  • All existing tests pass
  • No secrets or credentials committed
  • Changes scoped to the issue at hand

…detection

RPM spec:
- Add br_netfilter to modules-load.d config for K3s bridge netfilter
- Ship sysctl.d/99-openshell.conf with net.bridge.bridge-nf-call-iptables
- Add %post scriptlet to modprobe modules immediately (no reboot required)
- Add Recommends: podman-docker as belt-and-suspenders for socket compat

Podman socket detection:
- Add connect_local_auto() helper in docker.rs for auto-detecting runtime
- Replace all 7 Docker::connect_with_local_defaults() calls outside docker.rs
  with runtime-aware alternatives (connect_local, connect_local_auto, or
  metadata-based lookup with fallback)
- Remove unused bollard::Docker import from build.rs
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 7, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efa6eef3-8548-48d6-949b-8ebfe078f00f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rpm-modules-and-podman-socket

Comment @coderabbitai help to get the list of available commands and usage tips.

…ctions

Add connect_for_gateway(name) helper that resolves the container runtime
from stored gateway metadata first, falling back to detect_runtime() with
full error propagation instead of silently defaulting to Docker.

Replace the duplicated inline metadata-detect-fallback blocks in
extract_and_store_pki and gateway_container_logs with the new helper.
@cgwalters
Copy link
Copy Markdown

Humm...I am skeptical of this "load kernel modules" stuff. First of all most of that stuff should be dynamically loaded already - it's an anti-pattern to eagerly load modules. Is something blocking the load?

@maxamillion
Copy link
Copy Markdown
Author

@cgwalters I don't know if anything is blocking it, but it's definitely not loading dynamically. The usage of it happens inside the k3s container if that provides any insight as to what it might be.

@cgwalters
Copy link
Copy Markdown

In order for logic to work on e.g. MacOS, kernel modules have to be loaded inside the podman machine VM. So anything that involves having the client tool (in this case, an RPM) manipulate the host system state is I think wrong.

I'm not an expert in the iptables bits, it's possible that the problem is k3s is trying to do iptables/nft from inside a privileged container, which would break dynamic module loading. A general fix with privileged containers like this is to have them fork off host level operations via systemd-run; that may be the simplest.

@cgwalters
Copy link
Copy Markdown

In order for logic to work on e.g. MacOS, kernel modules have to be loaded inside the podman machine VM. So anything that involves having the client tool (in this case, an RPM) manipulate the host system state is I think wrong.

(followup since I know this was confusing) - For sure most people on Linux use podman without podman-machine, but it is architecturally valid to do so, and there are some use cases for it (albeit obscure).

But it's helpful to think of it this way - anything we do in shipping the client binary should I think work symmetrically across MacOS and Linux. And the client binary shouldn't have anything to do with kmods itself.

…es modules

When running under Podman, the k3s cluster now uses:
- Native nftables kube-proxy mode (--kube-proxy-arg=proxy-mode=nftables)
- Host DNS resolution instead of iptables DNAT proxy (Podman DNS is routable)
- Skipped iptables backend probe (unnecessary with nftables kube-proxy)

This eliminates the need for legacy iptables kernel modules (ip_tables,
iptable_nat, iptable_filter, iptable_mangle) on the host when using Podman.
The Docker path is completely unchanged — all new behavior is gated on
CONTAINER_RUNTIME=podman.

Container image: add nftables package (provides nft binary for kube-proxy).

RPM spec: modules-load.d now only loads br_netfilter (still required for
bridged pod traffic regardless of iptables/nftables). Remove podman-docker
recommends (no longer needed with native Podman socket detection and
nftables networking).
@maxamillion
Copy link
Copy Markdown
Author

@cgwalters

A general fix with privileged containers like this is to have them fork off host level operations via systemd-run; that may be the simplest.

Can you elaborate? It's not immediately obvious to me what you mean there. Thanks! :)

@cgwalters
Copy link
Copy Markdown

Basically in a privileged container, bind mounting in the host /run/systemd then systemd-run works to completely escape the container and run arbitrary code on the host.

Add :dev tag to both gateway and cluster multi-arch manifests in the
midstream container build workflow. Local cargo builds default to the
dev tag (OPENSHELL_IMAGE_TAG is unset), so this ensures locally-built
CLI binaries can pull images from GHCR without needing to override
the tag. The dev and midstream tags are kept in sync — both point to
the same image built from the midstream branch on every merge.
@maxamillion
Copy link
Copy Markdown
Author

@cgwalters I don't follow ... openshell doesn't bind mount in the host /run/systemd

@maxamillion
Copy link
Copy Markdown
Author

@cgwalters I want to continue to discuss this, but I'm going to merge this for now to unblock some other efforts

@maxamillion maxamillion merged commit 4b67305 into midstream Apr 7, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants