Agent Diagnostic
Loaded the debug-openshell-cluster skill and investigated a live Fedora 44 host running the gateway as a rootless Podman container. Both the gateway and sandbox containers had exited ~21 hours prior.
Skills loaded: debug-openshell-cluster
Investigation steps and findings:
openshell status → Connection refused on http://127.0.0.1:8080 — gateway container had exited (code 0, clean shutdown signal)
podman logs openshell-gateway → Sandbox containers were immediately exiting with code 1; gateway log showed sandbox JWT auth not configured and a subuid/subgid warning
- Inspected sandbox env →
OPENSHELL_ENDPOINT=http://host.containers.internal:8080
- Inspected gateway container → port mapped to
127.0.0.1:8080 only; sandbox on openshell bridge (10.89.1.0/24) resolves host.containers.internal to 10.89.1.1 — port 8080 not listening there → Connection refused
- Recreated gateway with
-p 0.0.0.0:8080:8080 → exited with Permission denied on Podman socket
- Gateway image now runs as UID 1000:1000; without
--userns=keep-id, container UID 1000 maps to host UID 100999, not socket owner tt (1000)
- Added
--userns=keep-id → DB volume files were owned by host UID 100999 (written by old container without keep-id) → unable to open database file
- Fixed with
podman unshare chown -R 0:0 on volume data → gateway started but hit SELinux AVC denial: container_t denied write on user_tmp_t sock_file
- Added
--security-opt label=disable → gateway reached Podman driver init; subuid warning fired (false positive — the check runs inside the container which doesn't have host's /etc/subuid mounted, but the host does have valid entries)
- Gateway started but sandbox creation failed: "create sandbox token directory failed" — gateway tries to write to
/app/.local/state/openshell/podman-sandbox-tokens/ inside container; that path is not writable by UID 1000 in the image
- Set
XDG_STATE_HOME=/run/user/1000/openshell-state with a mirrored bind mount (-v /run/user/1000/openshell-state:/run/user/1000/openshell-state) → token file written but Podman API returned 500: statfs ... sandbox.jwt: no such file or directory; gateway passes the container-internal path to the Podman API, which looks for it on the host — paths diverge unless source == destination in the bind mount
- Used mirrored mount (host path == container path) → gateway wrote token, Podman found it, sandbox came up Ready
- Also required: generating an Ed25519 JWT signing key pair and TOML config (
[openshell.gateway.gateway_jwt] + [openshell.gateway.auth] allow_unauthenticated_users = true) — neither is documented for the Podman-in-container case
Description
I was following the official documentation at https://docs.nvidia.com/openshell/home. Happy to contribute improvements so that the next people using Fedora or SELinux won't run into these.
Actual behavior: Containerized Podman gateway fails to start or create sandboxes on Fedora 44 due to a cascade of issues: wrong port binding, UID/SELinux mismatches between container and host, missing sandbox JWT configuration, and a state-directory path mismatch between what the gateway writes and what the Podman API can find on the host.
Expected behavior: The documented podman run invocation for the gateway works out of the box on Fedora 44 with rootless Podman and SELinux enforcing, or the documentation covers each required flag and config step.
Reproduction Steps
- Fedora 44 host, rootless Podman, SELinux enforcing
- Run the gateway container as documented (or as shown in issue logs) with
-p 127.0.0.1:8080:8080 and no --userns or SELinux flags
- Attempt
openshell sandbox create
Full working invocation (reached after debugging):
# Generate JWT signing keys (one-time)
mkdir -p ~/.config/openshell/jwt
openssl genpkey -algorithm ed25519 -out ~/.config/openshell/jwt/signing.pem
openssl pkey -in ~/.config/openshell/jwt/signing.pem -pubout -out ~/.config/openshell/jwt/public.pem
uuidgen > ~/.config/openshell/jwt/kid
# TOML config
cat > ~/.config/openshell/gateway.toml <<'TOML'
[openshell.gateway.auth]
allow_unauthenticated_users = true
[openshell.gateway.gateway_jwt]
signing_key_path = "/etc/openshell/jwt/signing.pem"
public_key_path = "/etc/openshell/jwt/public.pem"
kid_path = "/etc/openshell/jwt/kid"
gateway_id = "openshell"
TOML
# State dir must be mirrored (same path on host and in container)
mkdir -p /run/user/1000/openshell-state
podman run -d --name openshell-gateway \
--userns=keep-id \
--security-opt label=disable \
-p 0.0.0.0:8080:8080 \
-v openshell-state:/var/openshell \
-v /run/user/1000/podman/podman.sock:/var/run/podman.sock \
-v ~/.config/openshell/jwt:/etc/openshell/jwt:ro \
-v ~/.config/openshell/gateway.toml:/etc/openshell/gateway.toml:ro \
-v /run/user/1000/openshell-state:/run/user/1000/openshell-state \
-e OPENSHELL_DRIVERS=podman \
-e OPENSHELL_PODMAN_SOCKET=/var/run/podman.sock \
-e OPENSHELL_DB_URL=sqlite:/var/openshell/openshell.db \
-e OPENSHELL_DISABLE_TLS=true \
-e OPENSHELL_GATEWAY_CONFIG=/etc/openshell/gateway.toml \
-e XDG_STATE_HOME=/run/user/1000/openshell-state \
ghcr.io/nvidia/openshell/gateway:latest \
--log-level debug --bind-address 0.0.0.0 --port 8080
Root Causes
| # |
Root cause |
Required fix |
| 1 |
Port bound to 127.0.0.1:8080; sandbox bridge can't reach loopback |
-p 0.0.0.0:8080:8080 |
| 2 |
Gateway image runs as UID 1000:1000; without keep-id, maps to host UID 100999, not socket owner |
--userns=keep-id |
| 3 |
SELinux denies container_t write on user_tmp_t Podman socket |
--security-opt label=disable |
| 4 |
After image UID change, existing volume data owned by 100999; keep-id then can't read it |
podman unshare chown -R 0:0 <volume_data> (upgrade path) |
| 5 |
No sandbox JWT keys → gateway can't issue tokens to sandbox supervisor |
Generate Ed25519 key pair + mount + TOML [openshell.gateway.gateway_jwt] |
| 6 |
No user auth configured → CLI calls rejected |
[openshell.gateway.auth] allow_unauthenticated_users = true |
| 7 |
Gateway writes sandbox token files to a container-internal path, then passes that same path to the Podman API (which runs on the host) — host doesn't have the path |
State dir must use a mirrored bind mount so host path == container path; set via XDG_STATE_HOME |
| 8 |
Subuid/subgid warning fires falsely — check reads /etc/subuid inside the container, not from the host |
Warning should be suppressed or check should use podman system info output (host-side) |
Environment
- OS: Fedora 44
- Podman: rootless, SELinux enforcing
- OpenShell: 0.0.59
- Gateway deployment: containerized (
podman run ghcr.io/nvidia/openshell/gateway:latest)
Logs
Gateway log (original failure — sandbox connection refused):
WARN openshell_server::compute: Sandbox failed to become ready sandbox_name=national-mammal reason=ContainerExited Container exited with code 1
Sandbox log (original failure):
Error: × Policy fetch failed after 5 attempts: failed to connect to OpenShell server
SELinux AVC denial (during debugging):
AVC avc: denied { write } for comm="openshell-gatew" name="podman.sock" dev="tmpfs"
scontext=system_u:system_r:container_t:s0:c487,c938
tcontext=unconfined_u:object_r:user_tmp_t:s0 tclass=sock_file permissive=0
State path error (during debugging):
Error: × status: Internal, message: "create sandbox failed: podman API error (500):
statfs /run/user/1000/openshell-state/openshell/podman-sandbox-tokens/.../sandbox.jwt:
no such file or directory"
Agent Diagnostic
Loaded the
debug-openshell-clusterskill and investigated a live Fedora 44 host running the gateway as a rootless Podman container. Both the gateway and sandbox containers had exited ~21 hours prior.Skills loaded:
debug-openshell-clusterInvestigation steps and findings:
openshell status→ Connection refused onhttp://127.0.0.1:8080— gateway container had exited (code 0, clean shutdown signal)podman logs openshell-gateway→ Sandbox containers were immediately exiting with code 1; gateway log showed sandbox JWT auth not configured and a subuid/subgid warningOPENSHELL_ENDPOINT=http://host.containers.internal:8080127.0.0.1:8080only; sandbox onopenshellbridge (10.89.1.0/24) resolveshost.containers.internalto10.89.1.1— port 8080 not listening there → Connection refused-p 0.0.0.0:8080:8080→ exited with Permission denied on Podman socket--userns=keep-id, container UID 1000 maps to host UID 100999, not socket ownertt(1000)--userns=keep-id→ DB volume files were owned by host UID 100999 (written by old container without keep-id) →unable to open database filepodman unshare chown -R 0:0on volume data → gateway started but hit SELinux AVC denial:container_tdeniedwriteonuser_tmp_tsock_file--security-opt label=disable→ gateway reached Podman driver init; subuid warning fired (false positive — the check runs inside the container which doesn't have host's/etc/subuidmounted, but the host does have valid entries)/app/.local/state/openshell/podman-sandbox-tokens/inside container; that path is not writable by UID 1000 in the imageXDG_STATE_HOME=/run/user/1000/openshell-statewith a mirrored bind mount (-v /run/user/1000/openshell-state:/run/user/1000/openshell-state) → token file written but Podman API returned 500:statfs ... sandbox.jwt: no such file or directory; gateway passes the container-internal path to the Podman API, which looks for it on the host — paths diverge unless source == destination in the bind mount[openshell.gateway.gateway_jwt]+[openshell.gateway.auth] allow_unauthenticated_users = true) — neither is documented for the Podman-in-container caseDescription
I was following the official documentation at https://docs.nvidia.com/openshell/home. Happy to contribute improvements so that the next people using Fedora or SELinux won't run into these.
Actual behavior: Containerized Podman gateway fails to start or create sandboxes on Fedora 44 due to a cascade of issues: wrong port binding, UID/SELinux mismatches between container and host, missing sandbox JWT configuration, and a state-directory path mismatch between what the gateway writes and what the Podman API can find on the host.
Expected behavior: The documented
podman runinvocation for the gateway works out of the box on Fedora 44 with rootless Podman and SELinux enforcing, or the documentation covers each required flag and config step.Reproduction Steps
-p 127.0.0.1:8080:8080and no--usernsor SELinux flagsopenshell sandbox createFull working invocation (reached after debugging):
Root Causes
127.0.0.1:8080; sandbox bridge can't reach loopback-p 0.0.0.0:8080:8080keep-id, maps to host UID 100999, not socket owner--userns=keep-idcontainer_twrite onuser_tmp_tPodman socket--security-opt label=disablekeep-idthen can't read itpodman unshare chown -R 0:0 <volume_data>(upgrade path)[openshell.gateway.gateway_jwt][openshell.gateway.auth] allow_unauthenticated_users = trueXDG_STATE_HOME/etc/subuidinside the container, not from the hostpodman system infooutput (host-side)Environment
podman run ghcr.io/nvidia/openshell/gateway:latest)Logs
Gateway log (original failure — sandbox connection refused):
Sandbox log (original failure):
SELinux AVC denial (during debugging):
State path error (during debugging):