fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols by johntmyers · Pull Request #718 · NVIDIA/OpenShell

johntmyers · 2026-04-01T01:24:39Z

Summary

Detect HTTP 101 Switching Protocols in the L7 relay and switch to raw bidirectional TCP instead of re-entering the HTTP parsing loop
Validate client sent Upgrade + Connection: Upgrade headers before accepting 101 (rejects unsolicited upgrades from non-compliant upstream servers)
Replace bool return type with RelayOutcome enum across the L7Provider trait and all relay functions for cleaner upgrade signaling

Related Issue

Fixes [Bug] Egress proxy fails to relay WebSocket frames after successful HTTP CONNECT + 101 Switching Protocols #652
Supersedes fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols #683 (carries the core fix with additional hardening and tests)
Related: [Bug] openshell-sandbox egress proxy kills WebSocket connections after ~2 minutes — Discord (and other WS-based plugins) cannot run inside sandbox NemoClaw#409

Changes

crates/openshell-sandbox/src/l7/provider.rs: Add RelayOutcome enum (Reusable/Consumed/Upgraded { overflow }). Update L7Provider::relay trait return type from Result<bool> to Result<RelayOutcome>.
crates/openshell-sandbox/src/l7/rest.rs: Detect 101 in relay_response() before the generic 1xx/bodiless handler. Return RelayOutcome directly (no intermediate tuple). Add client_requested_upgrade() validation in relay_http_request_with_resolver(). Update relay_response_to_client wrapper to return RelayOutcome.
crates/openshell-sandbox/src/l7/relay.rs: Add shared handle_upgrade() helper. Update relay_rest() and relay_passthrough_with_credentials() to match on RelayOutcome. Add l7_decision = "allow_upgrade" audit log annotation.
crates/openshell-sandbox/tests/websocket_upgrade.rs: Integration test spinning up a real WebSocket echo server, upgrading through L7Provider::relay, and exchanging a text frame bidirectionally.
crates/openshell-sandbox/Cargo.toml: Add tokio-tungstenite and futures dev-dependencies for integration tests.

Improvements over #683

Area	#683	This PR
Return type	`(bool, u16, Vec<u8>)` tuple	`RelayOutcome` enum directly from `relay_response()`
Trait consistency	`L7Provider::relay` still returned `bool`	Trait updated to `Result<RelayOutcome>`
Unsolicited 101	No validation	Validates client sent `Upgrade` + `Connection: Upgrade` headers
Code duplication	Duplicate upgrade handling in two relay paths	Shared `handle_upgrade()` helper
Audit logging	No upgrade-specific annotation	`l7_decision = "allow_upgrade"` for observability
Dead code	`relay_response_to_client` silently discarded overflow	Updated to return `RelayOutcome`
Integration test	Unit test only (relay_response in isolation)	Full WebSocket echo test through `L7Provider::relay`

Security Notes

Post-upgrade L7 bypass is inherent and documented: After copy_bidirectional, no further L7 inspection occurs. The initial HTTP request is policy-checked before the upstream ever sees it.
Unsolicited 101 rejection: If the upstream sends 101 but the client never requested an upgrade, the connection is killed (Consumed). Prevents a non-compliant upstream from triggering a policy bypass.
Overflow buffer is bounded: ~17 KiB max (header read buffer + one read call), immediately forwarded, then dropped.

Testing

mise run pre-commit passes (format fixed)
All 405 existing unit tests pass
8 new unit tests: 101 overflow capture, 101 no overflow, unsolicited 101 rejection, accepted 101 with upgrade headers, 4 client_requested_upgrade header parsing tests
2 new integration tests: WebSocket upgrade through L7 relay with echo exchange, normal HTTP regression test
cargo clippy clean (no new warnings)

Checklist

Follows Conventional Commits
Architecture docs updated (not applicable — no new components)
Tests added for new behavior

Detect 101 Switching Protocols in relay_response() and switch to raw bidirectional TCP relay instead of re-entering the HTTP parsing loop. Previously, is_bodiless_response() treated 101 as a generic 1xx informational response, forwarding only the headers and returning to the HTTP parsing loop. After a 101, subsequent bytes are upgraded protocol frames (e.g. WebSocket), not HTTP — causing the relay to block or silently drop all post-upgrade traffic. Changes: - Add RelayOutcome enum (Reusable/Consumed/Upgraded) replacing bool return type across L7Provider::relay trait and all relay functions - Detect 101 before generic 1xx handler in relay_response(), capture overflow bytes, return RelayOutcome::Upgraded - Validate client sent Upgrade + Connection: Upgrade headers before accepting 101 (rejects unsolicited upgrades from non-compliant upstream servers) - Extract shared handle_upgrade() helper used by both relay_rest() and relay_passthrough_with_credentials() - Add l7_decision=allow_upgrade audit log annotation for upgrades - Add unit tests for 101 overflow capture, unsolicited 101 rejection, and client_requested_upgrade header validation - Add integration test: WebSocket echo through L7Provider::relay Fixes: #652

davidpeden3 · 2026-04-01T14:48:52Z

tested pr #718 on docker desktop + wsl2 (windows 11, amd64, openshell v0.0.16). clean before/after results confirm the fix works.

setup:

icingdeath: windows 11, docker desktop 29.3.1, wsl2 kernel 6.6.87.2
openclaw sandbox connecting to gateway at MY_GATEWAY_IP:18789
test: socat proxy tunnel inside sandbox netns → raw python websocket upgrade + frame exchange
policy: openclaw_gateway allowing socat → MY_GATEWAY_IP:18789

before (stock binary, v0.0.16):
binary md5: 6ed90711e33ec487abf39e32dea43a60

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
[fail] timeout — websocket frames not relayed after 101 (issue #652 confirmed)

after (pr #718 fix/websocket-101-upgrade-relay-v2):
binary md5: 99be46a66d97a8012fed6e36d16a1693

[1/4] connecting to socat tunnel at 127.0.0.1:18790...
[2/4] sending websocket upgrade request...
    upgrade response: http/1.1 101 switching protocols
[3/4] sending websocket text frame ('hello')...
[4/4] waiting for websocket echo response...
    received: opcode=8 length=2 payload='...'
[pass] websocket frame received

proxy logs (after):

debug openshell_sandbox::proxy: evaluate_opa_tcp total: 3ms host=MY_GATEWAY_IP port=18789
debug openshell_sandbox::proxy: handle_tcp_connection dns_resolve_and_tcp_connect: 4ms host=MY_GATEWAY_IP
debug openshell_sandbox::l7::rest: relay_response framing
debug openshell_sandbox::l7::rest: 101 switching protocols — signaling protocol upgrade

the 101 is correctly detected in relay_response(), relayoutcome::upgraded is returned, and handle_upgrade() switches to copy_bidirectional for raw frame relay. the gateway responded with a close frame, confirming end-to-end bidirectional websocket communication through the proxy.

note on pr #720 interaction:

we also merged feat/ocsf-log-integration (#720) into the same build to test the new ocsf structured logging alongside the websocket fix. there's a minor merge conflict: handle_upgrade() in relay.rs uses info!() (from #718), but #720 removes info from the tracing imports in that file in favor of ocsf events. we resolved it by converting the handle_upgrade() log to a networkactivitybuilder ocsf event at severityid::informational (preserving the intended log level from #718).

depending on merge order:

if feat(sandbox): integrate OCSF structured logging for sandbox events #720 lands first: fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols #718 will fail to compile (info! macro not in scope) — the handle_upgrade() log needs to be converted to an ocsf event.
if fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols #718 lands first: feat(sandbox): integrate OCSF structured logging for sandbox events #720 should update handle_upgrade() logging as part of its migration sweep.

either way it's a one-liner, just flagging it so it doesn't surprise anyone.

mjamiv · 2026-04-01T15:39:39Z

Tested this against a live Slack Socket Mode deployment running OpenClaw inside an OpenShell sandbox. Different setup than @davidpeden3's socat echo test — this exercises the full application path including bidirectional file transfers.

Environment

Ubuntu 22.04, Hetzner, x86_64
OpenShell 0.0.16 cluster container (unchanged)
Sandbox binary built from this branch, swapped in via overlay remount
OpenClaw 2026.3.31 with Slack Socket Mode (@slack/socket-mode + ws)
All traffic through CONNECT proxy at 10.200.0.1:3128

How we tested

We run two sandboxes under the same gateway — agent (production, stock 0.0.16 binary) and agent2 (test, this PR). Side-by-side, no risk to production.

Built openshell-sandbox from this branch
Swapped binary in agent2 pod only (remount overlay rw, cp, restart pod)
Confirmed Slack connects with our existing tls: skip workaround (baseline)
Removed tls: skip from the slack_websocket policy — on 0.0.16 this kills the connection
Restarted gateway to force fresh WebSocket through the L7 path

Results

Everything works:

Slack Socket Mode connects without tls: skip — L7 proxy handles 101 upgrade correctly
Inbound/outbound messages normal
Inbound file processing (image sent in Slack → downloaded through proxy → vision model) — works
Outbound file delivery (image generated → uploaded to Slack via *.slack-files.com) — works
No frame corruption, socket hang ups, or connection drops during testing
Production sandbox on 0.0.16 completely unaffected throughout

The working policy (for anyone else testing)

slack_websocket:
  name: slack_websocket
  endpoints:
    - host: wss-primary.slack.com
      port: 443
      # tls: skip not needed with this fix
      enforcement: enforce
      access: full
    - host: wss-backup.slack.com
      port: 443
      enforcement: enforce
      access: full
  binaries:
    - { path: /usr/bin/node }
    - { path: /usr/bin/curl }

Bottom line

This fixes #652 for us. On 0.0.16 the same config fails because is_bodiless_response() treats 101 as generic 1xx and re-enters the HTTP parser. With this PR, the transition to raw bidirectional TCP works cleanly. We can drop our tls: skip split-policy workaround once this ships.

Nice work @johntmyers — appreciate the quick turnaround. We didn't test the #720 interaction — only learned about it from @davidpeden3's comment. Our build was this branch only, no other PRs merged in.

PR #718 added two log calls for WebSocket upgrade handling: - 101 Switching Protocols info → NetworkActivity with Upgrade activity. This is a significant state change (L7 enforcement drops to raw relay). - Unsolicited 101 without client Upgrade header → DetectionFinding with High severity. A non-compliant upstream sending 101 without a client Upgrade request could be attempting to bypass L7 inspection.

…720) * feat(sandbox): integrate OCSF structured logging for all sandbox events WIP: Replace ad-hoc tracing calls with OCSF event builders across all sandbox subsystems (network, SSH, process, filesystem, config, lifecycle). - Register ocsf_logging_enabled setting (defaults false) - Replace stdout/file fmt layers with OcsfShorthandLayer - Add conditional OcsfJsonlLayer for /var/log/openshell-ocsf.log - Update LogPushLayer to extract OCSF shorthand for gRPC push - Migrate ~106 log sites to OCSF builders (NetworkActivity, HttpActivity, SshActivity, ProcessActivity, DetectionFinding, ConfigStateChange, AppLifecycle) - Add openshell-ocsf to all Docker build contexts * fix(scripts): attach provider to all smoke test phases to avoid rate limits GitHub's unauthenticated API rate limit (60/hour) causes flaky 403s for Phases 1, 2, and 4. Fix by attaching the provider to all sandboxes and upgrading the Phase 1 policy to L7 so credential injection works. Phase 4 (tls:skip) cannot inject credentials by design, so relax the assertion to accept either 200 or 403 from upstream -- both prove the proxy forwarded the request. * fix(ocsf): remove timestamp from shorthand format to avoid double-timestamp The display layer (gateway logs, TUI, sandbox logs CLI) already prepends a timestamp. Having one in the shorthand output too produces redundant double-timestamps like: 15:49:11 sandbox INFO 15:49:11.649 I NET:OPEN ALLOWED ... Now the shorthand is just the severity + structured content: 15:49:11 sandbox INFO I NET:OPEN ALLOWED ... * refactor(ocsf): replace single-char severity with bracketed labels Replace cryptic single-character severity codes (I/L/M/H/C/F) with readable bracketed labels: [LOW], [MED], [HIGH], [CRIT], [FATAL]. Informational severity (the happy-path default) is omitted entirely to keep normal log output clean and avoid redundancy with the tracing-level INFO that the display layer already provides. Before: sandbox INFO I NET:OPEN ALLOWED ... After: sandbox INFO NET:OPEN ALLOWED ... Before: sandbox INFO M NET:OPEN DENIED ... After: sandbox INFO [MED] NET:OPEN DENIED ... * feat(sandbox): use OCSF level label for structured events in log push Set the level field to 'OCSF' instead of 'INFO' for OCSF events in the gRPC log push. This visually distinguishes structured OCSF events from plain tracing output in the TUI and CLI sandbox logs: sandbox OCSF NET:OPEN [INFO] ALLOWED python3(42) -> api.example.com:443 sandbox OCSF NET:OPEN [MED] DENIED python3(42) -> blocked.com:443 sandbox INFO Fetching sandbox policy via gRPC * fix(sandbox): convert new Landlock path-skip warning to OCSF PR #677 added a warn!() for inaccessible Landlock paths in best-effort mode. Convert to ConfigStateChangeBuilder with degraded state so it flows through the OCSF shorthand format consistently. * fix(sandbox): use rolling appender for OCSF JSONL file Match the main openshell.log rotation mechanics (daily, 3 files max) instead of a single unbounded append-only file. Prevents disk exhaustion when ocsf_logging_enabled is left on in long-running sandboxes. * fix(sandbox): address reviewer warnings for OCSF integration W1: Remove redundant 'OCSF' prefix from shorthand file layer — the class name (NET:OPEN, HTTP:GET) already identifies structured events and the LogPushLayer separately sets the level field. W2: Log a debug message when OCSF_CTX.set() is called a second time instead of silently discarding via let _. W3: Document the boundary between OCSF-migrated events and intentionally plain tracing calls (DEBUG/TRACE, transient, internal plumbing). W4: Migrate remaining iptables LOG rule failure warnings in netns.rs (IPv4 TCP/UDP, IPv6 TCP/UDP) to ConfigStateChangeBuilder for consistency with the IPv4 bypass rule failure already migrated. W5: Migrate malformed inference request warn to NetworkActivity with ActivityId::Refuse and SeverityId::Medium. W6: Use Medium severity for L7 deny decisions (both CONNECT tunnel and FORWARD proxy paths) to match the CONNECT deny severity pattern. Allows and audits remain Informational. * refactor(sandbox): rename ocsf_logging_enabled to ocsf_json_enabled The shorthand logs are already OCSF-structured events. The setting specifically controls the JSONL file export, so the name should reflect that: ocsf_json_enabled. * fix(ocsf): add timestamps to shorthand file layer output The OcsfShorthandLayer writes directly to the log file with no outer display layer to supply timestamps. Add a UTC timestamp prefix to every line so the file output matches what tracing::fmt used to provide. Before: CONFIG:VALIDATED [INFO] Validated 'sandbox' user exists in image After: 2026-04-01T15:49:11.649Z CONFIG:VALIDATED [INFO] Validated ... * fix(docker): touch openshell-ocsf source to invalidate cargo cache The supervisor-workspace stage touches sandbox and core sources to force recompilation over the rust-deps dummy stubs, but openshell-ocsf was missing. This caused the Docker cargo cache to use stale ocsf objects from the deps stage, preventing changes to the ocsf crate (like the timestamp fix) from appearing in the final binary. Also adds a shorthand layer test verifying timestamp output, and drafts the observability docs section. * fix(ocsf): add OCSF level prefix to file layer shorthand output Without a level prefix, OCSF events in the log file have no visual anchor at the position where standard tracing lines show INFO/WARN. This makes scanning the file harder since the eye has nothing consistent to lock onto after the timestamp. Before: 2026-04-01T04:04:13.065Z CONFIG:DISCOVERY [INFO] ... After: 2026-04-01T04:04:13.065Z OCSF CONFIG:DISCOVERY [INFO] ... * fix(ocsf): clean up shorthand formatting for listen and SSH events - Fix double space in NET:LISTEN, SSH:LISTEN, and other events where action is empty (e.g., 'NET:LISTEN [INFO] 10.200.0.1' -> 'NET:LISTEN [INFO] 10.200.0.1') - Add listen address to SSH:LISTEN event (was empty) - Downgrade SSH handshake intermediate steps (reading preface, verifying) from OCSF events to debug!() traces. Only the final verdict (accepted/denied) is an OCSF event now, reducing noise from 3 events to 1 per SSH connection. - Apply same spacing fix to HTTP shorthand for consistency. * docs(observability): update examples with OCSF prefix and formatting fixes Align doc examples with the deployed output: - Add OCSF level prefix to all shorthand examples in the log file - Show mixed OCSF + standard tracing in the file format section - Update listen events (no double space, SSH includes address) - Show one SSH:OPEN per connection instead of three - Update grep patterns to use 'OCSF NET:' etc. * docs(agents): add OCSF logging guidance to AGENTS.md Add a Sandbox Logging (OCSF) section to AGENTS.md so agents have in-context guidance for deciding whether new log emissions should use OCSF structured logging or plain tracing. Covers event class selection, severity guidelines, builder API usage, dual-emit pattern for security findings, and the no-secrets rule. Also adds openshell-ocsf to the Architecture Overview table. * fix: remove workflow files accidentally included during rebase These files were already merged to main in separate PRs. They got pulled into our branch during rebase conflict resolution for the deleted docs-preview-pr.yml file. * docs(observability): use sandbox connect instead of raw SSH Users access sandboxes via 'openshell sandbox connect', not direct SSH. * fix(docs): correct settings CLI syntax in OCSF JSON export page The settings CLI requires --key and --value named flags, not positional arguments. Also fix the per-sandbox form: the sandbox name is a positional argument, not a --sandbox flag. * fix(e2e): update log assertions for OCSF shorthand format The E2E tests asserted on the old tracing::fmt key=value format (action=allow, l7_decision=audit, FORWARD, L7_REQUEST, always-blocked). Update to match the new OCSF shorthand (ALLOWED/DENIED, HTTP:, NET:, engine:ssrf, policy:). * feat(sandbox): convert WebSocket upgrade log calls to OCSF PR #718 added two log calls for WebSocket upgrade handling: - 101 Switching Protocols info → NetworkActivity with Upgrade activity. This is a significant state change (L7 enforcement drops to raw relay). - Unsolicited 101 without client Upgrade header → DetectionFinding with High severity. A non-compliant upstream sending 101 without a client Upgrade request could be attempting to bypass L7 inspection.

johntmyers requested a review from a team as a code owner April 1, 2026 01:24

johntmyers mentioned this pull request Apr 1, 2026

[Bug] Egress proxy fails to relay WebSocket frames after successful HTTP CONNECT + 101 Switching Protocols #652

Closed

3 tasks

johntmyers self-assigned this Apr 1, 2026

johntmyers added test:e2e Requires end-to-end coverage labels Apr 1, 2026

johntmyers mentioned this pull request Apr 1, 2026

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols #683

Closed

6 tasks

johntmyers mentioned this pull request Apr 2, 2026

Transparent proxy blocks WebSocket and inconsistently returns 403 for policy-allowed domains #728

Closed

3 tasks

drew approved these changes Apr 2, 2026

View reviewed changes

johntmyers merged commit c6f3087 into main Apr 2, 2026
16 checks passed

johntmyers deleted the fix/websocket-101-upgrade-relay-v2 branch April 2, 2026 04:39

mjamiv mentioned this pull request Apr 6, 2026

CONNECT tunnel with access: full + tls: skip still causes 405 from upstream (WhatsApp/Baileys) #760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718

fix(sandbox): relay WebSocket frames after HTTP 101 Switching Protocols#718
johntmyers merged 1 commit intomainfrom
fix/websocket-101-upgrade-relay-v2

johntmyers commented Apr 1, 2026

Uh oh!

davidpeden3 commented Apr 1, 2026 •

edited

Loading

Uh oh!

mjamiv commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

johntmyers commented Apr 1, 2026

Summary

Related Issue

Changes

Improvements over #683

Security Notes

Testing

Checklist

Uh oh!

davidpeden3 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjamiv commented Apr 1, 2026

Environment

How we tested

Results

The working policy (for anyone else testing)

Bottom line

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidpeden3 commented Apr 1, 2026 •

edited

Loading