Skip to content

feat(observability): investigate portable sandbox log collection #1922

@TaylorMutch

Description

@TaylorMutch

Problem Statement

Operators need a durable way to collect sandbox agent and supervisor logs, especially on Kubernetes. Today openshell logs is useful for interactive diagnosis, but it is backed by a bounded in-memory gateway buffer and is not sufficient as the production log export path. OpenShell already writes the complete sandbox log record to files inside the sandbox, so the investigation should look for the simplest architecture that lets operators collect those files across compute drivers.

This is a focused follow-up to #1055 and was motivated by the OCSF JSONL validation work in #1917 / #1921.

Technical Context

OpenShell currently has two sandbox log paths:

  • The supervisor process writes local files inside the sandbox, including the shorthand log and opt-in OCSF JSONL file.
  • A tracing layer pushes sandbox log lines over gRPC to the gateway, where they are kept in a per-sandbox in-memory tail buffer for openshell logs, the TUI, and watch streams.

The file-backed path is the complete record. The gRPC push path is best-effort and volatile. On Kubernetes, operators naturally expect cluster log collection via pod logs, sidecars, daemonsets, or OpenTelemetry collectors, but the OpenShell files live inside the agent container filesystem and are not directly exposed to kubectl logs or standard file collectors unless the pod has an appropriate shared volume or sidecar.

Affected Components

Component Key Files Role
Public API / CLI log access proto/openshell.proto, crates/openshell-cli/src/run.rs Defines GetSandboxLogs, PushSandboxLogs, WatchSandbox, and implements openshell logs.
Gateway log buffer crates/openshell-server/src/tracing_bus.rs, crates/openshell-server/src/grpc/policy.rs Stores pushed logs in memory and serves tail/watch requests.
Sandbox logging crates/openshell-sandbox/src/main.rs, crates/openshell-supervisor-process/src/log_push.rs, crates/openshell-ocsf/src/tracing_layers/jsonl_layer.rs Writes local files, emits OCSF shorthand/JSONL, and pushes best-effort lines to the gateway.
Settings crates/openshell-core/src/settings.rs Defines ocsf_json_enabled, which controls full OCSF JSONL output.
Compute-driver contract proto/compute_driver.proto Driver-neutral contract currently has lifecycle/status operations but no log collection/export contract.
Kubernetes driver and Helm chart crates/openshell-driver-kubernetes/src/driver.rs, crates/openshell-driver-kubernetes/src/config.rs, deploy/helm/openshell/values.yaml Builds sandbox pod specs and is the natural place for sidecar/shared-volume support.
Docker / Podman drivers crates/openshell-driver-docker/src/lib.rs, crates/openshell-driver-podman/src/container.rs, crates/openshell-core/src/driver_mounts.rs Already support validated driver-owned mounts that could support file-based log collection on local drivers.
Docs / architecture docs/observability/accessing-logs.mdx, docs/observability/ocsf-json-export.mdx, architecture/sandbox.md, architecture/gateway.md Document that gateway log storage is bounded/volatile and files are durable.

Technical Investigation

Architecture Overview

openshell logs resolves a sandbox name to an ID and then either calls WatchSandbox for tailing or GetSandboxLogs for one-shot reads. Both paths read from gateway-side streams, not from the sandbox filesystem. The gateway receives sandbox-originated logs through PushSandboxLogs, normalizes their source to sandbox, and publishes them into TracingLogBus.

TracingLogBus maintains a broadcast channel and a per-sandbox VecDeque tail buffer. The default tail cap is 2000 lines. The buffer is process-local memory and is dropped on gateway restart, sandbox deletion, or gateway rotation.

Inside the sandbox, tracing is layered so the shorthand formatter writes to stderr and the local file appender, while the JSONL layer writes OCSF JSONL when enabled. The same subscriber also installs the best-effort LogPushLayer. That push layer uses bounded channels, try_send, batching, reconnect, and backoff. Its behavior is intentionally non-blocking, so it can drop logs when the sandbox is under pressure or disconnected.

Kubernetes sandbox pods are built by the Kubernetes driver from a generated pod template. The driver injects a single agent container, required volumes for TLS/bootstrap identity, optional SPIFFE mounts, supervisor side-loading, and workspace persistence. Current Kubernetes platform_config support is intentionally narrow and does not expose arbitrary sidecars or arbitrary shared log mounts. The Helm chart exposes gateway and sandbox driver settings, but no log collector configuration today.

Docker and Podman already have validated driver mount configuration. They can map named volumes, bind mounts, and tmpfs/image mounts depending on driver support and safety settings. This makes file-based log collection plausible outside Kubernetes, but there is no OpenShell-owned log directory/mount convention yet.

Code References

Location Description
proto/openshell.proto:156 Public API defines GetSandboxLogs as the one-shot recent log fetch path.
proto/openshell.proto:159 Public API defines PushSandboxLogs as the supervisor-to-gateway client-streaming path.
proto/openshell.proto:768 WatchSandboxRequest.follow_logs streams gateway-correlated logs.
proto/openshell.proto:1329 GetSandboxLogsRequest supports tail count, time, source, and level filters only.
crates/openshell-cli/src/run.rs:7211 sandbox_logs resolves sandbox identity and calls gateway log APIs.
crates/openshell-cli/src/run.rs:7255 Tail mode uses WatchSandbox.
crates/openshell-cli/src/run.rs:7283 One-shot mode uses GetSandboxLogs.
crates/openshell-server/src/tracing_bus.rs:17 Gateway log bus is keyed by sandbox ID.
crates/openshell-server/src/tracing_bus.rs:88 Tail reads from the in-memory buffer.
crates/openshell-server/src/tracing_bus.rs:114 Default tail capacity is 2000 lines per sandbox.
crates/openshell-server/src/grpc/policy.rs:2075 handle_get_sandbox_logs serves the gateway buffer.
crates/openshell-server/src/grpc/policy.rs:2114 handle_push_sandbox_logs ingests supervisor-pushed log batches.
crates/openshell-supervisor-process/src/log_push.rs:17 LogPushLayer captures tracing events for gateway push.
crates/openshell-supervisor-process/src/log_push.rs:87 Log push is best-effort and drops if the channel is full.
crates/openshell-supervisor-process/src/log_push.rs:97 Background task batches and streams logs to the gateway.
crates/openshell-sandbox/src/main.rs:276 Sandbox sets up local file logging guards.
crates/openshell-sandbox/src/main.rs:279 OCSF JSONL file appender writes openshell-ocsf*.log under /var/log.
crates/openshell-sandbox/src/main.rs:299 Tracing subscriber combines shorthand file output, JSONL output, and gateway push.
crates/openshell-ocsf/src/tracing_layers/jsonl_layer.rs:21 JSONL output is gated by a runtime-enabled flag.
crates/openshell-core/src/settings.rs:118 ocsf_json_enabled is documented as writing /var/log/openshell-ocsf*.log.
proto/compute_driver.proto:18 Compute driver service has lifecycle/status operations but no log export RPC.
proto/compute_driver.proto:117 DriverSandboxTemplate.platform_config is the platform-specific escape hatch.
crates/openshell-driver-kubernetes/src/driver.rs:1249 Kubernetes driver builds the sandbox pod template.
crates/openshell-driver-kubernetes/src/driver.rs:1418 Kubernetes driver assembles the agent container volume mounts.
crates/openshell-driver-kubernetes/src/driver.rs:1454 Kubernetes driver assembles pod volumes.
crates/openshell-driver-kubernetes/src/driver.rs:1515 Kubernetes workspace persistence uses an injected PVC/init-container pattern.
crates/openshell-driver-kubernetes/src/driver.rs:1525 Kubernetes pod driver config currently merges node selector, priority class, and tolerations.
crates/openshell-driver-kubernetes/src/config.rs:161 Kubernetes compute config schema has no log collector fields today.
deploy/helm/openshell/values.yaml:147 Helm server values expose sandbox driver settings but no log collection options.
crates/openshell-driver-docker/src/lib.rs:1656 Docker driver parses and validates mount config.
crates/openshell-driver-docker/src/lib.rs:2142 Docker driver builds the container spec and host config.
crates/openshell-driver-podman/src/container.rs:787 Podman driver builds the container spec.
crates/openshell-driver-podman/src/container.rs:810 Podman driver already creates a named /sandbox workspace volume.
crates/openshell-core/src/driver_mounts.rs:56 Shared mount target validation does not currently reserve /var/log.
docs/observability/accessing-logs.mdx:36 Docs state that gateway log storage is bounded and not persisted.
docs/observability/accessing-logs.mdx:57 Docs state that files contain the complete record and the push channel can drop events.
docs/observability/ocsf-json-export.mdx:36 Docs state OCSF JSONL writes to /var/log/openshell-ocsf.YYYY-MM-DD.log.
architecture/sandbox.md:91 Architecture says sandbox logs are local and can also be pushed to the gateway.
architecture/gateway.md:104 Gateway observability scope includes pushing/streaming sandbox logs, not durable archival.

Current Behavior

  • Operators can use openshell logs and the TUI for recent or live logs, but those tools read the gateway's in-memory log bus.
  • Gateway restarts, gateway rotation, buffer overflow, sandbox deletion, or push-channel pressure can lose data from the gateway-access path.
  • The durable logs are inside the sandbox filesystem at /var/log/openshell*.log and /var/log/openshell-ocsf*.log.
  • Kubernetes pod logs do not expose those files unless the sandbox process also writes all desired logs to stdout/stderr or another container tails a shared volume.
  • Existing Docker/Podman mount support may let advanced users build a file-collection workaround, but OpenShell does not provide a documented, portable log collection model.

What Would Need to Change

A buildable design should answer these questions:

  1. Define the canonical log collection source. The simplest answer is the existing files, but the design should decide whether /var/log remains the path or whether OpenShell introduces a configurable log directory such as /var/log/openshell to simplify shared-volume mounts.
  2. Define the user/operator configuration surface. This may be gateway TOML, Helm values, sandbox template fields, or settings. Because this affects gateway TOML and Helm rendering if implemented there, docs/reference/gateway-config.mdx and Helm docs would need updates.
  3. Define how Kubernetes exposes file-backed logs. Likely options include a shared emptyDir/PVC for OpenShell log files and an optional sidecar that tails/ships them, or support for operator-provided collector sidecars in the sandbox pod template.
  4. Define how Docker/Podman/VM drivers expose file-backed logs. Options include generated named volumes, documented driver mounts, or a driver-neutral log directory convention that host collectors can read.
  5. Decide whether OpenTelemetry log shipping is implemented directly in the supervisor, delegated to an external collector sidecar/daemon, or deferred behind a later sink interface.
  6. Preserve the existing openshell logs behavior as interactive/recent diagnostics rather than making it the archival export mechanism.

Alternative Approaches Considered

File-driven collection with driver-specific exposure. Keep sandbox files as source of truth. Add a driver-neutral log directory convention and expose that directory through Kubernetes shared volumes/sidecars and local driver mounts. This is the simplest operational model and aligns with current implementation, but needs careful path/backward-compatibility decisions.

Kubernetes collector sidecar first. Add Helm/Kubernetes driver support for a configurable sidecar that tails OpenShell log files and writes to stdout, Fluent Bit, Vector, or an OpenTelemetry Collector. This solves the immediate Kubernetes operator problem, but should be framed as one driver implementation of a broader log sink model so Docker, Podman, and VM are not left behind.

Supervisor-owned OTLP exporter. Add an OpenTelemetry logs exporter in the supervisor that ships directly from the tracing/event stream. This is portable across drivers and avoids sidecars, but introduces network policy, credentials, batching/backpressure, retry, and configuration complexity inside every sandbox.

Gateway persistence/export. Persist pushed logs at the gateway or add an export API. This improves openshell logs, but it keeps the gateway in the hot path and does not solve file-backed collection or Kubernetes-native log collection. It should not be the primary archival design.

Patterns to Follow

  • Logging must remain non-blocking for sandbox execution. Existing push behavior drops under pressure rather than blocking the supervisor.
  • OCSF structured events should remain the machine-readable/security-audit format, while shorthand remains optimized for humans and agents.
  • Kubernetes driver changes should follow the existing pod-template transform pattern used for supervisor side-loading and workspace persistence.
  • Driver mount behavior should reuse crates/openshell-core/src/driver_mounts.rs validation and existing Docker/Podman driver config patterns.
  • Operator-facing config must be documented in docs/reference/gateway-config.mdx and Helm values/README if it affects gateway TOML or Helm rendering.

Proposed Approach

Investigate a small, file-first log collection architecture. Treat local sandbox log files as the durable source of truth, keep openshell logs as a recent/live diagnostic view, and add a driver-neutral concept of a sandbox log export directory or log sink. For Kubernetes, explore mounting that directory on a shared volume and optionally adding an operator-configurable collector sidecar that can tail files to stdout or ship to OTLP-compatible collectors. For Docker, Podman, and VM, explore how the same directory can map to host-visible volumes or driver state paths without requiring the gateway to persist log streams.

Scope Assessment

  • Complexity: Medium-High
  • Confidence: Medium — the existing file path is clear, but the right portable configuration surface needs human review.
  • Estimated files to change: 8-15 for an initial implementation, depending on whether the first build includes Kubernetes only or a driver-neutral config surface.
  • Issue type: feat

Risks & Open Questions

  • Should OpenShell keep writing directly under /var/log, or move to/configure a subdirectory that can be mounted cleanly without masking unrelated system logs?
  • Should the first implementation support Kubernetes only, or define the driver-neutral log sink model before adding the Kubernetes sidecar?
  • Should OpenTelemetry support be implemented by shipping files through a collector sidecar, or by adding OTLP export directly in the supervisor?
  • How should collector credentials and network egress be modeled without weakening sandbox isolation or leaking secrets into logs?
  • What are the retention and rotation expectations when a collector is enabled? Existing rotation is daily with 3 files for the OpenShell files.
  • How should this interact with OCSF JSONL enablement? Operators likely need to enable full JSONL and log collection independently.
  • LSM impact: shared volume and sidecar access should be checked under AppArmor/SELinux. Podman already has SELinux-aware mount behavior; Kubernetes sidecar access depends on pod security context, volume type, and labels.

Test Considerations

  • Unit tests for any new log directory or sink config parsing/defaults.
  • Kubernetes pod-template unit tests proving the shared log volume, agent mount, and optional sidecar are rendered correctly and do not disturb TLS/bootstrap/SPIFFE mounts.
  • Helm lint/docs tests if Helm values are added.
  • Docker/Podman unit tests if log directory mounts are generated or reserved paths change.
  • Kubernetes e2e test that creates a sandbox, emits OCSF events, and verifies a collector sidecar or shared volume can read the file-backed logs.
  • Optional OTLP integration test with a mock collector if direct OTLP export is selected.
  • Docs update for operator workflows under docs/observability/ and any gateway config changes under docs/reference/gateway-config.mdx.

Created by spike investigation. Use build-from-issue to plan and implement after human review.

Metadata

Metadata

Assignees

Labels

area:clusterRelated to running OpenShell on k3s/dockerarea:gatewayGateway server and control-plane workarea:sandboxSandbox runtime and isolation workspikestate:review-readyReady for human reviewtopic:observabilityLogging, metrics, and observability work

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions