Skip to content

feat(compute): support composable lifecycle extensions across compute drivers #1915

@cheese-head

Description

@cheese-head

Spike: Composable Lifecycle Extensions Across Compute Drivers

Problem Statement

OpenShell now has VM-specific lifecycle extension hooks via
#1583, but the extension model is local to the VM driver. The
same integration class exists for Kubernetes, Docker, Podman, BlueField/DPU, and
future external drivers: operators want to reuse the built-in driver lifecycle
while injecting deployment-specific validation, plan mutation, resource
allocation, network attribution, and cleanup.

This should be generalized as a compute-driver lifecycle extension model, with
VM as the reference implementation rather than the only implementation.

Technical Context

The gateway already has a common internal compute-driver protocol. The common
surface is compute_driver.proto, and ComputeRuntime translates public
Sandbox resources into DriverSandbox messages before calling a selected
driver. That protocol currently has lifecycle RPCs for validate, create, stop,
delete, get, list, and watch, but it does not expose composable hook phases or
driver plan mutation points.

The VM driver has a more expressive in-process extension framework. It defines a
LifecycleExtension trait, a LifecycleExtensionRegistry, a mutable
LaunchPlan, activation by policy-stamped labels, and failure/delete/restore
hooks. Kubernetes, Docker, and Podman instead have driver-local plan builders and
driver-specific driver_config parsing. Those are useful, but they do not let an
operator-installed extension compose with the existing driver implementation.

The design target is not "external drivers replace everything." It is "external
or specialized drivers can wrap or reuse existing drivers and customize only the
parts that differ."

Affected Components

Component Key Files Role
Internal compute-driver protocol proto/compute_driver.proto Defines the gateway-to-driver lifecycle RPCs and DriverSandboxTemplate fields.
Gateway compute runtime crates/openshell-server/src/compute/mod.rs Selects the active driver, translates public sandbox templates, forwards selected driver_config, and calls the compute-driver RPC surface.
Gateway driver selection crates/openshell-server/src/lib.rs, crates/openshell-core/src/config.rs, crates/openshell-server/src/config_file.rs Still switches over fixed ComputeDriverKind values and per-driver config tables.
VM lifecycle extensions crates/openshell-driver-vm/src/lifecycle.rs, crates/openshell-driver-vm/src/driver.rs Current reference implementation for lifecycle extension hooks, mutable launch plans, activation, rollback, delete, and restore.
Kubernetes driver crates/openshell-driver-kubernetes/src/driver.rs, crates/openshell-driver-kubernetes/src/grpc.rs Builds Kubernetes Sandbox CR and pod template JSON, parses Kubernetes driver_config, and is the strongest first non-VM target.
Docker driver crates/openshell-driver-docker/src/lib.rs Builds Docker container create requests, validates mount/CDI/resource config, and already has a narrow driver_config surface.
Podman driver crates/openshell-driver-podman/src/driver.rs, crates/openshell-driver-podman/src/container.rs Builds Podman libpod specs, manages volumes/token files, and already has a narrow driver_config surface.
Driver utility contracts crates/openshell-core/src/driver_utils.rs Shared driver labels, supervisor mount paths, token paths, and capability response helper.
Architecture/docs architecture/compute-runtimes.md, architecture/README.md, docs/reference/sandbox-compute-drivers.mdx, docs/reference/gateway-config.mdx Documents runtime boundaries and user/operator-facing compute driver config.

Technical Investigation

Architecture Overview

OpenShell has a gateway-owned compute orchestration layer over pluggable compute
drivers. The internal protocol is intentionally driver-native: proto/compute_driver.proto:10
states the file owns driver-native request/response/observation types, and
proto/compute_driver.proto:18 defines the ComputeDriver service.

The protocol supports lifecycle RPCs:

  • GetCapabilities at proto/compute_driver.proto:20
  • ValidateSandboxCreate at proto/compute_driver.proto:23
  • CreateSandbox at proto/compute_driver.proto:33
  • StopSandbox at proto/compute_driver.proto:36
  • DeleteSandbox at proto/compute_driver.proto:39
  • WatchSandboxes at proto/compute_driver.proto:42

The protocol does not currently model lifecycle extension phases, extension
capabilities, or a way to compose one driver with another.

The gateway wraps concrete drivers behind SharedComputeDriver at
crates/openshell-server/src/compute/mod.rs:50. ComputeRuntime::from_driver
stores the selected ComputeDriverKind and reads driver capabilities at
crates/openshell-server/src/compute/mod.rs:245. In-tree constructors wire
Docker, Kubernetes, VM, and Podman independently at
crates/openshell-server/src/compute/mod.rs:295,
crates/openshell-server/src/compute/mod.rs:330,
crates/openshell-server/src/compute/mod.rs:359, and
crates/openshell-server/src/compute/mod.rs:386.

Driver selection is still enum-bound. ComputeDriverKind supports only
kubernetes, vm, docker, and podman at
crates/openshell-core/src/config.rs:50. build_compute_runtime switches over
that enum at crates/openshell-server/src/lib.rs:701 and has a
TODO(driver-abstraction) at crates/openshell-server/src/lib.rs:12 saying the
per-driver wiring should eventually collapse to a driver-agnostic path. Config
inheritance is also enum-bound through driver_table and inheritable_keys at
crates/openshell-server/src/config_file.rs:225 and
crates/openshell-server/src/config_file.rs:258.

The gateway translates public sandbox templates into driver templates at
crates/openshell-server/src/compute/mod.rs:1374. It forwards only the matching
nested driver_config block for the selected driver at
crates/openshell-server/src/compute/mod.rs:1423. Tests at
crates/openshell-server/src/compute/mod.rs:1981 verify that only the matching
driver block is passed through and non-object blocks are rejected.

The existing VM lifecycle framework is VM-local. The extension activation label
prefix is defined at crates/openshell-driver-vm/src/lifecycle.rs:13.
LaunchPlan is defined at crates/openshell-driver-vm/src/lifecycle.rs:218.
LifecycleExtension is defined at crates/openshell-driver-vm/src/lifecycle.rs:308.
The registry validates names/descriptors at
crates/openshell-driver-vm/src/lifecycle.rs:505, runs configure_launch at
crates/openshell-driver-vm/src/lifecycle.rs:553, runs before_launch at
crates/openshell-driver-vm/src/lifecycle.rs:591, and runs cleanup/restore
hooks at crates/openshell-driver-vm/src/lifecycle.rs:603,
crates/openshell-driver-vm/src/lifecycle.rs:624,
crates/openshell-driver-vm/src/lifecycle.rs:637, and
crates/openshell-driver-vm/src/lifecycle.rs:644.

The VM driver wires those hooks into provisioning and cleanup. It calls
configure_launch around crates/openshell-driver-vm/src/driver.rs:727, calls
before_launch around crates/openshell-driver-vm/src/driver.rs:785, invokes
after_restore around crates/openshell-driver-vm/src/driver.rs:963, invokes
after_delete around crates/openshell-driver-vm/src/driver.rs:1043, and
invokes before_restore around crates/openshell-driver-vm/src/driver.rs:1204.

Kubernetes has a natural plan-building boundary but no extension hook around it.
KubernetesComputeDriver::create_sandbox builds a DynamicObject and assigns
obj.data = sandbox_to_k8s_spec(...) at
crates/openshell-driver-kubernetes/src/driver.rs:376 and
crates/openshell-driver-kubernetes/src/driver.rs:420. The pod template and
Sandbox CR spec are built in sandbox_to_k8s_spec at
crates/openshell-driver-kubernetes/src/driver.rs:1182 and
sandbox_template_to_k8s at
crates/openshell-driver-kubernetes/src/driver.rs:1249. Kubernetes-specific
driver_config is parsed at crates/openshell-driver-kubernetes/src/driver.rs:94
and applied to pod scheduling/resources at
crates/openshell-driver-kubernetes/src/driver.rs:1525 and
crates/openshell-driver-kubernetes/src/driver.rs:1553.

Docker and Podman show the same pattern in container-driver form. Docker parses
DockerSandboxDriverConfig at crates/openshell-driver-docker/src/lib.rs:273,
validates templates at crates/openshell-driver-docker/src/lib.rs:447, and
builds driver-owned mounts from config at
crates/openshell-driver-docker/src/lib.rs:1656. Podman parses
PodmanSandboxDriverConfig at
crates/openshell-driver-podman/src/container.rs:64, validates create requests
at crates/openshell-driver-podman/src/driver.rs:290, creates volumes/token
files/container state at crates/openshell-driver-podman/src/driver.rs:343, and
builds the libpod container spec in try_build_container_spec_with_token around
crates/openshell-driver-podman/src/driver.rs:426.

Current Behavior

Today, a deployment-specific integration has three imperfect options:

  1. Use driver_config if the built-in driver already exposes the exact knob.
    This works for simple typed driver options but does not support side effects,
    external allocation, rollback, or composition.
  2. Implement a full external compute driver. This is appropriate for a new
    backend, but too heavy for "Kubernetes plus platform-specific mutation" or
    "Podman plus custom mounts/IPAM."
  3. Add bespoke hooks inside one driver. VM has already done this, but repeating
    VM-specific patterns independently across Kubernetes/Docker/Podman would make
    extension behavior inconsistent.

There is also a config identity problem for wrapper drivers. Public docs state
that compute_drivers currently accepts one of docker, podman,
kubernetes, or vm at docs/reference/sandbox-compute-drivers.mdx:17 and
docs/reference/sandbox-compute-drivers.mdx:24. driver_config is keyed by
driver name and the gateway forwards only the active driver's block. A wrapper
or external driver that delegates to Kubernetes must decide whether it consumes
driver_config.kubernetes, driver_config.<wrapper-name>, or both.

What Would Need to Change

The first issue should be an RFC/tracking issue, not an immediate broad
implementation. It should define a common lifecycle extension model that can be
implemented incrementally by individual drivers.

The likely shape:

  • Add a shared lifecycle vocabulary in a common crate, probably
    openshell-core or a new compute-extension module.
  • Keep VM as the reference implementation and map current VM phases:
    configure_launch, before_launch, after_launch_failed, after_delete,
    before_restore, and after_restore.
  • Define common phases such as validate_request, prepare_plan,
    before_create, after_create, on_create_failed, before_start,
    after_ready, before_stop, after_stop, before_delete, after_delete,
    and reconcile.
  • Define a common hook context with sandbox ID/name, selected driver, template,
    typed resources, driver config, policy/tenant/user metadata when available,
    and audit/span context. Do not pass raw provider secrets by default.
  • Define a driver-owned mutable plan abstraction. It can be typed per driver
    rather than forcing one universal plan type.
  • Add capability discovery so a driver can declare which hook phases and plan
    mutation points it supports.
  • Add composition guidance for external/specialized drivers: they should be able
    to fully implement compute_driver.proto, or wrap/delegate to an existing
    driver and apply extensions around that driver's plan.
  • Define config identity for composed drivers: the wrapper's own config,
    delegated base-driver config, and any policy-controlled extension config must
    not be conflated.
  • Choose Kubernetes as the first non-VM proof of concept because it already has
    a clear native plan boundary and important use cases: namespace selection,
    NetworkPolicy, Services, DRA/resource claims, pod mutation, labels,
    annotations, placement, warm pools, and IP attribution.

Alternative Approaches Considered

Only extend driver_config. This is insufficient. driver_config is useful
for selected-driver schema knobs, but it is request data. It does not model
operator-installed code, external resource allocation, rollback, delete hooks,
capability discovery, or composition.

Only use external compute drivers. This works for complete replacement but
creates unnecessary forks when an integration wants 90 percent of the in-tree
Kubernetes/Docker/Podman/VM behavior.

Copy the VM trait into each driver. This would be fast but would ossify
VM-specific names like LaunchPlan and before_launch. A common contract should
be driver-neutral, with VM-specific adapters.

Change compute_driver.proto immediately. This may eventually be needed for
external driver capability discovery, but the first RFC can define in-process
driver composition and capability metadata before committing to protocol fields.

Patterns to Follow

  • Keep the gateway/driver boundary clean. DriverSandboxTemplate.driver_config
    is already selected by the gateway and validated by the driver.
  • Keep platform-specific schema inside the selected driver, following the
    Kubernetes comment at crates/openshell-driver-kubernetes/src/driver.rs:88.
  • Treat extensions as operator-installed, not user-supplied.
  • Preserve fail-closed behavior for unsupported capabilities, invalid
    activation keys, and unsafe config.
  • Follow VM cleanup ordering: setup in registration order, cleanup in reverse
    order, and idempotent cleanup hooks.
  • Keep internal tracking labels and auth/security fields driver-owned, following
    the Docker/Podman/Kubernetes patterns where managed labels and required env
    vars override user input.

Proposed Approach

Create an RFC issue under the OpenShell extensibility umbrella that defines
composable lifecycle extensions across compute drivers. The RFC should use the
merged VM lifecycle implementation as reference art, but deliberately generalize
the vocabulary to driver-neutral lifecycle phases and driver-owned plans.

The first implementation target should be Kubernetes because the current driver
already builds a rich native plan before create, and the highest-value use cases
are Kubernetes-native: tenant namespace selection, DRA/resource claims,
NetworkPolicy/Service creation, warm pool/SandboxClaim integration, placement,
labels, annotations, and IP attribution.

The RFC should explicitly support external/specialized drivers reusing existing
drivers. A driver should be able to wrap the in-tree Kubernetes or VM driver and
contribute validation, plan mutation, allocation, and cleanup without
reimplementing the full compute lifecycle.

Scope Assessment

  • Complexity: High
  • Confidence: Medium-high for an RFC and Kubernetes-first proof of concept;
    medium for a stable external-driver protocol shape.
  • Estimated files to change for first implementation: 8-15 files, depending
    on whether the first cut adds only in-process hooks or also extends
    compute_driver.proto.
  • Issue type: feat

Risks & Open Questions

  • Where should the common extension API live: openshell-core, a new crate, or
    per-driver adapter traits over a shared context?
  • Should external compute drivers advertise hook capabilities through
    GetCapabilities, a new RPC, or only through docs/config in the first cut?
  • How much should extensions be allowed to mutate? Some fields must be immutable
    after validation, especially identity labels, auth material, and gateway-owned
    metadata.
  • How should policy authorize extension activation? VM currently uses
    policy-stamped labels. A cross-driver contract needs the same fail-closed
    property.
  • How should failures roll back multi-resource Kubernetes extensions, such as
    creating a Service, NetworkPolicy, ResourceClaim, or SandboxClaim around the
    Sandbox CR?
  • How should tenant/user metadata be represented before the OIDC/tenant model is
    fully settled?
  • DRA-style resource requests need a clear boundary between user request,
    policy-approved resource class, and driver/extension-created Kubernetes
    objects.
  • LSM impact is low for the RFC itself, but non-VM container hooks that add bind
    mounts, device mounts, or /proc-visible behavior must account for SELinux and
    AppArmor. Podman already has SELinux-aware mount handling in
    crates/openshell-driver-podman/src/container.rs:17.

Test Considerations

  • Shared unit tests for lifecycle ordering: validate/prepare/create/failure/delete
    ordering, reverse cleanup order, no-op unsupported phases, and duplicate
    extension names.
  • VM regression tests should continue covering existing hook ordering, cleanup,
    backend validation, and restore behavior.
  • Kubernetes unit tests should exercise pod/Sandbox CR plan mutation before
    create, including rejecting attempts to overwrite reserved labels/annotations.
  • Failure tests should cover rollback when an extension fails after allocating a
    resource but before the main driver create succeeds.
  • Capability tests should verify unsupported hooks fail predictably or no-op
    according to the RFC.
  • External-driver tests may need a stub driver if compute_driver.proto gains
    capability discovery fields.

Documentation & Config Impact

  • architecture/compute-runtimes.md should document lifecycle extension phases
    and the distinction between driver replacement and driver composition.
  • docs/reference/sandbox-compute-drivers.mdx should explain which drivers
    support which lifecycle phases.
  • docs/reference/gateway-config.mdx must be updated if operator config is
    added for extension registration, extension enablement, or external driver
    composition.
  • The compute-driver docs currently state that only one gateway driver is
    selected and supported values are fixed. Those sections need updates if
    wrapper/composed drivers become selectable.
  • Kubernetes setup docs may need RBAC updates if extensions create resources
    beyond Sandbox CRs, such as Service, NetworkPolicy, ResourceClaim, or
    SandboxClaim.

Related Work


Created by spike investigation. Use build-from-issue to plan and implement
after human review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    state:triage-neededOpened without agent diagnostics and needs triage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions