Skip to content

Cgroup adopt#2

Open
reboss wants to merge 16 commits into
masterfrom
cgroup-adopt
Open

Cgroup adopt#2
reboss wants to merge 16 commits into
masterfrom
cgroup-adopt

Conversation

@reboss

@reboss reboss commented May 6, 2026

Copy link
Copy Markdown
Owner

- What I did

Added support for automatic cgroup adoption via a new daemon configuration option --adopt-user-cgroups. When enabled, containers automatically inherit their creator's cgroup parent instead of running under the default Docker cgroup. This enables better resource isolation and accounting in multi-user environments where users should not be able to escape their systemd resource constraints.

This feature is particularly useful in:

  • Multi-tenant environments where users have resource limits enforced via systemd
  • HPC environments where job schedulers track resource usage via cgroups
  • Systems where containers should respect the resource constraints of the user who created them

- How I did it

The implementation follows a clean architectural pattern:

  1. Peer credential extraction - Created middleware (daemon/server/middleware/peercred_linux.go) that extracts UID/GID/PID from Unix socket connections using the SO_PEERCRED syscall and stores them in the request context.

  2. Cgroup derivation - Added utility (pkg/cgroups/adoption_linux.go) that reads /proc/<pid>/cgroup and parses both cgroup v1 and v2 formats to derive the user's cgroup parent (extracts the deepest .slice component).

  3. Daemon configuration - Added AdoptUserCgroups bool field to daemon config with corresponding --adopt-user-cgroups CLI flag.

  4. Enforcement at daemon layer - Modified adaptContainerSettings() in daemon/daemon_unix.go to call applyCgroupAdoption() when the feature is enabled. The enforcement uses a strict model: if a user tries to specify a different cgroup parent than the adopted one, the request is rejected with an InvalidParameter error.

  5. Platform support - Linux-specific implementations with no-op stubs for other platforms.

All code follows test-driven development with comprehensive unit and integration tests.

- How to verify it

  1. Start dockerd with the feature enabled:

    sudo dockerd --adopt-user-cgroups
  2. Check your current cgroup:

    cat /proc/$$/cgroup
    # Example output (cgroup v2): 0::/user.slice/user-1000.slice/session-3.scope
  3. Create a container:

    docker run -d --name test nginx
  4. Verify the container inherited your cgroup parent:

    docker inspect test | jq '.[0].HostConfig.CgroupParent'
    # Should show: /user.slice/user-1000.slice (or similar)
  5. Verify enforcement - attempting to override should fail:

    docker run --cgroup-parent /custom/parent nginx
    # Should error: "cannot set cgroup parent when --adopt-user-cgroups is enabled"
  6. Run the test suite:

    # Unit tests
    make test-unit TESTDIRS='./pkg/cgroups ./daemon/server/middleware'
    
    # Integration tests
    make test-integration TESTFLAGS='-test.run TestCgroupAdoption'

- Human readable description for the release notes

+ Add `--adopt-user-cgroups` daemon flag to automatically set container cgroup parent based on the API client's cgroup (Linux only)

- A picture of a cute animal (not mandatory but encouraged)

Cute penguin

reboss added 10 commits May 6, 2026 14:27
Implements middleware to extract peer credentials (PID, UID, GID) from
Unix socket connections via SO_PEERCRED syscall. This provides the
foundation for per-user enforcement features like cgroup adoption.

- Linux implementation uses SO_PEERCRED to extract credentials from fd
- Non-Linux platforms get a no-op implementation
- Credentials are stored in request context for downstream handlers
- Gracefully handles non-Unix socket connections (TCP, etc.)

Signed-off-by: John Robbins <John.Robbins@amd.com>
Tests verify:
- Context value storage and retrieval of peer credentials
- Middleware structure and handler wrapping
- Graceful handling of missing credentials

Signed-off-by: John Robbins <John.Robbins@amd.com>
Implements DeriveParentFromPid() to read /proc/<pid>/cgroup and extract
the appropriate cgroup parent slice for container placement.

- Supports both cgroup v1 and v2 formats
- Prioritizes systemd slices over other controllers
- Extracts deepest .slice component (e.g., user-1000.slice)
- Falls back gracefully for non-systemd setups

Signed-off-by: John Robbins <John.Robbins@amd.com>
Tests verify:
- Cgroup v2 format parsing (unified hierarchy)
- Cgroup v1 format parsing (systemd and cpu controllers)
- Extraction of deepest .slice component
- Error handling for invalid/empty files
- Edge cases (root cgroup, no slice components)

Signed-off-by: John Robbins <John.Robbins@amd.com>
Adds new boolean config field to enable user cgroup adoption.
When enabled, containers will automatically inherit their creator's
cgroup parent based on the API client's process ID.

This is the configuration knob for the cgroup adoption enforcement
feature, allowing administrators to enable/disable it via daemon.json
or CLI flags.

Signed-off-by: John Robbins <John.Robbins@amd.com>
Adds CLI flag registration for the cgroup adoption feature.
Administrators can enable it via:
  dockerd --adopt-user-cgroups

Placed near other security and cgroup-related flags for logical grouping.

Signed-off-by: John Robbins <John.Robbins@amd.com>
Adds peer credential middleware to the middleware chain, positioned
after version middleware and before authorization middleware.

The middleware extracts peer credentials from Unix socket connections
and makes them available in request context for downstream handlers and
enforcement logic.

Signed-off-by: John Robbins <John.Robbins@amd.com>
Add placeholder tests for daemon-level cgroup adoption enforcement.
Tests are marked as Skip until the applyCgroupAdoption method is
implemented in the next commit.

Tests will verify:
- Adoption works when enabled with valid peer credentials
- Error when peer credentials missing
- Rejection of user-specified cgroup parent
- Acceptance of matching cgroup parent values

Signed-off-by: John Robbins <John.Robbins@amd.com>
Adds applyCgroupAdoption() method that:
- Extracts peer credentials from request context
- Derives cgroup parent from the client's PID
- Enforces that containers cannot override the cgroup parent
- Rejects requests with error if user tries to set different parent

Integrated into adaptContainerSettings() which is called during
container creation. Only enforces when AdoptUserCgroups config is enabled.

The enforcement is strict: containers MUST run under their creator's
cgroup when adoption is enabled, with no exceptions.

Signed-off-by: John Robbins <John.Robbins@amd.com>
- TestCgroupAdoptionEnabled: Verifies containers inherit creator's cgroup
- TestCgroupAdoptionDisabled: Verifies feature is off by default
- TestCgroupAdoptionUserOverrideRejected: Verifies enforcement (rejects non-matching parent)
- TestCgroupAdoptionMatchingParentAccepted: Allows matching parent override
- TestCgroupAdoptionNoPeerCredentials: Handles missing peer credentials gracefully

Tests require root and use daemon.New() test harness.

Signed-off-by: John Robbins <John.Robbins@amd.com>
reboss added 6 commits May 11, 2026 10:33
Add ConnContext to http.Server to store the net.Conn in request context.
This is required for the peer credential middleware to access the
underlying Unix socket file descriptor via SO_PEERCRED syscall.

Without this, r.Context().Value(http.LocalAddrContextKey) returns nil
and peer credentials cannot be extracted.

Signed-off-by: John Robbins <John.Robbins@amd.com>
Change cgroup adoption to use the entire cgroup hierarchy path
instead of extracting only the deepest .slice component. This
ensures containers properly inherit resource limits from SLURM
jobs, systemd scopes, and other complex cgroup hierarchies.

For example, a SLURM job's cgroup:
  /system.slice/slurmstepd.scope/job_123/step_0/user/task_0
will now be adopted in full, rather than just "system.slice".
Add middleware to extract UID/GID/PID from Unix socket connections
using SO_PEERCRED. This allows API handlers to identify the client
process for security and resource management features.

The middleware stores credentials in the request context using
PeerCredKey, and uses a custom PeerConnKey to avoid conflicts with
http.LocalAddrContextKey which gets overwritten by the HTTP stack.
Register the peer credential middleware in the API server chain
and configure ConnContext to store connections in the request context.

Uses middleware.PeerConnKey to avoid conflicts with the standard
http.LocalAddrContextKey which the HTTP stack overwrites with the
local address value.
Update cgroup adoption tests to expect full path instead of
deepest .slice component. Add test for SLURM cgroup hierarchy.

Add tests for peer credential middleware to verify:
- Middleware handles missing connections gracefully
- PeerConnKey is distinct from http.LocalAddrContextKey
- Credentials are properly stored in context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant