dstackai · Bihan · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -305,6 +305,7 @@ nav:
             - Exports: docs/concepts/exports.md
       - Guides:
           - CLI & API: docs/guides/cli-api.md
+          - Endpoint harness: docs/guides/endpoint-harness.md
           - Server deployment: docs/guides/server-deployment.md
           - Troubleshooting: docs/guides/troubleshooting.md
           - More:

diff --git a/mkdocs/docs/guides/endpoint-harness.md b/mkdocs/docs/guides/endpoint-harness.md
@@ -0,0 +1,177 @@
+---
+title: Endpoint harness
+description: Deploy inference endpoints with dstack endpoint create and the agent harness
+---
+
+# Endpoint harness
+
+The endpoint harness powers `dstack endpoint create`.
+It uses an LLM to generate a [`type: service`](../concepts/services.md) configuration,
+then deploys it through the same code path as [`dstack apply`](../reference/cli/dstack/apply.md).
+
+You describe what to deploy (model, GPU, backends, and other profile options). The harness:
+
+1. Asks an LLM to produce a service YAML (including container `commands`)
+2. Validates and saves the configuration
+3. Submits the run via dstack
+4. Monitors logs and, on failure, may ask the LLM to fix the config and redeploy
+
+The harness does **not** pick cloud offers or provision instances. dstack's scheduler
+does that after submission, the same way it does for a hand-written service config.
+
+
+## Quick start
+
+<div class="termy">
+
+```shell
+$ export DSTACK_HARNESS_API_KEY=sk-ant-...
+$ export DSTACK_HARNESS_MODEL=claude-sonnet-4-8
+$ dstack endpoint create \
+    --model meta-llama/Llama-3.1-8B-Instruct \
+    --gpu 24GB \
+    --max-attempts 3 \
+    -y
+```
+
+</div>
+
+`DSTACK_HARNESS_MODEL` is optional. If unset, the harness defaults to `claude-sonnet-4-6`
+for Anthropic.
+
+!!! note "`--max-attempts`"
+    Controls how many times the harness tries to deploy the endpoint. If the container
+    fails to start, it stops the run, asks the LLM to fix the configuration from the
+    error logs, and redeploys. Default is `3`. Set `--max-attempts 1` for a single
+    attempt with no retries.
+
+The command accepts the same resource and profile flags as [`dstack apply`](../reference/cli/dstack/apply.md)
+for services (`--gpu`, `--cpu`, `--memory`, `--disk`, `--backend`, `--region`, `--fleet`,
+`--max-price`, `--spot-policy`, and others). Run `dstack endpoint create --help` for the full list.
+
+## How it works
+
+```mermaid
+flowchart TD
+    A[dstack endpoint create] --> B[Build EndpointCreateParams from CLI]
+    B --> C["LLM: generate service YAML"]
+    C --> D[Validate with parse_apply_configuration]
+    D --> E[Apply CLI overrides via ServiceConfigurator]
+    E --> F["Save to .dstack-harness-configs/"]
+    F --> I[ServiceConfigurator.apply_configuration]
+    I --> M[Monitor container logs]
+    M --> N{Ready?}
+    N -->|yes| O[Print service URLs]
+    N -->|failed| P[Stop run]
+    P --> Q["LLM: fix YAML from error logs"]
+    Q --> R{attempts left?}
+    R -->|yes| I
+    R -->|no| S[Give up]
+```
+
+Orchestration is **programmatic** (Python via `ServiceConfigurator`), not LLM-generated
+`dstack` shell commands. The LLM only authors the service configuration and container
+`commands` that run on the GPU instance.
+
+
+## Relationship to `skills/dstack/SKILL.md`
+
+On every LLM call, the harness loads `skills/dstack/SKILL.md` and appends it to the system
+prompt.
+
+## Prompts Send to LLM
+
+### Call 1: Generate configuration
+
+Fixed prefix:
+
+```
+You generate dstack service configuration files for model inference endpoints.
+
+Rules:
+- Output a single valid YAML document for `type: service`
+- Do not wrap the YAML in markdown unless you also include the YAML body in a fenced block
+- Use only documented dstack service fields
+- Put secret values only as env var names in `env`, never inline values
+- Include `model`, `port`, `commands`, and `resources.gpu` when possible
+- Prefer `python: "3.12"` unless the user requests a custom image
+- User-provided CLI options in the request are mandatory: use the exact GPU, backends,
+  regions, fleets, CPU, memory, disk, and other resource/profile values given
+- Do not substitute different resource sizes or backends than those specified by the user
+- Do not invent unsupported CLI flags or YAML properties
+
+Reference skill:
+
+<entire contents of skills/dstack/SKILL.md>
+```
+
+
+CLI options:
+
+```
+Generate a dstack service configuration for an inference endpoint.
+The user passed these CLI options. You MUST use them exactly in the YAML. Do not substitute different GPU memory, backends, regions, fleets, or other resource/profile values.
+{
+  "model": "meta-llama/Llama-3.1-8B-Instruct",
+  "name": "meta-llama-3-1-8b-instruct",
+  "gpu": "24GB"
+}
+
+Return only the YAML configuration.
+```
+
+
+### Call 2: Fix configuration
+
+Fixed prefix:
+
+```
+You fix dstack service configurations that failed to start on the GPU instance.
+
+You are given the previous configuration and the container error logs. Return a
+corrected single YAML document for `type: service`.
+
+Rules:
+- Change as little as possible to address the specific error in the logs
+- Keep `model`, `name`, and `resources` unless the error requires changing them
+- For vLLM KV-cache / out-of-memory errors, prefer adding serve flags such as
+  `--max-model-len` or `--gpu-memory-utilization` rather than changing the GPU
+- Keep secret values as env var names only, never inline values
+- Use only documented dstack fields and valid serving CLI flags
+- Do not invent unsupported CLI flags or YAML properties
+
+Reference skill:
+
+<entire contents of skills/dstack/SKILL.md>
+```
+
+
+
+Error logs:
+
+````
+The following dstack service configuration failed to start:
+```yaml
+type: service
+name: meta-llama-3-1-8b-instruct
+model: meta-llama/Llama-3.1-8B-Instruct
+python: "3.12"
+port: 8000
+commands:
+  - |
+    pip install vllm
+    vllm serve meta-llama/Llama-3.1-8B-Instruct \
+      --host 0.0.0.0 \
+      --port 8000
+resources:
+  gpu: 24GB:1
+```
+
+Container error logs (tail):
+```
+...
+torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate ...
+```
+
+Return only the corrected YAML configuration.
+````
diff --git a/skills/dstack/SKILL.md b/skills/dstack/SKILL.md
@@ -237,10 +237,17 @@ port: 8000
 model: meta-llama/Meta-Llama-3.1-8B-Instruct
 
 resources:
-  gpu: 80GB
+  gpu: 24GB
   disk: 200GB
 ```
 
+**GPU sizing rule**
+
+If `--gpu` is not provided:
+- 7B/8B models -> `gpu: 24GB`
+- 13B/14B models -> `gpu: 40GB` or `48GB`
+- 30B+ models -> `gpu: 80GB`
+
 **Service endpoints:**
 - Without gateway: `<server URL>/proxy/services/<project name>/<run name>/`
 - With gateway: `https://<run name>.<gateway domain>/`

diff --git a/src/dstack/_internal/cli/commands/endpoint.py b/src/dstack/_internal/cli/commands/endpoint.py
@@ -0,0 +1,147 @@
+import argparse
+import shlex
+from typing import cast
+
+from dstack._internal.cli.commands import APIBaseCommand
+from dstack._internal.cli.services.configurators.run import ServiceConfigurator
+from dstack._internal.cli.utils.common import console
+from dstack._internal.core.errors import CLIError
+from dstack._internal.core.models.configurations import TaskConfiguration
+from dstack._internal.harness import (
+    EndpointCreateParams,
+    deploy_service_configuration,
+    deploy_service_with_self_healing,
+)
+from dstack._internal.harness.generator import (
+    generate_service_configuration,
+    save_service_configuration,
+)
+
+
+class EndpointCommand(APIBaseCommand):
+    NAME = "endpoint"
+    DESCRIPTION = "Manage inference endpoints"
+    ACCEPT_EXTRA_ARGS = True
+
+    def _register(self):
+        super()._register()
+        self._parser.set_defaults(subfunc=self._print_help)
+        subparsers = self._parser.add_subparsers(dest="action")
+
+        create_parser = subparsers.add_parser(
+            "create",
+            help="Create an inference endpoint",
+            formatter_class=self._parser.formatter_class,
+        )
+        create_parser.add_argument(
+            "--model",
+            required=True,
+            metavar="NAME",
+            help="The model to deploy",
+        )
+        create_parser.add_argument(
+            "--skill-path",
+            metavar="PATH",
+            help="Path to [code]skills/dstack/SKILL.md[/]. Defaults to project skill file",
+        )
+        create_parser.add_argument(
+            "--dry-run",
+            action="store_true",
+            help="Generate and save the configuration without deploying",
+        )
+        create_parser.add_argument(
+            "-y",
+            "--yes",
+            help="Do not ask for confirmation",
+            action="store_true",
+        )
+        create_parser.add_argument(
+            "-d",
+            "--detach",
+            help="Exit immediately after submitting instead of streaming container logs",
+            action="store_true",
+        )
+        create_parser.add_argument(
+            "-v",
+            "--verbose",
+            help="Show all plan properties including those with default values",
+            action="store_true",
+        )
+        create_parser.add_argument(
+            "--force",
+            help="Force apply when no changes detected",
+            action="store_true",
+        )
+        create_parser.add_argument(
+            "--max-attempts",
+            type=int,
+            default=3,
+            metavar="N",
+            help=(
+                "Max deploy attempts. On container failure, the harness stops the run,"
+                " asks the model to fix the configuration from the error logs, and redeploys."
+                " Set to 1 to disable self-healing"
+            ),
+        )
+        ServiceConfigurator.register_args(create_parser)
+        create_parser.set_defaults(subfunc=self._create)
+
+    def _command(self, args: argparse.Namespace):
+        super()._command(args)
+        args.subfunc(args)
+
+    def _print_help(self, args: argparse.Namespace):
+        self._parser.print_help()
+
+    def _create(self, args: argparse.Namespace):
+        configurator_parser = ServiceConfigurator.get_parser()
+        _, unknown_args = configurator_parser.parse_known_args(args.extra_args)
+        if unknown_args:
+            raise CLIError(f"Unrecognized arguments: {shlex.join(unknown_args)}")
+
+        params = EndpointCreateParams.from_namespace(args, model=args.model)
+
+        with console.status("Generating service configuration..."):
+            configuration = generate_service_configuration(
+                params=params,
+                skill_path=args.skill_path,
+            )
+
+        configurator = ServiceConfigurator(api_client=self.api)
+        configurator.apply_args(cast(TaskConfiguration, configuration), args)
+        configuration.model = args.model
+
+        config_path = save_service_configuration(configuration)
+        console.print(f"Saved configuration to [code]{config_path}[/]")
+
+        if args.dry_run:
+            console.print("Dry run complete. Skipping deployment.")
+            return
+
+        apply_args = argparse.Namespace(
+            yes=args.yes,
+            detach=args.detach,
+            verbose=args.verbose,
+            force=args.force,
+        )
+
+        if args.detach:
+            deploy_service_configuration(
+                api=self.api,
+                configuration=configuration,
+                configuration_path=config_path,
+                command_args=apply_args,
+                configurator_args=args,
+            )
+            return
+
+        deploy_service_with_self_healing(
+            api=self.api,
+            configuration=configuration,
+            params=params,
+            configuration_path=config_path,
+            command_args=apply_args,
+            configurator_args=args,
+            skill_path=args.skill_path,
+            max_attempts=args.max_attempts,
+        )
diff --git a/src/dstack/_internal/cli/main.py b/src/dstack/_internal/cli/main.py
@@ -8,6 +8,7 @@
 from dstack._internal.cli.commands.attach import AttachCommand
 from dstack._internal.cli.commands.completion import CompletionCommand
 from dstack._internal.cli.commands.delete import DeleteCommand
+from dstack._internal.cli.commands.endpoint import EndpointCommand
 from dstack._internal.cli.commands.event import EventCommand
 from dstack._internal.cli.commands.export import ExportCommand
 from dstack._internal.cli.commands.fleet import FleetCommand
@@ -67,6 +68,7 @@ def main():
     ApplyCommand.register(subparsers)
     AttachCommand.register(subparsers)
     DeleteCommand.register(subparsers)
+    EndpointCommand.register(subparsers)
     EventCommand.register(subparsers)
     ExportCommand.register(subparsers)
     FleetCommand.register(subparsers)

diff --git a/src/dstack/_internal/cli/services/configurators/run.py b/src/dstack/_internal/cli/services/configurators/run.py
@@ -203,6 +203,10 @@ def apply_configuration(
             console.print(detach_message)
             return
 
+        pre_attach_hook = getattr(command_args, "pre_attach_hook", None)
+        if pre_attach_hook is not None:
+            pre_attach_hook(run.name)
+
         abort_at_exit = False
         try:
             # We can attach to run multiple times if it goes from running to pending (retried).