Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ nav:
- Exports: docs/concepts/exports.md
- Guides:
- CLI & API: docs/guides/cli-api.md
- Endpoint harness: docs/guides/endpoint-harness.md
- Server deployment: docs/guides/server-deployment.md
- Troubleshooting: docs/guides/troubleshooting.md
- More:
Expand Down
177 changes: 177 additions & 0 deletions mkdocs/docs/guides/endpoint-harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
title: Endpoint harness
description: Deploy inference endpoints with dstack endpoint create and the agent harness
---

# Endpoint harness

The endpoint harness powers `dstack endpoint create`.
It uses an LLM to generate a [`type: service`](../concepts/services.md) configuration,
then deploys it through the same code path as [`dstack apply`](../reference/cli/dstack/apply.md).

You describe what to deploy (model, GPU, backends, and other profile options). The harness:

1. Asks an LLM to produce a service YAML (including container `commands`)
2. Validates and saves the configuration
3. Submits the run via dstack
4. Monitors logs and, on failure, may ask the LLM to fix the config and redeploy

The harness does **not** pick cloud offers or provision instances. dstack's scheduler
does that after submission, the same way it does for a hand-written service config.


## Quick start

<div class="termy">

```shell
$ export DSTACK_HARNESS_API_KEY=sk-ant-...
$ export DSTACK_HARNESS_MODEL=claude-sonnet-4-8
$ dstack endpoint create \
--model meta-llama/Llama-3.1-8B-Instruct \
--gpu 24GB \
--max-attempts 3 \
-y
```

</div>

`DSTACK_HARNESS_MODEL` is optional. If unset, the harness defaults to `claude-sonnet-4-6`
for Anthropic.

!!! note "`--max-attempts`"
Controls how many times the harness tries to deploy the endpoint. If the container
fails to start, it stops the run, asks the LLM to fix the configuration from the
error logs, and redeploys. Default is `3`. Set `--max-attempts 1` for a single
attempt with no retries.

The command accepts the same resource and profile flags as [`dstack apply`](../reference/cli/dstack/apply.md)
for services (`--gpu`, `--cpu`, `--memory`, `--disk`, `--backend`, `--region`, `--fleet`,
`--max-price`, `--spot-policy`, and others). Run `dstack endpoint create --help` for the full list.

## How it works

```mermaid
flowchart TD
A[dstack endpoint create] --> B[Build EndpointCreateParams from CLI]
B --> C["LLM: generate service YAML"]
C --> D[Validate with parse_apply_configuration]
D --> E[Apply CLI overrides via ServiceConfigurator]
E --> F["Save to .dstack-harness-configs/"]
F --> I[ServiceConfigurator.apply_configuration]
I --> M[Monitor container logs]
M --> N{Ready?}
N -->|yes| O[Print service URLs]
N -->|failed| P[Stop run]
P --> Q["LLM: fix YAML from error logs"]
Q --> R{attempts left?}
R -->|yes| I
R -->|no| S[Give up]
```

Orchestration is **programmatic** (Python via `ServiceConfigurator`), not LLM-generated
`dstack` shell commands. The LLM only authors the service configuration and container
`commands` that run on the GPU instance.


## Relationship to `skills/dstack/SKILL.md`

On every LLM call, the harness loads `skills/dstack/SKILL.md` and appends it to the system
prompt.

## Prompts Send to LLM

### Call 1: Generate configuration

Fixed prefix:

```
You generate dstack service configuration files for model inference endpoints.

Rules:
- Output a single valid YAML document for `type: service`
- Do not wrap the YAML in markdown unless you also include the YAML body in a fenced block
- Use only documented dstack service fields
- Put secret values only as env var names in `env`, never inline values
- Include `model`, `port`, `commands`, and `resources.gpu` when possible
- Prefer `python: "3.12"` unless the user requests a custom image
- User-provided CLI options in the request are mandatory: use the exact GPU, backends,
regions, fleets, CPU, memory, disk, and other resource/profile values given
- Do not substitute different resource sizes or backends than those specified by the user
- Do not invent unsupported CLI flags or YAML properties

Reference skill:

<entire contents of skills/dstack/SKILL.md>
```


CLI options:

```
Generate a dstack service configuration for an inference endpoint.
The user passed these CLI options. You MUST use them exactly in the YAML. Do not substitute different GPU memory, backends, regions, fleets, or other resource/profile values.
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"name": "meta-llama-3-1-8b-instruct",
"gpu": "24GB"
}

Return only the YAML configuration.
```


### Call 2: Fix configuration

Fixed prefix:

```
You fix dstack service configurations that failed to start on the GPU instance.

You are given the previous configuration and the container error logs. Return a
corrected single YAML document for `type: service`.

Rules:
- Change as little as possible to address the specific error in the logs
- Keep `model`, `name`, and `resources` unless the error requires changing them
- For vLLM KV-cache / out-of-memory errors, prefer adding serve flags such as
`--max-model-len` or `--gpu-memory-utilization` rather than changing the GPU
- Keep secret values as env var names only, never inline values
- Use only documented dstack fields and valid serving CLI flags
- Do not invent unsupported CLI flags or YAML properties

Reference skill:

<entire contents of skills/dstack/SKILL.md>
```



Error logs:

````
The following dstack service configuration failed to start:
```yaml
type: service
name: meta-llama-3-1-8b-instruct
model: meta-llama/Llama-3.1-8B-Instruct
python: "3.12"
port: 8000
commands:
- |
pip install vllm
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000
resources:
gpu: 24GB:1
```

Container error logs (tail):
```
...
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate ...
```

Return only the corrected YAML configuration.
````
9 changes: 8 additions & 1 deletion skills/dstack/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,10 +237,17 @@ port: 8000
model: meta-llama/Meta-Llama-3.1-8B-Instruct

resources:
gpu: 80GB
gpu: 24GB
disk: 200GB
```

**GPU sizing rule**

If `--gpu` is not provided:
- 7B/8B models -> `gpu: 24GB`
- 13B/14B models -> `gpu: 40GB` or `48GB`
- 30B+ models -> `gpu: 80GB`

**Service endpoints:**
- Without gateway: `<server URL>/proxy/services/<project name>/<run name>/`
- With gateway: `https://<run name>.<gateway domain>/`
Expand Down
147 changes: 147 additions & 0 deletions src/dstack/_internal/cli/commands/endpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
import argparse
import shlex
from typing import cast

from dstack._internal.cli.commands import APIBaseCommand
from dstack._internal.cli.services.configurators.run import ServiceConfigurator
from dstack._internal.cli.utils.common import console
from dstack._internal.core.errors import CLIError
from dstack._internal.core.models.configurations import TaskConfiguration
from dstack._internal.harness import (
EndpointCreateParams,
deploy_service_configuration,
deploy_service_with_self_healing,
)
from dstack._internal.harness.generator import (
generate_service_configuration,
save_service_configuration,
)


class EndpointCommand(APIBaseCommand):
NAME = "endpoint"
DESCRIPTION = "Manage inference endpoints"
ACCEPT_EXTRA_ARGS = True

def _register(self):
super()._register()
self._parser.set_defaults(subfunc=self._print_help)
subparsers = self._parser.add_subparsers(dest="action")

create_parser = subparsers.add_parser(
"create",
help="Create an inference endpoint",
formatter_class=self._parser.formatter_class,
)
create_parser.add_argument(
"--model",
required=True,
metavar="NAME",
help="The model to deploy",
)
create_parser.add_argument(
"--skill-path",
metavar="PATH",
help="Path to [code]skills/dstack/SKILL.md[/]. Defaults to project skill file",
)
create_parser.add_argument(
"--dry-run",
action="store_true",
help="Generate and save the configuration without deploying",
)
create_parser.add_argument(
"-y",
"--yes",
help="Do not ask for confirmation",
action="store_true",
)
create_parser.add_argument(
"-d",
"--detach",
help="Exit immediately after submitting instead of streaming container logs",
action="store_true",
)
create_parser.add_argument(
"-v",
"--verbose",
help="Show all plan properties including those with default values",
action="store_true",
)
create_parser.add_argument(
"--force",
help="Force apply when no changes detected",
action="store_true",
)
create_parser.add_argument(
"--max-attempts",
type=int,
default=3,
metavar="N",
help=(
"Max deploy attempts. On container failure, the harness stops the run,"
" asks the model to fix the configuration from the error logs, and redeploys."
" Set to 1 to disable self-healing"
),
)
ServiceConfigurator.register_args(create_parser)
create_parser.set_defaults(subfunc=self._create)

def _command(self, args: argparse.Namespace):
super()._command(args)
args.subfunc(args)

def _print_help(self, args: argparse.Namespace):
self._parser.print_help()

def _create(self, args: argparse.Namespace):
configurator_parser = ServiceConfigurator.get_parser()
_, unknown_args = configurator_parser.parse_known_args(args.extra_args)
if unknown_args:
raise CLIError(f"Unrecognized arguments: {shlex.join(unknown_args)}")

params = EndpointCreateParams.from_namespace(args, model=args.model)

with console.status("Generating service configuration..."):
configuration = generate_service_configuration(
params=params,
skill_path=args.skill_path,
)

configurator = ServiceConfigurator(api_client=self.api)
configurator.apply_args(cast(TaskConfiguration, configuration), args)
configuration.model = args.model

config_path = save_service_configuration(configuration)
console.print(f"Saved configuration to [code]{config_path}[/]")

if args.dry_run:
console.print("Dry run complete. Skipping deployment.")
return

apply_args = argparse.Namespace(
yes=args.yes,
detach=args.detach,
verbose=args.verbose,
force=args.force,
)

if args.detach:
deploy_service_configuration(
api=self.api,
configuration=configuration,
configuration_path=config_path,
command_args=apply_args,
configurator_args=args,
)
return

deploy_service_with_self_healing(
api=self.api,
configuration=configuration,
params=params,
configuration_path=config_path,
command_args=apply_args,
configurator_args=args,
skill_path=args.skill_path,
max_attempts=args.max_attempts,
)
2 changes: 2 additions & 0 deletions src/dstack/_internal/cli/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from dstack._internal.cli.commands.attach import AttachCommand
from dstack._internal.cli.commands.completion import CompletionCommand
from dstack._internal.cli.commands.delete import DeleteCommand
from dstack._internal.cli.commands.endpoint import EndpointCommand
from dstack._internal.cli.commands.event import EventCommand
from dstack._internal.cli.commands.export import ExportCommand
from dstack._internal.cli.commands.fleet import FleetCommand
Expand Down Expand Up @@ -67,6 +68,7 @@ def main():
ApplyCommand.register(subparsers)
AttachCommand.register(subparsers)
DeleteCommand.register(subparsers)
EndpointCommand.register(subparsers)
EventCommand.register(subparsers)
ExportCommand.register(subparsers)
FleetCommand.register(subparsers)
Expand Down
4 changes: 4 additions & 0 deletions src/dstack/_internal/cli/services/configurators/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,10 @@ def apply_configuration(
console.print(detach_message)
return

pre_attach_hook = getattr(command_args, "pre_attach_hook", None)
if pre_attach_hook is not None:
pre_attach_hook(run.name)

abort_at_exit = False
try:
# We can attach to run multiple times if it goes from running to pending (retried).
Expand Down
Loading
Loading