This document describes how PyLet works and why it's designed this way.
┌──────────────┐ poke ┌──────────────────┐
│ Controller │──────────────>│ Scheduler │
│ (FastAPI) │ │ (in-process) │
└──────┬───────┘ └────────┬─────────┘
│ │
│ ┌──────────────┐ │
└────────>│ SQLite │<──────┘
│ (WAL) │
└──────────────┘
^
│ heartbeat (long-poll)
┌───────────────┴───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ Worker │ │ Worker │ │ Worker │
└─────────┘ └─────────┘ └─────────┘
Head node: Runs the controller (FastAPI server) and scheduler. Single source of truth via SQLite.
Workers: Connect to head, receive desired state via heartbeat, reconcile local processes.
PyLet has exactly one concept: the instance - a process with resource allocation.
An instance has:
- A command to run
- Resource requirements (CPU, GPU, memory)
- A lifecycle (PENDING → ASSIGNED → RUNNING → COMPLETED/FAILED)
- An optional endpoint (host:port) for service discovery
That's it. No pods, replicas, services, deployments, or jobs. Higher-level systems compose instances via labels and application logic.
PyLet provides precise GPU control because research workloads need it. These features emerged from real use cases (ServerlessLLM, etc.).
- Physical GPU indices (
gpu_indices): Request specific GPUs by index. Exposed viaCUDA_VISIBLE_DEVICES. - GPU sharing (
exclusive): Whenfalse, GPUs aren't reserved exclusively. Enables daemons (e.g., model storage servers) to coexist with inference instances. - Worker placement (
target_worker): Target a specific worker (e.g., where a model is cached).
Workers don't receive commands. They receive desired state and reconcile:
Desired (from head): [instance_a@attempt=2, instance_b@attempt=1]
Actual (local): [instance_a@attempt=1, instance_c@attempt=1]
Reconcile:
- instance_a@attempt=1: stale attempt → kill
- instance_b@attempt=1: not running → start
- instance_c@attempt=1: not desired → kill
This is declarative: head says "what should be", worker figures out "how to get there".
Benefits:
- Crash recovery: worker restarts, receives desired state, reconciles
- Network partition: stale workers don't affect correctness (attempt fencing)
- Simplicity: no command queue, no ack/retry logic
PENDING ──[assign]──> ASSIGNED ──[start]──> RUNNING ──[exit]──> COMPLETED
│ │ │ │
│ │ │ FAILED
│ │ │
│ └─[worker offline]────┴──> UNKNOWN
│ │
└──[cancel]──────────────────────────────────> CANCELLED
| State | Meaning |
|---|---|
| PENDING | Waiting for worker assignment |
| ASSIGNED | Worker selected, process not yet started |
| RUNNING | Process is running |
| UNKNOWN | Worker went offline, outcome unknown |
| COMPLETED | Process exited with code 0 |
| FAILED | Process exited with code != 0 |
| CANCELLED | User cancelled the instance |
Valid transitions (see schemas.py:VALID_TRANSITIONS):
PENDING -> ASSIGNED, CANCELLED
ASSIGNED -> RUNNING, UNKNOWN, FAILED, CANCELLED
RUNNING -> COMPLETED, FAILED, UNKNOWN, CANCELLED
UNKNOWN -> RUNNING, COMPLETED, FAILED, CANCELLED
Cancellation uses a timestamp model (like Kubernetes deletionTimestamp):
- User requests cancel →
cancellation_requested_atis set - Instance excluded from desired state
- Worker sees absence, sends SIGTERM
- Grace period (default 30s)
- SIGKILL if still running
- Worker reports CANCELLED
Workers use generation-based long-polling:
- Worker sends heartbeat with
last_seen_genand instance reports - Controller processes reports (with attempt fencing)
- Controller waits for generation change or timeout (30s)
- Controller returns new
genanddesired_instances
Cancel-and-reissue: When local state changes (process starts/exits), worker cancels in-flight heartbeat and issues a new one immediately.
Each instance has an attempt counter that increments on each assignment:
Instance assigned to worker A (attempt=1)
Network partition
Instance reassigned to worker B (attempt=2)
Worker A reconnects, reports for attempt=1
Controller ignores (stale attempt)
Only reports matching the current attempt can mutate state. This prevents:
- Stale reports from affecting current state
- Duplicate execution after partition
These must always hold:
Allocated resources on worker ≤ worker's total resources
An instance has at most one worker executing it.
Reports with attempt != current_attempt are ignored.
| Status | assigned_to | endpoint | Process exists? | Resources held? |
|---|---|---|---|---|
| PENDING | None | None | No | No |
| ASSIGNED | set | None | No | Yes (reserved) |
| RUNNING | set | set* | Yes | Yes |
| UNKNOWN | set | maybe | Unknown | Yes |
| COMPLETED | set | stale | No | No |
| FAILED | set | stale | No | No |
| CANCELLED | set/None | None | No | No |
*endpoint is set when instance binds to port
endpoint != None implies instance is/was RUNNING. Endpoint may be stale after completion.
| File | Purpose |
|---|---|
controller.py |
Core scheduling and state management |
worker.py |
Process management and reconciliation |
schemas.py |
Pydantic models, state transitions |
db.py |
SQLite persistence layer |
server.py |
FastAPI HTTP endpoints |
client.py |
Async HTTP client |
All state under ~/.pylet/:
| Path | Contents |
|---|---|
~/.pylet/pylet.db |
SQLite database (WAL mode) |
~/.pylet/run/ |
Worker local state files |
~/.pylet/logs/ |
Instance log files |
- Single file, no external dependencies
- WAL mode handles concurrent reads
- Good enough for ~100 nodes (target scale)
- State survives head restart
- Workers get updates immediately (no polling delay)
- Natural distributed rate limiting
- Timeout = liveness check built-in
- Crash recovery is automatic
- No command queue to manage
- Network partitions don't cause duplicate execution
- Simpler than command/ack protocols
- Dramatically simpler than consensus
- Single source of truth, no split-brain
- Good enough for target scale (~100 nodes)
- Can always run head on reliable hardware
Instance stdout/stderr is captured using a sidecar pattern:
- Worker wraps each command:
(cmd) 2>&1 | python3 -m pylet.log_sidecar <log_dir> <instance_id> - Sidecar writes to rotating log files in
~/.pylet/logs/ - Worker runs an HTTP server (port 15599) for log retrieval
- Head proxies log requests to workers via
/instances/{id}/logs
Why sidecar? The log capture process survives even if the instance crashes, ensuring logs aren't lost. The pipe pattern also allows log rotation without instance cooperation.