Skip to content

Latest commit

 

History

History
230 lines (170 loc) · 8.25 KB

File metadata and controls

230 lines (170 loc) · 8.25 KB

PyLet Architecture

This document describes how PyLet works and why it's designed this way.

Overview

┌──────────────┐     poke      ┌──────────────────┐
│  Controller  │──────────────>│    Scheduler     │
│  (FastAPI)   │               │   (in-process)   │
└──────┬───────┘               └────────┬─────────┘
       │                                │
       │         ┌──────────────┐       │
       └────────>│    SQLite    │<──────┘
                 │     (WAL)    │
                 └──────────────┘
                       ^
                       │ heartbeat (long-poll)
       ┌───────────────┴───────────────┐
       │               │               │
  ┌────┴────┐    ┌────┴────┐    ┌────┴────┐
  │ Worker  │    │ Worker  │    │ Worker  │
  └─────────┘    └─────────┘    └─────────┘

Head node: Runs the controller (FastAPI server) and scheduler. Single source of truth via SQLite.

Workers: Connect to head, receive desired state via heartbeat, reconcile local processes.

The One Primitive: Instance

PyLet has exactly one concept: the instance - a process with resource allocation.

An instance has:

  • A command to run
  • Resource requirements (CPU, GPU, memory)
  • A lifecycle (PENDING → ASSIGNED → RUNNING → COMPLETED/FAILED)
  • An optional endpoint (host:port) for service discovery

That's it. No pods, replicas, services, deployments, or jobs. Higher-level systems compose instances via labels and application logic.

Fine-Grained GPU Scheduling

PyLet provides precise GPU control because research workloads need it. These features emerged from real use cases (ServerlessLLM, etc.).

  • Physical GPU indices (gpu_indices): Request specific GPUs by index. Exposed via CUDA_VISIBLE_DEVICES.
  • GPU sharing (exclusive): When false, GPUs aren't reserved exclusively. Enables daemons (e.g., model storage servers) to coexist with inference instances.
  • Worker placement (target_worker): Target a specific worker (e.g., where a model is cached).

Worker Reconciliation Model

Workers don't receive commands. They receive desired state and reconcile:

Desired (from head):  [instance_a@attempt=2, instance_b@attempt=1]
Actual (local):       [instance_a@attempt=1, instance_c@attempt=1]

Reconcile:
  - instance_a@attempt=1: stale attempt → kill
  - instance_b@attempt=1: not running → start
  - instance_c@attempt=1: not desired → kill

This is declarative: head says "what should be", worker figures out "how to get there".

Benefits:

  • Crash recovery: worker restarts, receives desired state, reconciles
  • Network partition: stale workers don't affect correctness (attempt fencing)
  • Simplicity: no command queue, no ack/retry logic

Instance Lifecycle

PENDING ──[assign]──> ASSIGNED ──[start]──> RUNNING ──[exit]──> COMPLETED
    │                    │                     │                    │
    │                    │                     │                 FAILED
    │                    │                     │
    │                    └─[worker offline]────┴──> UNKNOWN
    │                                                   │
    └──[cancel]──────────────────────────────────> CANCELLED
State Meaning
PENDING Waiting for worker assignment
ASSIGNED Worker selected, process not yet started
RUNNING Process is running
UNKNOWN Worker went offline, outcome unknown
COMPLETED Process exited with code 0
FAILED Process exited with code != 0
CANCELLED User cancelled the instance

Valid transitions (see schemas.py:VALID_TRANSITIONS):

PENDING -> ASSIGNED, CANCELLED
ASSIGNED -> RUNNING, UNKNOWN, FAILED, CANCELLED
RUNNING -> COMPLETED, FAILED, UNKNOWN, CANCELLED
UNKNOWN -> RUNNING, COMPLETED, FAILED, CANCELLED

Cancellation

Cancellation uses a timestamp model (like Kubernetes deletionTimestamp):

  1. User requests cancel → cancellation_requested_at is set
  2. Instance excluded from desired state
  3. Worker sees absence, sends SIGTERM
  4. Grace period (default 30s)
  5. SIGKILL if still running
  6. Worker reports CANCELLED

Heartbeat Protocol

Workers use generation-based long-polling:

  1. Worker sends heartbeat with last_seen_gen and instance reports
  2. Controller processes reports (with attempt fencing)
  3. Controller waits for generation change or timeout (30s)
  4. Controller returns new gen and desired_instances

Cancel-and-reissue: When local state changes (process starts/exits), worker cancels in-flight heartbeat and issues a new one immediately.

Attempt-Based Fencing

Each instance has an attempt counter that increments on each assignment:

Instance assigned to worker A (attempt=1)
Network partition
Instance reassigned to worker B (attempt=2)
Worker A reconnects, reports for attempt=1
Controller ignores (stale attempt)

Only reports matching the current attempt can mutate state. This prevents:

  • Stale reports from affecting current state
  • Duplicate execution after partition

System Invariants

These must always hold:

1. Resource Conservation

Allocated resources on worker ≤ worker's total resources

2. Assignment Uniqueness

An instance has at most one worker executing it.

3. Attempt Fencing

Reports with attempt != current_attempt are ignored.

4. Status-Reality Consistency

Status assigned_to endpoint Process exists? Resources held?
PENDING None None No No
ASSIGNED set None No Yes (reserved)
RUNNING set set* Yes Yes
UNKNOWN set maybe Unknown Yes
COMPLETED set stale No No
FAILED set stale No No
CANCELLED set/None None No No

*endpoint is set when instance binds to port

5. Endpoint Validity

endpoint != None implies instance is/was RUNNING. Endpoint may be stale after completion.

Components

File Purpose
controller.py Core scheduling and state management
worker.py Process management and reconciliation
schemas.py Pydantic models, state transitions
db.py SQLite persistence layer
server.py FastAPI HTTP endpoints
client.py Async HTTP client

Data Storage

All state under ~/.pylet/:

Path Contents
~/.pylet/pylet.db SQLite database (WAL mode)
~/.pylet/run/ Worker local state files
~/.pylet/logs/ Instance log files

Design Decisions

Why SQLite?

  • Single file, no external dependencies
  • WAL mode handles concurrent reads
  • Good enough for ~100 nodes (target scale)
  • State survives head restart

Why Long-Poll Heartbeat?

  • Workers get updates immediately (no polling delay)
  • Natural distributed rate limiting
  • Timeout = liveness check built-in

Why Declarative Reconciliation?

  • Crash recovery is automatic
  • No command queue to manage
  • Network partitions don't cause duplicate execution
  • Simpler than command/ack protocols

Why Single Head Node?

  • Dramatically simpler than consensus
  • Single source of truth, no split-brain
  • Good enough for target scale (~100 nodes)
  • Can always run head on reliable hardware

Log Capture

Instance stdout/stderr is captured using a sidecar pattern:

  1. Worker wraps each command: (cmd) 2>&1 | python3 -m pylet.log_sidecar <log_dir> <instance_id>
  2. Sidecar writes to rotating log files in ~/.pylet/logs/
  3. Worker runs an HTTP server (port 15599) for log retrieval
  4. Head proxies log requests to workers via /instances/{id}/logs

Why sidecar? The log capture process survives even if the instance crashes, ensuring logs aren't lost. The pipe pattern also allows log rotation without instance cooperation.