Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
0947b1e
refactor(server): extract state module and introduce StateBackend abs…
Yunnglin May 29, 2026
7df3d66
feat(server): replace Ray metrics with OpenTelemetry observability (t…
Yunnglin May 29, 2026
cc7eed8
fix(server): remove duplicate config methods in ServerStateProxy
Yunnglin May 29, 2026
81bcd6d
feat(server): add persistence layer with FileBackend, RedisBackend, a…
Yunnglin May 29, 2026
b8a27b3
feat(server): integrate telemetry into Ray Serve workers and fix pers…
Yunnglin May 29, 2026
0d021c4
fix(server): replace FastAPIInstrumentor with custom tracing middlewa…
Yunnglin May 29, 2026
2d783f4
update dockerfile
Yunnglin Jun 1, 2026
e75deeb
fix(server): harden telemetry/persistence wiring and middleware order
Yunnglin Jun 1, 2026
0819aff
docs(observability): add LGTM-based docker-compose stack for telemetry
Yunnglin Jun 1, 2026
c8439b4
test(server): add client-API contract harness and baseline snapshot (…
Yunnglin Jun 1, 2026
5c33080
refactor(server): convert TaskQueueConfig to Pydantic with field cons…
Yunnglin Jun 1, 2026
2a074bf
feat(server): introduce typed ServerConfig aggregate root (Phase 0c)
Yunnglin Jun 1, 2026
a558f32
refactor(server): direct-backend ServerState, drop detached actor (Ph…
Yunnglin Jun 1, 2026
2940f9c
feat(server): mock model + sampler backends with case-sensitive dispa…
Yunnglin Jun 1, 2026
e4ed2ba
feat(server): business-layer tracing + correlation + resource metrics…
Yunnglin Jun 1, 2026
81cc51f
feat(server): typer CLI with launch-time config-drift validation (Pha…
Yunnglin Jun 1, 2026
65775ad
feat(server): trace context carrier for cross-deployment propagation …
Yunnglin Jun 1, 2026
d55766b
docs(server): observability + server-configuration guides (Phase 5)
Yunnglin Jun 1, 2026
db48a1c
fix(server): address self-review findings (Phase 0a–5)
Yunnglin Jun 2, 2026
5064e78
test(server): Docker-backed integration tests (Phase 0d/3/4/5)
Yunnglin Jun 2, 2026
6a673af
test(server): OTLP trace integration test runs against Jaeger fallback
Yunnglin Jun 2, 2026
afcd573
fix(server): bind OTLP LoggingHandler to twinkle logger so server log…
Yunnglin Jun 2, 2026
7959a26
feat(observability): multi-user SFT demo + declare redis as optional …
Yunnglin Jun 2, 2026
e51a76f
style: pre-commit pass — flake8 / isort / yapf / pyupgrade / quote fixes
Yunnglin Jun 2, 2026
4b59671
chore: gitignore .kiro/ (local spec/planning notes)
Yunnglin Jun 2, 2026
cc29139
style: convert double-quoted f-strings to single quotes (CI Python 3.11)
Yunnglin Jun 2, 2026
c807814
fix(server): address code-review gaps from server-config-observabilit…
Yunnglin Jun 2, 2026
77113bd
refactor(server): drop ServerStateProxy alias, use ServerState directly
Yunnglin Jun 2, 2026
fad76f3
docs(observability): add load.py to populate every Grafana overview p…
Yunnglin Jun 2, 2026
0142f8e
fix(server): MockSampler accepts handler kwargs; load.py uses TELEMET…
Yunnglin Jun 2, 2026
90fb3e1
fix(server): lazy-start ServerState cleanup loop on first request
Yunnglin Jun 2, 2026
c9b77c1
docs(observability): load.py uses server-issued session_id, requires …
Yunnglin Jun 2, 2026
74b21f4
chore: drop unfinalized mock/observability surface from docs + cookbook
Yunnglin Jun 2, 2026
8cd73a8
docs(zh): drop 服务配置.md too — same unfinalized refactor
Yunnglin Jun 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ images
/custom/
megatron_output/
.qoder
.kiro/

# Pytorch
*.pth
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ RUN pip install flash-linear-attention -U --no-cache-dir
RUN pip install numpy==2.2 --no-cache-dir

# Install tinker, ray, and other deps
RUN pip install --no-cache-dir tinker==0.16.1 "ray[serve]" transformers peft<=0.18 accelerate -U
RUN pip install --no-cache-dir tinker==0.16.1 "ray[serve]" transformers peft<=0.18 accelerate redis opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp -U

# Clone and install twinkle, checkout to latest v-tag
RUN git clone https://github.com/modelscope/twinkle.git
Expand Down
13 changes: 12 additions & 1 deletion cookbook/client/server/megatron/server_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@ http_options:
host: 0.0.0.0 # Listen on all network interfaces
port: 9000 # Port number for the server

# Persistence configuration for ServerState (sessions, models, futures, ...).
# Top-level placement makes the launcher propagate this to every Ray worker
# via env vars, so the configured backend is used regardless of which
# deployment initializes the ServerState actor first.
# mode: memory | file | redis
# file_path: required for `file` mode
# redis_url / key_prefix: required for `redis` mode
# persistence:
# mode: file
# file_path: /tmp/twinkle_state.json

# Applications: each entry defines a service component deployed on the server
applications:

Expand Down Expand Up @@ -84,7 +95,7 @@ applications:
route_prefix: /api/v1/model/Qwen/Qwen3.6-27B
import_path: model
args:
use_megatron: true # Use Megatron-LM backend
backend: megatron # Use Megatron-LM backend
model_id: "ms://Qwen/Qwen3.6-27B" # ModelScope model identifier
max_length: 65536 # model max length
max_loras: 3 # model max loras
Expand Down
2 changes: 1 addition & 1 deletion cookbook/client/server/megatron/server_config_4b.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ applications:
route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
import_path: model
args:
use_megatron: true
backend: megatron
model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
max_length: 10240
nproc_per_node: 2 # Number of GPU processes per node
Expand Down
81 changes: 46 additions & 35 deletions cookbook/client/server/transformer/server_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@ http_options:
host: 0.0.0.0 # Listen on all network interfaces
port: 8000 # Port number for the server

# Persistence configuration for ServerState (sessions, models, futures, ...).
# Top-level placement makes the launcher propagate this to every Ray worker
# via env vars, so the configured backend is used regardless of which
# deployment initializes the ServerState actor first.
# mode: memory | file | redis
# file_path: required for `file` mode
# redis_url / key_prefix: required for `redis` mode
persistence:
mode: file
file_path: /tmp/twinkle_state.json

# Applications: each entry defines a service component deployed on the server
applications:

Expand Down Expand Up @@ -38,7 +49,7 @@ applications:
route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
import_path: model
args:
use_megatron: false # Use HuggingFace Transformers backend
backend: transformers # Model backend: transformers | megatron
model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
max_length: 10240
nproc_per_node: 1 # Number of GPU processes per node
Expand All @@ -64,43 +75,43 @@ applications:
num_cpus: 0.1
runtime_env:
env_vars:
TWINKLE_TRUST_REMOTE_CODE: "0"
TWINKLE_TRUST_REMOTE_CODE: "1"

# 3. Sampler Service - Runs inference / sampling using vLLM engine
# Used for generating text from the model (e.g., evaluating LoRA results).
- name: sampler-Qwen3.5-4B
route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
import_path: sampler
args:
model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
nproc_per_node: 2 # Number of GPU processes per node
sampler_type: vllm # Inference engine: 'vllm' (fast) or 'torch' (TorchSampler)
engine_args: # vLLM engine-specific settings
max_model_len: 4096 # Maximum sequence length the engine supports
gpu_memory_utilization: 0.5 # Fraction of GPU memory to use (0.0-1.0)
enable_lora: true # Allow loading LoRA adapters during inference
logprobs_mode: processed_logprobs # Logprobs mode for sampling results
device_group: # Logical device group for the sampler
name: sampler
ranks: 1 # Number of GPUs to use
device_type: cuda
device_mesh:
device_type: cuda
dp_size: 1
queue_config:
rps_limit: 100 # Max requests per second
tps_limit: 100000 # Max tokens per second
deployments:
- name: SamplerManagement
autoscaling_config:
min_replicas: 1
max_replicas: 1
target_ongoing_requests: 16
ray_actor_options:
num_cpus: 0.1
runtime_env:
env_vars:
TWINKLE_TRUST_REMOTE_CODE: "0"
# - name: sampler-Qwen3.5-4B
# route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
# import_path: sampler
# args:
# model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
# nproc_per_node: 2 # Number of GPU processes per node
# sampler_type: vllm # Inference engine: 'vllm' (fast) or 'torch' (TorchSampler)
# engine_args: # vLLM engine-specific settings
# max_model_len: 4096 # Maximum sequence length the engine supports
# gpu_memory_utilization: 0.5 # Fraction of GPU memory to use (0.0-1.0)
# enable_lora: true # Allow loading LoRA adapters during inference
# logprobs_mode: processed_logprobs # Logprobs mode for sampling results
# device_group: # Logical device group for the sampler
# name: sampler
# ranks: 1 # Number of GPUs to use
# device_type: cuda
# device_mesh:
# device_type: cuda
# dp_size: 1
# queue_config:
# rps_limit: 100 # Max requests per second
# tps_limit: 100000 # Max tokens per second
# deployments:
# - name: SamplerManagement
# autoscaling_config:
# min_replicas: 1
# max_replicas: 1
# target_ongoing_requests: 16
# ray_actor_options:
# num_cpus: 0.1
# runtime_env:
# env_vars:
# TWINKLE_TRUST_REMOTE_CODE: "1"

# 4. Processor Service
- name: processor
Expand Down
14 changes: 13 additions & 1 deletion cookbook/client/twinkle/self_host/self_cognition.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
base_model = 'Qwen/Qwen3.5-4B'
base_url = 'http://localhost:8000'
api_key = 'EMPTY_API_KEY'
save_dir = '/model'
save_dir = '/tmp/twinkle_sft_output'


# Step 2: Initialize the Twinkle client to communicate with the remote server.
Expand Down Expand Up @@ -108,8 +108,10 @@ def train():
start_step = progress['cur_step']

# Step 7: Run the training loop
max_steps = 10 # Limit to 10 steps for quick verification
logger.info(model.get_train_configs().model_dump())

global_step = 0
for epoch in range(3):
logger.info(f'Starting epoch {epoch}')
for cur_step, batch in enumerate(dataloader, start=start_step + 1):
Expand All @@ -128,12 +130,22 @@ def train():
# # Advance the learning rate scheduler by one step
# model.lr_step()

global_step += 1

# Log the loss every 2 steps (aligned with gradient accumulation)
if cur_step % 2 == 0:
# Print metric
metric = model.calculate_metric(is_training=True)
logger.info(f'Current is step {cur_step} of {len(dataloader)}, metric: {metric.result}')

# Stop after max_steps
if global_step >= max_steps:
logger.info(f'Reached max_steps={max_steps}, stopping training.')
break

if global_step >= max_steps:
break

# Step 8: Save the trained checkpoint
twinkle_path = model.save(
name=f'twinkle-epoch-{epoch}',
Expand Down
7 changes: 7 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,12 @@ dependencies = [
"safetensors",
"peft>=0.11.0,<=0.19.0",
"transformers",
"typer>=0.9.0",
]

[project.scripts]
twinkle-server = "twinkle.server.cli:main"

[project.optional-dependencies]
transformers = [
"accelerate",
Expand All @@ -27,6 +31,9 @@ megatron = ["megatron-core>=0.12.0", "transformer-engine[pytorch]", "mcore_bridg
vllm = ["vllm>=0.11"]
ray = ["ray[serve]"]
tinker = ["tinker==0.14.0"]
test = ["hypothesis>=6.0"]
telemetry = ["psutil>=5.9.0", "pynvml>=11.0.0"]
redis = ["redis>=5.0"]
docs = [
"sphinx>=5.3.0,<6.0.0",
"docutils>=0.16.0,<0.17.0",
Expand Down
119 changes: 12 additions & 107 deletions src/twinkle/server/__main__.py
Original file line number Diff line number Diff line change
@@ -1,117 +1,22 @@
# Copyright (c) ModelScope Contributors. All rights reserved.
"""
CLI entry point for Twinkle Server.
"""CLI entry point for Twinkle Server.

Thin shim — delegates to the typer-based :mod:`twinkle.server.cli` so the
``python -m twinkle.server`` command and the ``twinkle-server`` console
script share one implementation.

Usage::

Usage:
# From config file
python -m twinkle.server --config server_config.yaml
python -m twinkle.server launch --config server_config.yaml
python -m twinkle.server check-config --config server_config.yaml
python -m twinkle.server print-config --config server_config.yaml
python -m twinkle.server clear persistence --config server_config.yaml
"""
from __future__ import annotations

import argparse
import os
import sys
from pathlib import Path

from twinkle import get_logger

logger = get_logger()


def create_parser() -> argparse.ArgumentParser:
"""Create the argument parser."""
parser = argparse.ArgumentParser(
prog='python -m twinkle.server',
description='Twinkle Server Launcher - Unified launcher supporting both Tinker and Twinkle clients',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Start server from YAML config file
python -m twinkle.server --config server_config.yaml
""",
)

# Config file option
parser.add_argument(
'-c',
'--config',
type=str,
required=True,
metavar='PATH',
help='Path to YAML configuration file (required)',
)

# Ray options
parser.add_argument(
'--namespace',
type=str,
metavar='NS',
help="Ray namespace (default: 'twinkle_cluster')",
)

# Runtime options
parser.add_argument(
'--log-level',
type=str,
default='INFO',
choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'],
metavar='LEVEL',
help='Logging level (default: INFO)',
)

return parser


def main(args: list[str] | None = None) -> int:
"""
Main entry point for the CLI.

Args:
args: Command line arguments (uses sys.argv if None)

Returns:
Exit code (0 for success, non-zero for error)
"""
parser = create_parser()
parsed_args = parser.parse_args(args)

try:
from twinkle.server.launcher import launch_server

# Apply log level so that all loggers (including those created later)
# pick up the user-specified level via the LOG_LEVEL env var that
# get_logger() already reads.
os.environ['LOG_LEVEL'] = parsed_args.log_level

config_path = Path(parsed_args.config)
if not config_path.exists():
logger.error(f'Config file not found: {config_path}')
return 1

launch_server(
config_path=config_path,
ray_namespace=parsed_args.namespace,
)

return 0

except KeyboardInterrupt:
logger.info('Server stopped by user')
return 0
except FileNotFoundError as e:
logger.error(f'File not found: {e}')
return 1
except ValueError as e:
logger.error(f'Configuration error: {e}')
return 1
except ImportError as e:
logger.error(f'Import error: {e}')
logger.error('Make sure all required dependencies are installed')
return 1
except Exception as e:
logger.exception(f'Unexpected error: {e}')
return 1

from twinkle.server.cli import main

if __name__ == '__main__':
sys.exit(main())
5 changes: 5 additions & 0 deletions src/twinkle/server/cli/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Copyright (c) ModelScope Contributors. All rights reserved.
"""Twinkle Server CLI (typer)."""
from .app import app, main

__all__ = ['app', 'main']
Loading
Loading