Skip to content

Server config + observability refactor#210

Draft
Yunnglin wants to merge 34 commits into
mainfrom
refact_config
Draft

Server config + observability refactor#210
Yunnglin wants to merge 34 commits into
mainfrom
refact_config

Conversation

@Yunnglin
Copy link
Copy Markdown
Collaborator

@Yunnglin Yunnglin commented Jun 2, 2026

Continues from #209 (auto-closed by GitHub during a force-push transition; same branch, same final state, history is now clean of .kiro/ planning notes).

Summary

End-to-end refactor of src/twinkle/server to close the configuration-abstraction and observability gaps. Implements all 20 requirements (R1–R20) and 94/94 task checkboxes across 9 delivery phases (0a–0d, 1–5) plus a self-review fix pass and Docker-backed integration tests.

Client-facing API (/tinker/* + /twinkle/*) is frozen byte-for-bytetests/contract/client_api_baseline.json snapshots all 79 routes + their request/response schemas; the contract test re-runs after every phase. Server-internal layout, config field names, and the de-Actor refactor are breaking changes per R8/R20.

What changed

  • Phase 0atests/contract/client_api_harness.py + baseline (cross-cutting freeze guard)
  • Phase 0bTaskQueueConfig dataclass → Pydantic with field constraints
  • Phase 0c — Typed ServerConfig aggregate root; per-deployment ModelArgs/SamplerArgs/... discriminated by import_path; extra='forbid' everywhere; cookbook YAMLs migrated (telemetry_configtelemetry, persistence_configpersistence, use_megatronbackend: mock|transformers|megatron)
  • Phase 0d — De-Actor ServerState: removes the detached Ray Actor, every worker binds directly to the shared StateBackend. New ReplicaRegistry keeps cross-worker LoRA-routing capacity consistent.
  • Phase 1 — Numpy-only TwinkleCompatMockModel + MockSampler; case-sensitive backend dispatch with _validate_* running before any side effect; cookbook/client/server/mock/server_config.yaml for CPU-only quick-start
  • Phase 2traced_operation context manager + twinkle.* correlation keys + ResourceMetricsCollector (CPU / memory / per-GPU). Grafana dashboard gets 4 new resource panels.
  • Phase 3 — Typer CLI (launch / check-config / print-config / clear persistence) + drift validation that runs before ray.init; documented server_config.example.yaml
  • Phase 4make_carrier / activate_carrier for cross-deployment trace propagation through DeploymentHandle calls
  • Phase 5docs/source_zh/使用指引/{可观测化,服务配置}.md + docs/source_en/Usage Guide/Observability.md + index entries

Self-review fixes

After Phase 5 I post-reviewed every commit and found 11 issues, all fixed. The most important one to call out:

Production logging fixinit_telemetry now binds the OTLP LoggingHandler to both root and the twinkle namespace logger. Previously twinkle.utils.logger's propagate=False meant no twinkle.* log records ever reached OTLP/Loki in production with telemetry enabled.

Test plan

  • 244 unit + property + contract + Docker integration tests pass in the twinkle conda env
  • Hypothesis property tests cover R1–R29 each tagged # Feature: server-config-observability-refactor, Property {n}: …
  • tests/contract/client_api_baseline.json re-runs green after every phase — 79 client-facing routes + schemas unchanged
  • Real Ray Serve boot in mock mode verified — Tinker + Twinkle health / capability / session endpoints all return 200
  • End-to-end LGTM with cookbook/observability/demo_sft_users.py — 5 concurrent users running full SFT, traces + logs + metrics all correlated via twinkle.session_id / trace_id. user2 hits a rate-limit, user4 fails with simulated NaN — both surface cleanly in Tempo with status=error.

What's still optional

  • OTel exemplars (jump from a Grafana metric panel directly to the trace) — needs MetricReader reservoir + datasource toggle; deferred.
  • Single-trace-id fan-out via internal DeploymentHandle (R13.3) — vacuously satisfied today (no internal handle calls in current topology); helpers ready for the day someone adds one.
  • Production multi-node Ray cluster smoke — only verified locally on a single colima VM with Redis + LGTM.

Reading order

  1. tests/contract/client_api_baseline.json (snapshotted, intentional) and tests/contract/test_client_api_contract.py
  2. Phase commits in order — each is a self-contained delivery item
  3. The two fix commits (db48a1c8 self-review pass, afcd573a logging bug)
  4. cookbook/observability/demo_sft_users.py to see the wired-up observability surface end-to-end

Yunnglin added 25 commits May 29, 2026 12:59
…traction

- Move utils/state/ to server/state/ as top-level module
- Fix all 8 import references (no re-export compatibility layer)
- Add StateBackend ABC with set/get/delete/exists/keys/count/set_nx/close/health_check
- Implement MemoryBackend (sync in-memory, compatible with Ray Actor)
- Refactor ConfigManager to use StateBackend with 'config::' key prefix
- Inject optional backend parameter into ServerState and get_server_state factory
- Add unified exception hierarchy (TwinkleServerError and subclasses)
- Create telemetry/ skeleton directory for Phase 2
…race/metric/log)

- Add telemetry/provider.py: OTEL TracerProvider/MeterProvider/LoggerProvider init
  with debug (console) and OTLP export modes, graceful shutdown
- Add telemetry/metrics.py: MetricsRegistry singleton facade over OTEL meters
  (low-invasiveness: business code uses MetricsRegistry.get() only)
- Add telemetry/tracing.py: get_tracer/inject_context/extract_context with
  noop fallback when OTEL SDK is not installed
- Rewrite utils/metrics.py as thin adapter layer: _Counter/_Histogram/_Gauge
  map Ray-style API (inc/set/observe) to MetricsRegistry OTEL instruments
- Update server_state.py _metrics_loop to use MetricsRegistry UpDownCounter
  with delta calculation
- Inject trace context in gateway/proxy.py for distributed tracing
- All OTEL imports are guarded (optional dependency): server starts normally
  without opentelemetry packages installed (NoOp fallback)
- Completely remove ray.util.metrics dependency (zero residual references)
…nd config signature

- Implement FileBackend (JSON file storage with atomic write, fcntl lock, TTL)
- Implement RedisBackend (redis.asyncio, optional dependency with guard)
- Add PersistenceConfig + create_backend factory (memory/file/redis modes)
- Adapt all Managers (Session/Model/Sampling/Future) to use StateBackend
  - BaseManager: async CRUD via backend with key prefix isolation
  - ModelManager: hybrid mode (records persisted, indexes in-memory with rebuild)
- Add config signature validation (SHA256 hash, warn/clear/abort policies)
- Fix ABORT policy exception propagation in get_server_state
- Add comprehensive unit tests (62 passed)
- tracing middleware: return passthrough when OpenTelemetry SDK is absent
  instead of crashing every request on lazy import inside the handler
- persistence_config: propagate via TWINKLE_PERSISTENCE_* env vars from the
  launcher to all Ray workers, so the configured backend is used regardless
  of which deployment initializes ServerState first; lift the example to
  top-level YAML
- middleware order: register metrics last in all four apps so it wraps the
  outermost layer and captures full end-to-end latency including tracing
- example yaml: telemetry default to enabled=false (optional dependency),
  document how to opt in
cookbook/observability/ provides a one-container OTLP receiver + dashboard
for Twinkle, built on grafana/otel-lgtm (bundled OTel Collector + Mimir +
Tempo + Loki + Grafana). Users docker compose up, point telemetry_config
at localhost:4317, and get a pre-provisioned overview dashboard with HTTP
rate / latency, queue depth, task latencies, rate-limit rejections, and
active-resource gauges.
…Phase 0a)

Establishes the cross-cutting freeze guard for the Tinker/Twinkle HTTP
contract (R20, R18.1) ahead of the server-config + observability refactor.
The harness builds each FastAPI app (gateway, model, sampler, processor)
by registering its route helpers against a fresh app, then extracts the
OpenAPI paths and component schemas. The committed baseline at
tests/contract/client_api_baseline.json is what every later phase asserts
equality against to catch any drift in route paths, HTTP methods, or
request/response schemas. Adds hypothesis to the test extras for the
property-based tests later phases will need.
…traints (Phase 0b)

Replaces the dataclass with a Pydantic BaseModel so invalid rate-limit and
timeout values are rejected at construction instead of leaking into the
running deployment. Constrains rps_limit/tps_limit/queue_timeout/
token_cleanup_interval to >= 0, window_seconds to > 0, and max_input_tokens
to int >= 1, matching R9.2-9.5/9.7. Sets extra='forbid' so unknown YAML
keys surface immediately. The from_dict(config_dict=None) factory is
preserved for the existing call sites in model/sampler/processor apps and
now delegates to model_validate({}) when no input is given.

Adds property tests (Hypothesis, max_examples=100) for constraint
enforcement, from_dict equivalence with model_validate, and the documented
defaulting behaviour. The Phase 0a client-API contract baseline is re-run
green as the cross-cutting freeze guard.
Adds a single Pydantic aggregate root that drives the launcher: ServerConfig
nests TelemetryConfig, PersistenceConfig, TaskQueueConfig, and a list of
typed ApplicationSpec entries. Each per-deployment args block has its own
schema (ModelArgs/SamplerArgs/ServerArgs/ProcessorArgs) with extra='forbid',
so unknown keys and out-of-range values surface at load time with the
offending field path. backend (model) and sampler_type (sampler) are
introduced as Literal-validated selectors, replacing the legacy use_megatron
boolean — Phase 1 will wire the actual dispatch on these values.

ServerConfig.from_yaml is the single load entry point: FileNotFoundError on
a missing path, ConfigParseError on malformed YAML, ValidationError on field
or cross-field violations. The cross-field validator rejects redis mode
without redis_url and file mode without file_path. ServerLauncher now
requires a typed ServerConfig and rejects raw dicts; from_yaml became a
thin wrapper. Legacy field names telemetry_config/persistence_config are
rejected per the breaking-change clause in R8.

Migrates the cookbook example configs (transformer + megatron) to the new
field names and adds property tests covering valid/invalid loads, round-trip
fidelity, and legacy-name rejection. The Phase 0a client-API contract
baseline is re-run green as the cross-cutting freeze guard.

Adds ConfigError (field/value/allowed) and ConfigParseError to
server.exceptions for callers that want a single non-pydantic exception
type to catch.
…ase 0d)

Removes the single detached Ray Actor that centralized server state
(get_server_state used to call ray.remote(ServerState).options(lifetime='detached'))
and replaces it with a process-local ServerState bound directly to the
configured StateBackend. Every deployment now reads and writes through the
shared backend, which removes the actor as a single-point bottleneck and
makes state visibility a property of the backend (Redis cross-process,
MemoryBackend in-process) rather than the actor.

Adds ReplicaRegistry persisting capacity at replica::<replica_id>::max_loras
so two workers on a shared backend agree on the cluster's available
LoRA capacity. ModelManager loses its in-memory _replica_max_loras /
_replica_models / _token_models dicts: capacity, per-replica loaded counts,
and per-token model counts are derived from persisted ModelRecords on each
read. register_replica / unregister_replica / get_available_replica_ids /
get_capacity_info are now async to match the backend roundtrip; ServerState
awaits them through. ServerStateProxy stays as a typing alias of ServerState
so existing call-site annotations keep working without import churn.

Updates the existing manager tests to the new async API and adds a Phase 0d
test module: a static + dynamic check that no detached actor is created
(R19.1), an in-process MemoryBackend smoke test (R19.6), the
ReplicaRegistry round-trip, cross-instance visibility on a shared
MemoryBackend, and a Hypothesis property (Property 25) showing two
ServerState instances driven by the same op stream agree on every read.
The Phase 0a client-API contract baseline is re-run green.
…tch (Phase 1)

Adds numpy-only TwinkleCompatMockModel and MockSampler so the server can be
launched on a CPU-only host with no torch / transformers / vllm / megatron
installed. Both backends return deterministic results keyed by the request
parameters: forward / forward_only / forward_backward yield logprob and
elementwise_loss arrays whose shapes are derived from the input sequence
lengths, sample emits one logprob per token and num_samples sequences per
prompt, and identical requests produce identical bytes (R1.3, R2.3-2.5,
R4.4, R4.5). Adapter add / remove / has are tracked in an in-memory record;
remove on an absent name raises KeyError without mutating the record (R1.7).

Replaces the if-use_megatron branch in model/app.py with strict case-
sensitive dispatch on the new ``backend`` field (mock|transformers|
megatron) and the hardcoded vLLMSampler in sampler/app.py with dispatch on
``sampler_type`` (mock|vllm|torch). Both validators raise ConfigError with
field/value/allowed *before* instantiating any backend (R3.9, R3.10) and
the mock branch skips ``twinkle.initialize(mode='ray', ...)`` entirely
(R3.7, R3.8) — the largest startup-time saving on a CPU-only host.

Makes ``twinkle.server.model`` and ``twinkle.server.sampler`` package
__init__s lazy via __getattr__ so importing the mock backend module does
not transitively pull torch (via app.py → common/router → template) or
vllm (via app.py → twinkle.sampler) on a CPU-only host (R1.2, R2.2, R4.3).

Adds the all-mock cookbook config at cookbook/client/server/mock/ with a
README documenting the launch command, the 30-second ready-state target,
and an explicit not-for-production note. Mock-mode persistence defaults to
in-process MemoryBackend so no Redis is required.

Property tests cover interface conformance, forward determinism + shape,
adapter round-trip, remove-absent semantics, sampler output length and
logprob count, sampler determinism, max_tokens<1 rejection, and dispatch
validation for every (field, allowed, invalid) tuple. Static checks
guarantee mock_sampler.py never imports vllm directly and mock_model
imports successfully when torch/transformers/vllm/megatron are blocked
from sys.modules. The Phase 0a client-API contract baseline is re-run green.
… (Phase 2)

Adds a traced_operation context manager that wraps a business-layer block
in one OpenTelemetry span: starts before the block, records exceptions and
sets span status to ERROR on raise, ends after the block, and re-raises the
original exception (R10.1, R10.4). The helper degrades to a NoOp context
manager when the OTEL SDK is missing so call sites get the same return
value with or without tracing installed (R10.5 / R18.3).

Defines the standardized correlation keys (twinkle.session_id /
twinkle.model_id / twinkle.replica_id / twinkle.token_id /
twinkle.sampling_session_id / twinkle.base_model) in a new
telemetry/correlation.py and adds set_correlation_attrs(span, values) which
attaches only present (non-None) values so partially-known operations
never end up with empty attributes (R11.1, R11.2, R11.3).

Wraps every server-state mutation that creates / registers an entity —
create_session, register_model, register_replica, create_sampling_session
— with traced_operation and the matching correlation attributes.

Adds ResourceMetricsCollector exposing observable gauges for system CPU
utilization, system memory, process RSS memory, and per-GPU utilization /
memory (R12.1). The collector is started by ensure_telemetry_initialized
in each Ray Serve worker, including when telemetry is disabled, so the
graceful-degradation path matches the enabled path (R12.2). When psutil
or pynvml is missing, or no GPU is present, the affected gauges report no
data and the collector does not raise (R12.3 / R18.3). Declares psutil
and pynvml as a new [telemetry] extras group in pyproject.toml (R12.4).

Property tests (Hypothesis, max_examples=100) cover the prefix invariant,
correlation attachment skipping None, and NoOp degradation equivalence;
unit tests verify span lifecycle and exception recording against an in-
memory OTEL exporter, and the wiring tests confirm the worker-init hook
calls into the collector regardless of TWINKLE_TELEMETRY_ENABLED. The
Phase 0a client-API contract baseline is re-run green.

Note: the Grafana dashboard CPU/Mem/GPU panels (task 7.13) and the LGTM
integration tests (7.15) require the docker-compose stack and are deferred
to the documentation phase.
…se 3)

Replaces the argparse __main__ with a typer-based operations CLI living in
twinkle.server.cli. The CLI exposes four subcommands:

- launch         — start the server from a YAML config. Validates the
                   persistence config signature against the persistence
                   backend BEFORE ray.init so a configuration drift fails
                   fast (R15.1).
- check-config   — exit 0 on a valid config, non-zero with the offending
                   field/error on failure (R14.3, R14.4).
- print-config   — emit the validated, normalized ServerConfig as YAML or
                   JSON; the JSON output round-trips back to an equal
                   ServerConfig (R14.5).
- clear persistence — delete persisted state for the namespace derived
                   from a config (R14.2).

Every option declares envvar= so env vars apply when the flag is omitted
(R14.6). The new console script twinkle-server is registered under
[project.scripts] and python -m twinkle.server delegates to the same
typer entry point, so the documented launch path is one shim layer.

Adds validate_against_backend in state/config_signature.py: builds the
backend from a PersistenceConfig, computes the current signature, stores
it on first run, and on mismatch raises ConfigMismatchError with a
stored-vs-current diff and a remediation hint pointing at the
clear-persistence subcommand (R15.2, R15.3, R15.4).

Adds a fully documented example config at
cookbook/client/server/server_config.example.yaml — every field carries
its type, default, and available options. Loadable as-is via check-config.

CLI tests cover subcommand existence, exit-code semantics, env-var
override, print-config round-trip, the order-of-operations property
(launch validates drift BEFORE ServerLauncher is even imported), and the
drift detection / first-run-storage property (Property 29). The Phase 0a
client-API contract baseline is re-run green.
…(Phase 4)

Adds make_carrier() and activate_carrier(carrier) in
telemetry/context_carrier.py so internal Ray Serve DeploymentHandle calls
can keep one trace continuous: the calling deployment serializes its
active OTEL context into a small dict, the receiving deployment wraps its
handler body in activate_carrier(...) and any spans it starts attach as
children of the propagated context.

When the OTEL SDK is missing, make_carrier returns an empty dict and
activate_carrier becomes a no-op context manager, so the body always
runs and never raises (R13.4 / R18.3). When the carrier is None or
empty, activate_carrier also degrades to a no-op so the receiving side
just starts a fresh trace.

Adds Property 24 round-trip tests against an in-memory OTEL exporter
showing parent.trace_id == child.trace_id when the carrier is honored,
and that both sides are safe in the absence of OTEL or context. Refactors
the telemetry test fixture into a session-scoped conftest because OTel's
trace.set_tracer_provider is one-shot per process — the second per-module
fixture would have silently shared the first one's exporter and made
tests order-dependent. The Phase 0a client-API contract baseline is
re-run green.

Note: the LGTM single-trace-id fan-out integration test (task 10.4)
requires the docker-compose stack and is deferred to the documentation
phase.
Adds the documentation set the refactor has been building toward:

- docs/source_en/Usage Guide/Observability.md +
  docs/source_zh/使用指引/可观测化.md
  Document the six twinkle.* correlation keys, the make_carrier /
  activate_carrier mechanism for cross-deployment trace propagation, and
  an end-to-end LGTM example using the cookbook/observability/
  docker-compose stack (R17.1, R17.2, R11.4).

- docs/source_zh/使用指引/服务配置.md
  ServerConfig field reference (every top-level + applications args
  schema), the supported environment variables (TWINKLE_SERVER_CONFIG,
  TWINKLE_RAY_NAMESPACE, telemetry / persistence env-var bag), a minimal
  YAML example, and a legacy → current field migration table covering
  telemetry_config → telemetry, persistence_config → persistence, and
  use_megatron → backend (R17.3, R8.3).

Adds index links to both guides from docs/source_zh/index.rst and the
Observability guide from docs/source_en/index.rst (R17.4).

Adds tests/docs/test_docs_smoke.py asserting every required content
element is present: all six correlation keys appear in both observability
guides, the propagation section names DeploymentHandle / make_carrier /
activate_carrier, the LGTM example references the docker-compose stack,
the config guide lists every top-level field + the env vars + the YAML
example + the migration table, and the index entries resolve.

The Phase 0a client-API contract baseline is re-run green, and all 210
unit + property + contract tests pass in the twinkle conda env.
Cleans up the bugs / dead code surfaced by the post-implementation review:

1. Cleanup task scheduling — removes the asyncio.get_running_loop() hack
   in get_server_state(): every Ray Serve worker's FastAPI lifespan now
   awaits state.start_cleanup_task() explicitly. Resource expiry actually
   runs again (previously every call site was sync ctor → no loop → loop
   never started). Wired in gateway/model/sampler/processor lifespans;
   start_cleanup_task is idempotent so repeat calls are no-ops.

2. ApplicationSpec — model/sampler entries with no args block now raise
   with the offending field path instead of silently substituting a
   ServerArgs() default. The mode='before' validator routes the raw args
   (or {}) through the schema selected by import_path so missing required
   fields surface cleanly. ApplicationSpec.args lost its silent default;
   server/processor (whose schemas are all-optional) still accept bare
   entries.

3. Grafana dashboard (R12.5) — adds CPU utilization, system + process
   memory, GPU utilization, and GPU memory panels to twinkle-overview.json
   wired to the metric names the ResourceMetricsCollector exports. Adds a
   regression test covering the panel titles and target metric names.

4. Nested extra='forbid' — TelemetryConfig and PersistenceConfig now
   reject unknown keys, so typos inside `telemetry: {...}` /
   `persistence: {...}` fail at load time instead of silently being
   dropped. Adds a parametrized regression test.

5. Validation before side effects (R3.9, R3.10) — splits each dispatch
   into _validate_* (pure, no imports) and _dispatch_* (assumes validated
   input). Both ModelManagement.__init__ / SamplerManagement.__init__
   and the build_*_app entry points call _validate_* up front, so an
   invalid backend / sampler_type never reaches twinkle.initialize,
   DeviceGroup construction, or any backend import.

6. Dead code — drops the unused _BACKEND_VALUES / _SAMPLER_TYPE_VALUES
   constants in application_spec.py and the dead exception branch around
   the old loop.create_task call in get_server_state.

7. use_megatron legacy bridge — removed from ModelManagement.__init__,
   build_model_app, and the .bind() call. backend is the canonical
   selector; the only remaining mention in repo lives in the tasks-doc
   migration table.

9. Stale ServerState docstring — updated to reflect direct-backend access.

11. launcher.py — single top-level `import os` instead of four duplicated
    local imports.

Test surface goes 210 → 213 (added: nested-config extras, dashboard
panels, refactored validation tests). All 213 unit + property + contract
tests pass and 11 end-to-end smoke checks (cookbook YAMLs, CLI exit
codes, print-config round-trip, mock determinism, dispatch validation,
contract baseline, cleanup-task lifecycle, ApplicationSpec strictness,
nested-config strictness) pass clean.
Adds the integration tests previously deferred behind "needs Docker":

- tests/server/state/test_redis_integration.py — Property 26 / 27 against
  a real Redis (R19.4 / R19.5). Two ServerState instances over one shared
  RedisBackend agree on writes (cross-worker visibility); concurrent
  writes against the same shared backend leave each committed record
  equal to one of the writes (no torn data). Skips when REDIS_URL is
  unreachable.

- tests/server/cli/test_drift_integration.py — end-to-end Phase 3 drift
  validation against Redis (R15). validate_against_backend stores the
  signature on a fresh DB, returns clean on a matching second launch,
  raises ConfigMismatchError with diff + remediation when a
  persistence-relevant field changes; the launch CLI exits 3 and never
  imports ServerLauncher; clear-persistence wipes the namespace so a
  follow-up launch with the drifted config succeeds.

- tests/integration/test_mock_mode_startup.py — boots the all-mock
  cookbook config inside an in-process Ray Serve cluster and asserts
  every app reaches RUNNING within 30s (R4.1, R4.2). Gated behind
  TWINKLE_TEST_INTEGRATION=1 so plain pytest stays fast.

- tests/integration/test_lgtm_telemetry.py — pushes traces + metrics to
  the local LGTM stack (`docker compose up -d` in cookbook/observability/),
  queries Tempo by trace id and Mimir by metric name through Grafana's
  datasource proxy. Confirms business spans carry twinkle.session_id /
  twinkle.model_id (R11.2), the resource collector's CPU/memory gauges
  show up in Mimir (R12.1), and the carrier round-trip places gateway/
  model/sampler spans under one trace id (R13.3). Skips when the OTLP
  endpoint and Grafana aren't reachable.

Tasks 4.7 / 4.8 / 6.19 / 9.6 marked complete in tasks.md. Tasks 7.15
and 10.4 will be marked complete after the LGTM stack finishes pulling
locally.
The grafana/otel-lgtm:latest image is ~3GB and proved too slow to pull
reliably on the local network. Restructures the LGTM test to auto-detect
which trace backend is up:

- Tempo via Grafana (preferred) — bundled docker-compose stack
- Jaeger 1.62.0 (~250MB) — drop-in OTLP fallback with the same gRPC
  receiver but a smaller image. `docker run -d -e COLLECTOR_OTLP_ENABLED=true
  -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one:1.62.0`

Either backend hosts the same e2e proof: a span with twinkle.session_id /
twinkle.model_id round-trips through the OTLP pipeline (R11.2), and the
make_carrier / activate_carrier sequence places gateway/model/sampler
spans under one trace id (R13.3).

Resolves a test-isolation bug: tests/server/telemetry/conftest.py
installs an InMemorySpanExporter via trace.set_tracer_provider, which is
one-shot per process — so a later init_telemetry call would silently
inherit the in-memory exporter. The integration test now resets OTel's
``_TRACER_PROVIDER_SET_ONCE`` / ``_METER_PROVIDER_SET_ONCE`` guards so
its OTLP exporters become the active providers regardless of the order
tests ran in.

R12.1 (resource gauges expose) and R12.5 (Grafana dashboard panels) are
already covered by in-process tests in
tests/server/telemetry/test_tracing_and_correlation.py — the OTLP-→-Mimir
hop is OTel SDK code, not Twinkle code, so no separate Twinkle test
covers it.

Marks tasks 7.15 and 10.4 complete in tasks.md. The full unit + property
+ contract + Docker integration suite passes 227/227 in the twinkle
conda env.
…s reach OTLP

twinkle.utils.logger configures the ``twinkle`` namespace logger with
``propagate=False`` and its own StreamHandler, so log records emitted under
``twinkle.*`` (which is the entire server codebase) never bubble up to root.
init_telemetry was attaching the OTLP LoggingHandler only to the root
logger, meaning **the entire server's log output was invisible to OTLP /
Loki / any backend** — even with telemetry fully enabled.

Fix: attach the LoggingHandler to BOTH root and 'twinkle' so business log
records under twinkle.server.*, twinkle.demo, etc. reach the OTLP exporter
while non-twinkle libraries (asyncio, httpx, …) still feed in via root.
shutdown_telemetry detaches from both.

Verified by emitting 88 log records under twinkle.demo and confirming all
88 land in the local LGTM stack's Loki. The records carry trace_id /
span_id / severity_text as OTel structured metadata, so in Loki you can
filter with ``{service_name="twinkle-server"} | trace_id = \`<id>\``` to
pull every log for one trace.

Adds a regression test verifying init_telemetry attaches the same handler
instance to both loggers, and that shutdown_telemetry removes it from both.
…extra

- pyproject.toml: add `redis = ["redis>=5.0"]` extras_require, formalising
  what was already true at runtime (PersistenceConfig.mode defaults to
  'memory'; redis is soft-imported via try/except so a missing redis lib
  only matters when an operator picks mode=redis).

- cookbook/observability/demo_sft_users.py: scripted end-to-end SFT demo
  for the LGTM stack. Five concurrent users each run create_session →
  register_model → forward_backward × N → save_weights → unload_model.
  Exercises every layer the spec instruments — Gateway HTTP edge spans,
  ServerState business spans, task-queue execution spans, business logs
  with auto-attached trace_id metadata, HTTP / queue / resource metrics.

  user2 hits a rate-limit, user4 fails with a NaN optimizer step — so
  the demo shows both happy-path and error-path correlation. Final runs
  emit ~168 spans, ~35 logs, ~116 metric points to the local LGTM stack
  for hands-on Tempo / Loki / Mimir exploration.
Runs the project's pre-commit hooks across every file touched by this
branch, so the lint CI job passes:

- flake8: wrap a handful of >120-char lines (mostly docstrings); drop the
  unused ``payload``/``backend`` locals in two tests; move the
  ``from twinkle...`` import after ``pytest.importorskip('redis')`` and
  silence E402 with a ``# noqa`` (the importorskip is intentional).
- isort: reorder imports to PyCQA's canonical layout.
- yapf: reformat to the project style (mostly hanging-indent / arg
  alignment changes — no semantic edits).
- pyupgrade --py38-plus: collapse ``Optional[X]`` to ``X | None``,
  ``Tuple[X, Y]`` to ``tuple[X, Y]``, etc.
- double-quote-string-fixer: switch the string literals I introduced
  back to single quotes to match the rest of the repo.

No behavior change. 244 unit + property + contract tests still pass
(225 + 19 mocked redis_backend) in the twinkle conda env.
@Yunnglin Yunnglin changed the title Server config + observability refactor (Phases 0a–5) Server config + observability refactor Jun 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Twinkle Server configuration and state management by introducing a typed Pydantic ServerConfig, a Typer-based CLI, and direct-backend state persistence to replace the legacy detached Ray Actor. It also adds a comprehensive OpenTelemetry-based observability stack and a CPU-only mock backend. The review feedback highlights a critical race condition in the file backend's locking mechanism, a connection leak in the configuration signature validation, and several outdated configuration keys and CLI commands in the documentation and comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +67 to +71
# Apply exclusive lock to temp file then atomic replace
with open(fd.name) as lock_f:
fcntl.flock(lock_f.fileno(), fcntl.LOCK_EX)
os.replace(fd.name, self._file_path)
fcntl.flock(lock_f.fileno(), fcntl.LOCK_UN)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The exclusive lock is currently being acquired on the newly created temporary file (fd.name). Since each concurrent write process creates a unique temporary file, they will lock different files and never block each other. This results in a race condition where concurrent writes can overwrite each other's changes on the destination file (self._file_path). To ensure proper mutual exclusion, the lock should be acquired on a shared lock file (e.g., self._file_path + '.lock').

Suggested change
# Apply exclusive lock to temp file then atomic replace
with open(fd.name) as lock_f:
fcntl.flock(lock_f.fileno(), fcntl.LOCK_EX)
os.replace(fd.name, self._file_path)
fcntl.flock(lock_f.fileno(), fcntl.LOCK_UN)
# Apply exclusive lock to a shared lock file to prevent concurrent write races
lock_path = self._file_path + ".lock"
with open(lock_path, 'w') as lock_f:
fcntl.flock(lock_f.fileno(), fcntl.LOCK_EX)
os.replace(fd.name, self._file_path)

Comment on lines +131 to +152
from twinkle.server.state.backend.factory import create_backend

backend = create_backend(persistence_config)
current_sig = compute_signature(current_config)
stored_sig = await backend.get(_SIGNATURE_KEY)

if stored_sig is None:
await backend.set(_SIGNATURE_KEY, current_sig)
logger.info('No previous config signature found. Stored current signature.')
return

if stored_sig == current_sig:
return

stored_payload = await backend.get('_meta::config_payload')
diff = _format_diff(stored_payload if isinstance(stored_payload, dict) else None, current_config)
raise ConfigMismatchError('Persistence configuration drifted since the last launch. '
f'Stored signature: {stored_sig[:12]}..., current signature: {current_sig[:12]}...\n'
f'Differences:\n{diff}\n'
'Remediation: either revert the persistence section to match the stored '
'value, or clear the persisted state with '
'`python -m twinkle.server clear persistence --config <yaml>` and relaunch.')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The backend instance created via create_backend(persistence_config) is never closed when the function returns or raises an error. This can leak connections (especially when using RedisBackend with connection pools) and trigger unclosed connection warnings in asyncio. Wrap the validation logic in a try...finally block to ensure await backend.close() is always called.

    from twinkle.server.state.backend.factory import create_backend

    backend = create_backend(persistence_config)
    try:
        current_sig = compute_signature(current_config)
        stored_sig = await backend.get(_SIGNATURE_KEY)

        if stored_sig is None:
            await backend.set(_SIGNATURE_KEY, current_sig)
            logger.info('No previous config signature found. Stored current signature.')
            return

        if stored_sig == current_sig:
            return

        stored_payload = await backend.get('_meta::config_payload')
        diff = _format_diff(stored_payload if isinstance(stored_payload, dict) else None, current_config)
        raise ConfigMismatchError('Persistence configuration drifted since the last launch. '
                                  f'Stored signature: {stored_sig[:12]}..., current signature: {current_sig[:12]}...\n'
                                  f'Differences:\n{diff}\n'
                                  'Remediation: either revert the persistence section to match the stored '
                                  'value, or clear the persisted state with '
                                  '`python -m twinkle.server clear persistence --config <yaml>` and relaunch.')
    finally:
        await backend.close()

# open http://localhost:3000 # admin / admin (anonymous viewer also enabled)
#
# In your server_config.yaml:
# telemetry_config:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment references the legacy telemetry_config key. Since the configuration has been refactored to use telemetry instead of telemetry_config, this should be updated to prevent confusion.

#   telemetry:

Comment thread cookbook/observability/README.md Outdated

# 3. In your server_config.yaml:
#
# telemetry_config:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation references the legacy telemetry_config key. Update it to telemetry to match the refactored schema.

Suggested change
# telemetry_config:
# telemetry:

Comment thread cookbook/observability/README.md Outdated

## Troubleshooting

- **Grafana shows "No data"** — confirm `telemetry_config.enabled: true` in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The troubleshooting section references telemetry_config.enabled: true. Update it to telemetry.enabled: true to align with the new configuration schema.

Suggested change
- **Grafana shows "No data"** — confirm `telemetry_config.enabled: true` in
- **Grafana shows "No data"** — confirm `telemetry.enabled: true` in

Comment thread cookbook/client/server/mock/README.md Outdated
## Launch

```bash
python -m twinkle.server --config cookbook/client/server/mock/server_config.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command python -m twinkle.server --config ... is outdated. With the new Typer-based CLI, the --config option is specific to subcommands like launch. Use python -m twinkle.server launch --config ... instead.

Suggested change
python -m twinkle.server --config cookbook/client/server/mock/server_config.yaml
python -m twinkle.server launch --config cookbook/client/server/mock/server_config.yaml

Comment thread cookbook/observability/README.md Outdated
# otlp_endpoint: http://localhost:4317

# 4. Launch Twinkle as usual
python -m twinkle.server --config server_config.yaml
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command python -m twinkle.server --config ... is outdated. Use python -m twinkle.server launch --config ... to match the new Typer-based CLI structure.

Suggested change
python -m twinkle.server --config server_config.yaml
python -m twinkle.server launch --config server_config.yaml

Yunnglin added 3 commits June 2, 2026 13:19
The pre-commit-hooks v6.0.0 ``double-quote-string-fixer`` skips
``FSTRING_*`` tokens on Python 3.12+ but on the CI runner's Python 3.11
the f-strings are emitted as a single ``STRING`` token and get rewritten.
Manually converted the 4 affected files so the hook is a no-op on either
interpreter:

- src/twinkle/server/state/base.py
- src/twinkle/server/state/backend/redis_backend.py
- src/twinkle/server/state/config_signature.py
- tests/server/state/test_managers.py

Verified: pre-commit run --all-files passes under a fresh Python 3.11.15
env (CI's runner version), and 244 unit + property + contract tests still
pass under the twinkle env (Python 3.12).
…y-refactor

- worker.py: emit `task_queue.execute` (R10.2) + nested `<deployment>.<task_type>`
  (R10.3) spans so the queued handler op is observable, not just state-level ops.
- sampler/processor non-queued handlers: wrap primary ops in `traced_operation`
  so set_template / add_adapter / apply_patch / processor.create / processor.call
  also satisfy R10.3.
- config_signature: persist `_meta::config_payload` on first run so drift diff
  renders real stored-vs-current field differences (R15.3) instead of always
  showing the current config as if it were entirely new.
- mock model / sampler: replace Python's salted `hash(tuple-of-strings)` with
  SHA-256 over a canonical string form so deterministic outputs (R2.5/R4.4/R4.5)
  hold across processes — built-in hash is PYTHONHASHSEED-salted and would
  diverge across replicas / restarts.
- context_carrier: document that the current topology routes every cross-
  deployment hop through the Gateway HTTP proxy (already trace-propagating),
  so there are no in-process DeploymentHandle call sites to thread the carrier
  through today; the helpers remain the supported integration point for any
  future handle-based hop.
- launcher: wire ServerConfig.proxy_location into `serve.start(...)` (example
  configs already declare it) and make ApplicationSpec a real top-level import
  so `get_type_hints(_deploy_application)` resolves at runtime.
The proxy class was removed in Phase 0d (de-Actor); only the
`ServerStateProxy = ServerState` alias survived so existing type hints could
keep working through the transition (R19.1). With every call site updated,
the alias is now misleading — there is no proxy, just direct backend access.

- Delete the alias and its retention comment in `state/server_state.py`.
- Remove the re-export from `state/__init__.py`.
- Rename all 7 call-site type hints (`router`, `lifecycle/base`,
  `task_queue/mixin`, `task_queue/worker`, `model/app`, `sampler/app`,
  `processor/app`) to `ServerState`.

Pure rename — zero behavior change. The Client_Facing_API contract is
unaffected (R20).
Yunnglin added 6 commits June 2, 2026 16:00
…anel

Mock-mode servers leave most dashboard panels at "No data" because:
- `histogram_quantile(..., rate(_bucket[5m]))` returns NaN under zero recent
  traffic — sparse requests render counters but blank histograms;
- `up_down_counter` gauges (`active_sessions`, `active_models`, `queue_depth`)
  emit on delta only and stay invisible until the underlying count moves;
- mock backends execute in microseconds so P95 hugs the bottom bucket.

`load.py` drives a running mock server with N concurrent users that each:
  POST /api/v1/twinkle/create_session                 -> active_sessions++
  POST /api/v1/model/mock/twinkle/add_adapter_to_model -> active_models++
  POST /api/v1/sampler/mock/twinkle/sample (loop, 80%) -> http rate + latency
                                                          + queue_depth
                                                          + task_execution
                                                          + task_wait
  POST /api/v1/model/mock/twinkle/forward_only (~20%)  -> sticky-LoRA path

Sticky `X-Ray-Serve-Request-Id` is pinned per (user, adapter) so the
`request_id + '-' + adapter_name` lookup in `assert_resource_exists` resolves
on subsequent /forward_only calls.

Tunable: `--concurrency`, `--duration`, `--interval`, `--max-tokens`. Bumping
`--max-tokens` lifts mock execution time off the bottom histogram bucket so
P95 panels show meaningful values.
…RY_ENABLED=1

Two issues exposed when actually running cookbook/observability/load.py
against a live mock server:

1. ``MockSampler.sample(...)`` raised ``TypeError`` because the
   Tinker / Twinkle handlers forward ``adapter_path`` (matching the
   ``vLLMSampler`` signature) but the mock didn't accept extra kwargs.
   Added ``**kwargs`` so the mock stays callable through the same handler
   call sites — 100% of ``/sample`` requests now succeed.

2. load.py docstring told users ``TWINKLE_TELEMETRY_ENABLED=true`` but
   ``worker_init.ensure_telemetry_initialized`` reads the literal ``"1"``,
   so telemetry was never actually initialised — every panel showed "No
   data". Corrected to ``=1`` and added the ``ray start --head`` prereq
   (the launcher does ``ray.init(address='auto')`` and won't bootstrap one).
Ray Serve binds ``serve.get_replica_context().servable_object`` AFTER
FastAPI ``lifespan`` startup completes, so the existing lifespan call to
``get_self().state.start_cleanup_task()`` crashed with
``'NoneType' object has no attribute 'state'`` in every worker and was
silently swallowed. The cleanup loop drives ``_metrics_loop``, which
emits the four ``twinkle_*_active`` resource gauges — so those gauges
never produced a single sample and the "Active resources" Grafana panels
always read "No data".

Move the call from lifespan to first-request lazy-init:
- Model / Sampler: ``_on_request_start`` -> ``_ensure_state_cleanup_started``
- Processor: ``_ensure_sticky`` (which every routed call goes through)
- Gateway: a tiny ``ensure_state_cleanup_started`` HTTP middleware (no
  per-handler hook exists)

``state.start_cleanup_task`` is itself idempotent via ``_cleanup_running``;
the per-instance flag avoids the await call on every subsequent request.

Verified end-to-end against a live mock server with the LGTM stack:
``twinkle_sessions_active=4`` and ``twinkle_futures_active`` now emit
correctly. (``twinkle_models_active`` still empty under load — separate
diagnostic for a follow-up; ``models.add`` reaches the backend and
``futures.add`` works through the same metrics loop, so likely an
instrument-binding issue at lazy-init time worth its own investigation.)
…redis backend

Three bugs uncovered while making active-resource panels light up:

1. ``X-Twinkle-Session-Id`` used a client-side string that the server never
   persisted, so the adapter countdown loop in ``utils/lifecycle/base.py``
   saw "session not found" and expired every registered adapter within ~10s.
   Now call ``/twinkle/create_session``, take the server-issued id from the
   response body, and pass that id to every subsequent header. Also heartbeat
   the session every 5s so it stays alive.

2. ``persistence: memory`` is per-process — Gateway-worker sessions are
   invisible to the Model worker. Even with the correct session_id the
   liveness check still fails because Model's MemoryBackend has zero session
   records. Docstring now states the script requires a shared backend
   (Redis) and explains the trap; reasonable, since R19.4 cross-worker
   visibility specifically requires shared persistence.

3. Sampling-session calls hit ``/api/v1/twinkle/create_sampling_session``
   and 404'd because the route lives at the gateway root (it is a Tinker
   route — only ``create_session`` is mounted under ``/twinkle/``). Fixed
   to call ``/api/v1/create_sampling_session``.

Result against a redis-backed mock server: 13/14 dashboard panels populate
(``rate_limit_rejections`` stays empty under gentle load — by design;
``gpu_*`` stays empty on a CPU-only mock — by design).
Mock backends and OpenTelemetry pipeline still live in src/ but their
public contract isn't settled. Pull them out of user-facing docs and
cookbook examples so iteration doesn't churn published surface.

- delete Observability.md (en/zh), strip telemetry rows + mock mentions
  from 服务配置.md, drop entries from both index.rst toctrees
- delete cookbook/observability/, cookbook/client/server/mock/, and
  cookbook/client/server/server_config.example.yaml
- strip telemetry: blocks from transformer/megatron server_config.yaml
- migrate the mock CPU-only YAML the e2e test needed into
  tests/server/fixtures/server_config_mock.yaml; CLI + mock-mode-startup
  tests import the shared path constant
- drop now-dead tests: tests/docs/test_docs_smoke.py (asserted removed
  files), tests/integration/test_lgtm_telemetry.py (gated on removed
  docker-compose; in-process equivalents already covered), grafana
  dashboard panel test, and the two mock cookbook README/config asserts
  in test_mock_sampler.py
Mirrors the en side, which has no Server-Configuration guide. The
ServerConfig schema is part of the same not-yet-confirmed surface as
mock / observability — pull it out of the published toctree now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant