Skip to content

fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305

Open
james-cardenas wants to merge 7 commits into
mainfrom
jamesc-fix-auto-intrumentation
Open

fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305
james-cardenas wants to merge 7 commits into
mainfrom
jamesc-fix-auto-intrumentation

Conversation

@james-cardenas

@james-cardenas james-cardenas commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bootstrap OpenTelemetry auto-instrumentation at import time so uvicorn spawn workers get HTTP/library instrumentation and custom metrics, not just the parent process
  • Assign a per-worker service.instance.id by patching OTEL_RESOURCE_ATTRIBUTES before initialize(), fixing shared timeseries when multiple workers inherit the same pod-level resource attrs (opentelemetry-python#4390)
  • Move otel_metrics to the first import in app.py so instrumentors patch FastAPI/httpx/SQLAlchemy at import time; init_otel_metrics() at lifespan startup attaches custom instruments to the existing global MeterProvider

Problem

The OTel Operator injects auto-instrumentation via sitecustomize, which runs initialize() in the parent process and then strips auto-instrumentation from PYTHONPATH. Uvicorn spawn workers are fresh Python processes without sitecustomize, so they previously served plain FastAPI with no OTel HTTP middleware or metrics.

Separately, spawn workers share the pod-level OTEL_RESOURCE_ATTRIBUTES env var. Auto-instrumentation builds provider resources from env at initialize() time via Resource.create() — so all workers would emit on the same service.instance.id without a per-process suffix (opentelemetry-python#4390).

Solution

  1. bootstrap_auto_instrumentation() — runs on otel_metrics import; syncs service.instance.id.<pid> into env, then calls initialize() to create global TracerProvider/MeterProvider and load instrumentors (auto-instrumentation reference)
  2. init_otel_metrics() — unchanged coexistence model: attaches custom app metrics (auth_cache_*, db_*) to the bootstrap provider when present; standalone OTLP pipeline only when no global provider exists
  3. Import orderapp.py imports otel_metrics first, before FastAPI and other auto-instrumented libraries

References

Notes

  • Helm (single worker): bootstrap runs in the worker process; operator k8s resource labels are preserved; duplicate initialize() on --workers 1 only produces set-once warnings
  • Dockerfile (--workers 4): each spawn worker bootstraps independently with a distinct pid-suffixed service.instance.id
  • ddtrace coexistence: documented in module docstring; Helm uses ddtrace-run only when datadog.env is set
  • Removed ineffective NoOpMeterProvider reset on shutdown — OTel global MeterProvider is set-once and cannot be replaced

Test plan

  • pytest agentex/tests/unit/utils/test_otel_metrics.py (24 tests)
  • Deploy to cluster; confirm _InstrumentedFastAPI in spawn workers
  • Confirm distinct service.instance.id per worker when --workers > 1
  • Confirm FastAPI HTTP metrics include http_route and k8s resource labels from operator env
  • Confirm custom metrics (auth_cache_*, db_*) export on the same provider resource

Made with Cursor

Greptile Summary

This PR fixes OTel auto-instrumentation in uvicorn spawn workers by importing otel_metrics first in app.py so bootstrap_auto_instrumentation() runs at module load time in each worker process, before FastAPI and other instrumented libraries are imported. It also assigns a per-worker service.instance.id by appending the process PID to OTEL_RESOURCE_ATTRIBUTES before initialize(), resolving duplicate timeseries when multiple workers share pod-level resource attributes.

  • bootstrap_auto_instrumentation() is called at module import, calls OTel's initialize() once per process when contrib packages are present, handles ImportError and runtime failures gracefully, and sets the idempotency flag only after initialize() succeeds.
  • _sync_instance_id_to_env() writes a PID-suffixed service.instance.id into OTEL_RESOURCE_ATTRIBUTES before initialize() reads it; _resource_with_unique_instance_id() applies the same in the standalone metrics path.
  • Removed the NoOpMeterProvider reset on shutdown, which was incorrect since the OTel global provider slot is set-once and cannot be meaningfully replaced.

Confidence Score: 5/5

Safe to merge. The bootstrap path is fully guarded, the idempotency flag is set only after success, and the env mutation is scoped to module import time before any other process state is established.

The core bootstrap logic is correct: initialize() is called exactly once per worker process, exception handling prevents OTel failures from crashing the service, and the per-worker service.instance.id is computed and injected into env before initialize() reads it. The concerns raised in previous review threads have both been addressed in this version. The 14 new tests cover failure modes, idempotency, and env mutation semantics. No logic errors were found across all three changed files.

No files require special attention.

Important Files Changed

Filename Overview
agentex/src/utils/otel_metrics.py Adds bootstrap_auto_instrumentation() called at import time to patch OTel instrumentors in spawn workers; adds per-worker service.instance.id via env mutation; removes NoOpMeterProvider reset on shutdown. Logic is sound, exception handling is correct, flag is set only after success.
agentex/src/api/app.py Moves otel_metrics import to the top of the file so bootstrap runs before FastAPI and other auto-instrumented libraries are imported in each spawn worker. Change is minimal and correct.
agentex/tests/unit/utils/test_otel_metrics.py Adds 14 new unit tests covering bootstrap success/failure, idempotency, ImportError path, service.instance.id construction, env mutation, and shutdown behavior. Autouse fixture correctly saves/restores _auto_instrumentation_bootstrapped state.

Sequence Diagram

sequenceDiagram
    participant UV as uvicorn parent
    participant W as spawn worker
    participant OM as otel_metrics (import)
    participant ENV as os.environ
    participant INIT as OTel initialize()
    participant LP as lifespan startup

    UV->>W: fork/spawn new process
    W->>OM: import otel_metrics (first import in app.py)
    OM->>ENV: _sync_instance_id_to_env(service.pod.pid)
    OM->>INIT: initialize() reads env, creates TracerProvider+MeterProvider
    INIT-->>OM: global providers installed
    OM->>OM: "_auto_instrumentation_bootstrapped = True"
    W->>W: import FastAPI, httpx, SQLAlchemy (already patched)
    W->>LP: lifespan startup
    LP->>OM: init_otel_metrics()
    OM->>OM: _global_meter_provider() finds bootstrap provider
    OM-->>LP: returns existing MeterProvider
    LP->>LP: "register auth_cache_*, db_* instruments"
Loading

Reviews (3): Last reviewed commit: "refactor(agentex): polish otel_metrics m..." | Re-trigger Greptile

@james-cardenas james-cardenas requested a review from a team as a code owner June 12, 2026 05:14
Comment thread agentex/src/utils/otel_metrics.py
Comment thread agentex/src/utils/otel_metrics.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant