fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305
Open
james-cardenas wants to merge 7 commits into
Open
fix(agentex): bootstrap OTel auto-instrumentation in uvicorn spawn workers#305james-cardenas wants to merge 7 commits into
james-cardenas wants to merge 7 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
service.instance.idby patchingOTEL_RESOURCE_ATTRIBUTESbeforeinitialize(), fixing shared timeseries when multiple workers inherit the same pod-level resource attrs (opentelemetry-python#4390)otel_metricsto the first import inapp.pyso instrumentors patch FastAPI/httpx/SQLAlchemy at import time;init_otel_metrics()at lifespan startup attaches custom instruments to the existing globalMeterProviderProblem
The OTel Operator injects auto-instrumentation via
sitecustomize, which runsinitialize()in the parent process and then strips auto-instrumentation fromPYTHONPATH. Uvicorn spawn workers are fresh Python processes withoutsitecustomize, so they previously served plainFastAPIwith no OTel HTTP middleware or metrics.Separately, spawn workers share the pod-level
OTEL_RESOURCE_ATTRIBUTESenv var. Auto-instrumentation builds provider resources from env atinitialize()time viaResource.create()— so all workers would emit on the sameservice.instance.idwithout a per-process suffix (opentelemetry-python#4390).Solution
bootstrap_auto_instrumentation()— runs onotel_metricsimport; syncsservice.instance.id.<pid>into env, then callsinitialize()to create globalTracerProvider/MeterProviderand load instrumentors (auto-instrumentation reference)init_otel_metrics()— unchanged coexistence model: attaches custom app metrics (auth_cache_*,db_*) to the bootstrap provider when present; standalone OTLP pipeline only when no global provider existsapp.pyimportsotel_metricsfirst, before FastAPI and other auto-instrumented librariesReferences
initialize()creates providers viaResource.create()from env, not patching-onlyservice.instance.idfor multi-worker processesNotes
initialize()on--workers 1only produces set-once warnings--workers 4): each spawn worker bootstraps independently with a distinct pid-suffixedservice.instance.idddtrace-runonly whendatadog.envis setNoOpMeterProviderreset on shutdown — OTel globalMeterProvideris set-once and cannot be replacedTest plan
pytest agentex/tests/unit/utils/test_otel_metrics.py(24 tests)_InstrumentedFastAPIin spawn workersservice.instance.idper worker when--workers > 1http_routeand k8s resource labels from operator envauth_cache_*,db_*) export on the same provider resourceMade with Cursor
Greptile Summary
This PR fixes OTel auto-instrumentation in uvicorn spawn workers by importing
otel_metricsfirst inapp.pysobootstrap_auto_instrumentation()runs at module load time in each worker process, before FastAPI and other instrumented libraries are imported. It also assigns a per-workerservice.instance.idby appending the process PID toOTEL_RESOURCE_ATTRIBUTESbeforeinitialize(), resolving duplicate timeseries when multiple workers share pod-level resource attributes.bootstrap_auto_instrumentation()is called at module import, calls OTel'sinitialize()once per process when contrib packages are present, handlesImportErrorand runtime failures gracefully, and sets the idempotency flag only afterinitialize()succeeds._sync_instance_id_to_env()writes a PID-suffixedservice.instance.idintoOTEL_RESOURCE_ATTRIBUTESbeforeinitialize()reads it;_resource_with_unique_instance_id()applies the same in the standalone metrics path.NoOpMeterProviderreset on shutdown, which was incorrect since the OTel global provider slot is set-once and cannot be meaningfully replaced.Confidence Score: 5/5
Safe to merge. The bootstrap path is fully guarded, the idempotency flag is set only after success, and the env mutation is scoped to module import time before any other process state is established.
The core bootstrap logic is correct: initialize() is called exactly once per worker process, exception handling prevents OTel failures from crashing the service, and the per-worker service.instance.id is computed and injected into env before initialize() reads it. The concerns raised in previous review threads have both been addressed in this version. The 14 new tests cover failure modes, idempotency, and env mutation semantics. No logic errors were found across all three changed files.
No files require special attention.
Important Files Changed
Sequence Diagram
sequenceDiagram participant UV as uvicorn parent participant W as spawn worker participant OM as otel_metrics (import) participant ENV as os.environ participant INIT as OTel initialize() participant LP as lifespan startup UV->>W: fork/spawn new process W->>OM: import otel_metrics (first import in app.py) OM->>ENV: _sync_instance_id_to_env(service.pod.pid) OM->>INIT: initialize() reads env, creates TracerProvider+MeterProvider INIT-->>OM: global providers installed OM->>OM: "_auto_instrumentation_bootstrapped = True" W->>W: import FastAPI, httpx, SQLAlchemy (already patched) W->>LP: lifespan startup LP->>OM: init_otel_metrics() OM->>OM: _global_meter_provider() finds bootstrap provider OM-->>LP: returns existing MeterProvider LP->>LP: "register auth_cache_*, db_* instruments"Reviews (3): Last reviewed commit: "refactor(agentex): polish otel_metrics m..." | Re-trigger Greptile