feat(observability): replace App Insights SDK with OpenTelemetry#3823
feat(observability): replace App Insights SDK with OpenTelemetry#3823TaprootFreak wants to merge 2 commits into
Conversation
App Insights is no longer active in the new deployment. Replace the App Insights SDK with OpenTelemetry so request/dependency/exception telemetry flows into our self-hosted Grafana stack via OTLP. - add src/tracing.ts: NodeSDK with auto-instrumentations (HTTP/DB/NestJS) and an OTLP/HTTP trace exporter. The collector endpoint comes exclusively from OTEL_EXPORTER_OTLP_ENDPOINT; tracing is disabled when it is unset. 4xx server responses are marked non-failures in the HTTP response hook, replicating the old App Insights telemetry processor. - import the tracer first in main.ts (before any instrumented module) and drop the App Insights setup block. - DfxLogger: record exceptions/events on the active span and append the trace id to each log line so logs can be correlated to traces. - remove the applicationinsights dependency; document OTEL_EXPORTER_OTLP_ENDPOINT in .env.example. - unit tests for the 4xx classifier and the logger span integration. The collector/backend lives in a sibling infra change in our private deployment-config repo.
The HTTP responseHook ran before the instrumentation's own server-span status logic, so it could not override 4xx statuses. Replace it with a ClientErrorSpanProcessor that runs on span end (after instrumentation): it resets 4xx SERVER spans that were flagged ERROR (e.g. by a request that logs through DfxLogger.error) back to UNSET, so only 5xx count as failures — matching the old App Insights telemetry processor. Add unit tests for the processor and the status classifier.
✅ End-to-end verification (local)The instrumentation was verified by booting this branch's Traces arrive — request + dependency spansA Outbound HTTP dependency spans were captured with the real target host + status, e.g.: So 4xx-not-a-failure (
|
Summary
App Insights is no longer active in the new deployment, so the deep App Insights SDK integration only adds dead weight. This replaces it with OpenTelemetry, exporting request/dependency/exception telemetry over OTLP to our self-hosted observability stack — restoring request-duration tracking, dependency traces, latency percentiles and exception correlation.
The collector and trace backend are provisioned by a sibling infra change in our private deployment-config repo; this PR is the application half. The two are designed to be merged independently (tracing is a no-op until the endpoint env var is set).
Changes
src/tracing.ts(new):NodeSDKwithgetNodeAutoInstrumentations()(HTTP/HTTPS,pg/TypeORM, NestJS) and anOTLPTraceExporter(OTLP/HTTP). Started before any instrumented module loads (imported first inmain.ts). The exporter target comes exclusively fromOTEL_EXPORTER_OTLP_ENDPOINT— no hardcoded address — and tracing is disabled when the variable is unset, so local/test runs are unaffected.responseHookmarks 4xx server responses asOK, replicating the old App Insights telemetry processor (only 5xx count as server errors). Outbound client 4xx keep their default status.src/main.ts: import the tracer first; remove theapplicationinsightssetup block.src/shared/services/dfx-logger.ts: drop the App Insights telemetry sink; record exceptions/events on the active span and appendtrace_id/span_idto each log line so logs correlate to traces in the log backend.package.json: removeapplicationinsights; add@opentelemetry/{sdk-node,auto-instrumentations-node,exporter-trace-otlp-http,api}. Lockfile change is OTel-only (no unrelated transitive bumps)..env.example: documentOTEL_EXPORTER_OTLP_ENDPOINT(replacesAPPINSIGHTS_INSTRUMENTATIONKEY).What replaces what
pg/TypeORM auto-instrumentationspan.recordExceptioninDfxLoggertrace_idappended to every log lineresponseHooksets 4xx server spans to OKOut of scope
The App Insights query feature (
app-insights-query.service.ts,Config.azure.appInsights) reads telemetry via the REST API and is unrelated to the SDK/sink — left untouched.Verification
npm ci✓ ·npm run lint✓ ·npm run format:check✓ ·npm run build✓ ·npm test✓ (1012 passed). Full end-to-end verification against a local OTLP collector + trace backend is documented in a follow-up comment.Notes for reviewers / ops
OTEL_EXPORTER_OTLP_ENDPOINTmust be added to the Vault items (DEV + PRD) for tracing to activate. No code change is needed to toggle it.applicationinsightsis fully removed; no source imports it anymore.