Skip to content

feat(observability): replace App Insights SDK with OpenTelemetry#3823

Open
TaprootFreak wants to merge 2 commits into
developfrom
feat/observability-opentelemetry
Open

feat(observability): replace App Insights SDK with OpenTelemetry#3823
TaprootFreak wants to merge 2 commits into
developfrom
feat/observability-opentelemetry

Conversation

@TaprootFreak
Copy link
Copy Markdown
Collaborator

Summary

App Insights is no longer active in the new deployment, so the deep App Insights SDK integration only adds dead weight. This replaces it with OpenTelemetry, exporting request/dependency/exception telemetry over OTLP to our self-hosted observability stack — restoring request-duration tracking, dependency traces, latency percentiles and exception correlation.

The collector and trace backend are provisioned by a sibling infra change in our private deployment-config repo; this PR is the application half. The two are designed to be merged independently (tracing is a no-op until the endpoint env var is set).

Changes

  • src/tracing.ts (new): NodeSDK with getNodeAutoInstrumentations() (HTTP/HTTPS, pg/TypeORM, NestJS) and an OTLPTraceExporter (OTLP/HTTP). Started before any instrumented module loads (imported first in main.ts). The exporter target comes exclusively from OTEL_EXPORTER_OTLP_ENDPOINT — no hardcoded address — and tracing is disabled when the variable is unset, so local/test runs are unaffected.
  • 4xx-not-a-failure: the HTTP responseHook marks 4xx server responses as OK, replicating the old App Insights telemetry processor (only 5xx count as server errors). Outbound client 4xx keep their default status.
  • src/main.ts: import the tracer first; remove the applicationinsights setup block.
  • src/shared/services/dfx-logger.ts: drop the App Insights telemetry sink; record exceptions/events on the active span and append trace_id/span_id to each log line so logs correlate to traces in the log backend.
  • package.json: remove applicationinsights; add @opentelemetry/{sdk-node,auto-instrumentations-node,exporter-trace-otlp-http,api}. Lockfile change is OTel-only (no unrelated transitive bumps).
  • .env.example: document OTEL_EXPORTER_OTLP_ENDPOINT (replaces APPINSIGHTS_INSTRUMENTATIONKEY).
  • Tests: unit tests for the 4xx classifier and the logger's span integration.

What replaces what

App Insights feature Replacement
Request tracking (duration, status) OTel HTTP/NestJS auto-instrumentation
Dependency tracking (HTTP, DB) OTel HTTP + pg/TypeORM auto-instrumentation
Exception tracking span.recordException in DfxLogger
Log ↔ request correlation trace_id appended to every log line
4xx not counted as failure HTTP responseHook sets 4xx server spans to OK

Out of scope

The App Insights query feature (app-insights-query.service.ts, Config.azure.appInsights) reads telemetry via the REST API and is unrelated to the SDK/sink — left untouched.

Verification

npm ci ✓ · npm run lint ✓ · npm run format:check ✓ · npm run build ✓ · npm test ✓ (1012 passed). Full end-to-end verification against a local OTLP collector + trace backend is documented in a follow-up comment.

Notes for reviewers / ops

  • New env var OTEL_EXPORTER_OTLP_ENDPOINT must be added to the Vault items (DEV + PRD) for tracing to activate. No code change is needed to toggle it.
  • applicationinsights is fully removed; no source imports it anymore.

App Insights is no longer active in the new deployment. Replace the App
Insights SDK with OpenTelemetry so request/dependency/exception telemetry
flows into our self-hosted Grafana stack via OTLP.

- add src/tracing.ts: NodeSDK with auto-instrumentations (HTTP/DB/NestJS)
  and an OTLP/HTTP trace exporter. The collector endpoint comes exclusively
  from OTEL_EXPORTER_OTLP_ENDPOINT; tracing is disabled when it is unset.
  4xx server responses are marked non-failures in the HTTP response hook,
  replicating the old App Insights telemetry processor.
- import the tracer first in main.ts (before any instrumented module) and
  drop the App Insights setup block.
- DfxLogger: record exceptions/events on the active span and append the
  trace id to each log line so logs can be correlated to traces.
- remove the applicationinsights dependency; document
  OTEL_EXPORTER_OTLP_ENDPOINT in .env.example.
- unit tests for the 4xx classifier and the logger span integration.

The collector/backend lives in a sibling infra change in our private
deployment-config repo.
The HTTP responseHook ran before the instrumentation's own server-span
status logic, so it could not override 4xx statuses. Replace it with a
ClientErrorSpanProcessor that runs on span end (after instrumentation):
it resets 4xx SERVER spans that were flagged ERROR (e.g. by a request that
logs through DfxLogger.error) back to UNSET, so only 5xx count as failures —
matching the old App Insights telemetry processor. Add unit tests for the
processor and the status classifier.
@TaprootFreak
Copy link
Copy Markdown
Collaborator Author

✅ End-to-end verification (local)

The instrumentation was verified by booting this branch's dfx-api image against a full local OpenTelemetry stack via docker compose: a local OTLP collector (Grafana Alloy)Tempo (trace backend) → Prometheus (RED metrics) and Grafana, plus Loki for logs — the same topology used in our deployment. dfx-api ran in ENVIRONMENT=loc against a local Postgres, with OTEL_EXPORTER_OTLP_ENDPOINT pointed at the local collector (env-only, nothing hardcoded). ~600 real HTTP requests were sent.

Traces arrive — request + dependency spans

A GET /v1/asset request produced a complete span tree:

[SERVER]   GET /v1/asset                         ← request duration
[INTERNAL] AssetController.getAllAsset / NestJS middleware
[CLIENT]   pg.query:SELECT   db.system=postgresql   ← DB dependency (TypeORM/pg)
[CLIENT]   pg-pool.connect   db.system=postgresql

Outbound HTTP dependency spans were captured with the real target host + status, e.g.:

[CLIENT] POST  net.peer.name=eth-mainnet.g.alchemy.com   http.status_code=401
[CLIENT] GET   net.peer.name=static.polygon.technology   http.status_code=200

So getNodeAutoInstrumentations() correctly traces inbound requests, outbound HTTP, and DB queries.

4xx-not-a-failure (ClientErrorSpanProcessor)

  • GET /v1/lnurld/:id500 → server span status = STATUS_CODE_ERROR
  • GET /v1/user401 → server span status = STATUS_CODE_UNSET

RED-metric breakdown over the run: STATUS_CODE_ERROR = 20 (only the 5xx), STATUS_CODE_UNSET = 299 (all 200s and all 4xx). 4xx are excluded from the error rate, matching the old App Insights telemetry processor. Latency percentiles from the derived metrics: p50 1.5 ms / p95 7.6 ms / p99 15.9 ms.

Log ↔ trace correlation

A dfx-api log line carried trace_id=448ce7b040ec801b88e933196a2c4144 span_id=726dd894a1c9f8ac (appended by DfxLogger), and that exact trace was present in the trace backend (root span GET /v1/lnurld/:id) — log lines link straight to their trace.

Boot & SDK ordering

The image built from this branch (npm cinest build) boots cleanly; the compiled main.js runs require("./tracing") as its first statement, so the SDK starts before any instrumented module is loaded. With OTEL_EXPORTER_OTLP_ENDPOINT unset, tracing is a no-op and the app is unchanged.

CI

npm ci · lint · format:check · build · test (1016 passed, incl. the new tracing/logger specs) all green.

@TaprootFreak TaprootFreak marked this pull request as ready for review June 4, 2026 21:45
@TaprootFreak TaprootFreak requested a review from davidleomay as a code owner June 4, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant