otel: explicit traceparent injection + linked-trace mode for bounded per-invocation traces#2363
otel: explicit traceparent injection + linked-trace mode for bounded per-invocation traces#2363karthikscale3 wants to merge 8 commits into
Conversation
…per-invocation traces
- Add WORKFLOW_TRACE_MODE ('linked' default, 'continuous' legacy) to the
workflow and step queue handlers. In linked mode, WORKFLOW_V2/STEP spans
start a new trace root with span links to the incoming delivery context
and the run-origin context, and re-enqueued messages forward the
ORIGINAL run-origin trace carrier unchanged.
- world-vercel now explicitly injects W3C traceparent/tracestate/baggage
headers on outgoing workflow-server HTTP requests from inside the
client span (no-op without an OTEL SDK registered).
- New workflow.trace.mode span attribute; unit tests for both modes and
for header injection.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Next.js (Turbopack) | Nitro stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Next.js (Turbopack) | Nitro | Express 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Next.js (Turbopack) | Express fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)astro (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🦋 Changeset detectedLatest commit: 5b3ca9f The changes in this PR will be included in the next version bump. This PR includes changesets to release 22 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
WORKFLOW_V2/STEP prefixes with full machine names (workflow//./src/...//fn) become workflow.execute / step.execute / workflow.start with the short function name. New workflowDisplayName/stepDisplayName helpers in @workflow/utils handle both raw and queue-sanitized name forms; full names remain in the workflow.name/step.name attributes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pranaygp
left a comment
There was a problem hiding this comment.
Reviewed with a focus on attribute consistency, forwards compatibility, DX, and perf overhead. Overall this is solid: the linked-mode semantics are coherent with the other three PRs in the stack (baggage keys match what workflow-server#514 reads; the run-origin carrier semantics match the server's executionContext.traceCarrier-based span links, which are also pinned to run origin; world-vercel consumes deliveries via @vercel/queue.handleCallback, so vqs#181's consumer-side extraction is exactly what feeds linkToCurrentContext). Perf-wise the change is a net reduction when OTEL is active (linked mode skips a propagation.inject per re-enqueue) and stays a memoized no-op without an SDK. Ran the new test suites and typecheck locally — all green.
No blocking issues. Inline comments below: one behavioral edge around empty {} carriers in linked mode, a code-duplication suggestion, a DX nit on unrecognized WORKFLOW_TRACE_MODE values, two display-name edge cases, and a doc accuracy fix on span kinds.
| // continuous mode the current (active) context is serialized so the | ||
| // trace keeps chaining. | ||
| const getNextTraceCarrier = (): Promise<Record<string, string>> => | ||
| traceMode === 'linked' && traceContext |
There was a problem hiding this comment.
traceContext here can be {} and still take the linked branch: start() always attaches a carrier, and serializeTraceCarrier() returns {} both when no OTEL SDK is registered at the origin and when OTEL is registered but start() runs outside any active span (background job, script). For such runs, linked mode forwards the empty object forever, while the undefined branch adaptively falls back to serializeTraceCarrier() (making the first instrumented invocation the de-facto run origin for future links).
Consider treating an empty carrier like an absent one — e.g. traceContext && Object.keys(traceContext).length > 0 — so both "no usable origin" shapes behave the same (same applies to the copy in step-handler.ts). Related side effect (pre-existing, but more visible now): workflow.trace.propagated is !!traceContext, so it reports true for {} even though there's nothing usable to link to.
There was a problem hiding this comment.
Fixed in 5b3ca9f. Added isUsableTraceCarrier() and normalized the incoming carrier at the top of both queue handlers, so {} counts as "no usable origin" everywhere the mode logic branches — linked mode falls back to serializing the current context (first instrumented invocation becomes the de-facto origin) instead of forwarding {} forever. Also took the related side effect: workflow.trace.propagated now reports whether a usable (non-empty) carrier arrived. Test added pinning traceCarrier: {} ≡ no carrier.
| // so every future invocation links back to the same origin; in | ||
| // continuous mode the current (active) context is serialized so the | ||
| // trace keeps chaining. | ||
| const getNextTraceCarrier = (): Promise<Record<string, string>> => |
There was a problem hiding this comment.
This block — getNextTraceCarrier plus the origin-link dedup below (lines ~191–210) — is duplicated nearly verbatim from runtime.ts (~343–366). Since this encodes the core linked-mode invariants (forward the original carrier; dedup origin vs delivery link), consider extracting two small helpers into telemetry.ts, e.g. nextTraceCarrier(traceMode, traceContext) and buildInvocationSpanLinks(traceMode, traceContext), so the semantics can't drift between the workflow and step handlers.
Bonus if you do: resume-hook.ts (~186–193) has a hand-rolled version of carrier→link that lacks the isSpanContextValid guard your new linkToTraceCarrier has — it could reuse the helper too.
There was a problem hiding this comment.
Fixed in 5b3ca9f. Extracted both invariants into telemetry.ts as getNextTraceCarrier(traceMode, incomingCarrier) and buildInvocationSpanLinks(traceMode, incomingCarrier) (exact prior semantics, pinned by the existing trace-mode tests), now used by both runtime.ts and step-handler.ts. Took the bonus too: resume-hook.ts now uses linkToTraceCarrier and gains the isSpanContextValid guard it was missing.
| * Defaults to `'linked'`; any value other than `'continuous'` selects it. | ||
| */ | ||
| export function getWorkflowTraceMode(): WorkflowTraceMode { | ||
| return process.env.WORKFLOW_TRACE_MODE === 'continuous' |
There was a problem hiding this comment.
Any unrecognized value silently selects linked — a typo like WORKFLOW_TRACE_MODE=continous changes trace topology with zero signal, and if a future SDK version adds a third mode, older SDKs will silently reinterpret it as linked. A one-time runtimeLogger.warn for non-empty unrecognized values would make misconfiguration debuggable and give forward compatibility a soft landing. (Resolving once into a module-level constant would also give you the warn-once behavior for free — the env var can't meaningfully change mid-process anyway.)
There was a problem hiding this comment.
Fixed in 5b3ca9f. Kept the dynamic per-call env read (the trace-mode tests flip WORKFLOW_TRACE_MODE per test, so a module-level constant would break them) and added a one-time runtimeLogger.warn per distinct unrecognized non-empty value, naming the value and the accepted ones before falling back to linked. Test asserts the warning fires exactly once for a continous typo.
| if (!name.startsWith(`${tag}--`)) return null; | ||
| // The `//` separators became `--`, and within the function-name part any | ||
| // nested-function `/` became `-`. Function names are JS identifiers (no | ||
| // dashes), so the innermost name is the last dash-free segment. |
There was a problem hiding this comment.
Two best-effort edges worth noting in this comment (or handling):
$is a valid JS identifier character and gets sanitized to-, sostep//…//process$Orderin sanitized form displays asOrder— "no dashes" isn't strictly true for identifiers.- Default exports diverge between the two input forms:
parseNamemapsdefault/__defaultto the module short name, but this sanitized path returns the literaldefault. The same workflow can then show asworkflow.start order(raw name instart()) butworkflow.execute default(sanitized name in the queue handler). Mappingdefaultto the preceding segment here would keep the two span names consistent.
There was a problem hiding this comment.
Fixed in 5b3ca9f. (2) is handled: shortNameFromSanitized now maps default/__default to the preceding module segment's short name, mirroring parseName, so default exports display consistently (e.g. order) in both workflow.start and workflow.execute — pinned by a test. (1) is documented as an accepted best-effort limitation in the comment ($ sanitizes to -, so process$Order displays as Order), with a test pinning the behavior.
| | --- | --- | --- | | ||
| | `workflow.start <name>` | internal | `start()` is called in your application code | | ||
| | `workflow.execute <name>` | internal (root) | a queue delivery invokes the workflow — replay, orchestration, and inline steps run under it | | ||
| | `step.execute <name>` | internal | a step function executes | |
There was a problem hiding this comment.
Kind is inaccurate for the queue-delivered case: step-handler.ts creates this span with SpanKind.CONSUMER (only inline steps executed within workflow.execute are internal), and in linked mode the queue-delivered step.execute span is also a new trace root, same as workflow.execute. Suggest something like: internal (inline) / consumer + root (queue-delivered).
There was a problem hiding this comment.
Fixed in 5b3ca9f. Table now reads workflow.execute → consumer (root) (it's CONSUMER as of this commit, see the other thread) and step.execute → internal (inline) / consumer + root (queue-delivered), per your suggested wording.
| return trace( | ||
| `WORKFLOW_V2 ${workflowName}`, | ||
| { links: spanLinks }, | ||
| `workflow.execute ${workflowDisplayName(workflowName)}`, |
There was a problem hiding this comment.
Pre-existing inconsistency, but this PR's v5 window is the cheapest moment to fix it: this queue-delivered span has default INTERNAL kind while the equivalent queue-delivered step.execute span uses CONSUMER. Messaging semconv would suggest CONSUMER here too — and it would pair nicely with the PRODUCER-kind vqs.send span being added on the other side in vercel/vqs#181. Fine as a follow-up, but if you want it, doing it inside the same beta avoids a second span-shape change.
There was a problem hiding this comment.
Done in 5b3ca9f — agreed this beta is the cheapest window. The queue-delivered workflow.execute span now sets kind: CONSUMER via the same getSpanKind('CONSUMER') pattern step-handler uses (both modes), pairing with the PRODUCER vqs.send span in vercel/vqs#181. Added a SpanKind.CONSUMER assertion to the trace-mode test and a changeset bullet noting the internal→consumer kind change.
pranaygp
left a comment
There was a problem hiding this comment.
Pre-emptively approving — no blocking bugs, perf is clean (net reduction when OTEL is active, memoized no-op without it), and the cross-repo semantics line up with vqs-server#615 / workflow-server#514 / vqs#181.
@karthikscale3 please address the inline comments from my review (#2363 (review)) before merging — in particular:
- Empty
{}carrier in linked mode (runtime.ts:344+ thestep-handler.tscopy): runs started from uninstrumented contexts silently lose run-level correlation links; treat an empty carrier like an absent one. - Silent fallback on unrecognized
WORKFLOW_TRACE_MODEvalues (telemetry.ts:31): a typo flips trace topology with zero signal; add a warn-once.
The rest (dedup extraction, display-name edges, docs span-kind row, CONSUMER kind for workflow.execute) are nice-to-haves — fine in this PR or as follow-ups.
…ame edge cases, consumer span kind
- Treat an empty ({}) trace carrier as absent everywhere the trace-mode
logic branches, so linked mode falls back to a fresh origin instead of
forwarding a useless {} forever; workflow.trace.propagated now reports
whether a usable carrier arrived.
- Extract the duplicated linked-mode logic into shared telemetry helpers
getNextTraceCarrier() and buildInvocationSpanLinks(), used by both the
workflow and step queue handlers; resume-hook now uses
linkToTraceCarrier (gaining the isSpanContextValid guard).
- Warn once per distinct unrecognized WORKFLOW_TRACE_MODE value instead
of silently selecting linked.
- shortNameFromSanitized: map default/__default to the module short name
(mirroring parseName) and document the `$`-sanitization limitation.
- Queue-delivered workflow.execute spans now use the CONSUMER span kind,
matching queue-delivered step.execute spans; docs span table and
changeset updated accordingly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem: mega-traces
Today the workflow queue handlers restore the run-origin trace context from the message's
traceCarrierand make it the parent of everyWORKFLOW_V2/STEPinvocation span. Since each invocation re-serializes its own context onto the next queue message, a single workflow run becomes one giant trace — spanning hours of sleeps/retries and dozens of stitched-together function invocations. These mega-traces are slow to load, hard to read, and frequently broken in Datadog (span limits, late-arriving spans, partial flushes).Separately, the
world-vercelHTTP client creates a CLIENT span for every workflow-server request but never injectstraceparentinto the outgoing headers — propagation only happened if the customer's app happened to have undici auto-instrumentation, so workflow-server spans usually couldn't join the caller's trace.Linked-trace mode (new default)
This PR introduces
WORKFLOW_TRACE_MODEwith two values:linked(new default): each invocation'sWORKFLOW_V2 <name>/STEP <name>span is created as a new trace root (SpanOptions.root: true) with span links to:traceCarrier(skipped when absent/invalid or identical to the delivery link).Re-enqueued messages forward the original run-origin
traceCarrierunchanged, so every future invocation of the run links back to the same origin. Traces stay small and bounded per invocation, while links preserve full run-level correlation.continuous: exactly the previous behavior — restored run-origin context parents the invocation span, with a link to the delivery context, and re-enqueues serialize the current context. SetWORKFLOW_TRACE_MODE=continuousto opt back in.Both modes keep
withWorkflowBaggagewrapping, all existing span attributes (includingworkflow.trace.propagated), and add a newworkflow.trace.modeattribute recording the active mode.Explicit traceparent injection on workflow-server calls
world-vercel'smakeRequestnow injects W3C context (traceparent,tracestate,baggage) into the outgoing request headers from inside thehttp <method>CLIENT span, via a newinjectTraceContextIntoHeaders(headers)helper inworld-vercel's lazy telemetry module. workflow-server can now reliably parent its spans to the SDK's client span regardless of the customer's instrumentation setup.Queue sends (
@vercel/queue) are intentionally untouched here — VQS treats messageheadersas allowlisted custom headers; HTTP-layer injection for queue sends is handled in the@vercel/queueclient itself.Behavioral changes to telemetry (please read)
The API is backward compatible, but the new
linkeddefault changes the shape of emitted traces in ways existing dashboards and queries can feel. SetWORKFLOW_TRACE_MODE=continuousto restore the previous shape exactly.start()contained the entire workflow execution — everyWORKFLOW_V2/STEPspan across all invocations carried the run-origin trace ID. Now each invocation is its own root trace. Anything keyed on a shared per-run trace ID (saved trace queries, "open my request's trace and see the run" debugging flows, trace-ID joins) must switch to span links or theworkflow.run.idattribute.start()that covered the whole run consistently. Each invocation root now samples independently — ratio samplers will produce partially-sampled runs, and the number of root spans/traces increases to one per invocation (relevant for trace-volume-based vendor billing and rate-limiting samplers).WORKFLOW_V2/STEPspans had a remote parent; they are now parentless roots. Queries filtering on parent relationships and service-map edges from the calling service to the workflow handler will change.traceCarriersemantics change. Queue messages now forward the original run-origin carrier unchanged, rather than each invocation's current context. Custom worlds or tooling that introspect message carriers and assume "carrier = most recent invocation context" will observe different values.Not changed: all existing span attributes and baggage keys, and the no-OTEL no-op behavior. One footnote: app-set baggage entries now also leave the process as a
baggageHTTP request header on backend calls (they already left viatraceCarrierin events).Friendlier span names
Workflow/step span names previously used uppercase prefixes with full machine names (
WORKFLOW_V2 workflow//./src/jobs/order//processOrder). They are now short and lowercase:workflow.execute processOrder,step.execute chargeCard,workflow.start processOrder. NewworkflowDisplayName/stepDisplayNamehelpers in@workflow/utilsresolve both the raw machine name and the queue-sanitized form (workflow----src-jobs-order--processOrder) seen by queue handlers; unrecognized formats fall back to the raw string. The full machine name remains available in theworkflow.name/step.namespan attributes. This is also a span-name change for anyone queryingWORKFLOW_V2/STEPnames — same v5-beta reasoning as above.Backward compatibility
@opentelemetry/apistays an optional peer dep, the default no-op propagator injects nothing, and no headers are added.WORKFLOW_TRACE_MODE=continuousrestores the prior trace shape bit-for-bit (parenting, links, and carrier chaining).traceparent/tracestate/baggageare standard W3C headers; receivers without tracing simply drop them.Testing
packages/core/src/runtime-trace-mode.test.ts: default is linked; linked creates a root span with links to both delivery + run-origin contexts; continuous preserves the legacy parented shape; linked forwards the originaltraceCarrieron re-enqueues while continuous serializes the current context. Uses a real in-memory OTEL SDK (BasicTracerProvider+InMemorySpanExporter+ W3C propagator).packages/world-vercel/src/trace-propagation.test.ts:traceparentlands on the outgoing request and matches thehttp GETclient span; clean no-op without an active span context.pnpm build,pnpm typecheck, full unit suites forpackages/core(1124 passed) andpackages/world-vercel(134 passed), Biome format/lint clean.Backport policy
Do not backport to
stable(v4). Thelinkeddefault is a deliberate telemetry-shape change scoped to the v5 beta major — backporting it would change trace topology, per-run trace IDs, and sampling behavior for GA v4 users mid-major. v4 keeps its current behavior until users upgrade to v5; the platform side is fully tolerant of v4 SDKs.Documentation
Adds
docs/content/docs/v5/observability/tracing.mdx(linked from the Observability index): enabling OTEL, emitted spans and attributes, linked trace mode and span links,WORKFLOW_TRACE_MODEreference with a v4 behavior-change callout, and context-propagation/baggage notes. v4 docs intentionally untouched.Rollout notes
Server-side support for storing and re-injecting trace context on queue deliveries ships separately on the platform. This PR is safe to merge and release independently: without the platform-side support, behavior is unchanged apart from the new (ignorable) W3C headers and the bounded linked-trace shape.
Follow-up: bump
@vercel/queuein@workflow/world-vercelonce a release with HTTP-layer trace-context injection is published, so queue sends carry trace headers as well.🤖 Generated with Claude Code