feat(openfeature): emit server-side EVP flagevaluation by leoromanovsky · Pull Request #3984 · DataDog/dd-trace-php

leoromanovsky · 2026-06-14T14:02:30Z

Motivation

Server-side flag evaluations are currently invisible in the Datadog EVP flagevaluation track for PHP. This PR makes dd-trace-php emit schema-conformant, aggregated EVP flagevaluation payloads while leaving the existing OTel feature_flag.evaluations behavior unchanged (no regression). It mirrors the Go reference implementation (#4886) — the same two-tier aggregation, frozen caps, comparable canonical-context bucket key, and killswitch — adapted to PHP's native (Rust/C bridge → sidecar) architecture.

In PHP the OpenFeature evaluator is owned by native code (libdatadog/datadog-ffe, reached through components-rs/ffe.rs and tracer/ffe.c); the PHP layer owns the provider wrapper and the request-shutdown flush dispatch. So unlike the pure-SDK tracers, the aggregation lives in the native bridge and the payload is delivered through the sidecar's EVP path rather than an in-process HTTP writer.

What we learned

Recording flagevaluation is not like emitting the existing OTel metric. The metric is fire-and-forget; flagevaluation must aggregate per evaluation — bucket by flag + variant + reason + allocation + (full tier) targeting key + pruned context — and only emit the merged counts at flush. The aggregation and the flush therefore have to live where the evaluation already crosses into native code: every ddog_ffe_evaluate call does a cheap mutex insert into a process-global aggregator, and the request-shutdown handler drains/serializes/POSTs the batch through the sidecar.

flowchart LR
    A["PHP OpenFeature provider<br/>(DataDogProvider.php)"] --> B["ddog_ffe_evaluate()<br/>components-rs/ffe.rs (native eval)"]
    B --> C["cheap mutex insert into<br/>two-tier EVP_AGGREGATOR"]
    C --> D["RSHUTDOWN flush:<br/>ddtrace_ffe_flush_flag_evaluation_batch()<br/>tracer/ffe.c"]
    D --> E["ddog_ffe_flush_flag_evaluation_batch()<br/>drain + serialize"]
    E --> F["sidecar ffe_flagevaluation_flusher"]
    F --> G["POST /evp_proxy/v2/api/v2/flagevaluations"]

Transferable lesson for the fan-out: a non-self-describing IPC codec is part of the contract, not just the EVP schema. PHP's worker→sidecar hand-off serializes SidecarAction with bincode, and two idioms that are perfectly correct for the JSON POST are fatal over bincode: a serde_json::Value field (bincode cannot deserialize_any) and #[serde(skip_serializing_if = …)] (serialize omits the field, but bincode's positional deserialize still expects it → every subsequent field misaligns). Either makes the sidecar drop the batch with IPC serve: failed to decode request — and because the worker's enqueue still returns ok, it presents as a delivery failure, not a serialization one (it took instrumenting both ends to see the action was enqueued, never received). The resolution keeps the JSON-shaping (pruned-context object, omitted optionals) on the POST side in the sidecar flusher, while the bincode wire types stay plain (no serde_json::Value, no skip_serializing_if); a bincode round-trip test now locks it. This PR also adds tests driving the real FFI entry point ddog_ffe_evaluate, so a missing recording call can't pass silently. Lesson: for SDKs that cross a binary IPC before the HTTP POST, validate the wire codec round-trip, not only the EVP schema.

Design Decisions

Transport = the sidecar EVP path. PHP has no in-process EVP HTTP writer; the existing exposures/metrics flushers already route through the sidecar, so the new flagevaluation batch is enqueued the same way (SidecarAction::FfeFlagEvaluationBatch) and the sidecar POSTs to /evp_proxy/v2/api/v2/flagevaluations. The payload is aggregated counts per bucket, matching the worker's evaluation_count / first_evaluation / last_evaluation schema, serialized as camelCase flagEvaluations with nested {key} objects.
Recording = cheap mutex insert on the eval path; drain/serialize/POST at RSHUTDOWN (PHP's answer to the synchronous-hook problem). PHP's request model is synchronous and short-lived, so there is no long-running background worker to offload onto; instead the per-evaluation cost is kept to a bounded mutex insert (scalar copies + a canonical-context key build), and the expensive drain + serialization + hand-off happens once per request at shutdown (tracer/ddtrace.c RSHUTDOWN), off the evaluation calls themselves and alongside the existing OTel metric and exposure flushes.
Bucket identity = comparable canonical-context key, NOT a hash. The five enumerable dims (flag, variant, allocationKey, reason, targetingKey) are exact strings, and the context attributes are encoded into a type-tagged, length-delimited canonical string that is itself a field of the Rust map key — so the language hashes and compares it natively. Distinct contexts always land in distinct buckets: no manual digest, no collisions, no misattribution.
Two-tier degradation with explicit frozen caps: full (globalCap=131072 total, perFlagCap=10000/flag; context pruned 256 fields / 256-char values) → degraded (degradedCap=32768; drops targeting key + context, = OTel cardinality) → drop (counted). No ultra-degraded tier (removed in the Go reference; never triggers at the team's ≥2,500-flag target once degradedCap is sized correctly, and it lossily collapsed allocation+reason).
Existing OTel path is preserved byte-for-byte. The native record_ffe_evaluation_metric path (EvaluationMetricRecorder.php + sidecar metric flush) and the RSHUTDOWN ddtrace_ffe_flush_evaluation_metrics() call are untouched; the EVP path is purely additive.
Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default on) gates only the EVP path; with it off the aggregator is never touched and nothing is enqueued.

High-load behavior — when do we drop?

Counts are preserved; only dimensional fidelity degrades. Every evaluation contributes exactly one count to some tier (or the bounded drop counter); Σ(counts across tiers + drops) == evaluations processed.

Tier	Key dimensions	Bound	Entered when
Full	flag, variant, allocationKey, reason, targetingKey, canonical context (+ pruned context payload)	`globalCap=131,072` total and `perFlagCap=10,000`/flag	default; context pruned 256 fields / 256-char values in place
Degraded	flag, variant, allocationKey, reason	`degradedCap=32,768`	a flag reaches `perFlagCap` distinct full buckets, or `globalCap` is full
Drop (counted)	—	n/a (a single observable counter)	`degradedCap` is full

The degraded key is exactly the OTel feature_flag.evaluations cardinality (flag × variant × allocation × reason), sized to hold the full legitimate cardinality at the 2,500-flag target. Over-cap counts increment the dropped_degraded_overflow counter rather than cascading into a lossy third tier — in practice this only fires under genuine abuse. This PR makes that drop observable: the counter is read-and-reset at drain and logged via tracing::warn when non-zero (mirroring the Go reference), so an undersized degradedCap surfaces as a warning instead of a silent loss of legitimate counts.

Performance

PHP recording adds a bounded, lock-guarded insert to each native ddog_ffe_evaluate call (scalar copies + building the comparable canonical-context key); the heavier drain + serialization + sidecar hand-off is paid once per request at shutdown, not per evaluation. Context pruning (256/256) caps the per-bucket payload, and the global/per-flag/degraded caps bound total memory. The existing OTel metric path is unchanged, so the only added per-evaluation cost is the EVP insert.

A phpbench hot-path benchmark is wired into the canonical suite at tests/Benchmarks/API/FlagEvaluationBench.php — auto-discovered by tests/phpbench.json (Benchmarks/API, *Bench.php), run by make benchmarks and the GitLab benchmarks-tracer microbenchmark job (gated on tests/Benchmarks/** changes). It drives the real \DDTrace\ffe_evaluate() → ddog_ffe_evaluate record+aggregation path against a static in-memory UFC config (split / targeting-match / distinct-context subjects), plus a counting-disabled (benchEvaluateWithoutCounting) baseline subject that isolates the EVP record+aggregation cost from base evaluation. Per-op ns/µs figures are produced by the benchmarking-platform CI run; no figures are fabricated here.

Validation

Proven through ffe-dogfooding mock-intake against app-php7 (port 8087):

SDK=php7 APP_URL=http://localhost:8087 SKIP_OTEL_NON_REGRESSION=1 CHECK_EXPOSURES_NON_REGRESSION=1

Negative control → green: baseline shows NO EVP flagevaluation payload, then count > 0 with the expected flag.key / variant.key / service after evaluations + RSHUTDOWN flush.
OTel non-regression / exposures: the existing feature_flag.evaluations and exposure paths remain wired (native record_ffe_evaluation_metric + RSHUTDOWN flush untouched).
Native logic proven: unit/integration tests in components-rs/ffe.rs drive the real FFI entry point ddog_ffe_evaluate and assert it populates the aggregator the sidecar flush drains, plus tests for two-tier overflow, context pruning (256/256), comparable-key bucket identity, runtime-default-from-absent-variant, and the observable overflow-drop reset. The libdatadog change adds a bincode round-trip test for FfeFlagEvaluationBatch (with both Some and None/absent fields) — the mechanical guard that locks the wire-codec fix.
Confirmed end-to-end: mock-intake shows context.evaluation as a proper JSON object ({country, plan, version}), variant.key / allocation.key / targeting_key / dd.service populated, and no null placeholders in the degraded tier; the sidecar logs sent flag evaluation batch, status=202.

Cross-repo: the native fix lives in libdatadog (datadog-ffe payload types + datadog-sidecar flusher/IPC), shipped as a companion libdatadog PR; this dd-trace-php PR bumps the libdatadog submodule to it and adds the PHP-layer components-rs/ffe.rs wiring + regenerated FFI bindings. The root cause was a worker→sidecar bincode wire-codec incompatibility (serde_json::Value + skip_serializing_if), not the EVP schema or a stale build.

Resolution of the 8 PoC (#4874) reviewer concerns

#	Concern	Addressed
1	Context bounds before buffering	Shared `prune_context` (≤256 fields / ≤256-char string values, oversized skipped not truncated) applied to the full tier before it is buffered
2	Tiers validated vs `flageval-worker` schema	Both tiers serialize to the camelCase `flagEvaluations` / `{key}` schema via optional-field omission (no null placeholders)
3	Bucket identity not FNV-1a-alone	Exact enumerable struct key + comparable canonical-context string key (native hashing + comparison; no digest, no collisions)
4	`first/last_evaluation` via min/max	min/max merged per bucket under the aggregator lock; no wall-clock assumptions
5	Runtime-default from absence of variant	`runtime_default_used` derived from the missing variant (empty variant string), not the reason alone
6	Benchmark vs base eval cost	phpbench `tests/Benchmarks/API/FlagEvaluationBench.php` wired into the canonical suite (incl. a counting-disabled baseline subject isolating EVP cost); per-eval cost is a bounded mutex insert, drain/serialize deferred to RSHUTDOWN; numbers from the benchmarking-platform CI run
7	Hook counts error/default paths	Recording happens at the native evaluation result (covering success/error/default), not on a success-only `After` step
8	Explicit bounds on degraded/overflow	Per-tier frozen caps (full `131072` / per-flag `10000` / degraded `32768`) + explicit drop-and-count beyond degraded, now surfaced via `tracing::warn` at drain

🚧 Draft

…ith PREP-01 libdatadog - Enable 'flagevaluation-evp' feature on datadog-ffe dep (FfeFlagEvaluationBatch type now compiled) - Fix components-rs/bytes.rs: update 4x VecMap::remove() -> remove_slow() for libdatadog compat post-commit 74284cac7 (VecMap API renamed); this unblocks compilation against the PREP-01 libdatadog ref

…patch - Two-tier aggregation in components-rs/ffe.rs: full→degraded→drop-counted with caps GLOBAL_CAP=131072/PER_FLAG_CAP=10000/DEGRADED_CAP=32768 - Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default: on) via evp_enabled() in Rust and isEvpEnabled() in EvaluationMetricRecorder.php - ddog_ffe_flush_flag_evaluation_batch() Rust C-export dispatches SidecarAction::FfeFlagEvaluationBatch via sidecar_blocking::enqueue_actions - ddtrace_ffe_flush_flag_evaluation_batch() C wrapper in tracer/ffe.c mirrors existing exposure/metric flush pattern with sidecar globals - RSHUTDOWN call added in tracer/ddtrace.c after existing flush calls - 11 Rust unit tests covering both tiers, overflow, drain, killswitch

…EVP aggregator race ddog_ffe_evaluate() records into the global EVP_AGGREGATOR; without EVP_TEST_LOCK the test ran concurrently with degraded_tier_overflow tests, causing dropped_degraded_overflow to be 2 instead of 1.

… + regen Cargo.lock Points dd-trace-php's libdatadog submodule at the local PREP-01 commit containing the flagevaluation EVP emitter (FfeFlagEvaluationBatch), so components-rs builds against it via the datadog-ffe path dep with the flagevaluation-evp feature. NOTE: 89a2ba7fc is local/unpushed — re-point to the merged upstream libdatadog SHA before any PR.

The Rust C-export ddog_ffe_flush_flag_evaluation_batch (components-rs/ffe.rs) was added without a matching prototype in the committed cbindgen header components-rs/datadog.h. tracer/ffe.c calls it, so PHP8's stricter toolchain fails with -Werror=implicit-function-declaration (ddtrace.so link Error 2). PHP7 only warned and linked, masking the bug. Prototype matches the Rust signature (SidecarTransport**/InstanceId*/QueueId*/CharSlice x3).

…ow drops The full-tier EVP flagevaluation drain previously emitted context: None and drained the degraded-overflow drop count silently. - Full tier now carries the pruned evaluation context (shared prune_context bounds: <=256 fields, string values >256 bytes skipped) plus context.dd.service, matching the degraded tier's cap enforcement. The pruned context is captured once per bucket at insertion and carried verbatim into the drained event. - The degraded-tier overflow drop counter is read-and-reset at drain and logged via tracing::warn when non-zero, so an undersized degradedCap is observable instead of a silent loss of legitimate counts.

…low surfacing - ddog_ffe_evaluate_populates_evp_aggregator_for_flush / _respects_killswitch: drive the real FFI entry point ddog_ffe_evaluate (the function the PHP/C layer calls) and assert it feeds the aggregator that the sidecar flush drains, closing the 'unit-green but emits nothing' gap that earlier tests left uncovered. - full_tier_event_carries_pruned_context / _prunes_oversized_string_values / _empty_context_emits_no_context_object: assert the full tier carries the pruned context and enforces the field/value bounds. - drain_resets_degraded_overflow_drop_counter: assert drain reads-and-resets the observable overflow drop counter.

…ncode-safe wire + reliable enqueue) Bump the libdatadog submodule to the bincode-safe flagevaluation fix (DataDog/libdatadog#2117): the worker->sidecar IPC is bincode, which the old serde_json::Value + skip_serializing_if wire types could not deserialize, so the sidecar silently dropped every batch. - Stringify the pruned full-tier context (JSON object string) at drain so the bincode wire stays plain; the sidecar flusher re-expands it into a JSON object for the POST. - Use sidecar_blocking::enqueue_actions_reliable for the one-shot RSHUTDOWN flush.

datadog-official · 2026-06-14T14:10:43Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 61 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c: [8.0, arm64]

🧪 3 Tests failed

tmp/build_extension/tests/ext/close_spans_until.phpt (Test DDTrace\close_spans_until) from PHP.tmp.build_extension.tests.ext

(Fix with Cursor)

--
     [ddtrace] [span] [%d] Switching to different SpanStack: %d
     int(1)
     int(0)
024&#43; [ddtrace] [span] [4743] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: 7436922682114359497, span_id: 7436922682114359497, parent_id: 0, start: 1781449944278516898, duration: 11696705, error: 0, meta: VecMap { data: [(_dd.tags.process, entrypoint.basedir:ext,entrypoint.name:close_spans_until,entrypoint.type:script,entrypoint.workdir:dd-trace-php,runtime.sapi:cli), (runtime-id, f009df55-565f-48f7-9537-1f0a29e0a6ee), (_dd.code_origin.type, entry), (_dd.code_origin.frames.0.file, tmp/build_extension/tests/ext/close_spans_until.php), (_dd.code_origin.frames.0.line, 1), (_dd.p.dm, -0), (_dd.p.tid, 6a2ec4d800000000)], deduped: false }, metrics: VecMap { data: [(process_id, 4743.0), (_dd.agent_psr, 1.0), (_sampling_priority_v1, 1.0), (php.compilation.total_time_ms, 5.382), (php.memory.peak_usage_bytes, 0.0), (php.memory.peak_real_usage_bytes, 0.0)], deduped: false }, meta_struct: VecMap { data: [], deduped: false }, span_links: [], span_events: [] }
024- [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: %d, span_id: %d, parent_id: 0, start: %d, duration: %d, error: 0, meta: %s, metrics: %s, meta_struct: {}, span_links: [], span_events: [] }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: traced, resource: traced, type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
--

tmp/build_extension/tests/ext/sandbox/hook_function/hook_does_not_leak_error.phpt (Check that sandboxed hooks do not invoke error handlers or set the error code)

from PHP.tmp.build_extension.tests.ext.sandbox.hook_function

(Fix with Cursor)

--
     foo
007&#43; [ddtrace] [datadog_ipc::client] [6905] drain_acks: connection error (broken pipe), marking closed
008&#43; [ddtrace] [datadog_sidecar::service::blocking] [6905] The sidecar transport is closed. Reconnecting... This generally indicates a problem with the sidecar, most likely a crash. Check the logs / core dump locations and possibly report a bug.
009&#43; [ddtrace] [datadog_ipc::client] [6905] drain_acks: connection error (Connection reset by peer (os error 104)), marking closed
010&#43; [ddtrace] [datadog_sidecar::service::blocking] [6905] The sidecar transport is closed. Reconnecting... This generally indicates a problem with the sidecar, most likely a crash. Check the logs / core dump locations and possibly report a bug.

View all 3 test failures

DataDog/apm-reliability/dd-trace-php | ASAN test_c: [8.1, arm64]

🧪 1 Test failed

tmp/build_extension/tests/ext/close_spans_until.phpt (Test DDTrace\close_spans_until) from php.tmp.build_extension.tests.ext

(Fix with Cursor)

--
     [ddtrace] [span] [%d] Switching to different SpanStack: %d
     int(1)
     int(0)
024&#43; [ddtrace] [span] [6566] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: 13494954271331726157, span_id: 13494954271331726157, parent_id: 0, start: 1781451561562073867, duration: 12440298, error: 0, meta: VecMap { data: [(_dd.tags.process, entrypoint.basedir:ext,entrypoint.name:close_spans_until,entrypoint.type:script,entrypoint.workdir:dd-trace-php,runtime.sapi:cli), (runtime-id, 8eed3faf-8bb1-4b28-8c1d-71c82e48a0ba), (_dd.code_origin.type, entry), (_dd.code_origin.frames.0.file, tmp/build_extension/tests/ext/close_spans_until.php), (_dd.code_origin.frames.0.line, 1), (_dd.p.dm, -0), (_dd.p.tid, 6a2ecb2900000000)], deduped: false }, metrics: VecMap { data: [(process_id, 6566.0), (_dd.agent_psr, 1.0), (_sampling_priority_v1, 1.0), (php.compilation.total_time_ms, 5.486), (php.memory.peak_usage_bytes, 0.0), (php.memory.peak_real_usage_bytes, 0.0)], deduped: false }, meta_struct: VecMap { data: [], deduped: false }, span_links: [], span_events: [] }
024- [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: %d, span_id: %d, parent_id: 0, start: %d, duration: %d, error: 0, meta: %s, metrics: %s, meta_struct: {}, span_links: [], span_events: [] }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: traced, resource: traced, type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
--
...

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.1]

View all 61 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🔄 Datadog auto-retried 3 jobs - 0 passed on retry

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: f5e4087 | Docs | Datadog PR Page | Give us feedback!}

leoromanovsky added 10 commits June 12, 2026 15:47

chore(openfeature): remove internal planning annotations

862b74d

chore(openfeature): wire flagevaluation benchmark into benchmark suite

63fdebe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openfeature): emit server-side EVP flagevaluation#3984

feat(openfeature): emit server-side EVP flagevaluation#3984
leoromanovsky wants to merge 10 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php

leoromanovsky commented Jun 14, 2026

Uh oh!

datadog-official Bot commented Jun 14, 2026 •

edited by datadog-prod-us1-5 Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leoromanovsky commented Jun 14, 2026

Motivation

What we learned

Design Decisions

High-load behavior — when do we drop?

Performance

Validation

Resolution of the 8 PoC (#4874) reviewer concerns

Uh oh!

datadog-official Bot commented Jun 14, 2026 • edited by datadog-prod-us1-5 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-official Bot commented Jun 14, 2026 •

edited by datadog-prod-us1-5 Bot

Loading