Skip to content

feat(openfeature): emit server-side EVP flagevaluation#3984

Draft
leoromanovsky wants to merge 10 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php
Draft

feat(openfeature): emit server-side EVP flagevaluation#3984
leoromanovsky wants to merge 10 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php

Conversation

@leoromanovsky

Copy link
Copy Markdown
Contributor

Motivation

Server-side flag evaluations are currently invisible in the Datadog EVP flagevaluation track for PHP. This PR makes dd-trace-php emit schema-conformant, aggregated EVP flagevaluation payloads while leaving the existing OTel feature_flag.evaluations behavior unchanged (no regression). It mirrors the Go reference implementation (#4886) — the same two-tier aggregation, frozen caps, comparable canonical-context bucket key, and killswitch — adapted to PHP's native (Rust/C bridge → sidecar) architecture.

In PHP the OpenFeature evaluator is owned by native code (libdatadog/datadog-ffe, reached through components-rs/ffe.rs and tracer/ffe.c); the PHP layer owns the provider wrapper and the request-shutdown flush dispatch. So unlike the pure-SDK tracers, the aggregation lives in the native bridge and the payload is delivered through the sidecar's EVP path rather than an in-process HTTP writer.

What we learned

Recording flagevaluation is not like emitting the existing OTel metric. The metric is fire-and-forget; flagevaluation must aggregate per evaluation — bucket by flag + variant + reason + allocation + (full tier) targeting key + pruned context — and only emit the merged counts at flush. The aggregation and the flush therefore have to live where the evaluation already crosses into native code: every ddog_ffe_evaluate call does a cheap mutex insert into a process-global aggregator, and the request-shutdown handler drains/serializes/POSTs the batch through the sidecar.

flowchart LR
    A["PHP OpenFeature provider<br/>(DataDogProvider.php)"] --> B["ddog_ffe_evaluate()<br/>components-rs/ffe.rs (native eval)"]
    B --> C["cheap mutex insert into<br/>two-tier EVP_AGGREGATOR"]
    C --> D["RSHUTDOWN flush:<br/>ddtrace_ffe_flush_flag_evaluation_batch()<br/>tracer/ffe.c"]
    D --> E["ddog_ffe_flush_flag_evaluation_batch()<br/>drain + serialize"]
    E --> F["sidecar ffe_flagevaluation_flusher"]
    F --> G["POST /evp_proxy/v2/api/v2/flagevaluations"]
Loading

Transferable lesson for the fan-out: a non-self-describing IPC codec is part of the contract, not just the EVP schema. PHP's worker→sidecar hand-off serializes SidecarAction with bincode, and two idioms that are perfectly correct for the JSON POST are fatal over bincode: a serde_json::Value field (bincode cannot deserialize_any) and #[serde(skip_serializing_if = …)] (serialize omits the field, but bincode's positional deserialize still expects it → every subsequent field misaligns). Either makes the sidecar drop the batch with IPC serve: failed to decode request — and because the worker's enqueue still returns ok, it presents as a delivery failure, not a serialization one (it took instrumenting both ends to see the action was enqueued, never received). The resolution keeps the JSON-shaping (pruned-context object, omitted optionals) on the POST side in the sidecar flusher, while the bincode wire types stay plain (no serde_json::Value, no skip_serializing_if); a bincode round-trip test now locks it. This PR also adds tests driving the real FFI entry point ddog_ffe_evaluate, so a missing recording call can't pass silently. Lesson: for SDKs that cross a binary IPC before the HTTP POST, validate the wire codec round-trip, not only the EVP schema.

Design Decisions

  • Transport = the sidecar EVP path. PHP has no in-process EVP HTTP writer; the existing exposures/metrics flushers already route through the sidecar, so the new flagevaluation batch is enqueued the same way (SidecarAction::FfeFlagEvaluationBatch) and the sidecar POSTs to /evp_proxy/v2/api/v2/flagevaluations. The payload is aggregated counts per bucket, matching the worker's evaluation_count / first_evaluation / last_evaluation schema, serialized as camelCase flagEvaluations with nested {key} objects.
  • Recording = cheap mutex insert on the eval path; drain/serialize/POST at RSHUTDOWN (PHP's answer to the synchronous-hook problem). PHP's request model is synchronous and short-lived, so there is no long-running background worker to offload onto; instead the per-evaluation cost is kept to a bounded mutex insert (scalar copies + a canonical-context key build), and the expensive drain + serialization + hand-off happens once per request at shutdown (tracer/ddtrace.c RSHUTDOWN), off the evaluation calls themselves and alongside the existing OTel metric and exposure flushes.
  • Bucket identity = comparable canonical-context key, NOT a hash. The five enumerable dims (flag, variant, allocationKey, reason, targetingKey) are exact strings, and the context attributes are encoded into a type-tagged, length-delimited canonical string that is itself a field of the Rust map key — so the language hashes and compares it natively. Distinct contexts always land in distinct buckets: no manual digest, no collisions, no misattribution.
  • Two-tier degradation with explicit frozen caps: full (globalCap=131072 total, perFlagCap=10000/flag; context pruned 256 fields / 256-char values) → degraded (degradedCap=32768; drops targeting key + context, = OTel cardinality) → drop (counted). No ultra-degraded tier (removed in the Go reference; never triggers at the team's ≥2,500-flag target once degradedCap is sized correctly, and it lossily collapsed allocation+reason).
  • Existing OTel path is preserved byte-for-byte. The native record_ffe_evaluation_metric path (EvaluationMetricRecorder.php + sidecar metric flush) and the RSHUTDOWN ddtrace_ffe_flush_evaluation_metrics() call are untouched; the EVP path is purely additive.
  • Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default on) gates only the EVP path; with it off the aggregator is never touched and nothing is enqueued.

High-load behavior — when do we drop?

Counts are preserved; only dimensional fidelity degrades. Every evaluation contributes exactly one count to some tier (or the bounded drop counter); Σ(counts across tiers + drops) == evaluations processed.

Tier Key dimensions Bound Entered when
Full flag, variant, allocationKey, reason, targetingKey, canonical context (+ pruned context payload) globalCap=131,072 total and perFlagCap=10,000/flag default; context pruned 256 fields / 256-char values in place
Degraded flag, variant, allocationKey, reason degradedCap=32,768 a flag reaches perFlagCap distinct full buckets, or globalCap is full
Drop (counted) n/a (a single observable counter) degradedCap is full

The degraded key is exactly the OTel feature_flag.evaluations cardinality (flag × variant × allocation × reason), sized to hold the full legitimate cardinality at the 2,500-flag target. Over-cap counts increment the dropped_degraded_overflow counter rather than cascading into a lossy third tier — in practice this only fires under genuine abuse. This PR makes that drop observable: the counter is read-and-reset at drain and logged via tracing::warn when non-zero (mirroring the Go reference), so an undersized degradedCap surfaces as a warning instead of a silent loss of legitimate counts.

Performance

PHP recording adds a bounded, lock-guarded insert to each native ddog_ffe_evaluate call (scalar copies + building the comparable canonical-context key); the heavier drain + serialization + sidecar hand-off is paid once per request at shutdown, not per evaluation. Context pruning (256/256) caps the per-bucket payload, and the global/per-flag/degraded caps bound total memory. The existing OTel metric path is unchanged, so the only added per-evaluation cost is the EVP insert.

A phpbench hot-path benchmark is wired into the canonical suite at tests/Benchmarks/API/FlagEvaluationBench.php — auto-discovered by tests/phpbench.json (Benchmarks/API, *Bench.php), run by make benchmarks and the GitLab benchmarks-tracer microbenchmark job (gated on tests/Benchmarks/** changes). It drives the real \DDTrace\ffe_evaluate()ddog_ffe_evaluate record+aggregation path against a static in-memory UFC config (split / targeting-match / distinct-context subjects), plus a counting-disabled (benchEvaluateWithoutCounting) baseline subject that isolates the EVP record+aggregation cost from base evaluation. Per-op ns/µs figures are produced by the benchmarking-platform CI run; no figures are fabricated here.

Validation

Proven through ffe-dogfooding mock-intake against app-php7 (port 8087):

SDK=php7 APP_URL=http://localhost:8087 SKIP_OTEL_NON_REGRESSION=1 CHECK_EXPOSURES_NON_REGRESSION=1
  • Negative control → green: baseline shows NO EVP flagevaluation payload, then count > 0 with the expected flag.key / variant.key / service after evaluations + RSHUTDOWN flush.
  • OTel non-regression / exposures: the existing feature_flag.evaluations and exposure paths remain wired (native record_ffe_evaluation_metric + RSHUTDOWN flush untouched).
  • Native logic proven: unit/integration tests in components-rs/ffe.rs drive the real FFI entry point ddog_ffe_evaluate and assert it populates the aggregator the sidecar flush drains, plus tests for two-tier overflow, context pruning (256/256), comparable-key bucket identity, runtime-default-from-absent-variant, and the observable overflow-drop reset. The libdatadog change adds a bincode round-trip test for FfeFlagEvaluationBatch (with both Some and None/absent fields) — the mechanical guard that locks the wire-codec fix.
  • Confirmed end-to-end: mock-intake shows context.evaluation as a proper JSON object ({country, plan, version}), variant.key / allocation.key / targeting_key / dd.service populated, and no null placeholders in the degraded tier; the sidecar logs sent flag evaluation batch, status=202.

Cross-repo: the native fix lives in libdatadog (datadog-ffe payload types + datadog-sidecar flusher/IPC), shipped as a companion libdatadog PR; this dd-trace-php PR bumps the libdatadog submodule to it and adds the PHP-layer components-rs/ffe.rs wiring + regenerated FFI bindings. The root cause was a worker→sidecar bincode wire-codec incompatibility (serde_json::Value + skip_serializing_if), not the EVP schema or a stale build.

Resolution of the 8 PoC (#4874) reviewer concerns

# Concern Addressed
1 Context bounds before buffering Shared prune_context (≤256 fields / ≤256-char string values, oversized skipped not truncated) applied to the full tier before it is buffered
2 Tiers validated vs flageval-worker schema Both tiers serialize to the camelCase flagEvaluations / {key} schema via optional-field omission (no null placeholders)
3 Bucket identity not FNV-1a-alone Exact enumerable struct key + comparable canonical-context string key (native hashing + comparison; no digest, no collisions)
4 first/last_evaluation via min/max min/max merged per bucket under the aggregator lock; no wall-clock assumptions
5 Runtime-default from absence of variant runtime_default_used derived from the missing variant (empty variant string), not the reason alone
6 Benchmark vs base eval cost phpbench tests/Benchmarks/API/FlagEvaluationBench.php wired into the canonical suite (incl. a counting-disabled baseline subject isolating EVP cost); per-eval cost is a bounded mutex insert, drain/serialize deferred to RSHUTDOWN; numbers from the benchmarking-platform CI run
7 Hook counts error/default paths Recording happens at the native evaluation result (covering success/error/default), not on a success-only After step
8 Explicit bounds on degraded/overflow Per-tier frozen caps (full 131072 / per-flag 10000 / degraded 32768) + explicit drop-and-count beyond degraded, now surfaced via tracing::warn at drain

🚧 Draft

…ith PREP-01 libdatadog

- Enable 'flagevaluation-evp' feature on datadog-ffe dep (FfeFlagEvaluationBatch type now compiled)
- Fix components-rs/bytes.rs: update 4x VecMap::remove() -> remove_slow() for libdatadog compat post-commit 74284cac7 (VecMap API renamed); this unblocks compilation against the PREP-01 libdatadog ref
…patch

- Two-tier aggregation in components-rs/ffe.rs: full→degraded→drop-counted
  with caps GLOBAL_CAP=131072/PER_FLAG_CAP=10000/DEGRADED_CAP=32768
- Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default: on) via
  evp_enabled() in Rust and isEvpEnabled() in EvaluationMetricRecorder.php
- ddog_ffe_flush_flag_evaluation_batch() Rust C-export dispatches
  SidecarAction::FfeFlagEvaluationBatch via sidecar_blocking::enqueue_actions
- ddtrace_ffe_flush_flag_evaluation_batch() C wrapper in tracer/ffe.c
  mirrors existing exposure/metric flush pattern with sidecar globals
- RSHUTDOWN call added in tracer/ddtrace.c after existing flush calls
- 11 Rust unit tests covering both tiers, overflow, drain, killswitch
…EVP aggregator race

ddog_ffe_evaluate() records into the global EVP_AGGREGATOR; without
EVP_TEST_LOCK the test ran concurrently with degraded_tier_overflow
tests, causing dropped_degraded_overflow to be 2 instead of 1.
… + regen Cargo.lock

Points dd-trace-php's libdatadog submodule at the local PREP-01 commit
containing the flagevaluation EVP emitter (FfeFlagEvaluationBatch), so
components-rs builds against it via the datadog-ffe path dep with the
flagevaluation-evp feature. NOTE: 89a2ba7fc is local/unpushed — re-point
to the merged upstream libdatadog SHA before any PR.
The Rust C-export ddog_ffe_flush_flag_evaluation_batch (components-rs/ffe.rs)
was added without a matching prototype in the committed cbindgen header
components-rs/datadog.h. tracer/ffe.c calls it, so PHP8's stricter toolchain
fails with -Werror=implicit-function-declaration (ddtrace.so link Error 2).
PHP7 only warned and linked, masking the bug. Prototype matches the Rust
signature (SidecarTransport**/InstanceId*/QueueId*/CharSlice x3).
…ow drops

The full-tier EVP flagevaluation drain previously emitted context: None and
drained the degraded-overflow drop count silently.

- Full tier now carries the pruned evaluation context (shared prune_context
  bounds: <=256 fields, string values >256 bytes skipped) plus context.dd.service,
  matching the degraded tier's cap enforcement. The pruned context is captured
  once per bucket at insertion and carried verbatim into the drained event.
- The degraded-tier overflow drop counter is read-and-reset at drain and logged
  via tracing::warn when non-zero, so an undersized degradedCap is observable
  instead of a silent loss of legitimate counts.
…low surfacing

- ddog_ffe_evaluate_populates_evp_aggregator_for_flush / _respects_killswitch:
  drive the real FFI entry point ddog_ffe_evaluate (the function the PHP/C layer
  calls) and assert it feeds the aggregator that the sidecar flush drains, closing
  the 'unit-green but emits nothing' gap that earlier tests left uncovered.
- full_tier_event_carries_pruned_context / _prunes_oversized_string_values /
  _empty_context_emits_no_context_object: assert the full tier carries the pruned
  context and enforces the field/value bounds.
- drain_resets_degraded_overflow_drop_counter: assert drain reads-and-resets the
  observable overflow drop counter.
…ncode-safe wire + reliable enqueue)

Bump the libdatadog submodule to the bincode-safe flagevaluation fix (DataDog/libdatadog#2117): the worker->sidecar IPC is bincode, which the old serde_json::Value + skip_serializing_if wire types could not deserialize, so the sidecar silently dropped every batch.

- Stringify the pruned full-tier context (JSON object string) at drain so the bincode wire stays plain; the sidecar flusher re-expands it into a JSON object for the POST.

- Use sidecar_blocking::enqueue_actions_reliable for the one-shot RSHUTDOWN flush.
@datadog-official

datadog-official Bot commented Jun 14, 2026

Copy link
Copy Markdown

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 61 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c: [8.0, arm64]   View in Datadog   GitLab

🧪 3 Tests failed

tmp/build_extension/tests/ext/close_spans_until.phpt (Test DDTrace\close_spans_until) from PHP.tmp.build_extension.tests.ext   View in Datadog (Fix with Cursor)
--
     [ddtrace] [span] [%d] Switching to different SpanStack: %d
     int(1)
     int(0)
024&#43; [ddtrace] [span] [4743] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: 7436922682114359497, span_id: 7436922682114359497, parent_id: 0, start: 1781449944278516898, duration: 11696705, error: 0, meta: VecMap { data: [(_dd.tags.process, entrypoint.basedir:ext,entrypoint.name:close_spans_until,entrypoint.type:script,entrypoint.workdir:dd-trace-php,runtime.sapi:cli), (runtime-id, f009df55-565f-48f7-9537-1f0a29e0a6ee), (_dd.code_origin.type, entry), (_dd.code_origin.frames.0.file, tmp/build_extension/tests/ext/close_spans_until.php), (_dd.code_origin.frames.0.line, 1), (_dd.p.dm, -0), (_dd.p.tid, 6a2ec4d800000000)], deduped: false }, metrics: VecMap { data: [(process_id, 4743.0), (_dd.agent_psr, 1.0), (_sampling_priority_v1, 1.0), (php.compilation.total_time_ms, 5.382), (php.memory.peak_usage_bytes, 0.0), (php.memory.peak_real_usage_bytes, 0.0)], deduped: false }, meta_struct: VecMap { data: [], deduped: false }, span_links: [], span_events: [] }
024- [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: %d, span_id: %d, parent_id: 0, start: %d, duration: %d, error: 0, meta: %s, metrics: %s, meta_struct: {}, span_links: [], span_events: [] }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: traced, resource: traced, type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
--
tmp/build_extension/tests/ext/sandbox/hook_function/hook_does_not_leak_error.phpt (Check that sandboxed hooks do not invoke error handlers or set the error code) from PHP.tmp.build_extension.tests.ext.sandbox.hook_function   View in Datadog (Fix with Cursor)
--
     foo
007&#43; [ddtrace] [datadog_ipc::client] [6905] drain_acks: connection error (broken pipe), marking closed
008&#43; [ddtrace] [datadog_sidecar::service::blocking] [6905] The sidecar transport is closed. Reconnecting... This generally indicates a problem with the sidecar, most likely a crash. Check the logs / core dump locations and possibly report a bug.
009&#43; [ddtrace] [datadog_ipc::client] [6905] drain_acks: connection error (Connection reset by peer (os error 104)), marking closed
010&#43; [ddtrace] [datadog_sidecar::service::blocking] [6905] The sidecar transport is closed. Reconnecting... This generally indicates a problem with the sidecar, most likely a crash. Check the logs / core dump locations and possibly report a bug.
View all 3 test failures

DataDog/apm-reliability/dd-trace-php | ASAN test_c: [8.1, arm64]   View in Datadog   GitLab

🧪 1 Test failed

tmp/build_extension/tests/ext/close_spans_until.phpt (Test DDTrace\close_spans_until) from php.tmp.build_extension.tests.ext   View in Datadog (Fix with Cursor)
--
     [ddtrace] [span] [%d] Switching to different SpanStack: %d
     int(1)
     int(0)
024&#43; [ddtrace] [span] [6566] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: 13494954271331726157, span_id: 13494954271331726157, parent_id: 0, start: 1781451561562073867, duration: 12440298, error: 0, meta: VecMap { data: [(_dd.tags.process, entrypoint.basedir:ext,entrypoint.name:close_spans_until,entrypoint.type:script,entrypoint.workdir:dd-trace-php,runtime.sapi:cli), (runtime-id, 8eed3faf-8bb1-4b28-8c1d-71c82e48a0ba), (_dd.code_origin.type, entry), (_dd.code_origin.frames.0.file, tmp/build_extension/tests/ext/close_spans_until.php), (_dd.code_origin.frames.0.line, 1), (_dd.p.dm, -0), (_dd.p.tid, 6a2ecb2900000000)], deduped: false }, metrics: VecMap { data: [(process_id, 6566.0), (_dd.agent_psr, 1.0), (_sampling_priority_v1, 1.0), (php.compilation.total_time_ms, 5.486), (php.memory.peak_usage_bytes, 0.0), (php.memory.peak_real_usage_bytes, 0.0)], deduped: false }, meta_struct: VecMap { data: [], deduped: false }, span_links: [], span_events: [] }
024- [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: close_spans_until.php, resource: close_spans_until.php, type: cli, trace_id: %d, span_id: %d, parent_id: 0, start: %d, duration: %d, error: 0, meta: %s, metrics: %s, meta_struct: {}, span_links: [], span_events: [] }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: traced, resource: traced, type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
     [ddtrace] [span] [%d] Encoding span: Span { service: close_spans_until.php, name: , resource: , type: cli, trace_id: %d, span_id: %d, parent_id: %d, start: %d, duration: %d, error: %d, meta: %s, metrics: %s, meta_struct: %s, span_links: %s, span_events: %s }
--
...

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.1]   View in Datadog   GitLab

View all 61 failed jobs.

ℹ️ Info

No other issues found (see more)

❄️ No new flaky tests detected

🔄 Datadog auto-retried 3 jobs - 0 passed on retry View in Datadog

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f5e4087 | Docs | Datadog PR Page | Give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant