feat(openfeature): emit server-side EVP flagevaluation#3984
Draft
leoromanovsky wants to merge 10 commits into
Draft
feat(openfeature): emit server-side EVP flagevaluation#3984leoromanovsky wants to merge 10 commits into
leoromanovsky wants to merge 10 commits into
Conversation
…ith PREP-01 libdatadog - Enable 'flagevaluation-evp' feature on datadog-ffe dep (FfeFlagEvaluationBatch type now compiled) - Fix components-rs/bytes.rs: update 4x VecMap::remove() -> remove_slow() for libdatadog compat post-commit 74284cac7 (VecMap API renamed); this unblocks compilation against the PREP-01 libdatadog ref
…patch - Two-tier aggregation in components-rs/ffe.rs: full→degraded→drop-counted with caps GLOBAL_CAP=131072/PER_FLAG_CAP=10000/DEGRADED_CAP=32768 - Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default: on) via evp_enabled() in Rust and isEvpEnabled() in EvaluationMetricRecorder.php - ddog_ffe_flush_flag_evaluation_batch() Rust C-export dispatches SidecarAction::FfeFlagEvaluationBatch via sidecar_blocking::enqueue_actions - ddtrace_ffe_flush_flag_evaluation_batch() C wrapper in tracer/ffe.c mirrors existing exposure/metric flush pattern with sidecar globals - RSHUTDOWN call added in tracer/ddtrace.c after existing flush calls - 11 Rust unit tests covering both tiers, overflow, drain, killswitch
…EVP aggregator race ddog_ffe_evaluate() records into the global EVP_AGGREGATOR; without EVP_TEST_LOCK the test ran concurrently with degraded_tier_overflow tests, causing dropped_degraded_overflow to be 2 instead of 1.
… + regen Cargo.lock Points dd-trace-php's libdatadog submodule at the local PREP-01 commit containing the flagevaluation EVP emitter (FfeFlagEvaluationBatch), so components-rs builds against it via the datadog-ffe path dep with the flagevaluation-evp feature. NOTE: 89a2ba7fc is local/unpushed — re-point to the merged upstream libdatadog SHA before any PR.
The Rust C-export ddog_ffe_flush_flag_evaluation_batch (components-rs/ffe.rs) was added without a matching prototype in the committed cbindgen header components-rs/datadog.h. tracer/ffe.c calls it, so PHP8's stricter toolchain fails with -Werror=implicit-function-declaration (ddtrace.so link Error 2). PHP7 only warned and linked, masking the bug. Prototype matches the Rust signature (SidecarTransport**/InstanceId*/QueueId*/CharSlice x3).
…ow drops The full-tier EVP flagevaluation drain previously emitted context: None and drained the degraded-overflow drop count silently. - Full tier now carries the pruned evaluation context (shared prune_context bounds: <=256 fields, string values >256 bytes skipped) plus context.dd.service, matching the degraded tier's cap enforcement. The pruned context is captured once per bucket at insertion and carried verbatim into the drained event. - The degraded-tier overflow drop counter is read-and-reset at drain and logged via tracing::warn when non-zero, so an undersized degradedCap is observable instead of a silent loss of legitimate counts.
…low surfacing - ddog_ffe_evaluate_populates_evp_aggregator_for_flush / _respects_killswitch: drive the real FFI entry point ddog_ffe_evaluate (the function the PHP/C layer calls) and assert it feeds the aggregator that the sidecar flush drains, closing the 'unit-green but emits nothing' gap that earlier tests left uncovered. - full_tier_event_carries_pruned_context / _prunes_oversized_string_values / _empty_context_emits_no_context_object: assert the full tier carries the pruned context and enforces the field/value bounds. - drain_resets_degraded_overflow_drop_counter: assert drain reads-and-resets the observable overflow drop counter.
…ncode-safe wire + reliable enqueue) Bump the libdatadog submodule to the bincode-safe flagevaluation fix (DataDog/libdatadog#2117): the worker->sidecar IPC is bincode, which the old serde_json::Value + skip_serializing_if wire types could not deserialize, so the sidecar silently dropped every batch. - Stringify the pruned full-tier context (JSON object string) at drain so the bincode wire stays plain; the sidecar flusher re-expands it into a JSON object for the POST. - Use sidecar_blocking::enqueue_actions_reliable for the one-shot RSHUTDOWN flush.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Server-side flag evaluations are currently invisible in the Datadog EVP
flagevaluationtrack for PHP. This PR makesdd-trace-phpemit schema-conformant, aggregated EVPflagevaluationpayloads while leaving the existing OTelfeature_flag.evaluationsbehavior unchanged (no regression). It mirrors the Go reference implementation (#4886) — the same two-tier aggregation, frozen caps, comparable canonical-context bucket key, and killswitch — adapted to PHP's native (Rust/C bridge → sidecar) architecture.In PHP the OpenFeature evaluator is owned by native code (
libdatadog/datadog-ffe, reached throughcomponents-rs/ffe.rsandtracer/ffe.c); the PHP layer owns the provider wrapper and the request-shutdown flush dispatch. So unlike the pure-SDK tracers, the aggregation lives in the native bridge and the payload is delivered through the sidecar's EVP path rather than an in-process HTTP writer.What we learned
Recording
flagevaluationis not like emitting the existing OTel metric. The metric is fire-and-forget;flagevaluationmust aggregate per evaluation — bucket by flag + variant + reason + allocation + (full tier) targeting key + pruned context — and only emit the merged counts at flush. The aggregation and the flush therefore have to live where the evaluation already crosses into native code: everyddog_ffe_evaluatecall does a cheap mutex insert into a process-global aggregator, and the request-shutdown handler drains/serializes/POSTs the batch through the sidecar.flowchart LR A["PHP OpenFeature provider<br/>(DataDogProvider.php)"] --> B["ddog_ffe_evaluate()<br/>components-rs/ffe.rs (native eval)"] B --> C["cheap mutex insert into<br/>two-tier EVP_AGGREGATOR"] C --> D["RSHUTDOWN flush:<br/>ddtrace_ffe_flush_flag_evaluation_batch()<br/>tracer/ffe.c"] D --> E["ddog_ffe_flush_flag_evaluation_batch()<br/>drain + serialize"] E --> F["sidecar ffe_flagevaluation_flusher"] F --> G["POST /evp_proxy/v2/api/v2/flagevaluations"]Transferable lesson for the fan-out: a non-self-describing IPC codec is part of the contract, not just the EVP schema. PHP's worker→sidecar hand-off serializes
SidecarActionwith bincode, and two idioms that are perfectly correct for the JSON POST are fatal over bincode: aserde_json::Valuefield (bincode cannotdeserialize_any) and#[serde(skip_serializing_if = …)](serialize omits the field, but bincode's positional deserialize still expects it → every subsequent field misaligns). Either makes the sidecar drop the batch withIPC serve: failed to decode request— and because the worker's enqueue still returnsok, it presents as a delivery failure, not a serialization one (it took instrumenting both ends to see the action was enqueued, never received). The resolution keeps the JSON-shaping (pruned-context object, omitted optionals) on the POST side in the sidecar flusher, while the bincode wire types stay plain (noserde_json::Value, noskip_serializing_if); a bincode round-trip test now locks it. This PR also adds tests driving the real FFI entry pointddog_ffe_evaluate, so a missing recording call can't pass silently. Lesson: for SDKs that cross a binary IPC before the HTTP POST, validate the wire codec round-trip, not only the EVP schema.Design Decisions
flagevaluationbatch is enqueued the same way (SidecarAction::FfeFlagEvaluationBatch) and the sidecar POSTs to/evp_proxy/v2/api/v2/flagevaluations. The payload is aggregated counts per bucket, matching the worker'sevaluation_count/first_evaluation/last_evaluationschema, serialized as camelCaseflagEvaluationswith nested{key}objects.tracer/ddtrace.cRSHUTDOWN), off the evaluation calls themselves and alongside the existing OTel metric and exposure flushes.flag,variant,allocationKey,reason,targetingKey) are exact strings, and the context attributes are encoded into a type-tagged, length-delimited canonical string that is itself a field of the Rust map key — so the language hashes and compares it natively. Distinct contexts always land in distinct buckets: no manual digest, no collisions, no misattribution.globalCap=131072total,perFlagCap=10000/flag; context pruned 256 fields / 256-char values) → degraded (degradedCap=32768; drops targeting key + context, = OTel cardinality) → drop (counted). No ultra-degraded tier (removed in the Go reference; never triggers at the team's ≥2,500-flag target oncedegradedCapis sized correctly, and it lossily collapsed allocation+reason).record_ffe_evaluation_metricpath (EvaluationMetricRecorder.php+ sidecar metric flush) and the RSHUTDOWNddtrace_ffe_flush_evaluation_metrics()call are untouched; the EVP path is purely additive.DD_FLAGGING_EVALUATION_COUNTS_ENABLED(default on) gates only the EVP path; with it off the aggregator is never touched and nothing is enqueued.High-load behavior — when do we drop?
Counts are preserved; only dimensional fidelity degrades. Every evaluation contributes exactly one count to some tier (or the bounded drop counter);
Σ(counts across tiers + drops) == evaluations processed.globalCap=131,072total andperFlagCap=10,000/flagdegradedCap=32,768perFlagCapdistinct full buckets, orglobalCapis fulldegradedCapis fullThe degraded key is exactly the OTel
feature_flag.evaluationscardinality (flag × variant × allocation × reason), sized to hold the full legitimate cardinality at the 2,500-flag target. Over-cap counts increment thedropped_degraded_overflowcounter rather than cascading into a lossy third tier — in practice this only fires under genuine abuse. This PR makes that drop observable: the counter is read-and-reset at drain and logged viatracing::warnwhen non-zero (mirroring the Go reference), so an undersizeddegradedCapsurfaces as a warning instead of a silent loss of legitimate counts.Performance
PHP recording adds a bounded, lock-guarded insert to each native
ddog_ffe_evaluatecall (scalar copies + building the comparable canonical-context key); the heavier drain + serialization + sidecar hand-off is paid once per request at shutdown, not per evaluation. Context pruning (256/256) caps the per-bucket payload, and the global/per-flag/degraded caps bound total memory. The existing OTel metric path is unchanged, so the only added per-evaluation cost is the EVP insert.A phpbench hot-path benchmark is wired into the canonical suite at
tests/Benchmarks/API/FlagEvaluationBench.php— auto-discovered bytests/phpbench.json(Benchmarks/API,*Bench.php), run bymake benchmarksand the GitLabbenchmarks-tracermicrobenchmark job (gated ontests/Benchmarks/**changes). It drives the real\DDTrace\ffe_evaluate()→ddog_ffe_evaluaterecord+aggregation path against a static in-memory UFC config (split / targeting-match / distinct-context subjects), plus a counting-disabled (benchEvaluateWithoutCounting) baseline subject that isolates the EVP record+aggregation cost from base evaluation. Per-op ns/µs figures are produced by the benchmarking-platform CI run; no figures are fabricated here.Validation
Proven through
ffe-dogfoodingmock-intake againstapp-php7(port 8087):flagevaluationpayload, thencount > 0with the expectedflag.key/variant.key/serviceafter evaluations + RSHUTDOWN flush.feature_flag.evaluationsand exposure paths remain wired (nativerecord_ffe_evaluation_metric+ RSHUTDOWN flush untouched).components-rs/ffe.rsdrive the real FFI entry pointddog_ffe_evaluateand assert it populates the aggregator the sidecar flush drains, plus tests for two-tier overflow, context pruning (256/256), comparable-key bucket identity, runtime-default-from-absent-variant, and the observable overflow-drop reset. The libdatadog change adds a bincode round-trip test forFfeFlagEvaluationBatch(with bothSomeandNone/absent fields) — the mechanical guard that locks the wire-codec fix.context.evaluationas a proper JSON object ({country, plan, version}),variant.key/allocation.key/targeting_key/dd.servicepopulated, and no null placeholders in the degraded tier; the sidecar logssent flag evaluation batch, status=202.Resolution of the 8 PoC (#4874) reviewer concerns
prune_context(≤256 fields / ≤256-char string values, oversized skipped not truncated) applied to the full tier before it is bufferedflageval-workerschemaflagEvaluations/{key}schema via optional-field omission (no null placeholders)first/last_evaluationvia min/maxruntime_default_usedderived from the missing variant (empty variant string), not the reason alonetests/Benchmarks/API/FlagEvaluationBench.phpwired into the canonical suite (incl. a counting-disabled baseline subject isolating EVP cost); per-eval cost is a bounded mutex insert, drain/serialize deferred to RSHUTDOWN; numbers from the benchmarking-platform CI runAfterstep131072/ per-flag10000/ degraded32768) + explicit drop-and-count beyond degraded, now surfaced viatracing::warnat drain🚧 Draft