Recover OpenFeature provider after initialization timeout#11474
Recover OpenFeature provider after initialization timeout#11474leoromanovsky wants to merge 5 commits into
Conversation
🟢 Java Benchmark SLOs — All performance SLOs passed
PR vs. master resultsStartup Time
Commit: Load and DaCapo benchmarks can be triggered manually in the GitLab pipeline. Results will appear in the Benchmarking Platform UI after completion. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fcfa34c9cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| initializationState.set(InitializationState.READY); | ||
| } catch (final OpenFeatureError e) { | ||
| initializationState.set(InitializationState.ERROR); |
There was a problem hiding this comment.
Avoid reverting READY recovery after a timeout race
If the initialization timeout wins the compareAndSet(INITIALIZING, ERROR) path and a real config arrives before this catch runs, onConfigurationChange() can already transition the provider back to READY and emit PROVIDER_READY; this unconditional assignment then puts the internal state back to ERROR while configuration is present. In that state later config-loss updates are ignored by onConfigurationUnavailable(), so the OpenFeature client can remain ready while evaluations return PROVIDER_NOT_READY. Preserve an already-recovered READY state instead of always overwriting it here.
Useful? React with 👍 / 👎.
| if (state != InitializationState.READY) { | ||
| return; |
There was a problem hiding this comment.
Handle null config while initial readiness is pending
When a real config releases the initialization latch, the provider is in INITIAL_CONFIG_RECEIVED until initialize() finishes; if a null config update arrives in that window, this branch drops it because the state is not yet READY, and initialize() can subsequently set the provider to READY even though evaluator.hasConfiguration() is already false. That leaves the client advertising readiness while evaluations return PROVIDER_NOT_READY until another RC update arrives, so the unavailable transition needs to account for INITIAL_CONFIG_RECEIVED too.
Useful? React with 👍 / 👎.
Motivation
A Java application can start while its Datadog Agent is already running but does not yet have an
FFE_FLAGSpayload cached for that tracer. That is the customer-reported shape we have been investigating: the tracer starts, asks for feature flag configuration, and the OpenFeature provider waits for usable configuration during initialization.That blocking behavior is intentional.
setProviderAndWait()should wait until the provider receives usable flag configuration, or fail with the configured timeout. The bug is what happens after the timeout path: the OpenFeature provider transitions toERROR, but whenFFE_FLAGSarrives later it must transition back toREADYwithout requiring an application restart.There are two edge cases this PR needs to handle correctly. First, a real config can arrive right at the timeout boundary, after the evaluator has received it but before provider initialization has finished deciding whether it timed out. Second, usable config can disappear after the provider is already
READY; in that case the provider state should not remainREADYwhile evaluations returnPROVIDER_NOT_READY.Changes
This keeps provider initialization blocking.
DDEvaluator.initialize()registers for Feature Flagging Gateway updates and waits on the initialization latch until a non-nullServerConfigurationarrives or the configured timeout expires. Anullconfig update means there is still no usable FFE product for the provider, so it does not satisfy initialization.The provider tracks initialization state explicitly. If real configuration arrives while initialization is still blocked, the provider records that initial config was received and lets the OpenFeature SDK publish
PROVIDER_READYafterinitialize()returns. If initialization has already timed out and the provider is inERROR, the next real config update moves the provider toREADYand emitsPROVIDER_READY. After the provider is ready, later real config updates emitPROVIDER_CONFIGURATION_CHANGED.Nullable config updates after readiness are handled as a real lifecycle transition. If the provider was
READYand FFE config becomes unavailable, it emitsPROVIDER_ERROR; evaluations then return the caller default withPROVIDER_NOT_READY. A later real config moves the provider back toREADY.Decisions
Blocking is not the problem we are fixing here. Blocking until config or timeout is the Java behavior we want because the provider should not advertise readiness before it has usable flag configuration. The missing piece was recovery after timeout and after losing usable configuration. This PR fixes those recovery paths without changing the public provider API or the caller-facing timeout option.
This also pairs with the first-RC subscription work that landed separately: that work makes the tracer request
FFE_FLAGSas early as possible from an already-running Agent, while this change makes the Java OpenFeature provider recover correctly if that first attempt still times out. The related Go tracer reference isSubscribeRC, which subscribesFFE_FLAGSduring tracer startup so it is included in the first RC request: https://github.com/DataDog/dd-trace-go/blob/3ded6653e44aeb0d27bd5944e1e8033775473768/internal/openfeature/rc_subscription.go#L40-L44Evidence
Dogfooding validation is captured in DataDog/ffe-dogfooding#71. That PR adds the local Java build path used to run this PR against the full ffe-dogfooding compose stack before the Java artifacts are published.
The local validation built
dd-java-agent-1.63.0-SNAPSHOT.jaranddd-openfeature-1.63.0-SNAPSHOT.jarfrom thisdd-trace-javabranch, staged the local provider JAR into the Java dogfooding app image, and mounted the localdd-trace-javacheckout so the Java container started with the local agent JAR.Commands used for the dogfooding smoke test:
The full compose stack started successfully with
app-go,app-python,app-nodejs,app-java,app-ruby,app-dotnet,datadog-agent,mock-intake,otlp-intake, andevaluatorall running. Health checks passed on app ports8081through8086, and the evaluator/statsendpoint responded. The Java container log confirmed the local Java tracer was used:State Transitions
stateDiagram-v2 [*] --> NOT_STARTED NOT_STARTED --> INITIALIZING: initialize() INITIALIZING --> INITIALIZING: null config / keep blocking INITIALIZING --> INITIAL_CONFIG_RECEIVED: real config arrives before initialize returns INITIALIZING --> ERROR: timeout without real config / throw ProviderNotReadyError INITIAL_CONFIG_RECEIVED --> READY: initialize returns / OpenFeature SDK emits PROVIDER_READY ERROR --> READY: later real config / emit PROVIDER_READY ERROR --> ERROR: null config / remain unavailable READY --> READY: later real config / emit PROVIDER_CONFIGURATION_CHANGED READY --> ERROR: null config / emit PROVIDER_ERROR