adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025
Draft
jasonhernandez wants to merge 1 commit into
Draft
adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025jasonhernandez wants to merge 1 commit into
jasonhernandez wants to merge 1 commit into
Conversation
Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk- transport + MetricsTransport setup and dropping the [patch.crates-io] override. The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData Source/eventsource StreamClosed bug where a non-Eof stream error left the data source stuck with no reconnect, silently breaking LD sync. A prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because that bug was still unfixed upstream. The fixes have since landed — rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1 resolves eventsource-client to 0.17.5, which carries them. Use the rustls + aws-lc-rs features (hyper-rustls-native-roots, crypto-aws-lc-rs), now the upstream defaults, instead of the prior attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The transport build_https() call is identical either way. deny.toml gains skips for the duplicate versions the transport stack pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/ cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper. Adds a test, test_metric_frozen_on_midstream_error, modeling the exact incident-984 failure mode (200 OK then a mid-stream timeout): it asserts the last_sse_time_seconds gauge freezes so the staleness alert can detect a stuck data source. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This moves
launchdarkly-server-sdkfrom ourMaterializeInc/rust-server-sdkfork back to upstream crates.io (3.1.1) and drops the[patch.crates-io]override.The fork existed for launchdarkly/rust-server-sdk#116: both the streaming and polling data sources spawn a background fetch task that
breaks (and never reconnects) on most non-Eofstream errors, silently wedging LD sync with no way to detect it. A previous upgrade to upstream3.0.1(#35903 + the streaming-data-source reapply) had to be reverted in #36320 (incident-984) because the bug was still unfixed upstream and broke LD sync in the prod sandbox.In incident-984 the failure surfaced as:
i.e. a non-
Eoferror hitting the body stream after a successful 200, which the SDK did not recover from. SSE then went stale and thelaunchdarkly-stale-sse-tier-2alert fired.The fixes have since landed upstream:
rust-server-sdk#168 — keepStreamingDataSourcealive on non-Eofstream errors (v3.0.3)rust-eventsource-client#134 / #135 — schedule reconnect after a stream/parse error (v0.17.4)At
3.1.1, Cargo resolveseventsource-clientto0.17.5andlaunchdarkly-sdk-transportto0.1.3, all carrying these fixes.What changed
Cargo.toml:launchdarkly-server-sdk→3.1.1; re-addlaunchdarkly-sdk-transport = "0.1"; remove the fork[patch.crates-io]entry.hyper-rustls-native-roots+crypto-aws-lc-rsfeatures (now the upstream defaults) instead of the prior attempt'snative-tls/crypto-openssl. This avoids re-introducing the OpenSSL path and aligns with the rustls/aws-lc-rs standardization.transport.build_https()is the same call either way;3.0.1simply didn't offer rustls/aws-lc-rs yet.frontend.rs/dyncfg-launchdarkly: restore thelaunchdarkly-sdk-transportHyperTransportAPI and theMetricsTransportwrapper that powersmz_parameter_frontend_last_sse_time_seconds(the metric this migration exists to enable). The post-revert "skip bad entries" fix inlib.rsis preserved.deny.toml: re-add thelaunchdarkly-sdk-transportwrapper, and skip the duplicate versions the transport stack introduces — oldertower/rustls-native-certs(proxy/rustls), and newerrand/rand_core/getrandom/cpufeatures(RNG). No unrelated crate bumps;hyper-timeout 0.4.1+tokio-io-timeoutduplicates are actually eliminated.Tests
Adds
test_metric_frozen_on_midstream_error(src/adapter/src/config/frontend.rs), which models the exact incident-984 failure mode — a transport that returns 200 OK, delivers one event, then errors mid-stream with a timeout — and assertslast_sse_time_secondsadvances only for the event that arrived and then freezes, so the staleness alert can detect a stuck data source. The existingMetricsTransporttests covered the success path and an initial-request failure, but not a mid-stream error after a successful 200.Validation note
The SDK-level reconnect on non-
Eofstream errors can't be reproduced in a unit test (a prior real-LD-keys repro attempt in #36325 could not trigger it); the added test covers the metric/detection path we own, and we rely on the upstream fixes for the reconnect itself. Before un-drafting, this should be validated against the Nightly LaunchDarkly test (the same signal the revert PR used), since the previous attempt passed CI and failed only at runtime.