Skip to content

adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025

Draft
jasonhernandez wants to merge 1 commit into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sdk-upstream-migration
Draft

adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025
jasonhernandez wants to merge 1 commit into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sdk-upstream-migration

Conversation

@jasonhernandez

Copy link
Copy Markdown
Contributor

Motivation

This moves launchdarkly-server-sdk from our MaterializeInc/rust-server-sdk fork back to upstream crates.io (3.1.1) and drops the [patch.crates-io] override.

The fork existed for launchdarkly/rust-server-sdk#116: both the streaming and polling data sources spawn a background fetch task that breaks (and never reconnects) on most non-Eof stream errors, silently wedging LD sync with no way to detect it. A previous upgrade to upstream 3.0.1 (#35903 + the streaming-data-source reapply) had to be reverted in #36320 (incident-984) because the bug was still unfixed upstream and broke LD sync in the prod sandbox.

In incident-984 the failure surfaced as:

ERROR launchdarkly_server_sdk::data_source: unhandled error on event stream:
Transport(TransportError { inner: hyper::Error(Body, Kind(TimedOut)) })

i.e. a non-Eof error hitting the body stream after a successful 200, which the SDK did not recover from. SSE then went stale and the launchdarkly-stale-sse-tier-2 alert fired.

The fixes have since landed upstream:

  • rust-server-sdk #168 — keep StreamingDataSource alive on non-Eof stream errors (v3.0.3)
  • rust-eventsource-client #134 / #135 — schedule reconnect after a stream/parse error (v0.17.4)

At 3.1.1, Cargo resolves eventsource-client to 0.17.5 and launchdarkly-sdk-transport to 0.1.3, all carrying these fixes.

What changed

  • Cargo.toml: launchdarkly-server-sdk3.1.1; re-add launchdarkly-sdk-transport = "0.1"; remove the fork [patch.crates-io] entry.
  • TLS/crypto: use the hyper-rustls-native-roots + crypto-aws-lc-rs features (now the upstream defaults) instead of the prior attempt's native-tls/crypto-openssl. This avoids re-introducing the OpenSSL path and aligns with the rustls/aws-lc-rs standardization. transport.build_https() is the same call either way; 3.0.1 simply didn't offer rustls/aws-lc-rs yet.
  • frontend.rs / dyncfg-launchdarkly: restore the launchdarkly-sdk-transport HyperTransport API and the MetricsTransport wrapper that powers mz_parameter_frontend_last_sse_time_seconds (the metric this migration exists to enable). The post-revert "skip bad entries" fix in lib.rs is preserved.
  • deny.toml: re-add the launchdarkly-sdk-transport wrapper, and skip the duplicate versions the transport stack introduces — older tower / rustls-native-certs (proxy/rustls), and newer rand / rand_core / getrandom / cpufeatures (RNG). No unrelated crate bumps; hyper-timeout 0.4.1 + tokio-io-timeout duplicates are actually eliminated.

Tests

Adds test_metric_frozen_on_midstream_error (src/adapter/src/config/frontend.rs), which models the exact incident-984 failure mode — a transport that returns 200 OK, delivers one event, then errors mid-stream with a timeout — and asserts last_sse_time_seconds advances only for the event that arrived and then freezes, so the staleness alert can detect a stuck data source. The existing MetricsTransport tests covered the success path and an initial-request failure, but not a mid-stream error after a successful 200.

Validation note

The SDK-level reconnect on non-Eof stream errors can't be reproduced in a unit test (a prior real-LD-keys repro attempt in #36325 could not trigger it); the added test covers the metric/detection path we own, and we rely on the upstream fixes for the reconnect itself. Before un-drafting, this should be validated against the Nightly LaunchDarkly test (the same signal the revert PR used), since the previous attempt passed CI and failed only at runtime.

Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk
fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk-
transport + MetricsTransport setup and dropping the [patch.crates-io]
override.

The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData
Source/eventsource StreamClosed bug where a non-Eof stream error left
the data source stuck with no reconnect, silently breaking LD sync. A
prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because
that bug was still unfixed upstream. The fixes have since landed —
rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1
resolves eventsource-client to 0.17.5, which carries them.

Use the rustls + aws-lc-rs features (hyper-rustls-native-roots,
crypto-aws-lc-rs), now the upstream defaults, instead of the prior
attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The
transport build_https() call is identical either way.

deny.toml gains skips for the duplicate versions the transport stack
pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/
cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper.

Adds a test, test_metric_frozen_on_midstream_error, modeling the exact
incident-984 failure mode (200 OK then a mid-stream timeout): it asserts
the last_sse_time_seconds gauge freezes so the staleness alert can detect
a stuck data source.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant