adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1 by jasonhernandez · Pull Request #37025 · MaterializeInc/materialize

jasonhernandez · 2026-06-12T23:47:03Z

Motivation

This moves launchdarkly-server-sdk from our MaterializeInc/rust-server-sdk fork back to upstream crates.io (3.1.1) and drops the [patch.crates-io] override.

The fork existed for launchdarkly/rust-server-sdk#116: both the streaming and polling data sources spawn a background fetch task that breaks (and never reconnects) on most non-Eof stream errors, silently wedging LD sync with no way to detect it. A previous upgrade to upstream 3.0.1 (#35903 + the streaming-data-source reapply) had to be reverted in #36320 (incident-984) because the bug was still unfixed upstream and broke LD sync in the prod sandbox.

In incident-984 the failure surfaced as:

ERROR launchdarkly_server_sdk::data_source: unhandled error on event stream:
Transport(TransportError { inner: hyper::Error(Body, Kind(TimedOut)) })

i.e. a non-Eof error hitting the body stream after a successful 200, which the SDK did not recover from. SSE then went stale and the launchdarkly-stale-sse-tier-2 alert fired.

The fixes have since landed upstream:

rust-server-sdk #168 — keep StreamingDataSource alive on non-Eof stream errors (v3.0.3)
rust-eventsource-client #134 / #135 — schedule reconnect after a stream/parse error (v0.17.4)

At 3.1.1, Cargo resolves eventsource-client to 0.17.5 and launchdarkly-sdk-transport to 0.1.3, all carrying these fixes.

What changed

Cargo.toml: launchdarkly-server-sdk → 3.1.1; re-add launchdarkly-sdk-transport = "0.1"; remove the fork [patch.crates-io] entry.
TLS/crypto: use the hyper-rustls-native-roots + crypto-aws-lc-rs features (now the upstream defaults) instead of the prior attempt's native-tls/crypto-openssl. This avoids re-introducing the OpenSSL path and aligns with the rustls/aws-lc-rs standardization. transport.build_https() is the same call either way; 3.0.1 simply didn't offer rustls/aws-lc-rs yet.
frontend.rs / dyncfg-launchdarkly: restore the launchdarkly-sdk-transport HyperTransport API and the MetricsTransport wrapper that powers mz_parameter_frontend_last_sse_time_seconds (the metric this migration exists to enable). The post-revert "skip bad entries" fix in lib.rs is preserved.
deny.toml: re-add the launchdarkly-sdk-transport wrapper, and skip the duplicate versions the transport stack introduces — older tower / rustls-native-certs (proxy/rustls), and newer rand / rand_core / getrandom / cpufeatures (RNG). No unrelated crate bumps; hyper-timeout 0.4.1 + tokio-io-timeout duplicates are actually eliminated.

Tests

Adds test_metric_frozen_on_midstream_error (src/adapter/src/config/frontend.rs), which models the exact incident-984 failure mode — a transport that returns 200 OK, delivers one event, then errors mid-stream with a timeout — and asserts last_sse_time_seconds advances only for the event that arrived and then freezes, so the staleness alert can detect a stuck data source. The existing MetricsTransport tests covered the success path and an initial-request failure, but not a mid-stream error after a successful 200.

Validation note

The SDK-level reconnect on non-Eof stream errors can't be reproduced in a unit test (a prior real-LD-keys repro attempt in #36325 could not trigger it); the added test covers the metric/detection path we own, and we rely on the upstream fixes for the reconnect itself. Before un-drafting, this should be validated against the Nightly LaunchDarkly test (the same signal the revert PR used), since the previous attempt passed CI and failed only at runtime.

Move launchdarkly-server-sdk from the MaterializeInc/rust-server-sdk fork back to upstream crates.io 3.1.1, restoring the launchdarkly-sdk- transport + MetricsTransport setup and dropping the [patch.crates-io] override. The fork existed for launchdarkly/rust-server-sdk#116: a StreamingData Source/eventsource StreamClosed bug where a non-Eof stream error left the data source stuck with no reconnect, silently breaking LD sync. A prior upgrade to upstream 3.0.1 had to be reverted (incident-984) because that bug was still unfixed upstream. The fixes have since landed — rust-server-sdk#168 and rust-eventsource-client#134/#135 — and 3.1.1 resolves eventsource-client to 0.17.5, which carries them. Use the rustls + aws-lc-rs features (hyper-rustls-native-roots, crypto-aws-lc-rs), now the upstream defaults, instead of the prior attempt's native-tls/crypto-openssl, avoiding the OpenSSL path. The transport build_https() call is identical either way. deny.toml gains skips for the duplicate versions the transport stack pulls (older tower/rustls-native-certs; newer rand/rand_core/getrandom/ cpufeatures) and re-adds the launchdarkly-sdk-transport wrapper. Adds a test, test_metric_frozen_on_midstream_error, modeling the exact incident-984 failure mode (200 OK then a mid-stream timeout): it asserts the last_sse_time_seconds gauge freezes so the staleness alert can detect a stuck data source. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

jasonhernandez mentioned this pull request Jun 13, 2026

adapter: add LaunchDarkly reconnect integration test #37026

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025

adapter: migrate LaunchDarkly SDK off fork to upstream 3.1.1#37025
jasonhernandez wants to merge 1 commit into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-sdk-upstream-migration

jasonhernandez commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jasonhernandez commented Jun 12, 2026

Motivation

What changed

Tests

Validation note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant