fix: fall back to default DNS resolver when system config is unparseable#613
fix: fall back to default DNS resolver when system config is unparseable#613n13 wants to merge 2 commits into
Conversation
litep2p's `with_system_resolver()` reads the OS resolver config via `hickory read_system_conf()` at node startup. When `/etc/resolv.conf` contains a nameserver the parser rejects (e.g. macOS zoned link-local IPv6 `fe80::1%en0`), the read errors and the whole service aborts with `CannotReadSystemDnsConfig`, so the node cannot start at all. Log the parse failure and fall back to the default resolver instead of aborting, keeping #586's "prefer system DNS" intent while no longer bricking startup on hosts with a nameserver hickory can't parse.
n13
left a comment
There was a problem hiding this comment.
Review
Thanks for chasing this down — the startup abort on macOS is real and the diagnosis of read_system_conf() as the trigger is correct. My concern is with the fallback target, which I think re-introduces the exact bug #586 was created to fix, just in a quieter form.
The fallback lands back on the #586 failure
The fallback is (Default::default(), Default::default()). In hickory-resolver 0.26.1 (the version this crate pins), ResolverConfig is #[derive(Default)], so ResolverConfig::default() has an empty name_servers list — no nameservers at all. That is precisely the state #586 fixed ("ResolverConfig::default() … empty config with no nameservers, so all /dns/… bootnode and telemetry dials fail").
So on the exact hosts this PR targets:
- Before: loud crash at startup (
CannotReadSystemDnsConfig). - After: node boots, logs one error, then runs with zero nameservers → every
/dns/dial fails. That includes all of our bootnodes (/dns/a1-p2p-heisenberg.quantus.cat, …) and telemetry (/dns/shard-telemetry.quantus.cat) innode/src/chain_spec.rs→ 0 peers, no telemetry.
The --dev test can't surface this: the dev chain spec has no DNS bootnodes/telemetry, so DNS resolution is never exercised. The regression only shows on testnet/mainnet.
Why the libp2p backend doesn't have this problem
The two network backends use different hickory versions, and 0.26 carries two regressions vs 0.24:
| libp2p backend | litep2p backend | |
|---|---|---|
| DNS lib | libp2p-dns 0.42 → hickory 0.24.4 |
vendored litep2p → hickory 0.26.1 |
ResolverConfig::default() |
Self::google() → Google DNS (populated) |
derived Default → empty (no nameservers) |
| macOS system read | no apple.rs; unix.rs reads /etc/resolv.conf via resolv-conf, which parses fe80::1%en0 fine (alphanumeric scope) |
new apple.rs reads SystemConfiguration then IpAddr::from_str("fe80::1%en0") → std rejects the %en0 zone id, and one bad entry aborts the whole read via ? |
So libp2p "just works" on the same Mac because hickory 0.24 (a) reads the real system nameservers successfully via /etc/resolv.conf, and (b) even when it can't, its default is Google — never empty.
Suggested change
Keep preferring system DNS, but fall back to a populated public resolver instead of an empty config — reproducing the default() == google() behavior libp2p relies on:
use hickory_resolver::config::{ResolverConfig, GOOGLE};
let (resolver_config, resolver_opts) = if litep2p_config.use_system_dns_config {
match hickory_resolver::system_conf::read_system_conf() {
Ok(conf) => conf,
Err(error) => {
tracing::error!(
target: LOG_TARGET,
?error,
"failed to read system DNS config; falling back to public DNS",
);
(ResolverConfig::udp_and_tcp(&GOOGLE), Default::default())
},
}
} else {
(Default::default(), Default::default())
};This way the node never crashes and never ends up with 0 nameservers. (Google matches libp2p exactly; CLOUDFLARE/QUAD9 are equivalent consts.) It's also worth treating a successfully-read-but-empty config the same way.
If preserving an operator's custom/split-horizon DNS on macOS matters, the more faithful option is to parse /etc/resolv.conf via the resolv-conf crate directly on apple targets (what hickory 0.24 does) — but for public /dns/ bootnodes the public-DNS fallback above is sufficient.
Longer term, the cleanest "once and for all" is to align this vendored litep2p's hickory-resolver with libp2p's 0.24 line (or patch its apple.rs to tolerate zoned nameservers and avoid the empty default), which removes both regressions permanently.
Minor
client/litep2p/src/config.rs — the with_system_resolver doc comment still says "instead of default (Google)", which is stale under hickory 0.26 (the default is now empty, not Google).
|
Also refer to upstream fix PR hickory-dns/hickory-dns#3765 Upstream main already fixes the crash, it ignores unknown items in the Apple DNS config instead of crashing. |
n13
left a comment
There was a problem hiding this comment.
Re-review — verdict: LGTM (non-blocking nits below)
I'm the PR author so GitHub won't let me click Approve, but the blocking concern from the previous round is resolved correctly and the change is sound.
What changed since last round
The fallback moved from (Default::default(), Default::default()) → (ResolverConfig::udp_and_tcp(&GOOGLE), Default::default()) (commit 3e39d41). This closes the regression I flagged (empty resolver ⇒ every /dns/ bootnode + telemetry dial fails on the exact macOS hosts this PR targets).
Verified
ResolverConfig::default()really is empty in 0.26.1 —ResolverConfigis#[derive(Default)]withname_servers: Vec<NameServerConfig>(hickory-resolver-0.26.1/src/config.rs:49-60), so the original fallback yielded 0 nameservers. Confirmed.- New fallback is populated —
ResolverConfig::udp_and_tcp(&GOOGLE)fillsname_serversfrom theGOOGLEServerGroup(config.rs:66, const atconfig.rs:920), reproducing the hickory-0.24default() == google()behavior the libp2p backend relies on. - Compiles cleanly — import path
hickory_resolver::config::{ResolverConfig, GOOGLE}is valid (pub mod config),LOG_TARGETis in scope, andcargo check -p litep2ppasses. - Path is actually exercised — the node always calls
with_system_resolver()(client/network/src/litep2p/mod.rs:327), so the fix takes effect on the affected hosts.
Non-blocking nits (both raised last round, still open)
- Ok-but-empty config isn't covered. The
matchonly falls back onErr. A successful read that returns an emptyname_serverslist would still leave the node with 0 nameservers. A cheap guard (fall back when the parsedname_serversis empty too) would fully close the gap. Optional. - Stale doc comment.
client/litep2p/src/config.rs:257still sayswith_system_resolversets DNS "instead of default (Google)", but under hickory 0.26 the non-system default (theelsebranch,Default::default()) is empty, not Google. Doc-only — quantus always callswith_system_resolver(), so no runtime impact.
Minor
- The Linux
resolv.confsanity-check box in the test plan is still unchecked.
Fix logic and safety look good — ship it once you're comfortable with the nits.
Summary
The litep2p backend calls
with_system_resolver()(added in fix: use system DNS resolver in litep2p backend #586), which reads the OS resolver config viahickory read_system_conf()duringLitep2p::new().If
/etc/resolv.confcontains a nameserver the parser rejects — notably the macOS zoned link-local IPv6 entrynameserver fe80::1%en0— the read returns an error, mapped toError::CannotReadSystemDnsConfig, which propagates out and aborts the whole service on startup:The node cannot start at all on such hosts (reproducible on macOS with the default network config).
This change logs the parse failure at
errorlevel and falls back to litep2p's default resolver instead of aborting, preserving fix: use system DNS resolver in litep2p backend #586's "prefer system DNS" intent while no longer bricking startup.Root cause
hickory_resolver::system_conf::read_system_conf()(viaresolv-conf) fails to parse a nameserver with an IPv6 zone id (%en0). The vendored litep2p treated that read as fatal:Test plan
cargo check -p litep2pquantus-nodeand started a--devnode on a macOS host whose/etc/resolv.confhasnameserver fe80::1%en0. Before: startup aborts withCannotReadSystemDnsConfig. After: node logs the fallback and boots, mines blocks, and serves RPC (verified transfers, multisig, and tech-collective governance via the CLI).resolv.conf(system resolver path unchanged).