Fix/backport prod fixes#39
Conversation
Three fixes that landed on the production node but never made it here: - trigger_sync: 30s timeout with the body read inside it (was 5s covering headers only — large unpaginated repo lists from canonical nodes and transpacific peers aborted mid-body), plus distinct logging for non-2xx vs timeout - announce: reject self-announcements by public URL or own DID; complements the existing boot-time prune_self_peers so self-loop rows never enter the peers table in the first place - list_all_repos_paged: report MAX(updated_at) across each logical repo group instead of the canonical row's own value — gossip pushes touch only the mirror row, so canonical timestamps go stale Monorepo refs: 8c2fe1b, c1dc33e, ef21c75. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Backports the monorepo TCP-era p2p fixes (9f875b6, 267d371) adapted to the QUIC transport: - listen on /ip6/:: as well as /ip4/0.0.0.0 — Fly's 6PN inter-app network is IPv6-only (<app>.internal resolves to AAAA records), so without an IPv6 socket peers on the private network can't reach us - wrap the QUIC transport in libp2p-dns so bootstrap multiaddrs can use /dns6/<app>.internal — dialing peers through the public anycast edge breaks the handshake when the proxy closes the connection mid-stream Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A push propagated via the /api/v1/sync/notify HTTP fallback replicated the repo but left no trace in received_ref_updates — only the libp2p gossip path recorded it, so /api/v1/events/ref-updates went blind whenever the mesh wasn't delivering. Insert the same record from the notify handler, with optional provenance fields (pusher_did, old_sha, timestamp, cert_id) that new senders already include. Monorepo ref: 087eeac. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR enhances peer synchronization and P2P infrastructure across four main areas: DNS-based QUIC transport with dual-stack listening support, improved peer sync timeout and error handling, self-announcement rejection validation, HTTP notification tracking with database persistence, and repository deduplication query logic. ChangesPeer Synchronization and P2P Infrastructure
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/gitlawb-node/src/api/peers.rs`:
- Around line 147-168: The outer tokio::time::timeout(30s, ...) in trigger_sync
is being masked by a shorter client-level timeout on state.http_client, so slow
peers surface as "peer fetch failed" instead of the intended "peer timed out";
update the request to set a per-request timeout that is >= the outer timeout
(e.g., client.get(&url).timeout(std::time::Duration::from_secs(31))) or
construct the Client without a global timeout and rely on the outer
tokio::time::timeout; change the code around client.get(&url).send().await? in
trigger_sync (and ensure state.http_client construction is adjusted if you
choose the global-client approach) so the outer tokio timeout is effective.
- Around line 264-283: Currently both the gossipsub path
(crates::gitlawb_node::p2p::mod::/* handler */) and the HTTP handler in
api::peers.rs create ReceivedRefUpdate rows with fresh UUIDs, so the DB dedupe
on id fails and the same logical push can be stored twice; change id generation
so both paths produce the same deterministic id for the same logical update
(instead of Uuid::new_v4()). Implement a stable ID function (called from both
places) that computes a collision-resistant id from the canonical fields of
ReceivedRefUpdate (e.g., repo, ref_name, new_sha, old_sha, pusher_did,
timestamp) using a stable hash/UUID v5 strategy, then use that id when
constructing ReceivedRefUpdate before calling state.db.insert_ref_update; update
both the gossipsub handler and api::peers.rs to call this new id generator and
ensure insert_ref_update behavior remains unchanged.
In `@crates/gitlawb-node/src/p2p/mod.rs`:
- Around line 250-260: The current start() logic only warns on failures from
swarm.listen_on for v4 and v6 so the node can start with zero listeners; change
it to record both results and if both calls fail return or propagate an error
from start() (or panic) so startup fails visibly. Specifically, capture the
Result from swarm.listen_on(v4) and swarm.listen_on(v6), log each failure as
before, and after both attempts check if both are Err; if so return an Err (or
map one of the errors into a StartError) from the start() function so callers
cannot continue with no listeners; reference the variables v4, v6, listen_port
and the swarm.listen_on calls to locate the code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 0adb6ed9-c93c-473b-b808-4cf046747ba8
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (4)
crates/gitlawb-node/Cargo.tomlcrates/gitlawb-node/src/api/peers.rscrates/gitlawb-node/src/db/mod.rscrates/gitlawb-node/src/p2p/mod.rs
| let result = tokio::time::timeout(std::time::Duration::from_secs(30), async { | ||
| let resp = client.get(&url).send().await?; | ||
| if !resp.status().is_success() { | ||
| return Err(anyhow::anyhow!("peer returned status {}", resp.status())); | ||
| } | ||
| let repos: Vec<serde_json::Value> = resp.json().await?; | ||
| Ok::<_, anyhow::Error>(repos) | ||
| }) | ||
| .await; | ||
|
|
||
| let repos: Vec<serde_json::Value> = match result { | ||
| Ok(Ok(resp)) if resp.status().is_success() => { | ||
| let repos = match result { | ||
| Ok(Ok(repos)) => { | ||
| peers_reached += 1; | ||
| resp.json().await.unwrap_or_default() | ||
| repos | ||
| } | ||
| Ok(Err(e)) => { | ||
| tracing::warn!(peer = %peer.did, err = %e, "trigger_sync: peer fetch failed"); | ||
| continue; | ||
| } | ||
| _ => { | ||
| tracing::warn!(peer = %peer.did, "trigger_sync: could not reach peer"); | ||
| Err(_) => { | ||
| tracing::warn!(peer = %peer.did, "trigger_sync: peer timed out"); | ||
| continue; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find the node-side shared reqwest client construction and any configured request timeouts.
rg -n --type rust -C4 'http_client|reqwest::Client::builder|Client::builder\(|\.timeout\(' crates/gitlawb-node/srcRepository: Gitlawb/node
Length of output: 13556
Timeout diagnostics split likely masked by shared reqwest timeout
crates/gitlawb-node/src/main.rs constructs state.http_client with reqwest::Client::builder().timeout(Duration::from_secs(10)), which will typically make client.get(...).send().await? fail before the outer tokio::time::timeout(30s, ...) can fire—so slow peers will be logged as peer fetch failed (Ok(Err(_))) rather than peer timed out (Err(_)). There’s also an alternate AppState path in crates/gitlawb-node/src/auth/mod.rs that uses reqwest::Client::new() (no explicit timeout), so the outer timeout branch may only be reachable when that state is used for trigger_sync.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/api/peers.rs` around lines 147 - 168, The outer
tokio::time::timeout(30s, ...) in trigger_sync is being masked by a shorter
client-level timeout on state.http_client, so slow peers surface as "peer fetch
failed" instead of the intended "peer timed out"; update the request to set a
per-request timeout that is >= the outer timeout (e.g.,
client.get(&url).timeout(std::time::Duration::from_secs(31))) or construct the
Client without a global timeout and rely on the outer tokio::time::timeout;
change the code around client.get(&url).send().await? in trigger_sync (and
ensure state.http_client construction is adjusted if you choose the
global-client approach) so the outer tokio timeout is effective.
| // Mirror the gossipsub-receive handler: insert the same record we'd | ||
| // get from the libp2p path, so /api/v1/events/ref-updates reflects | ||
| // pushes that arrive over either transport. | ||
| let now = chrono::Utc::now().to_rfc3339(); | ||
| let update = crate::db::ReceivedRefUpdate { | ||
| id: uuid::Uuid::new_v4().to_string(), | ||
| node_did: req.node_did.clone(), | ||
| pusher_did: req.pusher_did.clone().unwrap_or_default(), | ||
| repo: req.repo.clone(), | ||
| ref_name: req.ref_name.clone(), | ||
| old_sha: req.old_sha.clone().unwrap_or_default(), | ||
| new_sha: req.new_sha.clone(), | ||
| timestamp: req.timestamp.clone().unwrap_or_else(|| now.clone()), | ||
| cert_id: req.cert_id.clone(), | ||
| received_at: now, | ||
| from_peer: format!("http:{}", req.node_did), | ||
| }; | ||
| if let Err(e) = state.db.insert_ref_update(&update).await { | ||
| tracing::warn!(err = %e, repo = %req.repo, "failed to insert ref-update from sync notify"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find push fanout code that emits ref updates.
rg -n --type rust -C4 'publish_ref_update\s*\(|sync/notify|insert_ref_update\s*\(' crates
# Inspect readers of received_ref_updates to see whether they dedupe by content/source.
rg -n --type rust -C4 'received_ref_updates|from_peer' crates/gitlawb-node/srcRepository: Gitlawb/node
Length of output: 18423
Prevent duplicate received_ref_updates rows for one logical push across HTTP notify + gossipsub
- Both the gossipsub receive path (
crates/gitlawb-node/src/p2p/mod.rs) and the HTTP handler (crates/gitlawb-node/src/api/peers.rs) insert aReceivedRefUpdateusing a freshid: Uuid::new_v4(). received_ref_updatesonly dedupes onid(PRIMARY KEY (id)andINSERT ... ON CONFLICT(id) DO NOTHING), so two different UUIDs will both persist; there’s no reader-side dedupe inlist_ref_updates.- As a result, a single logical ref update can persist twice when a peer receives both the gossipsub delivery and the
/api/v1/sync/notifyfallback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/api/peers.rs` around lines 264 - 283, Currently both
the gossipsub path (crates::gitlawb_node::p2p::mod::/* handler */) and the HTTP
handler in api::peers.rs create ReceivedRefUpdate rows with fresh UUIDs, so the
DB dedupe on id fails and the same logical push can be stored twice; change id
generation so both paths produce the same deterministic id for the same logical
update (instead of Uuid::new_v4()). Implement a stable ID function (called from
both places) that computes a collision-resistant id from the canonical fields of
ReceivedRefUpdate (e.g., repo, ref_name, new_sha, old_sha, pusher_did,
timestamp) using a stable hash/UUID v5 strategy, then use that id when
constructing ReceivedRefUpdate before calling state.db.insert_ref_update; update
both the gossipsub handler and api::peers.rs to call this new id generator and
ensure insert_ref_update behavior remains unchanged.
| // Listen on both IPv4 (local/mDNS + any IPv4 dials) and IPv6 (required | ||
| // for Fly's 6PN inter-app network — <app>.internal DNS only returns AAAA | ||
| // records, so peers dial us via IPv6 and need a matching IPv6 socket). | ||
| let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?; | ||
| if let Err(e) = swarm.listen_on(v4) { | ||
| warn!(err = %e, "failed to listen on IPv4"); | ||
| } | ||
| let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?; | ||
| if let Err(e) = swarm.listen_on(v6) { | ||
| warn!(err = %e, "failed to listen on IPv6"); | ||
| } |
There was a problem hiding this comment.
Fail startup when neither QUIC socket binds.
Line 254 and Line 258 only warn, so start() still succeeds when both listen_on calls fail. That leaves the node running with zero swarm listeners and no inbound P2P reachability while boot looks healthy.
Proposed fix
+ let mut listen_ok = false;
let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?;
- if let Err(e) = swarm.listen_on(v4) {
- warn!(err = %e, "failed to listen on IPv4");
- }
+ match swarm.listen_on(v4) {
+ Ok(_) => listen_ok = true,
+ Err(e) => warn!(err = %e, "failed to listen on IPv4"),
+ }
let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?;
- if let Err(e) = swarm.listen_on(v6) {
- warn!(err = %e, "failed to listen on IPv6");
- }
+ match swarm.listen_on(v6) {
+ Ok(_) => listen_ok = true,
+ Err(e) => warn!(err = %e, "failed to listen on IPv6"),
+ }
+ if !listen_ok {
+ return Err(anyhow::anyhow!(
+ "failed to bind any QUIC listen address on port {listen_port}"
+ ));
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Listen on both IPv4 (local/mDNS + any IPv4 dials) and IPv6 (required | |
| // for Fly's 6PN inter-app network — <app>.internal DNS only returns AAAA | |
| // records, so peers dial us via IPv6 and need a matching IPv6 socket). | |
| let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?; | |
| if let Err(e) = swarm.listen_on(v4) { | |
| warn!(err = %e, "failed to listen on IPv4"); | |
| } | |
| let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?; | |
| if let Err(e) = swarm.listen_on(v6) { | |
| warn!(err = %e, "failed to listen on IPv6"); | |
| } | |
| let mut listen_ok = false; | |
| let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?; | |
| match swarm.listen_on(v4) { | |
| Ok(_) => listen_ok = true, | |
| Err(e) => warn!(err = %e, "failed to listen on IPv4"), | |
| } | |
| let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?; | |
| match swarm.listen_on(v6) { | |
| Ok(_) => listen_ok = true, | |
| Err(e) => warn!(err = %e, "failed to listen on IPv6"), | |
| } | |
| if !listen_ok { | |
| return Err(anyhow::anyhow!( | |
| "failed to bind any QUIC listen address on port {listen_port}" | |
| )); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/gitlawb-node/src/p2p/mod.rs` around lines 250 - 260, The current
start() logic only warns on failures from swarm.listen_on for v4 and v6 so the
node can start with zero listeners; change it to record both results and if both
calls fail return or propagate an error from start() (or panic) so startup fails
visibly. Specifically, capture the Result from swarm.listen_on(v4) and
swarm.listen_on(v6), log each failure as before, and after both attempts check
if both are Err; if so return an Err (or map one of the errors into a
StartError) from the start() function so callers cannot continue with no
listeners; reference the variables v4, v6, listen_port and the swarm.listen_on
calls to locate the code.
Summary
Backports the remaining production fixes that landed on the monorepo node but never made it into this repo, closing out code parity ahead of the monorepo sunset. After this merges (together with #
and the idle_timeout PR), the monorepo's node code is fully superseded and frozen.
1.
trigger_sync: 30s timeout covering the body read (monorepo8c2fe1b)The 5s timeout only covered the response headers — the body read happened outside it, so large unpaginated repo lists from canonical nodes (and transpacific round trips) aborted mid-body. The timeout now wraps
send + status check + body read, with distinct log lines for non-2xx responses vs timeouts.
2.
announce: reject self-announcements (monorepoc1dc33e)A peer row whose
http_urlis our own public URL makes the HTTP-notify path fan out to ourselves — seen in production when misconfigured dev nodes announce with their upstream's URL. This repo already had theboot-time
prune_self_peersand the skip-self check in the notify fanout; this adds the missing piece, rejecting them at the door (by public URL or own DID).3.
list_all_repos_paged: fresh timestamps across mirror rows (monorepoef21c75)The
DISTINCT ONdedupe here already collapses the canonical + mirror row pair per logical repo, but it reported the canonical row's ownupdated_at— and pushes arriving via gossip only touch the mirror row,so timestamps went stale. Now reports
MAX(updated_at)over the group.4. p2p: dual-stack QUIC listen + DNS multiaddr dialing (monorepo
9f875b6+267d371, adapted from TCP to QUIC)/ip6/::alongside/ip4/0.0.0.0— Fly's 6PN inter-app network is IPv6-only (<app>.internalresolves to AAAA records), so without an IPv6 socket peers on the private network can't reach us.libp2p-dnsso bootstrap multiaddrs can use/dns6/<app>.internal/udp/<port>/quic-v1. Dialing through the public anycast edge breaks the handshake when the proxy closes theconnection mid-stream.
Deploy note: Fly bootstrap multiaddrs should move to the
/dns6/<app>.internal/udp/<port>/quic-v1form to take advantage of this. Likely relevant to theconnected_peers=0issue seen in production.5. HTTP notify records
received_ref_updates(monorepo087eeac)A push propagated via the
/api/v1/sync/notifyHTTP fallback replicated the repo but left no trace inreceived_ref_updates— only the libp2p gossip path recorded it, so/api/v1/events/ref-updateswent blindwhenever the mesh wasn't delivering. The notify handler now inserts the same record, with new optional provenance fields (
pusher_did,old_sha,timestamp,cert_id) that senders already include; oldersenders without them still work.
Testing
cargo fmt --check,cargo clippy -- -D warnings, and all 88 tests passlibp2p-dns = 0.44dependency resolves cleanly against the existing libp2p 0.43-family cratesSummary by CodeRabbit
New Features
Bug Fixes
Improvements