Skip to content

Fix/backport prod fixes#39

Open
kevincodex1 wants to merge 3 commits into
mainfrom
fix/backport-prod-fixes
Open

Fix/backport prod fixes#39
kevincodex1 wants to merge 3 commits into
mainfrom
fix/backport-prod-fixes

Conversation

@kevincodex1

@kevincodex1 kevincodex1 commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Backports the remaining production fixes that landed on the monorepo node but never made it into this repo, closing out code parity ahead of the monorepo sunset. After this merges (together with #
and the idle_timeout PR), the monorepo's node code is fully superseded and frozen.

1. trigger_sync: 30s timeout covering the body read (monorepo 8c2fe1b)

The 5s timeout only covered the response headers — the body read happened outside it, so large unpaginated repo lists from canonical nodes (and transpacific round trips) aborted mid-body. The timeout now wraps
send + status check + body read, with distinct log lines for non-2xx responses vs timeouts.

2. announce: reject self-announcements (monorepo c1dc33e)

A peer row whose http_url is our own public URL makes the HTTP-notify path fan out to ourselves — seen in production when misconfigured dev nodes announce with their upstream's URL. This repo already had the
boot-time prune_self_peers and the skip-self check in the notify fanout; this adds the missing piece, rejecting them at the door (by public URL or own DID).

3. list_all_repos_paged: fresh timestamps across mirror rows (monorepo ef21c75)

The DISTINCT ON dedupe here already collapses the canonical + mirror row pair per logical repo, but it reported the canonical row's own updated_at — and pushes arriving via gossip only touch the mirror row,
so timestamps went stale. Now reports MAX(updated_at) over the group.

4. p2p: dual-stack QUIC listen + DNS multiaddr dialing (monorepo 9f875b6 + 267d371, adapted from TCP to QUIC)

  • Listen on /ip6/:: alongside /ip4/0.0.0.0 — Fly's 6PN inter-app network is IPv6-only (<app>.internal resolves to AAAA records), so without an IPv6 socket peers on the private network can't reach us.
  • Wrap the QUIC transport in libp2p-dns so bootstrap multiaddrs can use /dns6/<app>.internal/udp/<port>/quic-v1. Dialing through the public anycast edge breaks the handshake when the proxy closes the
    connection mid-stream.

Deploy note: Fly bootstrap multiaddrs should move to the /dns6/<app>.internal/udp/<port>/quic-v1 form to take advantage of this. Likely relevant to the connected_peers=0 issue seen in production.

5. HTTP notify records received_ref_updates (monorepo 087eeac)

A push propagated via the /api/v1/sync/notify HTTP fallback replicated the repo but left no trace in received_ref_updates — only the libp2p gossip path recorded it, so /api/v1/events/ref-updates went blind
whenever the mesh wasn't delivering. The notify handler now inserts the same record, with new optional provenance fields (pusher_did, old_sha, timestamp, cert_id) that senders already include; older
senders without them still work.

Testing

  • cargo fmt --check, cargo clippy -- -D warnings, and all 88 tests pass
  • New libp2p-dns = 0.44 dependency resolves cleanly against the existing libp2p 0.43-family crates

Summary by CodeRabbit

  • New Features

    • Added optional metadata fields to peer notifications (pusher_did, old_sha, timestamp, cert_id)
    • Enabled DNS resolution for peer connections
  • Bug Fixes

    • Prevents nodes from announcing themselves as peers
    • Improved IPv4/IPv6 network connectivity resilience
  • Improvements

    • Extended peer fetch timeout from 5 seconds to 30 seconds
    • Fixed repo update timestamps in deduplicated results

kevincodex1 and others added 3 commits June 12, 2026 19:38
Three fixes that landed on the production node but never made it here:

- trigger_sync: 30s timeout with the body read inside it (was 5s
  covering headers only — large unpaginated repo lists from canonical
  nodes and transpacific peers aborted mid-body), plus distinct logging
  for non-2xx vs timeout
- announce: reject self-announcements by public URL or own DID;
  complements the existing boot-time prune_self_peers so self-loop
  rows never enter the peers table in the first place
- list_all_repos_paged: report MAX(updated_at) across each logical
  repo group instead of the canonical row's own value — gossip pushes
  touch only the mirror row, so canonical timestamps go stale

Monorepo refs: 8c2fe1b, c1dc33e, ef21c75.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Backports the monorepo TCP-era p2p fixes (9f875b6, 267d371) adapted to
the QUIC transport:

- listen on /ip6/:: as well as /ip4/0.0.0.0 — Fly's 6PN inter-app
  network is IPv6-only (<app>.internal resolves to AAAA records), so
  without an IPv6 socket peers on the private network can't reach us
- wrap the QUIC transport in libp2p-dns so bootstrap multiaddrs can use
  /dns6/<app>.internal — dialing peers through the public anycast edge
  breaks the handshake when the proxy closes the connection mid-stream

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A push propagated via the /api/v1/sync/notify HTTP fallback replicated
the repo but left no trace in received_ref_updates — only the libp2p
gossip path recorded it, so /api/v1/events/ref-updates went blind
whenever the mesh wasn't delivering. Insert the same record from the
notify handler, with optional provenance fields (pusher_did, old_sha,
timestamp, cert_id) that new senders already include.

Monorepo ref: 087eeac.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enhances peer synchronization and P2P infrastructure across four main areas: DNS-based QUIC transport with dual-stack listening support, improved peer sync timeout and error handling, self-announcement rejection validation, HTTP notification tracking with database persistence, and repository deduplication query logic.

Changes

Peer Synchronization and P2P Infrastructure

Layer / File(s) Summary
P2P Networking: DNS-QUIC and Dual-Stack
crates/gitlawb-node/Cargo.toml, crates/gitlawb-node/src/p2p/mod.rs
Adds libp2p-dns 0.44.0 dependency with tokio feature; wraps QUIC transport with libp2p_dns::tokio::Transport::system for DNS multiaddr resolution; changes swarm listening to attempt both IPv4 and IPv6 with warnings instead of hard failures.
Peer Sync Reliability: Timeout and Error Handling
crates/gitlawb-node/src/api/peers.rs
Increases per-peer HTTP fetch timeout from 5s to 30s; adds explicit non-success HTTP status handling; distinguishes fetch errors and timeouts with separate logging while only counting successful fetches toward reachable peer count.
Self-Announcement Rejection
crates/gitlawb-node/src/api/peers.rs
Validates that announced http_url (after trailing slash trim) does not match the node's public URL and announced did does not match the node's own DID; rejects both cases with BadRequest.
HTTP Notification Tracking
crates/gitlawb-node/src/api/peers.rs
Extends NotifyRequest JSON payload with optional provenance fields (pusher_did, old_sha, timestamp, cert_id); records each HTTP-received ref update in the database via ReceivedRefUpdate with generated id, request fields, defaulted optional values, current timestamp default, and http:<node_did> peer identifier.
Repository Deduplication Query
crates/gitlawb-node/src/db/mod.rs
Updates list_all_repos_paged deduped CTE to compute updated_at as MAX(updated_at) over partition (split_part(owner_did, ':', -1), name) instead of row-level value; ensures deduplicated paged results reflect the most recent update across collapsed rows.

🎯 3 (Moderate) | ⏱️ ~25 minutes


🐰 Through DNS we hop and dual-stack we stand,
Peers shall not announce themselves, oh what a plan!
Notifications tracked and timeouts extended wide,
With dedup fixed, no stale updates hide. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and generic, using the phrase 'Fix/backport prod fixes' which does not clearly convey what specific changes are being made. While the PR objectives explain the detailed changes (timeout extension, self-announcement rejection, query optimization, DNS support, HTTP notify handler), the title itself is non-descriptive. Consider using a more specific title that highlights the primary change, such as 'Backport production fixes: timeout, self-announcement, DNS, and logging improvements' or focus on the most impactful change if a shorter title is preferred.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/backport-prod-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/gitlawb-node/src/api/peers.rs`:
- Around line 147-168: The outer tokio::time::timeout(30s, ...) in trigger_sync
is being masked by a shorter client-level timeout on state.http_client, so slow
peers surface as "peer fetch failed" instead of the intended "peer timed out";
update the request to set a per-request timeout that is >= the outer timeout
(e.g., client.get(&url).timeout(std::time::Duration::from_secs(31))) or
construct the Client without a global timeout and rely on the outer
tokio::time::timeout; change the code around client.get(&url).send().await? in
trigger_sync (and ensure state.http_client construction is adjusted if you
choose the global-client approach) so the outer tokio timeout is effective.
- Around line 264-283: Currently both the gossipsub path
(crates::gitlawb_node::p2p::mod::/* handler */) and the HTTP handler in
api::peers.rs create ReceivedRefUpdate rows with fresh UUIDs, so the DB dedupe
on id fails and the same logical push can be stored twice; change id generation
so both paths produce the same deterministic id for the same logical update
(instead of Uuid::new_v4()). Implement a stable ID function (called from both
places) that computes a collision-resistant id from the canonical fields of
ReceivedRefUpdate (e.g., repo, ref_name, new_sha, old_sha, pusher_did,
timestamp) using a stable hash/UUID v5 strategy, then use that id when
constructing ReceivedRefUpdate before calling state.db.insert_ref_update; update
both the gossipsub handler and api::peers.rs to call this new id generator and
ensure insert_ref_update behavior remains unchanged.

In `@crates/gitlawb-node/src/p2p/mod.rs`:
- Around line 250-260: The current start() logic only warns on failures from
swarm.listen_on for v4 and v6 so the node can start with zero listeners; change
it to record both results and if both calls fail return or propagate an error
from start() (or panic) so startup fails visibly. Specifically, capture the
Result from swarm.listen_on(v4) and swarm.listen_on(v6), log each failure as
before, and after both attempts check if both are Err; if so return an Err (or
map one of the errors into a StartError) from the start() function so callers
cannot continue with no listeners; reference the variables v4, v6, listen_port
and the swarm.listen_on calls to locate the code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 0adb6ed9-c93c-473b-b808-4cf046747ba8

📥 Commits

Reviewing files that changed from the base of the PR and between 6abaf1d and ad27ca9.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • crates/gitlawb-node/Cargo.toml
  • crates/gitlawb-node/src/api/peers.rs
  • crates/gitlawb-node/src/db/mod.rs
  • crates/gitlawb-node/src/p2p/mod.rs

Comment on lines +147 to 168
let result = tokio::time::timeout(std::time::Duration::from_secs(30), async {
let resp = client.get(&url).send().await?;
if !resp.status().is_success() {
return Err(anyhow::anyhow!("peer returned status {}", resp.status()));
}
let repos: Vec<serde_json::Value> = resp.json().await?;
Ok::<_, anyhow::Error>(repos)
})
.await;

let repos: Vec<serde_json::Value> = match result {
Ok(Ok(resp)) if resp.status().is_success() => {
let repos = match result {
Ok(Ok(repos)) => {
peers_reached += 1;
resp.json().await.unwrap_or_default()
repos
}
Ok(Err(e)) => {
tracing::warn!(peer = %peer.did, err = %e, "trigger_sync: peer fetch failed");
continue;
}
_ => {
tracing::warn!(peer = %peer.did, "trigger_sync: could not reach peer");
Err(_) => {
tracing::warn!(peer = %peer.did, "trigger_sync: peer timed out");
continue;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find the node-side shared reqwest client construction and any configured request timeouts.
rg -n --type rust -C4 'http_client|reqwest::Client::builder|Client::builder\(|\.timeout\(' crates/gitlawb-node/src

Repository: Gitlawb/node

Length of output: 13556


Timeout diagnostics split likely masked by shared reqwest timeout
crates/gitlawb-node/src/main.rs constructs state.http_client with reqwest::Client::builder().timeout(Duration::from_secs(10)), which will typically make client.get(...).send().await? fail before the outer tokio::time::timeout(30s, ...) can fire—so slow peers will be logged as peer fetch failed (Ok(Err(_))) rather than peer timed out (Err(_)). There’s also an alternate AppState path in crates/gitlawb-node/src/auth/mod.rs that uses reqwest::Client::new() (no explicit timeout), so the outer timeout branch may only be reachable when that state is used for trigger_sync.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/api/peers.rs` around lines 147 - 168, The outer
tokio::time::timeout(30s, ...) in trigger_sync is being masked by a shorter
client-level timeout on state.http_client, so slow peers surface as "peer fetch
failed" instead of the intended "peer timed out"; update the request to set a
per-request timeout that is >= the outer timeout (e.g.,
client.get(&url).timeout(std::time::Duration::from_secs(31))) or construct the
Client without a global timeout and rely on the outer tokio::time::timeout;
change the code around client.get(&url).send().await? in trigger_sync (and
ensure state.http_client construction is adjusted if you choose the
global-client approach) so the outer tokio timeout is effective.

Comment on lines +264 to +283
// Mirror the gossipsub-receive handler: insert the same record we'd
// get from the libp2p path, so /api/v1/events/ref-updates reflects
// pushes that arrive over either transport.
let now = chrono::Utc::now().to_rfc3339();
let update = crate::db::ReceivedRefUpdate {
id: uuid::Uuid::new_v4().to_string(),
node_did: req.node_did.clone(),
pusher_did: req.pusher_did.clone().unwrap_or_default(),
repo: req.repo.clone(),
ref_name: req.ref_name.clone(),
old_sha: req.old_sha.clone().unwrap_or_default(),
new_sha: req.new_sha.clone(),
timestamp: req.timestamp.clone().unwrap_or_else(|| now.clone()),
cert_id: req.cert_id.clone(),
received_at: now,
from_peer: format!("http:{}", req.node_did),
};
if let Err(e) = state.db.insert_ref_update(&update).await {
tracing::warn!(err = %e, repo = %req.repo, "failed to insert ref-update from sync notify");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find push fanout code that emits ref updates.
rg -n --type rust -C4 'publish_ref_update\s*\(|sync/notify|insert_ref_update\s*\(' crates

# Inspect readers of received_ref_updates to see whether they dedupe by content/source.
rg -n --type rust -C4 'received_ref_updates|from_peer' crates/gitlawb-node/src

Repository: Gitlawb/node

Length of output: 18423


Prevent duplicate received_ref_updates rows for one logical push across HTTP notify + gossipsub

  • Both the gossipsub receive path (crates/gitlawb-node/src/p2p/mod.rs) and the HTTP handler (crates/gitlawb-node/src/api/peers.rs) insert a ReceivedRefUpdate using a fresh id: Uuid::new_v4().
  • received_ref_updates only dedupes on id (PRIMARY KEY (id) and INSERT ... ON CONFLICT(id) DO NOTHING), so two different UUIDs will both persist; there’s no reader-side dedupe in list_ref_updates.
  • As a result, a single logical ref update can persist twice when a peer receives both the gossipsub delivery and the /api/v1/sync/notify fallback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/api/peers.rs` around lines 264 - 283, Currently both
the gossipsub path (crates::gitlawb_node::p2p::mod::/* handler */) and the HTTP
handler in api::peers.rs create ReceivedRefUpdate rows with fresh UUIDs, so the
DB dedupe on id fails and the same logical push can be stored twice; change id
generation so both paths produce the same deterministic id for the same logical
update (instead of Uuid::new_v4()). Implement a stable ID function (called from
both places) that computes a collision-resistant id from the canonical fields of
ReceivedRefUpdate (e.g., repo, ref_name, new_sha, old_sha, pusher_did,
timestamp) using a stable hash/UUID v5 strategy, then use that id when
constructing ReceivedRefUpdate before calling state.db.insert_ref_update; update
both the gossipsub handler and api::peers.rs to call this new id generator and
ensure insert_ref_update behavior remains unchanged.

Comment on lines +250 to +260
// Listen on both IPv4 (local/mDNS + any IPv4 dials) and IPv6 (required
// for Fly's 6PN inter-app network — <app>.internal DNS only returns AAAA
// records, so peers dial us via IPv6 and need a matching IPv6 socket).
let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?;
if let Err(e) = swarm.listen_on(v4) {
warn!(err = %e, "failed to listen on IPv4");
}
let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?;
if let Err(e) = swarm.listen_on(v6) {
warn!(err = %e, "failed to listen on IPv6");
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail startup when neither QUIC socket binds.

Line 254 and Line 258 only warn, so start() still succeeds when both listen_on calls fail. That leaves the node running with zero swarm listeners and no inbound P2P reachability while boot looks healthy.

Proposed fix
+    let mut listen_ok = false;
     let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?;
-    if let Err(e) = swarm.listen_on(v4) {
-        warn!(err = %e, "failed to listen on IPv4");
-    }
+    match swarm.listen_on(v4) {
+        Ok(_) => listen_ok = true,
+        Err(e) => warn!(err = %e, "failed to listen on IPv4"),
+    }
     let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?;
-    if let Err(e) = swarm.listen_on(v6) {
-        warn!(err = %e, "failed to listen on IPv6");
-    }
+    match swarm.listen_on(v6) {
+        Ok(_) => listen_ok = true,
+        Err(e) => warn!(err = %e, "failed to listen on IPv6"),
+    }
+    if !listen_ok {
+        return Err(anyhow::anyhow!(
+            "failed to bind any QUIC listen address on port {listen_port}"
+        ));
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Listen on both IPv4 (local/mDNS + any IPv4 dials) and IPv6 (required
// for Fly's 6PN inter-app network — <app>.internal DNS only returns AAAA
// records, so peers dial us via IPv6 and need a matching IPv6 socket).
let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?;
if let Err(e) = swarm.listen_on(v4) {
warn!(err = %e, "failed to listen on IPv4");
}
let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?;
if let Err(e) = swarm.listen_on(v6) {
warn!(err = %e, "failed to listen on IPv6");
}
let mut listen_ok = false;
let v4: Multiaddr = format!("/ip4/0.0.0.0/udp/{listen_port}/quic-v1").parse()?;
match swarm.listen_on(v4) {
Ok(_) => listen_ok = true,
Err(e) => warn!(err = %e, "failed to listen on IPv4"),
}
let v6: Multiaddr = format!("/ip6/::/udp/{listen_port}/quic-v1").parse()?;
match swarm.listen_on(v6) {
Ok(_) => listen_ok = true,
Err(e) => warn!(err = %e, "failed to listen on IPv6"),
}
if !listen_ok {
return Err(anyhow::anyhow!(
"failed to bind any QUIC listen address on port {listen_port}"
));
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/gitlawb-node/src/p2p/mod.rs` around lines 250 - 260, The current
start() logic only warns on failures from swarm.listen_on for v4 and v6 so the
node can start with zero listeners; change it to record both results and if both
calls fail return or propagate an error from start() (or panic) so startup fails
visibly. Specifically, capture the Result from swarm.listen_on(v4) and
swarm.listen_on(v6), log each failure as before, and after both attempts check
if both are Err; if so return an Err (or map one of the errors into a
StartError) from the start() function so callers cannot continue with no
listeners; reference the variables v4, v6, listen_port and the swarm.listen_on
calls to locate the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant