Stop user RPC on local DB too big by ZanCorDX · Pull Request #175 · BuilderNet/FlowProxy

ZanCorDX · 2026-02-05T21:17:01Z

This PR adds a new parameter disk_max_size_to_accept_user_rpc_mb.
When the user RPC detects that the local DB backup size exceeds disk_max_size_to_accept_user_rpc_mb, it starts rejecting flow.
This allows us to configure it to guarantee we never lose archive flow.
It also adds a metric for disk_max_size_to_accept_user_rpc_mb to allow adding alarms easily.

metachris · 2026-02-11T07:56:34Z

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Copilot

Pull request overview

Adds a configurable safeguard to reject user RPC requests when the local ClickHouse backup DB grows beyond a configured size, protecting archive flow and exposing the threshold via metrics.

Changes:

Introduces disk_max_size_to_accept_user_rpc_mb CLI/config and propagates it into ingress runtime state.
Adds a DiskFull JSON-RPC error and rejects user RPC when ClickHouse local backup exceeds threshold.
Exposes configured threshold as a metric and adds an integration test covering reject→drain→accept behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/cli.rs	Adds new CLI/env config for max disk size threshold (MB)
src/lib.rs	Publishes threshold metric and passes threshold (bytes) into ingress
src/metrics.rs	Adds gauge metric for the configured threshold
src/jsonrpc.rs	Adds `DiskFull` JSON-RPC error and mappings
src/ingress/mod.rs	Enforces threshold by rejecting requests based on shared disk-size state
src/indexer/click/mod.rs	Tracks backup DB disk size globally and updates it from ClickHouse backup metrics; adds integration test
src/utils.rs	Adds helpers for generating bundles with predictable tx input sizes (test support)
Cargo.toml	Adds `parking_lot` dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T07:57:55Z

src/lib.rs

        metrics::BUILD_INFO_METRICS.info(env!("CARGO_PKG_VERSION"), env!("GIT_HASH")).set(1);
+        metrics::CLICKHOUSE_METRICS
+            .disk_max_size_to_accept_user_rpc_bytes()
+            .set(args.disk_max_size_to_accept_user_rpc_mb * 1024 * 1024);


Converting *_mb to bytes via * 1024 * 1024 can overflow u64 for large CLI/env values and panic in debug builds (or wrap in release depending on build settings). Use checked_mul (and handle failure) or saturating_mul when computing bytes.

Suggested change

.set(args.disk_max_size_to_accept_user_rpc_mb * 1024 * 1024);

.set(args.disk_max_size_to_accept_user_rpc_mb.saturating_mul(1024 * 1024));

Copilot · 2026-02-11T07:57:56Z

src/cli.rs

+    /// Maximum local ClickHouse backup disk size in MB above which user RPC (e.g. eth_sendBundle)
+    /// is rejected with disk full. Defaults to 1024 MB (1 GiB).
+    #[clap(
+        long = "disk-max-size-to-accept-user-rpc-mb",
+        default_value_t = 1024,
+        env = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC",
+        id = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC"
+    )]
+    pub disk_max_size_to_accept_user_rpc_mb: u64,


The flag is documented as 'MB' but the runtime conversion uses 1024 * 1024 (MiB). Either update docs/flag naming to MiB (recommended for clarity) or change the conversion to use decimal MB (1_000_000) to match the current documentation. Also consider aligning the env var name with the unit (e.g., ..._MB or ..._MIB) to avoid ambiguity.

Copilot · 2026-02-11T07:57:56Z

src/jsonrpc.rs

            Self::MethodNotFound(_) => StatusCode::NOT_FOUND,
            Self::RateLimited => StatusCode::TOO_MANY_REQUESTS,
-            Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
+            Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,


Mapping DiskFull to HTTP 500 makes it look like a server crash rather than a capacity/backpressure condition. Consider using 507 Insufficient Storage (closest semantic match) or 503 Service Unavailable so clients can distinguish this from internal failures.

Suggested change

Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,

Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,

Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+        self.bundle_receipts_size = size;
+    }
+    pub(crate) fn disk_size(&self) -> u64 {
+        self.bundles_size + self.bundle_receipts_size


Summing two u64 sizes with + can overflow and wrap/panic depending on build settings. Using saturating_add (or checked_add with fallback/logging) avoids incorrect disk-size readings when sizes get large.

Suggested change

self.bundles_size + self.bundle_receipts_size

self.bundles_size.saturating_add(self.bundle_receipts_size)

Copilot · 2026-02-11T07:57:57Z

src/ingress/mod.rs

+        if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
+            ingress.disk_max_size_to_accept_user_rpc
+        {
+            return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);


This takes a global mutex on every user RPC request (hot path). Since the value is just a size snapshot, consider switching to atomics (e.g., AtomicU64 for total size, or two atomics and saturating_add on read) to avoid contention and reduce tail latency under load.

Suggested change

if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >

ingress.disk_max_size_to_accept_user_rpc

{

return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);

if let Ok(backup_disk_size) = CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.try_lock() {

if backup_disk_size.disk_size() > ingress.disk_max_size_to_accept_user_rpc {

return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);

}

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+            (DISK_MAX_SIZE_TO_ACCEPT_USER_RPC_MB * 1024 * 1024) as usize /
+            (BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);
+
+        const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);


The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Copilot · 2026-02-11T07:57:57Z

src/indexer/click/mod.rs

+        tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
+
+        let reqwest_client = reqwest::Client::default();
+        let signer = PrivateKeySigner::random();
+


The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Suggested change

tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;

let reqwest_client = reqwest::Client::default();

let signer = PrivateKeySigner::random();

let reqwest_client = reqwest::Client::default();

let signer = PrivateKeySigner::random();

let readiness_client = reqwest_client.clone();

tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {

loop {

match readiness_client.get(&user_url).send().await {

Ok(_) => break,

Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,

}

}

})

.await

.expect("flowproxy did not become ready in time");

Copilot · 2026-02-11T07:57:58Z

src/indexer/click/mod.rs

+            (BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);
+
+        const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);
+        // Asume 100ms per bundle to clickhouse (it's a LOT)


Correct spelling of 'Asume' to 'Assume'.

Suggested change

// Asume 100ms per bundle to clickhouse (it's a LOT)

// Assume 100ms per bundle to clickhouse (it's a LOT)

ZanCorDX · 2026-02-12T17:12:58Z

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Good point, added.

ZanCorDX and others added 12 commits February 4, 2026 15:51

proper-shutdown

55aa648

polish

548e42d

test/lint

80d5818

lint

7d69282

trying to fix tests

dc00c4a

removed println!

acdcef3

avoid Global executor already set error log

62dbf47

fix: IPv4/IPv6 issues + TLS SNI

3871ee2

min fixes

a5311fc

done

47a4f91

new metric

490d562

test

0c68418

ZanCorDX requested review from efbig, mempirate, metachris and thedevbirb as code owners February 5, 2026 21:17

Merge main

4243732

metachris requested a review from Copilot February 11, 2026 07:55

Copilot AI reviewed Feb 11, 2026

View reviewed changes

ready handler

a91ebae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop user RPC on local DB too big#175

Stop user RPC on local DB too big#175
ZanCorDX wants to merge 14 commits intomainfrom
dx/stop-rpc-on-disk-full

ZanCorDX commented Feb 5, 2026

Uh oh!

metachris commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

ZanCorDX commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	.set(args.disk_max_size_to_accept_user_rpc_mb * 1024 * 1024);
	.set(args.disk_max_size_to_accept_user_rpc_mb.saturating_mul(1024 * 1024));

	Self::Internal \| Self::Unknown(_) \| Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
	Self::Internal \| Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
	Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

	self.bundles_size + self.bundle_receipts_size
	self.bundles_size.saturating_add(self.bundle_receipts_size)

-        tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
-        let reqwest_client = reqwest::Client::default();
-        let signer = PrivateKeySigner::random();
+        let reqwest_client = reqwest::Client::default();
+        let signer = PrivateKeySigner::random();
+        let readiness_client = reqwest_client.clone();
+        tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {
+            loop {
+                match readiness_client.get(&user_url).send().await {
+                    Ok(_) => break,
+                    Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,
+                }
+            }
+        })
+        .await
+        .expect("flowproxy did not become ready in time");

	// Asume 100ms per bundle to clickhouse (it's a LOT)
	// Assume 100ms per bundle to clickhouse (it's a LOT)

Conversation

ZanCorDX commented Feb 5, 2026

Uh oh!

metachris commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

ZanCorDX commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants