Skip to content

Stop user RPC on local DB too big#175

Open
ZanCorDX wants to merge 14 commits intomainfrom
dx/stop-rpc-on-disk-full
Open

Stop user RPC on local DB too big#175
ZanCorDX wants to merge 14 commits intomainfrom
dx/stop-rpc-on-disk-full

Conversation

@ZanCorDX
Copy link
Collaborator

@ZanCorDX ZanCorDX commented Feb 5, 2026

This PR adds a new parameter disk_max_size_to_accept_user_rpc_mb.
When the user RPC detects that the local DB backup size exceeds disk_max_size_to_accept_user_rpc_mb, it starts rejecting flow.
This allows us to configure it to guarantee we never lose archive flow.
It also adds a metric for disk_max_size_to_accept_user_rpc_mb to allow adding alarms easily.

@metachris metachris requested a review from Copilot February 11, 2026 07:55
@metachris
Copy link
Contributor

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable safeguard to reject user RPC requests when the local ClickHouse backup DB grows beyond a configured size, protecting archive flow and exposing the threshold via metrics.

Changes:

  • Introduces disk_max_size_to_accept_user_rpc_mb CLI/config and propagates it into ingress runtime state.
  • Adds a DiskFull JSON-RPC error and rejects user RPC when ClickHouse local backup exceeds threshold.
  • Exposes configured threshold as a metric and adds an integration test covering reject→drain→accept behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/cli.rs Adds new CLI/env config for max disk size threshold (MB)
src/lib.rs Publishes threshold metric and passes threshold (bytes) into ingress
src/metrics.rs Adds gauge metric for the configured threshold
src/jsonrpc.rs Adds DiskFull JSON-RPC error and mappings
src/ingress/mod.rs Enforces threshold by rejecting requests based on shared disk-size state
src/indexer/click/mod.rs Tracks backup DB disk size globally and updates it from ClickHouse backup metrics; adds integration test
src/utils.rs Adds helpers for generating bundles with predictable tx input sizes (test support)
Cargo.toml Adds parking_lot dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

metrics::BUILD_INFO_METRICS.info(env!("CARGO_PKG_VERSION"), env!("GIT_HASH")).set(1);
metrics::CLICKHOUSE_METRICS
.disk_max_size_to_accept_user_rpc_bytes()
.set(args.disk_max_size_to_accept_user_rpc_mb * 1024 * 1024);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting *_mb to bytes via * 1024 * 1024 can overflow u64 for large CLI/env values and panic in debug builds (or wrap in release depending on build settings). Use checked_mul (and handle failure) or saturating_mul when computing bytes.

Suggested change
.set(args.disk_max_size_to_accept_user_rpc_mb * 1024 * 1024);
.set(args.disk_max_size_to_accept_user_rpc_mb.saturating_mul(1024 * 1024));

Copilot uses AI. Check for mistakes.
Comment on lines +269 to +277
/// Maximum local ClickHouse backup disk size in MB above which user RPC (e.g. eth_sendBundle)
/// is rejected with disk full. Defaults to 1024 MB (1 GiB).
#[clap(
long = "disk-max-size-to-accept-user-rpc-mb",
default_value_t = 1024,
env = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC",
id = "DISK_MAX_SIZE_TO_ACCEPT_USER_RPC"
)]
pub disk_max_size_to_accept_user_rpc_mb: u64,
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flag is documented as 'MB' but the runtime conversion uses 1024 * 1024 (MiB). Either update docs/flag naming to MiB (recommended for clarity) or change the conversion to use decimal MB (1_000_000) to match the current documentation. Also consider aligning the env var name with the unit (e.g., ..._MB or ..._MIB) to avoid ambiguity.

Copilot uses AI. Check for mistakes.
Self::MethodNotFound(_) => StatusCode::NOT_FOUND,
Self::RateLimited => StatusCode::TOO_MANY_REQUESTS,
Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping DiskFull to HTTP 500 makes it look like a server crash rather than a capacity/backpressure condition. Consider using 507 Insufficient Storage (closest semantic match) or 503 Service Unavailable so clients can distinguish this from internal failures.

Suggested change
Self::Internal | Self::Unknown(_) | Self::DiskFull => StatusCode::INTERNAL_SERVER_ERROR,
Self::Internal | Self::Unknown(_) => StatusCode::INTERNAL_SERVER_ERROR,
Self::DiskFull => StatusCode::INSUFFICIENT_STORAGE,

Copilot uses AI. Check for mistakes.
self.bundle_receipts_size = size;
}
pub(crate) fn disk_size(&self) -> u64 {
self.bundles_size + self.bundle_receipts_size
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summing two u64 sizes with + can overflow and wrap/panic depending on build settings. Using saturating_add (or checked_add with fallback/logging) avoids incorrect disk-size readings when sizes get large.

Suggested change
self.bundles_size + self.bundle_receipts_size
self.bundles_size.saturating_add(self.bundle_receipts_size)

Copilot uses AI. Check for mistakes.
Comment on lines 161 to 164
if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
ingress.disk_max_size_to_accept_user_rpc
{
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This takes a global mutex on every user RPC request (hot path). Since the value is just a size snapshot, consider switching to atomics (e.g., AtomicU64 for total size, or two atomics and saturating_add on read) to avoid contention and reduce tail latency under load.

Suggested change
if CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.lock().disk_size() >
ingress.disk_max_size_to_accept_user_rpc
{
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
if let Ok(backup_disk_size) = CLICKHOUSE_LOCAL_BACKUP_DISK_SIZE.try_lock() {
if backup_disk_size.disk_size() > ingress.disk_max_size_to_accept_user_rpc {
return JsonRpcResponse::error(Value::Null, JsonRpcError::DiskFull);
}

Copilot uses AI. Check for mistakes.
(DISK_MAX_SIZE_TO_ACCEPT_USER_RPC_MB * 1024 * 1024) as usize /
(BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);

const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Copilot uses AI. Check for mistakes.
Comment on lines +692 to +696
tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;

let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();

Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses a fixed sleep to wait for the server to start, which is a common source of flaky CI failures. Prefer polling a readiness condition (e.g., hitting /readyz or retrying the RPC until it responds) with a bounded timeout instead of sleeping a fixed duration.

Suggested change
tokio::time::sleep(FLOWPROXY_START_DELAY_MS).await;
let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();
let reqwest_client = reqwest::Client::default();
let signer = PrivateKeySigner::random();
let readiness_client = reqwest_client.clone();
tokio::time::timeout(FLOWPROXY_START_DELAY_MS, async move {
loop {
match readiness_client.get(&user_url).send().await {
Ok(_) => break,
Err(_) => tokio::time::sleep(Duration::from_millis(50)).await,
}
}
})
.await
.expect("flowproxy did not become ready in time");

Copilot uses AI. Check for mistakes.
(BUNDLE_TX_COUNT * BUNDLE_TX_INPUT_SIZE);

const FLOWPROXY_START_DELAY_MS: Duration = Duration::from_millis(800);
// Asume 100ms per bundle to clickhouse (it's a LOT)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct spelling of 'Asume' to 'Assume'.

Suggested change
// Asume 100ms per bundle to clickhouse (it's a LOT)
// Assume 100ms per bundle to clickhouse (it's a LOT)

Copilot uses AI. Check for mistakes.
@ZanCorDX
Copy link
Collaborator Author

If it starts rejecting flow, does it also return an error on the readyz endpoint? It probably should, so it's removed from the load balancer for the time being.

Good point, added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants