Skip to content

feat: add atlas-server observability metrics#47

Open
pthmas wants to merge 5 commits intomainfrom
pthmas/observability
Open

feat: add atlas-server observability metrics#47
pthmas wants to merge 5 commits intomainfrom
pthmas/observability

Conversation

@pthmas
Copy link
Copy Markdown
Collaborator

@pthmas pthmas commented Mar 31, 2026

Summary

  • add Prometheus-backed observability to atlas-server, including a shared metrics registry and a /metrics endpoint
  • expose richer health probes with /health/live and /health/ready
  • instrument HTTP requests, API errors, indexer/DA/metadata work, SSE connections, and DB pool gauges
  • add configurable text/json log formatting and thread observability state through server startup and test helpers

Summary by CodeRabbit

  • New Features

    • Added health endpoints (/health/live, /health/ready), a Prometheus metrics endpoint (/metrics), and a log format option (text/json).
    • Exposed SSE connection tracking and API error labeling in Prometheus output.
  • Refactor

    • Instrumented services and background workers with structured logging and broad metrics (HTTP, indexer, DA worker, metadata, DB pools, errors) and added HTTP middleware for request/latency/status.
  • Tests

    • Updated unit/integration tests and test harness to initialize and verify metrics and adjust DB pool usage.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Prometheus metrics and middleware, a Metrics module with SSE guard, new /metrics and health endpoints, threads Metrics into API/indexer/DA/metadata components, updates AppState and tests, and adds a log-format CLI option with structured logging adjustments.

Changes

Cohort / File(s) Summary
Workspace deps
backend/Cargo.toml, backend/crates/atlas-server/Cargo.toml
Add metrics = "0.24" and metrics-exporter-prometheus = "0.16"; change testcontainers entry to table with features = ["blocking"].
Metrics core
backend/crates/atlas-server/src/metrics.rs, backend/crates/atlas-server/src/lib.rs
New metrics module: installs Prometheus recorder, exposes Metrics, SseConnectionGuard, and http_metrics_middleware; exported via pub mod metrics.
API surface & router
backend/crates/atlas-server/src/api/mod.rs, .../handlers/mod.rs
Add metrics and prometheus_handle to AppState, register /metrics, /health/live, /health/ready, and insert HTTP metrics middleware into router layering.
Health & metrics handlers
backend/crates/atlas-server/src/api/handlers/health.rs, .../handlers/metrics.rs
Add liveness/readiness handlers (DB check + indexer staleness) and /metrics handler returning Prometheus text; include unit tests.
SSE changes
backend/crates/atlas-server/src/api/handlers/sse.rs
make_event_stream gains metrics: Option<Metrics> and uses SseConnectionGuard; block_events passes Some(state.metrics); tests updated to supply None where appropriate.
Api error instrumentation
backend/crates/atlas-server/src/api/error.rs
Add error_type(&AtlasError) helper and increment atlas_errors_total{component="api",error_type=...} in IntoResponse when applicable; unit test added.
Indexer / DA / metadata instrumentation
backend/crates/atlas-server/src/indexer/...
Thread Metrics into Indexer, DaWorker, MetadataFetcher, and fetcher functions; add numerous metric recordings and convert logs to structured fields.
Indexer helpers & batch
backend/crates/atlas-server/src/indexer/indexer.rs, .../batch.rs, .../status.rs
Add lag_blocks, latest_indexed_block, BlockBatch::last_block_timestamp, adjust latest_height_and_indexed_at to use helper; tests updated.
Main/init & tracing
backend/crates/atlas-server/src/main.rs
Install Prometheus recorder at startup, construct Metrics, wire into AppState and workers, spawn DB-pool sampler, and extend init_tracing to accept format (text
CLI/config
backend/crates/atlas-server/src/cli.rs, .../config.rs
Add --atlas.log.format / LOG_FORMAT to LogArgs (default "text"); tests updated to include format.
Integration tests & harness
backend/crates/atlas-server/tests/integration/common.rs, many integration tests, .../handlers/faucet.rs
Refactor test harness lazy init, change pool() to return PgPool, inject Metrics/prometheus_handle into test AppState, and update many tests to pass &PgPool.
Minor call-site fixes
backend/crates/atlas-server/tests/integration/*
Adjust many seeding helper calls to pass &pool to match signatures.
Logging-only edits
various files
Switch many logs to structured tracing fields; no control-flow changes.

Sequence Diagrams

sequenceDiagram
  participant Client
  participant ReadinessHandler as /health/ready
  participant Database
  participant HeadTracker

  Client->>ReadinessHandler: GET /health/ready
  ReadinessHandler->>Database: SELECT 1
  alt DB unreachable
    Database-->>ReadinessHandler: Error
    ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"database unreachable"}
  else DB reachable
    Database-->>ReadinessHandler: OK
    ReadinessHandler->>HeadTracker: latest_indexed_block()
    HeadTracker-->>ReadinessHandler: Some(block,timestamp) or None
    alt None
      ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"indexer state unavailable"}
    else timestamp too old (>5m)
      ReadinessHandler-->>Client: 503 {"status":"not_ready","reason":"indexer stale: ...s ago"}
    else recent
      ReadinessHandler-->>Client: 200 {"status":"ready"}
    end
  end
Loading
sequenceDiagram
  participant Client
  participant Middleware as HTTP Metrics Middleware
  participant Router
  participant Prometheus as PrometheusHandle

  Client->>Middleware: HTTP Request
  Middleware->>Middleware: start timer, capture method/path via MatchedPath
  Middleware->>Router: forward request
  Router-->>Middleware: Response (status)
  Middleware->>Middleware: compute latency
  Middleware->>Prometheus: increment atlas_http_requests_total{method,path,status}
  Middleware->>Prometheus: observe atlas_http_request_duration_seconds{method,path}
  Middleware-->>Client: Return Response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • chore: audit changes #26 — Modifies IntoResponse/error handling in api/error.rs; overlaps with error-type mapping and Prometheus increments.
  • test: add API integration tests #34 — Changes integration test harness and testcontainers usage; closely related to test-harness refactor here.
  • feat: status page #16 — Touches AppState and status/indexer handling; overlaps with AppState extensions and readiness logic.

Suggested reviewers

  • tac0turtle

Poem

🐇 I hopped through code with metrics bright,

Counters ticking in the night,
Health and metrics now in sync,
SSE guards watch every link,
Dashboards gleam — a carrot-light! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.66% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'feat: add atlas-server observability metrics' accurately describes the main change: adding Prometheus-backed observability including metrics, health endpoints, and instrumentation throughout the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pthmas/observability

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pthmas pthmas marked this pull request as ready for review March 31, 2026 16:22
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/indexer/fetcher.rs (1)

132-150: ⚠️ Potential issue | 🟡 Minor

Missing record_rpc_request("error") on parse failure.

When JSON parsing fails (line 138), only metrics.error("rpc", "rpc_parse") is called, but metrics.record_rpc_request("error") is not. This is inconsistent with the HTTP failure path (lines 101-102) which records both. The atlas_indexer_rpc_requests_total metric will undercount failures when the HTTP request succeeds but the response cannot be parsed.

🛠️ Proposed fix
             Err(e) => {
                 let delay = RPC_RETRY_DELAYS
                     .get(attempt)
                     .copied()
                     .unwrap_or(*RPC_RETRY_DELAYS.last().unwrap_or(&30));

+                metrics.record_rpc_request("error");
                 metrics.error("rpc", "rpc_parse");
                 tracing::warn!(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/fetcher.rs` around lines 132 - 150,
The parse-failure branch in fetcher.rs calls metrics.error("rpc", "rpc_parse")
but omits metrics.record_rpc_request("error"), causing undercounted RPC
failures; update the Err(e) arm (around RPC_RETRY_DELAYS / last_error handling)
to also call metrics.record_rpc_request("error") (same place where the HTTP
failure path records both) before logging and sleeping so parse errors increment
atlas_indexer_rpc_requests_total consistently with other failures.
🧹 Nitpick comments (2)
backend/crates/atlas-server/src/cli.rs (1)

373-380: Constrain --atlas.log.format to accepted values at parse time.

The format field accepts any string value, causing invalid inputs to silently fall back to text format via the wildcard pattern in init_tracing(). Add value_parser to reject invalid values at CLI parse time instead.

Proposed change
     #[arg(
         long = "atlas.log.format",
         env = "LOG_FORMAT",
         default_value = "text",
+        value_parser = ["text", "json"],
         value_name = "FORMAT",
         help = "Log output format: text or json"
     )]
     pub format: String,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/cli.rs` around lines 373 - 380, The CLI
currently defines pub format: String in the CLI struct and accepts any string,
letting init_tracing() silently handle invalid values; update the #[arg(...)]
for the format field in backend/crates/atlas-server/src/cli.rs to add a
value_parser that restricts accepted values to "text" and "json" (e.g., use
clap's PossibleValuesParser or value_parser = ["text","json"]) so invalid inputs
are rejected at parse time and users get an immediate error instead of falling
back in init_tracing().
backend/crates/atlas-server/src/indexer/indexer.rs (1)

415-420: Batch duration includes wait time, not just processing time.

elapsed is measured from last_log_time, which is reset at the end of each batch. This means record_batch_duration captures the full cycle time including any sleep at chain head, not just active processing. If the intent is to measure processing time only, the timer should start at the beginning of each batch iteration.

However, measuring full cycle time may be intentional for understanding overall throughput. If this is the desired behavior, consider clarifying the metric description in metrics.rs (line 57-59) to indicate it includes idle time at chain head.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/api/error.rs`:
- Around line 54-70: Add same-file unit tests in a #[cfg(test)] mod tests block
that exercise the error-type branching for AtlasError by constructing
AtlasError::Database(...), AtlasError::Internal(...), AtlasError::Config(...),
AtlasError::Rpc(...), and AtlasError::MetadataFetch(...) and asserting the
mapped error_type matches "database", "internal", "config", "rpc_request", and
"metadata_fetch" respectively; if the match logic lives inside a private method
(the block with let error_type = match &self.0 { ... }), either call the public
API that triggers that method or add a small pub(crate) helper to expose the
classification so tests can call it directly, then run cargo test --workspace to
verify.

In `@backend/crates/atlas-server/src/api/handlers/health.rs`:
- Around line 14-60: Add a #[cfg(test)] mod tests block in this same file and
implement unit tests that call liveness() and readiness(...) to cover all
branches: a liveness test asserting Json(HealthResponse{status:"ok",..}), a
readiness test simulating DB failure by making sqlx query execution return Err
(mock or use a test pool) to assert SERVICE_UNAVAILABLE and reason contains
"database unreachable", a readiness test where state.head_tracker.latest()
returns a stale block (indexed_at > 5 minutes ago) asserting SERVICE_UNAVAILABLE
and "indexer stale" in the reason, and a readiness test with healthy DB and
fresh head asserting OK/"ready"; reference the functions liveness, readiness,
HealthResponse, AppState, state.pool and head_tracker.latest() to locate code
and use a minimal test AppState/Arc<AppState> setup or mocks to control DB and
head_tracker behavior.
- Around line 25-31: The readiness handler currently returns the raw DB error
string in the JSON body (from the sqlx::query(...).execute(&state.pool).await
error branch), which leaks internal details; instead, log the full error on the
server (e.g., with tracing::error! or the existing logger on state) and change
the HealthResponse reason to a generic message like "database unreachable" (or
None) while keeping StatusCode::SERVICE_UNAVAILABLE and status "not_ready";
locate the error branch around the sqlx::query execute call and replace the
Json(HealthResponse { reason: Some(format!("database unreachable: {e}")) }) with
a generic reason and a server-side error log that includes e.
- Around line 35-51: The current readiness check treats None from
state.head_tracker.latest().await as healthy; instead, when
head_tracker.latest() returns None you must fall back to the persisted indexer
state the same way status.rs does: call or obtain the equivalent indexer_state
(e.g., state.indexer_state or its accessor) to get the last indexed
block/timestamp, then compute age = now - indexed_at and return
SERVICE_UNAVAILABLE with the same "indexer stale" HealthResponse if age > 5
minutes; keep the existing path when head_tracker.latest() returns Some but
reuse the indexer_state fallback when head is empty so the health handler
mirrors status.rs behavior (refer to head_tracker.latest(), state.indexer_state,
and HealthResponse).

In `@backend/crates/atlas-server/src/api/handlers/metrics.rs`:
- Around line 6-8: Add a same-file unit test in a #[cfg(test)] mod tests block
that constructs a minimal Arc<AppState> with a Prometheus handle (matching how
AppState is normally built), calls the async metrics(State(state):
State<Arc<AppState>>) handler (awaiting it) and asserts the returned String
contains expected Prometheus output (e.g., non-empty or contains a known metric
name). Put the test in the same file (metrics.rs), reference AppState,
prometheus_handle.render() and the metrics handler to ensure the state wiring is
exercised; use tokio::test or an async test runner to await the handler.

In `@backend/crates/atlas-server/src/api/mod.rs`:
- Around line 49-52: The /metrics route is currently created as metrics_routes
and merged after the main TimeoutLayer so it bypasses the 10s timeout; update
the routing so that metrics_routes is wrapped with the same TimeoutLayer (10s,
StatusCode::REQUEST_TIMEOUT) or move its route back into the main Router that
the TimeoutLayer is applied to, ensuring handlers::metrics::metrics is not the
only non-SSE route running unbounded; make the same change for the other
instance referenced around lines 196-205 so all non-SSE routes are uniformly
wrapped by TimeoutLayer.

---

Outside diff comments:
In `@backend/crates/atlas-server/src/indexer/fetcher.rs`:
- Around line 132-150: The parse-failure branch in fetcher.rs calls
metrics.error("rpc", "rpc_parse") but omits metrics.record_rpc_request("error"),
causing undercounted RPC failures; update the Err(e) arm (around
RPC_RETRY_DELAYS / last_error handling) to also call
metrics.record_rpc_request("error") (same place where the HTTP failure path
records both) before logging and sleeping so parse errors increment
atlas_indexer_rpc_requests_total consistently with other failures.

---

Nitpick comments:
In `@backend/crates/atlas-server/src/cli.rs`:
- Around line 373-380: The CLI currently defines pub format: String in the CLI
struct and accepts any string, letting init_tracing() silently handle invalid
values; update the #[arg(...)] for the format field in
backend/crates/atlas-server/src/cli.rs to add a value_parser that restricts
accepted values to "text" and "json" (e.g., use clap's PossibleValuesParser or
value_parser = ["text","json"]) so invalid inputs are rejected at parse time and
users get an immediate error instead of falling back in init_tracing().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7e9015e8-37ad-403a-9af0-b066ffd16ce6

📥 Commits

Reviewing files that changed from the base of the PR and between 1f86eb2 and df1ebb2.

📒 Files selected for processing (21)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/faucet.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/handlers/mod.rs
  • backend/crates/atlas-server/src/api/handlers/sse.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/mod.rs
  • backend/crates/atlas-server/src/cli.rs
  • backend/crates/atlas-server/src/config.rs
  • backend/crates/atlas-server/src/indexer/da_worker.rs
  • backend/crates/atlas-server/src/indexer/evnode.rs
  • backend/crates/atlas-server/src/indexer/fetcher.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/indexer/metadata.rs
  • backend/crates/atlas-server/src/lib.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/src/metrics.rs
  • backend/crates/atlas-server/tests/integration/common.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/crates/atlas-server/tests/integration/common.rs (1)

117-127: Consider logging test skips more visibly in CI.

Tests are silently skipped (only eprintln!) when the database environment cannot be initialized. In CI pipelines, this could mask configuration issues. Consider using #[ignore] attribute with a custom test harness or ensuring CI always has ATLAS_TEST_DATABASE_URL set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 -
127, The run function currently skips tests silently using eprintln! when ENV is
Err, which can hide CI configuration issues; update run (referencing run and
ENV) to (1) emit a visible log (use the tracing or log crate instead of
eprintln!) so CI logs show the skip, and (2) if running in CI (detect via
std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early (panic! with
a clear message referencing ATLAS_TEST_DATABASE_URL) instead of skipping so CI
surfaces the missing configuration; otherwise keep the local skip behavior for
developer runs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The run function currently skips tests silently using
eprintln! when ENV is Err, which can hide CI configuration issues; update run
(referencing run and ENV) to (1) emit a visible log (use the tracing or log
crate instead of eprintln!) so CI logs show the skip, and (2) if running in CI
(detect via std::env::var("CI") or "GITHUB_ACTIONS"), make the test fail early
(panic! with a clear message referencing ATLAS_TEST_DATABASE_URL) instead of
skipping so CI surfaces the missing configuration; otherwise keep the local skip
behavior for developer runs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0fa92730-4a14-4864-b25f-20bbc3a0eaf9

📥 Commits

Reviewing files that changed from the base of the PR and between df1ebb2 and 9675bff.

📒 Files selected for processing (15)
  • backend/Cargo.toml
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/mod.rs
  • backend/crates/atlas-server/tests/integration/addresses.rs
  • backend/crates/atlas-server/tests/integration/blocks.rs
  • backend/crates/atlas-server/tests/integration/common.rs
  • backend/crates/atlas-server/tests/integration/nfts.rs
  • backend/crates/atlas-server/tests/integration/schema.rs
  • backend/crates/atlas-server/tests/integration/search.rs
  • backend/crates/atlas-server/tests/integration/status.rs
  • backend/crates/atlas-server/tests/integration/tokens.rs
  • backend/crates/atlas-server/tests/integration/transactions.rs
✅ Files skipped from review due to trivial changes (1)
  • backend/crates/atlas-server/tests/integration/addresses.rs
🚧 Files skipped from review as they are similar to previous changes (5)
  • backend/Cargo.toml
  • backend/crates/atlas-server/src/api/handlers/metrics.rs
  • backend/crates/atlas-server/src/api/error.rs
  • backend/crates/atlas-server/src/api/handlers/status.rs
  • backend/crates/atlas-server/src/api/handlers/health.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/crates/atlas-server/tests/integration/common.rs (2)

117-127: Silent test skipping may hide CI failures.

When ENV initialization fails, the test prints to stderr and returns successfully. In test frameworks, this counts as a passing test. If Docker is unavailable and ATLAS_TEST_DATABASE_URL isn't set in CI, all integration tests will appear to pass despite not running.

Consider making skips explicit:

Option 1: Panic to fail explicitly in CI
 pub fn run<F: std::future::Future<Output = ()>>(f: F) {
     if let Err(error) = ENV.as_ref() {
-        eprintln!("skipping integration test: {error}");
-        return;
+        if env::var("CI").is_ok() {
+            panic!("integration test environment unavailable in CI: {error}");
+        } else {
+            eprintln!("skipping integration test: {error}");
+            return;
+        }
     }
Option 2: Use #[ignore] on integration tests and run explicitly in CI

Mark integration tests with #[ignore] and run them in CI with cargo test -- --ignored.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 117 -
127, The current run function quietly returns when ENV.as_ref() is Err, causing
tests to be counted as passing; change this to fail loudly by replacing the
silent eprintln!/return with a panic! (or assert!) that includes the ENV
initialization error and a hint about setting ATLAS_TEST_DATABASE_URL or
enabling Docker; update the logic inside run (the ENV.as_ref() branch) to call
panic!("integration test skipped due to ENV init error: {error}. Set
ATLAS_TEST_DATABASE_URL or enable Docker to run integration tests.") so failing
CI will surface missing integration setup.

91-93: Prometheus recorder is built but not installed globally.

The recorder is created with build_recorder() and its handle is extracted, but the recorder itself is dropped without being installed. Per the Metrics::new() docstring in src/metrics.rs, it "Must be called after install_prometheus_recorder()".

For tests that don't verify metrics output, this works fine (types compile, code runs). However, if any test tries to verify metrics via prometheus_handle.render(), it won't see any registered metrics since Metrics::new() on line 110 registers with the default no-op recorder.

If metrics verification is not needed in tests, consider adding a brief comment explaining this is intentional. Otherwise, install the recorder:

Suggested fix if metrics verification is needed
-    let prometheus_handle = metrics_exporter_prometheus::PrometheusBuilder::new()
-        .build_recorder()
-        .handle();
+    let recorder = metrics_exporter_prometheus::PrometheusBuilder::new()
+        .build_recorder();
+    let prometheus_handle = recorder.handle();
+    // Note: install() can only be called once per process; tests may conflict
+    let _ = recorder.install();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/common.rs` around lines 91 -
93, The Prometheus recorder returned by PrometheusBuilder::build_recorder() is
being dropped (only the handle is kept) so Metrics::new() registers against the
default no-op recorder; capture both the recorder and handle (let (recorder,
prometheus_handle) =
metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then
call install_prometheus_recorder(recorder) (the function referenced in
src/metrics.rs) before calling Metrics::new() so tests that call
prometheus_handle.render() will see registered metrics; alternatively, if you
intentionally don't verify metrics in these tests, add a brief comment next to
the current code explaining that dropping the recorder is intentional.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/tests/integration/common.rs`:
- Around line 117-127: The current run function quietly returns when
ENV.as_ref() is Err, causing tests to be counted as passing; change this to fail
loudly by replacing the silent eprintln!/return with a panic! (or assert!) that
includes the ENV initialization error and a hint about setting
ATLAS_TEST_DATABASE_URL or enabling Docker; update the logic inside run (the
ENV.as_ref() branch) to call panic!("integration test skipped due to ENV init
error: {error}. Set ATLAS_TEST_DATABASE_URL or enable Docker to run integration
tests.") so failing CI will surface missing integration setup.
- Around line 91-93: The Prometheus recorder returned by
PrometheusBuilder::build_recorder() is being dropped (only the handle is kept)
so Metrics::new() registers against the default no-op recorder; capture both the
recorder and handle (let (recorder, prometheus_handle) =
metrics_exporter_prometheus::PrometheusBuilder::new().build_recorder()), then
call install_prometheus_recorder(recorder) (the function referenced in
src/metrics.rs) before calling Metrics::new() so tests that call
prometheus_handle.render() will see registered metrics; alternatively, if you
intentionally don't verify metrics in these tests, add a brief comment next to
the current code explaining that dropping the recorder is intentional.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 597a4861-10cf-4f61-9db2-907b37575796

📥 Commits

Reviewing files that changed from the base of the PR and between 9675bff and 5f3ada6.

📒 Files selected for processing (1)
  • backend/crates/atlas-server/tests/integration/common.rs

@pthmas pthmas requested a review from tac0turtle April 1, 2026 15:19
@auricom
Copy link
Copy Markdown

auricom commented Apr 1, 2026

This is a nice set of metrics ! I would suggest to add some, so we can have a little bit more insight about performance/swiftness of indexation.

atlas_indexer_missing_blocks (Gauge)

Tracks the current count of blocks known to be missing from the index and pending backfill. Must be set to 0 when fully caught up.

atlas_indexer_block_processing_duration_seconds (Histogram)

CodeRabbit flagged that atlas_indexer_batch_duration_second includes idle/sleep time between batches, not just processing time. p95 would be inflated and not really usable for a "processing time" alert.

This new histogram should be timed from batch start to write completion, before any sleep/wait. Suggested buckets: .001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5.

atlas_db_pool_checkout_duration_seconds (Histogram, label: pool)

Time from requesting a connection from the pool to receiving one. Must be instrumented at every pool.acquire() / pool.get() call site.
Suggested buckets (fine-grained sub-millisecond): .0001, .0005, .001, .005, .01, .025, .05, .1.

atlas_indexer_head_block_timestamp_seconds (Gauge)

Unix timestamp (seconds) of the most recently indexed block — specifically the chain-level block timestamp, not when it was processed. Updated every time atlas_indexer_head_block is updated.

…uration metrics

- Track indexer lag (chain head vs indexed head) as a gauge
- Track known missing blocks from failed_blocks table
- Add block processing duration histogram (excluding idle sleep)
- Add head block timestamp gauge
- Make install_prometheus_recorder idempotent via OnceLock
- Add custom histogram buckets for processing duration
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/indexer/indexer.rs (1)

407-434: ⚠️ Potential issue | 🟠 Major

Refresh atlas_indexer_missing_blocks from failed_blocks after these upserts.

known_missing_blocks += failed_blocks.len() assumes every INSERT ... ON CONFLICT added a new row. If a block is already present, the gauge grows even though table cardinality did not, so the exported missing-block count drifts upward. Re-query the count after the loop, or derive it from insert-vs-update results instead of incrementing the cached value.

Proposed fix
-                    known_missing_blocks += failed_blocks.len() as u64;
-                    self.metrics
-                        .set_indexer_missing_blocks(known_missing_blocks);
+                    known_missing_blocks = self.get_missing_block_count().await?;
+                    self.metrics
+                        .set_indexer_missing_blocks(known_missing_blocks);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/indexer.rs` around lines 407 - 434,
The code increments known_missing_blocks by failed_blocks.len() after performing
INSERT ... ON CONFLICT upserts, which overcounts when rows already exist;
instead, after executing the upserts in the loop (the sqlx::query that INSERTs
INTO failed_blocks ... ON CONFLICT), re-query the actual cardinality of the
failed_blocks table (e.g. SELECT COUNT(*) FROM failed_blocks) and call
self.metrics.set_indexer_missing_blocks with that result (or track per-upsert
whether rows were newly inserted vs updated and adjust known_missing_blocks
accordingly); update the logic around known_missing_blocks and
set_indexer_missing_blocks to use the authoritative count rather than adding
failed_blocks.len().
🧹 Nitpick comments (1)
backend/crates/atlas-server/src/indexer/batch.rs (1)

289-296: Add an empty-batch test for the None path

Line 295 validates the happy path; please also assert behavior when b_timestamps is empty to lock the contract.

✅ Suggested test addition
 #[test]
 fn last_block_timestamp_returns_latest_collected_timestamp() {
     let mut batch = BlockBatch::new();
     batch.b_timestamps.push(1_700_000_001);
     batch.b_timestamps.push(1_700_000_042);

     assert_eq!(batch.last_block_timestamp(), Some(1_700_000_042));
 }
+
+#[test]
+fn last_block_timestamp_returns_none_for_empty_batch() {
+    let batch = BlockBatch::new();
+    assert_eq!(batch.last_block_timestamp(), None);
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/batch.rs` around lines 289 - 296, Add
a unit test asserting the None path for an empty BlockBatch: create a new
BlockBatch (do not push any values into b_timestamps) and assert that
BlockBatch::last_block_timestamp() returns None to lock the contract; add this
alongside the existing last_block_timestamp_returns_latest_collected_timestamp
test and reference the BlockBatch type and the b_timestamps field in the test
name or body for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/indexer/indexer.rs`:
- Around line 323-328: The metrics are being driven from end_block instead of
the persisted watermark, so if batch.tail remains in failed_blocks after retries
we report an advanced head/counts incorrectly; before calling write_batch(&mut
copy_client, batch, true) capture the current persisted watermark/head and count
from indexer_state (the actual persisted head/timestamp/counter), then after
write_batch (and after folding in any single-block retry successes that may have
been persisted) read the persisted head/count again and use those values to
update metrics via metrics.record_db_write_duration,
record_block_processing_duration and the exported counters/gauges (e.g.,
atlas_indexer_head_block and atlas_indexer_head_block_timestamp_seconds) instead
of using batch.last_block or end_block; apply the same change in the other
occurrence referenced (around the 438–448 block).

In `@backend/crates/atlas-server/src/metrics.rs`:
- Around line 319-324: The idempotence test
install_prometheus_recorder_is_idempotent currently asserts render() is
non-empty without emitting any metrics, which makes it order-dependent; modify
the test to emit a metric after acquiring the two recorder handles (from
install_prometheus_recorder())—for example increment a simple counter or record
a gauge—then call render() on both handles and assert the rendered output
contains the emitted metric (or that both renders are equal/non-empty). This
mirrors the approach used in new_metrics_render_when_emitted and ensures
install_prometheus_recorder() is truly idempotent in isolation.

---

Outside diff comments:
In `@backend/crates/atlas-server/src/indexer/indexer.rs`:
- Around line 407-434: The code increments known_missing_blocks by
failed_blocks.len() after performing INSERT ... ON CONFLICT upserts, which
overcounts when rows already exist; instead, after executing the upserts in the
loop (the sqlx::query that INSERTs INTO failed_blocks ... ON CONFLICT), re-query
the actual cardinality of the failed_blocks table (e.g. SELECT COUNT(*) FROM
failed_blocks) and call self.metrics.set_indexer_missing_blocks with that result
(or track per-upsert whether rows were newly inserted vs updated and adjust
known_missing_blocks accordingly); update the logic around known_missing_blocks
and set_indexer_missing_blocks to use the authoritative count rather than adding
failed_blocks.len().

---

Nitpick comments:
In `@backend/crates/atlas-server/src/indexer/batch.rs`:
- Around line 289-296: Add a unit test asserting the None path for an empty
BlockBatch: create a new BlockBatch (do not push any values into b_timestamps)
and assert that BlockBatch::last_block_timestamp() returns None to lock the
contract; add this alongside the existing
last_block_timestamp_returns_latest_collected_timestamp test and reference the
BlockBatch type and the b_timestamps field in the test name or body for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e2d9f746-45e0-4930-a433-c0736a2364b8

📥 Commits

Reviewing files that changed from the base of the PR and between 5f3ada6 and b629530.

📒 Files selected for processing (3)
  • backend/crates/atlas-server/src/indexer/batch.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/metrics.rs

- Capture batch.last_block before write_batch consumes the batch and
  use it for set_indexer_head_block and indexed_head; if the tail block
  landed in failed_blocks the persisted watermark is lower than end_block,
  so using end_block would under-report lag and misalign the timestamp gauge
- Strengthen install_prometheus_recorder_is_idempotent: emit a metric
  inside the test so render() is non-empty regardless of execution order,
  then assert both handles see the same output
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/crates/atlas-server/src/metrics.rs (1)

130-133: Consider adding atlas_db_pool_checkout_duration_seconds histogram.

The PR comments suggested a histogram to measure DB connection checkout latency (pool.acquire() timing). The current implementation includes pool size/idle/max gauges but not the checkout duration metric. This would help diagnose connection pool contention.

If this is intentionally deferred, consider adding a TODO comment.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/metrics.rs` around lines 130 - 133, The
metrics file currently only registers gauges (describe_gauge! for
atlas_db_pool_size, atlas_db_pool_idle, atlas_db_pool_max); add a histogram
registration for DB checkout latency (e.g., describe_histogram! for
"atlas_db_pool_checkout_duration_seconds" with an appropriate help string) and
then instrument the DB pool acquisition path (wrap the pool.acquire()/checkout
logic to observe/start/stop the histogram) to record checkout latency; if you
intentionally want to defer recording, add a clear TODO next to the existing
gauge registrations mentioning the missing checkout-duration histogram and where
it should be measured (pool.acquire).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/src/metrics.rs`:
- Around line 130-133: The metrics file currently only registers gauges
(describe_gauge! for atlas_db_pool_size, atlas_db_pool_idle, atlas_db_pool_max);
add a histogram registration for DB checkout latency (e.g., describe_histogram!
for "atlas_db_pool_checkout_duration_seconds" with an appropriate help string)
and then instrument the DB pool acquisition path (wrap the
pool.acquire()/checkout logic to observe/start/stop the histogram) to record
checkout latency; if you intentionally want to defer recording, add a clear TODO
next to the existing gauge registrations mentioning the missing
checkout-duration histogram and where it should be measured (pool.acquire).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c9c1d619-ba10-47e5-97e5-6766899400a9

📥 Commits

Reviewing files that changed from the base of the PR and between b629530 and 948d0fc.

📒 Files selected for processing (2)
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/metrics.rs

@pthmas
Copy link
Copy Markdown
Collaborator Author

pthmas commented Apr 2, 2026

@auricom
Thanks for the review! I implemented all the metrics except for this one:

atlas_db_pool_checkout_duration_seconds (Histogram, label: pool)

Time from requesting a connection from the pool to receiving one. Must be instrumented at every pool.acquire() / pool.get() call site. Suggested buckets (fine-grained sub-millisecond): .0001, .0005, .001, .005, .01, .025, .05, .1.

I would skip this for now unless it's a hard requirement. Because we are using sqlx for queries like this sqlx::query_as(...).fetch_one(&state.pool).await?; the pool checkout wait is done under the hood, so it's hard to measure without having a small refactor that adds some complexity to the code:

query.fetch_one(&pool).await

would become:

let start = Instant::now();
let mut conn = pool.acquire().await?;
metrics.observe_checkout(start.elapsed(), "api");
query.fetch_one(&mut *conn).await

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants