retry connection lost by giurgiur99 · Pull Request #1213 · oceanprotocol/ocean-node

giurgiur99 · 2026-02-19T15:31:56Z

Fixes #1211

Changes proposed in this PR:

Retry es connection when db is down
Add debounce for multiple fired events
Stop indexer on lost connection
Adjust config timeouts

alexcos20 · 2026-02-20T11:05:01Z

/run-security-scan

alexcos20

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request significantly enhances the resilience of the Ocean Node's Elasticsearch connection. It introduces an event-driven mechanism to detect and react to database connection loss and restoration. The OceanIndexer now listens for these events, gracefully stopping and restarting chain indexers as the database connection fluctuates. The ElasticsearchClientSingleton has been refactored to remove its explicit retry loop in favor of continuous health monitoring that attempts single reconnections and emits events. Default Elasticsearch configuration parameters (timeouts, health check intervals) have also been adjusted to be more aggressive, leading to quicker detection of connection issues.

Comments:
• [ERROR][bug] The variable numCrawlAttempts is used here (numCrawlAttempts = 0) but does not appear to be defined or imported within this file's scope. This will lead to a runtime error. It is likely defined in src/components/Indexer/utils.ts and needs to be explicitly imported if it's a mutable shared variable. Please ensure numCrawlAttempts is correctly defined or imported.
• [WARNING][bug] The logic for sniffOnStart has been changed from process.env.ELASTICSEARCH_SNIFF_ON_START !== 'false' to process.env.ELASTICSEARCH_SNIFF_ON_START === 'false'. This inverts the default behavior. Previously, sniffOnStart was true by default unless explicitly set to 'false'. With the current change, if ELASTICSEARCH_SNIFF_ON_START is undefined or any value other than the exact string 'false', sniffOnStart will become false. This might unintentionally disable sniffing on start, which can impact the client's ability to discover available nodes. Please confirm if this inversion is intended; if not, consider reverting to !== 'false' or using process.env.ELASTICSEARCH_SNIFF_ON_START === 'true' for explicit opt-in.
• [WARNING][bug] Similar to sniffOnStart, the sniffInterval logic has changed from defaulting to 30000 (unless explicitly 'false') to process.env.ELASTICSEARCH_SNIFF_INTERVAL === 'true' ? 30000 : false. This means sniffInterval will be false by default if the environment variable is not explicitly set to 'true'. This also inverts the previous default behavior, potentially disabling periodic sniffing of cluster state. Please confirm if this change in default behavior is intentional.
• [INFO][other] When !isHealthy, the code attempts this.client.close() within a try-catch block that silences any errors. While client.close() is generally robust, it's good practice to log any errors that might occur during the close operation, even if they are ultimately handled by setting client = null. This can aid in debugging unexpected behavior.
• [INFO][performance] The removal of the startRetryConnection method with its exponential backoff and maxRetries in favor of a continuous fixed-interval retry within startHealthMonitoring represents a significant change in the reconnection strategy. The new approach continuously attempts reconnection at the healthCheckInterval rate until success, which is generally robust for transient and prolonged outages. However, it means the system will keep trying indefinitely without ever giving up, which could potentially consume resources during unrecoverable errors. This is a valid design choice, but it changes the resource consumption profile and error handling for persistent connection failures.

giurgiur99 · 2026-02-20T12:47:47Z

/run-security-scan

alexcos20

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This PR significantly refactors Elasticsearch connection management to enhance robustness and introduce an event-driven mechanism. It replaces a previous explicit retry loop with a continuous health monitoring system that emits CONNECTION_LOST and CONNECTION_RESTORED events. The OceanIndexer now subscribes to these events to gracefully stop and restart all chain indexers upon database connectivity changes. Default Elasticsearch client configuration values (timeouts, max retries, health check interval) have also been adjusted for faster detection and response to connection issues.

Comments:
• [INFO][other] The numCrawlAttempts = 0 reset upon CONNECTION_RESTORED is a good addition to ensure the indexer gets a fresh start. It implicitly assumes numCrawlAttempts is a global variable or part of a shared state that needs to be cleared for proper re-initialization. Could add a comment clarifying its purpose if it's not immediately obvious to a new contributor.
• [INFO][performance] The getDatabase(true) call here is crucial for ensuring the Indexer uses a fresh database client instance after a reconnection. This leverages the new singleton behavior in ElasticsearchConfigHelper.ts that allows re-initialization. The comment for ES_CONNECTION_EVENTS.on(DB_EVENTS.CONNECTION_RESTORED) clarifies this in the other file. Adding a small comment here too, explaining why true is passed to getDatabase in this context, would make the code clearer.
• [WARNING][style] Good use of parseInt(..., 10) || 30000 for sniffInterval to ensure a valid number and handle potential NaN from parseInt. For consistency and robustness, consider applying this pattern (parseInt(process.env.VAR || 'default_value', 10) || default_value) to other parseInt calls within this DEFAULT_ELASTICSEARCH_CONFIG object as well (e.g., requestTimeout, pingTimeout, maxRetries, healthCheckInterval). This ensures that if an environment variable is set to a non-numeric string, it gracefully falls back to the default instead of resulting in NaN.
• [INFO][other] The comment explaining why an extra ping is skipped is very helpful. This prevents potential false errors and redundant operations, especially during rapid reconnect cycles.
• [INFO][performance] Changing return this.startRetryConnection(...) to throw new Error(...) is a key part of the new design. It ensures that initial calls to getClient() fail fast if the database is unhealthy, allowing the new event-driven health monitoring to take over reconnection. This is a sound change given the new architecture.
• [INFO][other] The if (this.isRetrying) { return } check within the setInterval callback is crucial for preventing multiple concurrent reconnection attempts if a health check fires while a reconnection is already in progress. This helps maintain a single source of truth for the connection state.
• [INFO][other] The connectionLostEmitted flag is well-placed to ensure that DB_EVENTS.CONNECTION_LOST is emitted only once per actual connection loss, avoiding redundant events for components listening to them.

giurgiur99 added 4 commits February 19, 2026 14:50

retry connection lost

4f2c968

temporary debug check

4394210

revert nonce comment

6b84b8d

Merge branch 'main' into 1211-enhance-retry-es-connection

17f09cf

giurgiur99 marked this pull request as ready for review February 20, 2026 09:14

giurgiur99 requested review from alexcos20, andreip136, bogdanfazakas, denisiuriet and ndrpp as code owners February 20, 2026 09:14

alexcos20 reviewed Feb 20, 2026

View reviewed changes

revert sniff on start logic and log err

b64ed20

alexcos20 reviewed Feb 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retry connection lost#1213

retry connection lost#1213
giurgiur99 wants to merge 5 commits intomainfrom
1211-enhance-retry-es-connection

giurgiur99 commented Feb 19, 2026 •

edited

Loading

Uh oh!

alexcos20 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Uh oh!

giurgiur99 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

giurgiur99 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexcos20 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

giurgiur99 commented Feb 20, 2026

Uh oh!

alexcos20 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

giurgiur99 commented Feb 19, 2026 •

edited

Loading