Conversation
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request significantly enhances the resilience of the Ocean Node's Elasticsearch connection. It introduces an event-driven mechanism to detect and react to database connection loss and restoration. The OceanIndexer now listens for these events, gracefully stopping and restarting chain indexers as the database connection fluctuates. The ElasticsearchClientSingleton has been refactored to remove its explicit retry loop in favor of continuous health monitoring that attempts single reconnections and emits events. Default Elasticsearch configuration parameters (timeouts, health check intervals) have also been adjusted to be more aggressive, leading to quicker detection of connection issues.
Comments:
• [ERROR][bug] The variable numCrawlAttempts is used here (numCrawlAttempts = 0) but does not appear to be defined or imported within this file's scope. This will lead to a runtime error. It is likely defined in src/components/Indexer/utils.ts and needs to be explicitly imported if it's a mutable shared variable. Please ensure numCrawlAttempts is correctly defined or imported.
• [WARNING][bug] The logic for sniffOnStart has been changed from process.env.ELASTICSEARCH_SNIFF_ON_START !== 'false' to process.env.ELASTICSEARCH_SNIFF_ON_START === 'false'. This inverts the default behavior. Previously, sniffOnStart was true by default unless explicitly set to 'false'. With the current change, if ELASTICSEARCH_SNIFF_ON_START is undefined or any value other than the exact string 'false', sniffOnStart will become false. This might unintentionally disable sniffing on start, which can impact the client's ability to discover available nodes. Please confirm if this inversion is intended; if not, consider reverting to !== 'false' or using process.env.ELASTICSEARCH_SNIFF_ON_START === 'true' for explicit opt-in.
• [WARNING][bug] Similar to sniffOnStart, the sniffInterval logic has changed from defaulting to 30000 (unless explicitly 'false') to process.env.ELASTICSEARCH_SNIFF_INTERVAL === 'true' ? 30000 : false. This means sniffInterval will be false by default if the environment variable is not explicitly set to 'true'. This also inverts the previous default behavior, potentially disabling periodic sniffing of cluster state. Please confirm if this change in default behavior is intentional.
• [INFO][other] When !isHealthy, the code attempts this.client.close() within a try-catch block that silences any errors. While client.close() is generally robust, it's good practice to log any errors that might occur during the close operation, even if they are ultimately handled by setting client = null. This can aid in debugging unexpected behavior.
• [INFO][performance] The removal of the startRetryConnection method with its exponential backoff and maxRetries in favor of a continuous fixed-interval retry within startHealthMonitoring represents a significant change in the reconnection strategy. The new approach continuously attempts reconnection at the healthCheckInterval rate until success, which is generally robust for transient and prolonged outages. However, it means the system will keep trying indefinitely without ever giving up, which could potentially consume resources during unrecoverable errors. This is a valid design choice, but it changes the resource consumption profile and error handling for persistent connection failures.
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This PR significantly refactors Elasticsearch connection management to enhance robustness and introduce an event-driven mechanism. It replaces a previous explicit retry loop with a continuous health monitoring system that emits CONNECTION_LOST and CONNECTION_RESTORED events. The OceanIndexer now subscribes to these events to gracefully stop and restart all chain indexers upon database connectivity changes. Default Elasticsearch client configuration values (timeouts, max retries, health check interval) have also been adjusted for faster detection and response to connection issues.
Comments:
• [INFO][other] The numCrawlAttempts = 0 reset upon CONNECTION_RESTORED is a good addition to ensure the indexer gets a fresh start. It implicitly assumes numCrawlAttempts is a global variable or part of a shared state that needs to be cleared for proper re-initialization. Could add a comment clarifying its purpose if it's not immediately obvious to a new contributor.
• [INFO][performance] The getDatabase(true) call here is crucial for ensuring the Indexer uses a fresh database client instance after a reconnection. This leverages the new singleton behavior in ElasticsearchConfigHelper.ts that allows re-initialization. The comment for ES_CONNECTION_EVENTS.on(DB_EVENTS.CONNECTION_RESTORED) clarifies this in the other file. Adding a small comment here too, explaining why true is passed to getDatabase in this context, would make the code clearer.
• [WARNING][style] Good use of parseInt(..., 10) || 30000 for sniffInterval to ensure a valid number and handle potential NaN from parseInt. For consistency and robustness, consider applying this pattern (parseInt(process.env.VAR || 'default_value', 10) || default_value) to other parseInt calls within this DEFAULT_ELASTICSEARCH_CONFIG object as well (e.g., requestTimeout, pingTimeout, maxRetries, healthCheckInterval). This ensures that if an environment variable is set to a non-numeric string, it gracefully falls back to the default instead of resulting in NaN.
• [INFO][other] The comment explaining why an extra ping is skipped is very helpful. This prevents potential false errors and redundant operations, especially during rapid reconnect cycles.
• [INFO][performance] Changing return this.startRetryConnection(...) to throw new Error(...) is a key part of the new design. It ensures that initial calls to getClient() fail fast if the database is unhealthy, allowing the new event-driven health monitoring to take over reconnection. This is a sound change given the new architecture.
• [INFO][other] The if (this.isRetrying) { return } check within the setInterval callback is crucial for preventing multiple concurrent reconnection attempts if a health check fires while a reconnection is already in progress. This helps maintain a single source of truth for the connection state.
• [INFO][other] The connectionLostEmitted flag is well-placed to ensure that DB_EVENTS.CONNECTION_LOST is emitted only once per actual connection loss, avoiding redundant events for components listening to them.
Fixes #1211
Changes proposed in this PR: