fix: eliminate provider leak and migrate to shared logger/error packages by MaximusHaximus · Pull Request #73 · 0xPolygon/proof-generation-api

MaximusHaximus · 2026-03-28T20:34:05Z

Problem

The service was OOMKilling repeatedly in production (72+ restarts on the current pod, observed since March 25). Kubernetes working-set memory grew monotonically under load until the container hit its 2 GiB limit.

Root cause: every proof generation request — including every retry — called initMatic(), which created two new ethers.js JsonRpcProvider instances and a new POSClient. With maxRetries = 2 × rpcCount and concurrent requests, hundreds of provider instances accumulated between GC cycles. Each holds an HTTP connection pool, event emitters, and (for JsonRpcProvider) a 4-second background polling timer. The GC can reclaim the JS objects but the kernel-mapped socket buffers and libuv handles persist far longer, which is why the leak showed clearly in kubernetes.memory.working_set but was invisible in V8 heap metrics.

Related: 0xPolygon/dry-dock#958

Changes

Fix: provider leak

src/maticClient.ts (moved from src/helpers/maticClient.ts)

Cache POSClient instances by (network, version, maticRPC, ethereumRPC) key for the lifetime of the process. All requests and retries share the same initialised client.
Use StaticJsonRpcProvider instead of JsonRpcProvider. The static variant makes no background block-polling calls and holds no cached chain state — safe to keep alive indefinitely. The regular provider's polling loop was accumulating memory even in cached long-lived instances.
Cache stores Promise<POSClient> so concurrent requests for the same key share a single initialisation rather than racing. Failed initialisations are evicted so the next request retries.

src/services/v1ProofGenerationServices.ts

Remove the 1-second setTimeout between RPC retries. The retry immediately advances to a different provider endpoint — waiting only prolongs how long concurrent requests hold open connections and associated kernel memory.

Profiling tooling (src/scripts/)

memory-report.ts — starts the server with --expose-gc --inspect, replays a corpus of real production burnTxHash values from Datadog log exports, and samples heap via CDP at intervals. No forced GC between samples.
memory-compare.ts removed — moved to apps-team-ops/src/bin/memory-compare.ts so any service can use it.
A copyable scaffold and runbook live in apps-team-ops: 0xPolygon/apps-team-ops#30.

Refactor: shared logger, error packages, and lint upgrade

Logger — Winston replaced with @polygonlabs/logger (pino-based, Sentry-integrated, Datadog-configured):

createLogger() in src/logger.ts is an async factory. The only call site is src/bin/apiServer.ts, which awaits it at startup and threads the Logger down explicitly: getExpressApp(logger) → createIndexRouter(logger) → createV1Router(logger) / createZkEVMRouter(logger) → service functions as a parameter. No module-level logger singletons anywhere.
src/test/helpers/agent.ts uses a top-level await createLogger() guarded by testEnv.TEST_BASE_URL so env validation is skipped when tests target a remote server.
logger.error({ err }, 'message') triggers automatic Sentry capture via the err key.

Errors — custom InfoError(type, message) replaced with a @polygonlabs/verror hierarchy (src/errors.ts):

InfoError (base) → BlockNotIncludedError, IncorrectTxError, TxNotCheckpointedError, ZKEVMServiceError
Routes convert domain errors to HTTPError subclasses (e.g. NotFound) and throw them; a central Express error handler in src/index.ts maps err.statusCode to the response. Validation failures throw BadRequest directly.
src/helpers/responseHandlers.ts deleted — the per-handler response shaping it provided is replaced by the central error handler and plain res.json() calls.
Response format normalised: all errors now return { error: true, message } (previously 400s used msg, 404/500 used message).
src/constants.ts deleted — errorTypes string constants are no longer needed.

Lint — @polygonlabs/apps-team-lint bumped 1.x → 2.0; commitlint.config.js updated to the commitlint() function-call syntax required by v2. eslint.config.js adds argsIgnorePattern: '^_' so underscore-prefixed parameters (required by the Express 4-arg error handler signature) are not flagged as unused.

File layout — src/helpers/errorHelper.ts → src/errors.ts; src/helpers/maticClient.ts → src/maticClient.ts; src/helpers/ removed.

Validation

Replayed 1,380 real production burnTxHash values against Docker containers (2 GiB memory limit, matching production) running before and after builds:

	Before	After
Heap growth trend	+2.3 MB/snapshot	−0.1 MB/snapshot
Peak heap	133 MB	136 MB
Peak RSS	824 MB	959 MB

The trend line flips from positive (monotonic growth → OOMKill) to flat, confirming the leak is eliminated.

To reproduce:

docker build -t proof-generation-api:test .
docker run -d --name proof-gen-test --env-file .env \
  -p 5001:5000 -p 9230:9229 --memory=2g \
  --entrypoint node proof-generation-api:test \
  --expose-gc --inspect=0.0.0.0:9229 src/bin/apiServer.ts

MEMORY_REPORT_EXTERNAL=true \
  MEMORY_REPORT_CORPUS=path/to/datadog-export.csv \
  MEMORY_REPORT_OUT=reports/memory-after.json \
  MEMORY_REPORT_SERVER_PORT=5001 \
  MEMORY_REPORT_INSPECTOR_PORT=9230 \
  pnpm run memory-report

pnpm run memory-compare reports/memory-before.json reports/memory-after.json

Each proof generation request was creating new ethers.js JsonRpcProvider instances and a POSClient on every attempt, including retries. These held HTTP connection pools, event emitters, and background polling timers that the GC could not promptly reclaim. Under concurrent traffic, working-set memory grew monotonically until the container hit its 2 GiB limit. - Cache POSClient instances by RPC endpoint pair for the lifetime of the process — all requests and retries share the same initialised client - Switch from JsonRpcProvider to StaticJsonRpcProvider, which makes no background block-polling calls and holds no cached chain state, making it safe to keep alive indefinitely - Remove the 1-second sleep between RPC retries — retries immediately advance to a different provider, so delaying only holds connections open longer without benefit Validated by replaying 1380 real production burnTxHashes against Docker containers running the before and after builds with a 2 GiB memory limit (matching production). Before: +2.3 MB/snapshot heap growth trend. After: -0.1 MB/snapshot (flat). Adds memory-report and memory-compare scripts for reproducing this measurement.

src/scripts/memory-report.ts

claude · 2026-03-28T20:43:46Z

Code review

Found 3 issues - 2 team-standards violations (non-null assertions) and 1 bug. Checked for bugs and CLAUDE.md compliance. See inline comments for details.

claude · 2026-03-28T20:48:54Z

Inline Comment 1: src/scripts/memory-compare.ts line 38

Non-null assertions violate team standards (CLAUDE.md adherence)

Lines 37-38 use snaps[i]!.heapUsedMB which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Instead, narrow the type explicitly. This applies everywhere, including test files.

With noUncheckedIndexedAccess enabled, snaps[i] is typed as T | undefined. Use a for...of loop or a local variable with a guard instead.

Inline Comment 2: src/scripts/memory-report.ts line 57

Non-null assertion violates team standards (CLAUDE.md adherence)

match[1]! uses a non-null assertion, which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value.

Same issue exists at line 246 (batches[b]!) and line 282 (snapshots[snapshots.length - 1]!). All three should use a guard or fallback instead.

Inline Comment 3: src/scripts/memory-report.ts line 293

Bug: totalRequests reports wrong value (bug)

ROUNDS * CONCURRENCY does not equal the actual number of requests fired. ROUNDS controls snapshot frequency, not batch count - the script replays the entire corpus. The actual count is tracked by the totalRequests local variable (line 241). These values diverge whenever corpus.length !== ROUNDS * CONCURRENCY.

For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200.

proof-generation-api/src/scripts/memory-report.ts

Lines 292 to 294 in 7de8ede

    
           summary: { 
        
             totalRequests: ROUNDS * CONCURRENCY, 
        
             heapGrowthMB: last.heapUsedMB - initial.heapUsedMB,

claude · 2026-03-28T20:49:14Z

Inline review for src/scripts/memory-compare.ts line 38:

Non-null assertions violate team standards (CLAUDE.md adherence)

Lines 37-38 use snaps[i]!.heapUsedMB which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Instead, narrow the type explicitly. This applies everywhere, including test files.

With noUncheckedIndexedAccess enabled, snaps[i] is typed as T | undefined. Use a for...of loop or a local variable with a guard instead.

  for (const [i, snap] of snaps.entries()) {
    sumX += i;
    sumY += snap.heapUsedMB;
    sumXY += i * snap.heapUsedMB;
    sumX2 += i * i;
  }

claude · 2026-03-28T20:49:23Z

Inline review for src/scripts/memory-report.ts line 57:

Non-null assertion violates team standards (CLAUDE.md adherence)

match[1]! uses a non-null assertion, which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value.

Same issue exists at line 246 (batches[b]!) and line 282 (snapshots[snapshots.length - 1]!). All three should use a guard or fallback instead.

      const group = match[1];
      if (!group) continue;
      const obj = JSON.parse(group.replace(/double-quotes/g, single-quote)) as { burnTxHash?: string };

claude · 2026-03-28T20:49:29Z

Inline review for src/scripts/memory-report.ts line 293:

Bug: totalRequests reports wrong value (bug)

ROUNDS * CONCURRENCY does not equal the actual number of requests fired. ROUNDS controls snapshot frequency, not batch count - the script replays the entire corpus. The actual count is tracked by the totalRequests local variable (line 241). These values diverge whenever corpus.length !== ROUNDS * CONCURRENCY.

For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200.

proof-generation-api/src/scripts/memory-report.ts

Lines 292 to 294 in 7de8ede

    
           summary: { 
        
             totalRequests: ROUNDS * CONCURRENCY, 
        
             heapGrowthMB: last.heapUsedMB - initial.heapUsedMB,

        totalRequests,

memory-compare is fully generic — it reads two JSON report files and knows nothing about this service. It now lives in apps-team-ops/src/bin/ so any service can use it without duplicating the logic. The package.json script is updated to reference the shared version via relative path.

claude · 2026-03-28T21:01:22Z

Code review

Found 4 non-null assertion violations across 2 new files. No bugs found in the core fix logic (provider caching, StaticJsonRpcProvider switch, retry sleep removal). The caching pattern is correct and handles concurrent requests and initialization failures properly.

See inline comments below.

claude · 2026-03-28T21:02:21Z

Inline review details

1. src/scripts/memory-compare.ts (lines 37-38) — Team standards violation: no non-null assertions. snaps[i]! is used twice in growthSlope(). The team standard states: Never use x! to force-unwrap. With noUncheckedIndexedAccess, snaps[i] is T|undefined. Fix: const snap = snaps[i]; if (!snap) continue; — source

2. src/scripts/memory-report.ts (line 57) — Team standards violation: no non-null assertions. match[1]! force-unwraps a possibly-undefined capture group. Fix: const captured = match[1]; if (!captured) continue; — source

3. src/scripts/memory-report.ts (line 246) — Team standards violation: no non-null assertions. batches[b]! force-unwraps an indexed array access. Fix: const batch = batches[b]; if (!batch) continue; — source

4. src/scripts/memory-report.ts (line 282) — Team standards violation: no non-null assertions. snapshots[snapshots.length - 1]! force-unwraps. Fix: const last = snapshots[snapshots.length - 1]; if (!last) throw new Error(...); — source

All four are the same rule from team standards: No non-null assertions. These are in the new profiling scripts only — the core fix in maticClient.ts and v1ProofGenerationServices.ts has no issues.

src/routes/v1.ts

+  router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => {
    const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query });
    if (!result.success) {
-      return handleBadRequest({
-        res,
-        errMsg: result.error.issues[0]?.message ?? 'Invalid request'
-      });
+      throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request');
    }

    const { network } = result.data.params;
    const { start, end, number } = result.data.query;

    if (end < start || number > end || number < start) {
-      return handleBadRequest({
-        res,
-        errMsg: 'Invalid start or end or block numbers!'
-      });
+      throw new BadRequest('Invalid start or end or block numbers!');
    }

    const { version, isMainnet } = networkDetails[network];

-    const responseObj = await fastMerkleProof(
-      String(start),
-      String(end),
-      number,
-      isMainnet,
-      version
-    );
-
-    if (
-      !responseObj ||
-      !responseObj.proof ||
-      !verifyMerkleProof(String(number), String(start), responseObj.proof)
-    ) {
-      handleError({ res, errMsg: 'Invalid merkle proof created' });
-      return;
+    try {
+      const responseObj = await fastMerkleProof(
+        String(start),
+        String(end),
+        number,
+        isMainnet,
+        version,
+        logger
+      );
+
+      if (
+        !responseObj?.proof ||
+        !verifyMerkleProof(String(number), String(start), responseObj.proof)
+      ) {
+        throw new GeneralError('Invalid merkle proof created');
+      }
+
+      res.json(responseObj);
+    } catch (err) {
+      if (err instanceof InfoError) throw new NotFound(err.message);
+      throw err;
    }
+  });


In general, the fix is to add rate‑limiting middleware to the HTTP route that performs the expensive proof generation and verification, so that any single client cannot flood the service with requests. In Express, this is typically done with a dedicated middleware such as express-rate-limit.

The best way to fix this without changing existing functionality is:

Import a well‑known rate‑limiting middleware (express-rate-limit) at the top of src/routes/v1.ts.

Define a limiter instance tuned for this particular endpoint (e.g., modest per‑IP limit over a reasonable time window).

Attach that limiter specifically to the /:network/fast-merkle-proof route by inserting it as a middleware argument before the async handler. This way, the behavior of other routes remains unchanged and only this sensitive route is protected.

Keep the rest of the handler logic intact so that its current semantics do not change.

Concretely:

In src/routes/v1.ts, add import rateLimit from 'express-rate-limit'; near the other imports.

Just before export function createV1Router(): Router {, define a fastMerkleProofLimiter using rateLimit({ windowMs: ..., max: ..., standardHeaders: true, legacyHeaders: false }).

Change the router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => { ... }); call so that it becomes router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => { ... });.

…ps-team-lint v2 Replace Winston with the shared pino-based logger package, replace the custom InfoError with a VError-based hierarchy, and update commitlint to the apps-team-lint v2 function-call API. Logger: createLogger() in src/logger.ts is an async factory — the only caller is src/bin/apiServer.ts, which awaits it at startup and threads the Logger down through getExpressApp → createIndexRouter → route factories → service functions as an explicit parameter. No module-level logger singletons anywhere. src/test/helpers/agent.ts initialises the app with a top-level await guarded by testEnv.TEST_BASE_URL so env validation is skipped when tests target a remote server. Sentry capture is automatic on logger.error({ err }) calls via the err key. Errors: InfoError, BlockNotIncludedError, IncorrectTxError, TxNotCheckpointedError, and ZKEVMServiceError now extend VError. Routes still use `instanceof InfoError` for 404 detection; service retry loops use `instanceof InfoError` to stop retrying immediately on domain errors. The errorTypes string constants and serverCoreErrorCodes stub are removed. Lint: apps-team-lint bumped from 1.x to 2.0; commitlint.config.js updated to the new commitlint() function-call syntax required by v2.

Bumps @polygonlabs/logger to 1.0.1 which re-exports pino's Logger type, so we no longer need a direct pino import just for the type.

Logger type now comes from @polygonlabs/logger — pino is a transitive dep and no longer needs to be declared directly.

sonarqubecloud · 2026-03-29T02:12:09Z

Quality Gate failed

Failed conditions
1 Security Hotspot
6.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

github-advanced-security bot found potential problems Mar 28, 2026

View reviewed changes

src/scripts/memory-report.ts Show resolved Hide resolved

MaximusHaximus marked this pull request as ready for review March 28, 2026 20:37

MaximusHaximus changed the title ~~fix: eliminate provider leak causing OOMKill under load~~ fix: eliminate provider leak and migrate to shared logger/error packages Mar 28, 2026

MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 2 times, most recently from 00ed5ee to 359263b Compare March 28, 2026 22:13

github-advanced-security bot found potential problems Mar 28, 2026

View reviewed changes

src/routes/v1.ts Fixed Show fixed Hide fixed

MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 3 times, most recently from 2f8a8e0 to 1d22626 Compare March 28, 2026 23:09

github-advanced-security bot found potential problems Mar 28, 2026

View reviewed changes

MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 11 times, most recently from ddc58b4 to c5fea8f Compare March 29, 2026 00:13

MaximusHaximus force-pushed the fix/pos-client-provider-leak branch from c5fea8f to 730cc55 Compare March 29, 2026 00:19

MaximusHaximus force-pushed the fix/pos-client-provider-leak branch from 730cc55 to 0cd3878 Compare March 29, 2026 00:23

MaximusHaximus requested review from grajat90, py-zoid and shan8851 March 29, 2026 01:30

MaximusHaximus added 2 commits March 29, 2026 03:07

refactor(logger): import Logger type from @polygonlabs/logger

f3ab062

Bumps @polygonlabs/logger to 1.0.1 which re-exports pino's Logger type, so we no longer need a direct pino import just for the type.

chore: remove direct pino dependency

e3dbc26

Logger type now comes from @polygonlabs/logger — pino is a transitive dep and no longer needs to be declared directly.

@@ -4,6 +4,8 @@
             import { BadRequest, GeneralError } from '@polygonlabs/verror';
+            import rateLimit from 'express-rate-limit';
             import {
               AllExitPayloadsSchema,
               BlockIncludedSchema,
@@ -22,6 +24,13 @@
               amoy: { version: 'amoy' as const, isMainnet: false }
             } as const;
+            const fastMerkleProofLimiter = rateLimit({
+              windowMs: 15 * 60 * 1000, // 15 minutes
+              max: 100, // limit each IP to 100 requests per windowMs for this endpoint
+              standardHeaders: true,
+              legacyHeaders: false
+            });
             /**
              * Verify merkle proof
              *
@@ -64,7 +73,7 @@
                 res.json(responseObj);
               });
-              router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => {
+              router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => {
                 const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query });
                 if (!result.success) {
                   throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request');

@@ -66,6 +66,7 @@
                 "cors": "^2.8.5",
                 "ethers": "5.5.1",
                 "express": "^5.2.1",
-                "zod": "^4.3.6"
+                "zod": "^4.3.6",
+                "express-rate-limit": "^8.3.1"
               }
             }

Package	Version	Security advisories
express-rate-limit (npm)	8.3.1	None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate provider leak and migrate to shared logger/error packages#73

fix: eliminate provider leak and migrate to shared logger/error packages#73
MaximusHaximus wants to merge 5 commits intomasterfrom
fix/pos-client-provider-leak

MaximusHaximus commented Mar 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

Uh oh!

Check failure

Copilot Autofix

sonarqubecloud bot commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaximusHaximus commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Fix: provider leak

Refactor: shared logger, error packages, and lint upgrade

Validation

Uh oh!

Uh oh!

claude bot commented Mar 28, 2026

Code review

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Uh oh!

claude bot commented Mar 28, 2026

Code review

Uh oh!

claude bot commented Mar 28, 2026

Inline review details

Uh oh!

Uh oh!

Check failure

Uh oh!

Copilot Autofix

sonarqubecloud bot commented Mar 29, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaximusHaximus commented Mar 28, 2026 •

edited

Loading