Skip to content

fix: eliminate provider leak and migrate to shared logger/error packages#73

Open
MaximusHaximus wants to merge 5 commits intomasterfrom
fix/pos-client-provider-leak
Open

fix: eliminate provider leak and migrate to shared logger/error packages#73
MaximusHaximus wants to merge 5 commits intomasterfrom
fix/pos-client-provider-leak

Conversation

@MaximusHaximus
Copy link
Copy Markdown
Contributor

@MaximusHaximus MaximusHaximus commented Mar 28, 2026

Problem

The service was OOMKilling repeatedly in production (72+ restarts on the current pod, observed since March 25). Kubernetes working-set memory grew monotonically under load until the container hit its 2 GiB limit.

Root cause: every proof generation request — including every retry — called initMatic(), which created two new ethers.js JsonRpcProvider instances and a new POSClient. With maxRetries = 2 × rpcCount and concurrent requests, hundreds of provider instances accumulated between GC cycles. Each holds an HTTP connection pool, event emitters, and (for JsonRpcProvider) a 4-second background polling timer. The GC can reclaim the JS objects but the kernel-mapped socket buffers and libuv handles persist far longer, which is why the leak showed clearly in kubernetes.memory.working_set but was invisible in V8 heap metrics.

Related: 0xPolygon/dry-dock#958

Changes

Fix: provider leak

src/maticClient.ts (moved from src/helpers/maticClient.ts)

  • Cache POSClient instances by (network, version, maticRPC, ethereumRPC) key for the lifetime of the process. All requests and retries share the same initialised client.
  • Use StaticJsonRpcProvider instead of JsonRpcProvider. The static variant makes no background block-polling calls and holds no cached chain state — safe to keep alive indefinitely. The regular provider's polling loop was accumulating memory even in cached long-lived instances.
  • Cache stores Promise<POSClient> so concurrent requests for the same key share a single initialisation rather than racing. Failed initialisations are evicted so the next request retries.

src/services/v1ProofGenerationServices.ts

  • Remove the 1-second setTimeout between RPC retries. The retry immediately advances to a different provider endpoint — waiting only prolongs how long concurrent requests hold open connections and associated kernel memory.

Profiling tooling (src/scripts/)

  • memory-report.ts — starts the server with --expose-gc --inspect, replays a corpus of real production burnTxHash values from Datadog log exports, and samples heap via CDP at intervals. No forced GC between samples.
  • memory-compare.ts removed — moved to apps-team-ops/src/bin/memory-compare.ts so any service can use it.
  • A copyable scaffold and runbook live in apps-team-ops: 0xPolygon/apps-team-ops#30.

Refactor: shared logger, error packages, and lint upgrade

Logger — Winston replaced with @polygonlabs/logger (pino-based, Sentry-integrated, Datadog-configured):

  • createLogger() in src/logger.ts is an async factory. The only call site is src/bin/apiServer.ts, which awaits it at startup and threads the Logger down explicitly: getExpressApp(logger)createIndexRouter(logger)createV1Router(logger) / createZkEVMRouter(logger) → service functions as a parameter. No module-level logger singletons anywhere.
  • src/test/helpers/agent.ts uses a top-level await createLogger() guarded by testEnv.TEST_BASE_URL so env validation is skipped when tests target a remote server.
  • logger.error({ err }, 'message') triggers automatic Sentry capture via the err key.

Errors — custom InfoError(type, message) replaced with a @polygonlabs/verror hierarchy (src/errors.ts):

  • InfoError (base) → BlockNotIncludedError, IncorrectTxError, TxNotCheckpointedError, ZKEVMServiceError
  • Routes convert domain errors to HTTPError subclasses (e.g. NotFound) and throw them; a central Express error handler in src/index.ts maps err.statusCode to the response. Validation failures throw BadRequest directly.
  • src/helpers/responseHandlers.ts deleted — the per-handler response shaping it provided is replaced by the central error handler and plain res.json() calls.
  • Response format normalised: all errors now return { error: true, message } (previously 400s used msg, 404/500 used message).
  • src/constants.ts deleted — errorTypes string constants are no longer needed.

Lint@polygonlabs/apps-team-lint bumped 1.x → 2.0; commitlint.config.js updated to the commitlint() function-call syntax required by v2. eslint.config.js adds argsIgnorePattern: '^_' so underscore-prefixed parameters (required by the Express 4-arg error handler signature) are not flagged as unused.

File layoutsrc/helpers/errorHelper.tssrc/errors.ts; src/helpers/maticClient.tssrc/maticClient.ts; src/helpers/ removed.

Validation

Replayed 1,380 real production burnTxHash values against Docker containers (2 GiB memory limit, matching production) running before and after builds:

Before After
Heap growth trend +2.3 MB/snapshot −0.1 MB/snapshot
Peak heap 133 MB 136 MB
Peak RSS 824 MB 959 MB

The trend line flips from positive (monotonic growth → OOMKill) to flat, confirming the leak is eliminated.

To reproduce:

docker build -t proof-generation-api:test .
docker run -d --name proof-gen-test --env-file .env \
  -p 5001:5000 -p 9230:9229 --memory=2g \
  --entrypoint node proof-generation-api:test \
  --expose-gc --inspect=0.0.0.0:9229 src/bin/apiServer.ts

MEMORY_REPORT_EXTERNAL=true \
  MEMORY_REPORT_CORPUS=path/to/datadog-export.csv \
  MEMORY_REPORT_OUT=reports/memory-after.json \
  MEMORY_REPORT_SERVER_PORT=5001 \
  MEMORY_REPORT_INSPECTOR_PORT=9230 \
  pnpm run memory-report

pnpm run memory-compare reports/memory-before.json reports/memory-after.json

Each proof generation request was creating new ethers.js JsonRpcProvider
instances and a POSClient on every attempt, including retries. These held
HTTP connection pools, event emitters, and background polling timers that
the GC could not promptly reclaim. Under concurrent traffic, working-set
memory grew monotonically until the container hit its 2 GiB limit.

- Cache POSClient instances by RPC endpoint pair for the lifetime of the
  process — all requests and retries share the same initialised client
- Switch from JsonRpcProvider to StaticJsonRpcProvider, which makes no
  background block-polling calls and holds no cached chain state, making
  it safe to keep alive indefinitely
- Remove the 1-second sleep between RPC retries — retries immediately
  advance to a different provider, so delaying only holds connections
  open longer without benefit

Validated by replaying 1380 real production burnTxHashes against Docker
containers running the before and after builds with a 2 GiB memory limit
(matching production). Before: +2.3 MB/snapshot heap growth trend. After:
-0.1 MB/snapshot (flat). Adds memory-report and memory-compare scripts
for reproducing this measurement.
@MaximusHaximus MaximusHaximus marked this pull request as ready for review March 28, 2026 20:37
@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Code review

Found 3 issues - 2 team-standards violations (non-null assertions) and 1 bug. Checked for bugs and CLAUDE.md compliance. See inline comments for details.

@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Inline Comment 1: src/scripts/memory-compare.ts line 38

Non-null assertions violate team standards (CLAUDE.md adherence)

Lines 37-38 use snaps[i]!.heapUsedMB which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Instead, narrow the type explicitly. This applies everywhere, including test files.

With noUncheckedIndexedAccess enabled, snaps[i] is typed as T | undefined. Use a for...of loop or a local variable with a guard instead.


Inline Comment 2: src/scripts/memory-report.ts line 57

Non-null assertion violates team standards (CLAUDE.md adherence)

match[1]! uses a non-null assertion, which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value.

Same issue exists at line 246 (batches[b]!) and line 282 (snapshots[snapshots.length - 1]!). All three should use a guard or fallback instead.


Inline Comment 3: src/scripts/memory-report.ts line 293

Bug: totalRequests reports wrong value (bug)

ROUNDS * CONCURRENCY does not equal the actual number of requests fired. ROUNDS controls snapshot frequency, not batch count - the script replays the entire corpus. The actual count is tracked by the totalRequests local variable (line 241). These values diverge whenever corpus.length !== ROUNDS * CONCURRENCY.

For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200.

summary: {
totalRequests: ROUNDS * CONCURRENCY,
heapGrowthMB: last.heapUsedMB - initial.heapUsedMB,

@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Inline review for src/scripts/memory-compare.ts line 38:

Non-null assertions violate team standards (CLAUDE.md adherence)

Lines 37-38 use snaps[i]!.heapUsedMB which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Instead, narrow the type explicitly. This applies everywhere, including test files.

With noUncheckedIndexedAccess enabled, snaps[i] is typed as T | undefined. Use a for...of loop or a local variable with a guard instead.

  for (const [i, snap] of snaps.entries()) {
    sumX += i;
    sumY += snap.heapUsedMB;
    sumXY += i * snap.heapUsedMB;
    sumX2 += i * i;
  }

@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Inline review for src/scripts/memory-report.ts line 57:

Non-null assertion violates team standards (CLAUDE.md adherence)

match[1]! uses a non-null assertion, which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value.

Same issue exists at line 246 (batches[b]!) and line 282 (snapshots[snapshots.length - 1]!). All three should use a guard or fallback instead.

      const group = match[1];
      if (!group) continue;
      const obj = JSON.parse(group.replace(/double-quotes/g, single-quote)) as { burnTxHash?: string };

@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Inline review for src/scripts/memory-report.ts line 293:

Bug: totalRequests reports wrong value (bug)

ROUNDS * CONCURRENCY does not equal the actual number of requests fired. ROUNDS controls snapshot frequency, not batch count - the script replays the entire corpus. The actual count is tracked by the totalRequests local variable (line 241). These values diverge whenever corpus.length !== ROUNDS * CONCURRENCY.

For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200.

summary: {
totalRequests: ROUNDS * CONCURRENCY,
heapGrowthMB: last.heapUsedMB - initial.heapUsedMB,

        totalRequests,

memory-compare is fully generic — it reads two JSON report files and
knows nothing about this service. It now lives in apps-team-ops/src/bin/
so any service can use it without duplicating the logic. The package.json
script is updated to reference the shared version via relative path.
@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Code review

Found 4 non-null assertion violations across 2 new files. No bugs found in the core fix logic (provider caching, StaticJsonRpcProvider switch, retry sleep removal). The caching pattern is correct and handles concurrent requests and initialization failures properly.

See inline comments below.

@claude
Copy link
Copy Markdown

claude bot commented Mar 28, 2026

Inline review details

1. src/scripts/memory-compare.ts (lines 37-38) — Team standards violation: no non-null assertions. snaps[i]! is used twice in growthSlope(). The team standard states: Never use x! to force-unwrap. With noUncheckedIndexedAccess, snaps[i] is T|undefined. Fix: const snap = snaps[i]; if (!snap) continue;source

2. src/scripts/memory-report.ts (line 57) — Team standards violation: no non-null assertions. match[1]! force-unwraps a possibly-undefined capture group. Fix: const captured = match[1]; if (!captured) continue;source

3. src/scripts/memory-report.ts (line 246) — Team standards violation: no non-null assertions. batches[b]! force-unwraps an indexed array access. Fix: const batch = batches[b]; if (!batch) continue;source

4. src/scripts/memory-report.ts (line 282) — Team standards violation: no non-null assertions. snapshots[snapshots.length - 1]! force-unwraps. Fix: const last = snapshots[snapshots.length - 1]; if (!last) throw new Error(...);source

All four are the same rule from team standards: No non-null assertions. These are in the new profiling scripts only — the core fix in maticClient.ts and v1ProofGenerationServices.ts has no issues.

@MaximusHaximus MaximusHaximus changed the title fix: eliminate provider leak causing OOMKill under load fix: eliminate provider leak and migrate to shared logger/error packages Mar 28, 2026
@MaximusHaximus MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 2 times, most recently from 00ed5ee to 359263b Compare March 28, 2026 22:13
@MaximusHaximus MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 3 times, most recently from 2f8a8e0 to 1d22626 Compare March 28, 2026 23:09
Comment on lines +75 to +112
router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => {
const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query });
if (!result.success) {
return handleBadRequest({
res,
errMsg: result.error.issues[0]?.message ?? 'Invalid request'
});
throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request');
}

const { network } = result.data.params;
const { start, end, number } = result.data.query;

if (end < start || number > end || number < start) {
return handleBadRequest({
res,
errMsg: 'Invalid start or end or block numbers!'
});
throw new BadRequest('Invalid start or end or block numbers!');
}

const { version, isMainnet } = networkDetails[network];

const responseObj = await fastMerkleProof(
String(start),
String(end),
number,
isMainnet,
version
);

if (
!responseObj ||
!responseObj.proof ||
!verifyMerkleProof(String(number), String(start), responseObj.proof)
) {
handleError({ res, errMsg: 'Invalid merkle proof created' });
return;
try {
const responseObj = await fastMerkleProof(
String(start),
String(end),
number,
isMainnet,
version,
logger
);

if (
!responseObj?.proof ||
!verifyMerkleProof(String(number), String(start), responseObj.proof)
) {
throw new GeneralError('Invalid merkle proof created');
}

res.json(responseObj);
} catch (err) {
if (err instanceof InfoError) throw new NotFound(err.message);
throw err;
}
});

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
authorization
, but is not rate-limited.

Copilot Autofix

AI about 5 hours ago

In general, the fix is to add rate‑limiting middleware to the HTTP route that performs the expensive proof generation and verification, so that any single client cannot flood the service with requests. In Express, this is typically done with a dedicated middleware such as express-rate-limit.

The best way to fix this without changing existing functionality is:

  1. Import a well‑known rate‑limiting middleware (express-rate-limit) at the top of src/routes/v1.ts.
  2. Define a limiter instance tuned for this particular endpoint (e.g., modest per‑IP limit over a reasonable time window).
  3. Attach that limiter specifically to the /:network/fast-merkle-proof route by inserting it as a middleware argument before the async handler. This way, the behavior of other routes remains unchanged and only this sensitive route is protected.
  4. Keep the rest of the handler logic intact so that its current semantics do not change.

Concretely:

  • In src/routes/v1.ts, add import rateLimit from 'express-rate-limit'; near the other imports.
  • Just before export function createV1Router(): Router {, define a fastMerkleProofLimiter using rateLimit({ windowMs: ..., max: ..., standardHeaders: true, legacyHeaders: false }).
  • Change the router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => { ... }); call so that it becomes router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => { ... });.
Suggested changeset 2
src/routes/v1.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/routes/v1.ts b/src/routes/v1.ts
--- a/src/routes/v1.ts
+++ b/src/routes/v1.ts
@@ -4,6 +4,8 @@
 
 import { BadRequest, GeneralError } from '@polygonlabs/verror';
 
+import rateLimit from 'express-rate-limit';
+
 import {
   AllExitPayloadsSchema,
   BlockIncludedSchema,
@@ -22,6 +24,13 @@
   amoy: { version: 'amoy' as const, isMainnet: false }
 } as const;
 
+const fastMerkleProofLimiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 100, // limit each IP to 100 requests per windowMs for this endpoint
+  standardHeaders: true,
+  legacyHeaders: false
+});
+
 /**
  * Verify merkle proof
  *
@@ -64,7 +73,7 @@
     res.json(responseObj);
   });
 
-  router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => {
+  router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => {
     const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query });
     if (!result.success) {
       throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request');
EOF
@@ -4,6 +4,8 @@

import { BadRequest, GeneralError } from '@polygonlabs/verror';

import rateLimit from 'express-rate-limit';

import {
AllExitPayloadsSchema,
BlockIncludedSchema,
@@ -22,6 +24,13 @@
amoy: { version: 'amoy' as const, isMainnet: false }
} as const;

const fastMerkleProofLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // limit each IP to 100 requests per windowMs for this endpoint
standardHeaders: true,
legacyHeaders: false
});

/**
* Verify merkle proof
*
@@ -64,7 +73,7 @@
res.json(responseObj);
});

router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => {
router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => {
const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query });
if (!result.success) {
throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request');
package.json
Outside changed files

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/package.json b/package.json
--- a/package.json
+++ b/package.json
@@ -66,6 +66,7 @@
     "cors": "^2.8.5",
     "ethers": "5.5.1",
     "express": "^5.2.1",
-    "zod": "^4.3.6"
+    "zod": "^4.3.6",
+    "express-rate-limit": "^8.3.1"
   }
 }
EOF
@@ -66,6 +66,7 @@
"cors": "^2.8.5",
"ethers": "5.5.1",
"express": "^5.2.1",
"zod": "^4.3.6"
"zod": "^4.3.6",
"express-rate-limit": "^8.3.1"
}
}
This fix introduces these dependencies
Package Version Security advisories
express-rate-limit (npm) 8.3.1 None
Copilot is powered by AI and may make mistakes. Always verify output.
@MaximusHaximus MaximusHaximus force-pushed the fix/pos-client-provider-leak branch 11 times, most recently from ddc58b4 to c5fea8f Compare March 29, 2026 00:13
@MaximusHaximus MaximusHaximus force-pushed the fix/pos-client-provider-leak branch from c5fea8f to 730cc55 Compare March 29, 2026 00:19
…ps-team-lint v2

Replace Winston with the shared pino-based logger package, replace the
custom InfoError with a VError-based hierarchy, and update commitlint to
the apps-team-lint v2 function-call API.

Logger: createLogger() in src/logger.ts is an async factory — the only
caller is src/bin/apiServer.ts, which awaits it at startup and threads
the Logger down through getExpressApp → createIndexRouter → route
factories → service functions as an explicit parameter. No module-level
logger singletons anywhere. src/test/helpers/agent.ts initialises the
app with a top-level await guarded by testEnv.TEST_BASE_URL so env
validation is skipped when tests target a remote server.
Sentry capture is automatic on logger.error({ err }) calls via the err key.

Errors: InfoError, BlockNotIncludedError, IncorrectTxError,
TxNotCheckpointedError, and ZKEVMServiceError now extend VError. Routes
still use `instanceof InfoError` for 404 detection; service retry loops
use `instanceof InfoError` to stop retrying immediately on domain errors.
The errorTypes string constants and serverCoreErrorCodes stub are removed.

Lint: apps-team-lint bumped from 1.x to 2.0; commitlint.config.js updated
to the new commitlint() function-call syntax required by v2.
@MaximusHaximus MaximusHaximus force-pushed the fix/pos-client-provider-leak branch from 730cc55 to 0cd3878 Compare March 29, 2026 00:23
Bumps @polygonlabs/logger to 1.0.1 which re-exports pino's Logger type,
so we no longer need a direct pino import just for the type.
Logger type now comes from @polygonlabs/logger — pino is a transitive
dep and no longer needs to be declared directly.
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot
6.1% Duplication on New Code (required ≤ 3%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant