fix: eliminate provider leak and migrate to shared logger/error packages#73
fix: eliminate provider leak and migrate to shared logger/error packages#73MaximusHaximus wants to merge 5 commits intomasterfrom
Conversation
Each proof generation request was creating new ethers.js JsonRpcProvider instances and a POSClient on every attempt, including retries. These held HTTP connection pools, event emitters, and background polling timers that the GC could not promptly reclaim. Under concurrent traffic, working-set memory grew monotonically until the container hit its 2 GiB limit. - Cache POSClient instances by RPC endpoint pair for the lifetime of the process — all requests and retries share the same initialised client - Switch from JsonRpcProvider to StaticJsonRpcProvider, which makes no background block-polling calls and holds no cached chain state, making it safe to keep alive indefinitely - Remove the 1-second sleep between RPC retries — retries immediately advance to a different provider, so delaying only holds connections open longer without benefit Validated by replaying 1380 real production burnTxHashes against Docker containers running the before and after builds with a 2 GiB memory limit (matching production). Before: +2.3 MB/snapshot heap growth trend. After: -0.1 MB/snapshot (flat). Adds memory-report and memory-compare scripts for reproducing this measurement.
Code reviewFound 3 issues - 2 team-standards violations (non-null assertions) and 1 bug. Checked for bugs and CLAUDE.md compliance. See inline comments for details. |
|
Inline Comment 1: src/scripts/memory-compare.ts line 38 Non-null assertions violate team standards (CLAUDE.md adherence) Lines 37-38 use snaps[i]!.heapUsedMB which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Instead, narrow the type explicitly. This applies everywhere, including test files. With noUncheckedIndexedAccess enabled, snaps[i] is typed as T | undefined. Use a for...of loop or a local variable with a guard instead. Inline Comment 2: src/scripts/memory-report.ts line 57 Non-null assertion violates team standards (CLAUDE.md adherence) match[1]! uses a non-null assertion, which violates the team standard: No non-null assertions - Never use x! to force-unwrap a possibly-null/undefined value. Same issue exists at line 246 (batches[b]!) and line 282 (snapshots[snapshots.length - 1]!). All three should use a guard or fallback instead. Inline Comment 3: src/scripts/memory-report.ts line 293 Bug: totalRequests reports wrong value (bug) ROUNDS * CONCURRENCY does not equal the actual number of requests fired. ROUNDS controls snapshot frequency, not batch count - the script replays the entire corpus. The actual count is tracked by the totalRequests local variable (line 241). These values diverge whenever corpus.length !== ROUNDS * CONCURRENCY. For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200. proof-generation-api/src/scripts/memory-report.ts Lines 292 to 294 in 7de8ede |
|
Inline review for src/scripts/memory-compare.ts line 38: Non-null assertions violate team standards (CLAUDE.md adherence) Lines 37-38 use With |
|
Inline review for src/scripts/memory-report.ts line 57: Non-null assertion violates team standards (CLAUDE.md adherence)
Same issue exists at line 246 ( |
|
Inline review for src/scripts/memory-report.ts line 293: Bug:
For example: corpus=250, CONCURRENCY=20, ROUNDS=10 fires 250 requests but reports 200. proof-generation-api/src/scripts/memory-report.ts Lines 292 to 294 in 7de8ede |
memory-compare is fully generic — it reads two JSON report files and knows nothing about this service. It now lives in apps-team-ops/src/bin/ so any service can use it without duplicating the logic. The package.json script is updated to reference the shared version via relative path.
Code reviewFound 4 non-null assertion violations across 2 new files. No bugs found in the core fix logic (provider caching, StaticJsonRpcProvider switch, retry sleep removal). The caching pattern is correct and handles concurrent requests and initialization failures properly. See inline comments below. |
Inline review details1. src/scripts/memory-compare.ts (lines 37-38) — Team standards violation: no non-null assertions. 2. src/scripts/memory-report.ts (line 57) — Team standards violation: no non-null assertions. 3. src/scripts/memory-report.ts (line 246) — Team standards violation: no non-null assertions. 4. src/scripts/memory-report.ts (line 282) — Team standards violation: no non-null assertions. All four are the same rule from team standards: No non-null assertions. These are in the new profiling scripts only — the core fix in maticClient.ts and v1ProofGenerationServices.ts has no issues. |
00ed5ee to
359263b
Compare
2f8a8e0 to
1d22626
Compare
| router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => { | ||
| const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query }); | ||
| if (!result.success) { | ||
| return handleBadRequest({ | ||
| res, | ||
| errMsg: result.error.issues[0]?.message ?? 'Invalid request' | ||
| }); | ||
| throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request'); | ||
| } | ||
|
|
||
| const { network } = result.data.params; | ||
| const { start, end, number } = result.data.query; | ||
|
|
||
| if (end < start || number > end || number < start) { | ||
| return handleBadRequest({ | ||
| res, | ||
| errMsg: 'Invalid start or end or block numbers!' | ||
| }); | ||
| throw new BadRequest('Invalid start or end or block numbers!'); | ||
| } | ||
|
|
||
| const { version, isMainnet } = networkDetails[network]; | ||
|
|
||
| const responseObj = await fastMerkleProof( | ||
| String(start), | ||
| String(end), | ||
| number, | ||
| isMainnet, | ||
| version | ||
| ); | ||
|
|
||
| if ( | ||
| !responseObj || | ||
| !responseObj.proof || | ||
| !verifyMerkleProof(String(number), String(start), responseObj.proof) | ||
| ) { | ||
| handleError({ res, errMsg: 'Invalid merkle proof created' }); | ||
| return; | ||
| try { | ||
| const responseObj = await fastMerkleProof( | ||
| String(start), | ||
| String(end), | ||
| number, | ||
| isMainnet, | ||
| version, | ||
| logger | ||
| ); | ||
|
|
||
| if ( | ||
| !responseObj?.proof || | ||
| !verifyMerkleProof(String(number), String(start), responseObj.proof) | ||
| ) { | ||
| throw new GeneralError('Invalid merkle proof created'); | ||
| } | ||
|
|
||
| res.json(responseObj); | ||
| } catch (err) { | ||
| if (err instanceof InfoError) throw new NotFound(err.message); | ||
| throw err; | ||
| } | ||
| }); |
Check failure
Code scanning / CodeQL
Missing rate limiting High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 5 hours ago
In general, the fix is to add rate‑limiting middleware to the HTTP route that performs the expensive proof generation and verification, so that any single client cannot flood the service with requests. In Express, this is typically done with a dedicated middleware such as express-rate-limit.
The best way to fix this without changing existing functionality is:
- Import a well‑known rate‑limiting middleware (
express-rate-limit) at the top ofsrc/routes/v1.ts. - Define a limiter instance tuned for this particular endpoint (e.g., modest per‑IP limit over a reasonable time window).
- Attach that limiter specifically to the
/:network/fast-merkle-proofroute by inserting it as a middleware argument before the async handler. This way, the behavior of other routes remains unchanged and only this sensitive route is protected. - Keep the rest of the handler logic intact so that its current semantics do not change.
Concretely:
- In
src/routes/v1.ts, addimport rateLimit from 'express-rate-limit';near the other imports. - Just before
export function createV1Router(): Router {, define afastMerkleProofLimiterusingrateLimit({ windowMs: ..., max: ..., standardHeaders: true, legacyHeaders: false }). - Change the
router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => { ... });call so that it becomesrouter.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => { ... });.
| @@ -4,6 +4,8 @@ | ||
|
|
||
| import { BadRequest, GeneralError } from '@polygonlabs/verror'; | ||
|
|
||
| import rateLimit from 'express-rate-limit'; | ||
|
|
||
| import { | ||
| AllExitPayloadsSchema, | ||
| BlockIncludedSchema, | ||
| @@ -22,6 +24,13 @@ | ||
| amoy: { version: 'amoy' as const, isMainnet: false } | ||
| } as const; | ||
|
|
||
| const fastMerkleProofLimiter = rateLimit({ | ||
| windowMs: 15 * 60 * 1000, // 15 minutes | ||
| max: 100, // limit each IP to 100 requests per windowMs for this endpoint | ||
| standardHeaders: true, | ||
| legacyHeaders: false | ||
| }); | ||
|
|
||
| /** | ||
| * Verify merkle proof | ||
| * | ||
| @@ -64,7 +73,7 @@ | ||
| res.json(responseObj); | ||
| }); | ||
|
|
||
| router.get('/:network/fast-merkle-proof', async (req: Request, res: Response) => { | ||
| router.get('/:network/fast-merkle-proof', fastMerkleProofLimiter, async (req: Request, res: Response) => { | ||
| const result = FastMerkleProofSchema.safeParse({ params: req.params, query: req.query }); | ||
| if (!result.success) { | ||
| throw new BadRequest(result.error.issues[0]?.message ?? 'Invalid request'); |
| @@ -66,6 +66,7 @@ | ||
| "cors": "^2.8.5", | ||
| "ethers": "5.5.1", | ||
| "express": "^5.2.1", | ||
| "zod": "^4.3.6" | ||
| "zod": "^4.3.6", | ||
| "express-rate-limit": "^8.3.1" | ||
| } | ||
| } |
| Package | Version | Security advisories |
| express-rate-limit (npm) | 8.3.1 | None |
ddc58b4 to
c5fea8f
Compare
c5fea8f to
730cc55
Compare
…ps-team-lint v2
Replace Winston with the shared pino-based logger package, replace the
custom InfoError with a VError-based hierarchy, and update commitlint to
the apps-team-lint v2 function-call API.
Logger: createLogger() in src/logger.ts is an async factory — the only
caller is src/bin/apiServer.ts, which awaits it at startup and threads
the Logger down through getExpressApp → createIndexRouter → route
factories → service functions as an explicit parameter. No module-level
logger singletons anywhere. src/test/helpers/agent.ts initialises the
app with a top-level await guarded by testEnv.TEST_BASE_URL so env
validation is skipped when tests target a remote server.
Sentry capture is automatic on logger.error({ err }) calls via the err key.
Errors: InfoError, BlockNotIncludedError, IncorrectTxError,
TxNotCheckpointedError, and ZKEVMServiceError now extend VError. Routes
still use `instanceof InfoError` for 404 detection; service retry loops
use `instanceof InfoError` to stop retrying immediately on domain errors.
The errorTypes string constants and serverCoreErrorCodes stub are removed.
Lint: apps-team-lint bumped from 1.x to 2.0; commitlint.config.js updated
to the new commitlint() function-call syntax required by v2.
730cc55 to
0cd3878
Compare
Bumps @polygonlabs/logger to 1.0.1 which re-exports pino's Logger type, so we no longer need a direct pino import just for the type.
Logger type now comes from @polygonlabs/logger — pino is a transitive dep and no longer needs to be declared directly.
|




Problem
The service was OOMKilling repeatedly in production (72+ restarts on the current pod, observed since March 25). Kubernetes working-set memory grew monotonically under load until the container hit its 2 GiB limit.
Root cause: every proof generation request — including every retry — called
initMatic(), which created two newethers.js JsonRpcProviderinstances and a newPOSClient. WithmaxRetries = 2 × rpcCountand concurrent requests, hundreds of provider instances accumulated between GC cycles. Each holds an HTTP connection pool, event emitters, and (forJsonRpcProvider) a 4-second background polling timer. The GC can reclaim the JS objects but the kernel-mapped socket buffers and libuv handles persist far longer, which is why the leak showed clearly inkubernetes.memory.working_setbut was invisible in V8 heap metrics.Related: 0xPolygon/dry-dock#958
Changes
Fix: provider leak
src/maticClient.ts(moved fromsrc/helpers/maticClient.ts)POSClientinstances by(network, version, maticRPC, ethereumRPC)key for the lifetime of the process. All requests and retries share the same initialised client.StaticJsonRpcProviderinstead ofJsonRpcProvider. The static variant makes no background block-polling calls and holds no cached chain state — safe to keep alive indefinitely. The regular provider's polling loop was accumulating memory even in cached long-lived instances.Promise<POSClient>so concurrent requests for the same key share a single initialisation rather than racing. Failed initialisations are evicted so the next request retries.src/services/v1ProofGenerationServices.tssetTimeoutbetween RPC retries. The retry immediately advances to a different provider endpoint — waiting only prolongs how long concurrent requests hold open connections and associated kernel memory.Profiling tooling (
src/scripts/)memory-report.ts— starts the server with--expose-gc --inspect, replays a corpus of real productionburnTxHashvalues from Datadog log exports, and samples heap via CDP at intervals. No forced GC between samples.memory-compare.tsremoved — moved toapps-team-ops/src/bin/memory-compare.tsso any service can use it.apps-team-ops: 0xPolygon/apps-team-ops#30.Refactor: shared logger, error packages, and lint upgrade
Logger — Winston replaced with
@polygonlabs/logger(pino-based, Sentry-integrated, Datadog-configured):createLogger()insrc/logger.tsis an async factory. The only call site issrc/bin/apiServer.ts, which awaits it at startup and threads theLoggerdown explicitly:getExpressApp(logger)→createIndexRouter(logger)→createV1Router(logger)/createZkEVMRouter(logger)→ service functions as a parameter. No module-level logger singletons anywhere.src/test/helpers/agent.tsuses a top-levelawait createLogger()guarded bytestEnv.TEST_BASE_URLso env validation is skipped when tests target a remote server.logger.error({ err }, 'message')triggers automatic Sentry capture via theerrkey.Errors — custom
InfoError(type, message)replaced with a@polygonlabs/verrorhierarchy (src/errors.ts):InfoError(base) →BlockNotIncludedError,IncorrectTxError,TxNotCheckpointedError,ZKEVMServiceErrorHTTPErrorsubclasses (e.g.NotFound) and throw them; a central Express error handler insrc/index.tsmapserr.statusCodeto the response. Validation failures throwBadRequestdirectly.src/helpers/responseHandlers.tsdeleted — the per-handler response shaping it provided is replaced by the central error handler and plainres.json()calls.{ error: true, message }(previously 400s usedmsg, 404/500 usedmessage).src/constants.tsdeleted —errorTypesstring constants are no longer needed.Lint —
@polygonlabs/apps-team-lintbumped 1.x → 2.0;commitlint.config.jsupdated to thecommitlint()function-call syntax required by v2.eslint.config.jsaddsargsIgnorePattern: '^_'so underscore-prefixed parameters (required by the Express 4-arg error handler signature) are not flagged as unused.File layout —
src/helpers/errorHelper.ts→src/errors.ts;src/helpers/maticClient.ts→src/maticClient.ts;src/helpers/removed.Validation
Replayed 1,380 real production
burnTxHashvalues against Docker containers (2 GiB memory limit, matching production) running before and after builds:The trend line flips from positive (monotonic growth → OOMKill) to flat, confirming the leak is eliminated.
To reproduce:
docker build -t proof-generation-api:test . docker run -d --name proof-gen-test --env-file .env \ -p 5001:5000 -p 9230:9229 --memory=2g \ --entrypoint node proof-generation-api:test \ --expose-gc --inspect=0.0.0.0:9229 src/bin/apiServer.ts MEMORY_REPORT_EXTERNAL=true \ MEMORY_REPORT_CORPUS=path/to/datadog-export.csv \ MEMORY_REPORT_OUT=reports/memory-after.json \ MEMORY_REPORT_SERVER_PORT=5001 \ MEMORY_REPORT_INSPECTOR_PORT=9230 \ pnpm run memory-report pnpm run memory-compare reports/memory-before.json reports/memory-after.json