ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590) by avrabe · Pull Request #599 · pulseengine/rivet

avrabe · 2026-06-26T12:12:21Z

Refs #590.

What

In .github/workflows/ci.yml, the mutants-core shard now caps virtual address space via ulimit -v 50331648 (~48 G) before invoking cargo mutants. Single-line change wrapped in a multi-line run:; the rest of the step is unchanged.

+        ulimit -v 50331648
         cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || true

Why (per #590 re-diagnosis)

@avrabe's last comment on #590 ruled out an unmutated-source bug — the OOM is a runaway mutant from cargo-mutants that allocates ~100 G in seconds, beating the 30 s per-mutant timeout. The kernel then OOM-kills system-wide and takes down neighboring jobs on the lean-mem pool.

The comment recommends two fixes:

Primary (infra, not in this repo): the MemoryMax ~48 G cgroup cap on the lean-mem runner pool.
Optional repo-side defense-in-depth, works now: ulimit -v 50331648 before cargo mutants. ← this PR.

continue-on-error: true + the existing || true mean a clipped mutant is still recorded as timeout/error rather than failing the gate. The carefully-tuned --jobs 2 is preserved — its rationale (smithy operator tuning, 16 G/worker target) is still valid with the per-process cap in place.

Acceptance criterion (from #590)

No rivet_core process exceeds the lean-mem MemoryMax (~48 G); zero kernel OOM kills attributable to rivet.

Status	Note
⏳ Partially addressable here	This PR caps each cargo-mutants worker process at ~48 G via `RLIMIT_AS`. With `--jobs 2`, the sum can still reach ~96 G, so this PR alone does not bound the cgroup-level total — that's what the infra `MemoryMax` cap will do. The two together meet the criterion.
🔭 Operational verification	The criterion is verifiable only by the nightly `mutants-core` fan-out post-merge (the OOMs cluster ~once per nightly run on high-load days). Not testable inside one PR run.

Why draft: opening as a draft for two reasons:

The author's own comment marks the repo-side change as optional and points at infra as the real fix; landing this should be coordinated with whoever owns the MemoryMax rollout.
This run's mandatory pre-PR step — re-reading https://pulseengine.eu/blog/ for current process guidance — failed (the host returned HTTP 503 / expired TLS cert; I did not bypass verification). Mark ready for review once the blog is reachable and the cgroup cap is scheduled.

Test plan

Manually trigger mutants-core via workflow_dispatch (or wait for next nightly).
Confirm shards still complete within the 45 min budget (--jobs 2 is preserved, so wall-clock should be unchanged for non-runaway mutants).
Once the infra MemoryMax cap lands, confirm a runaway mutant aborts as timeout/error in mutants-out/ instead of triggering an entry in the host's kernel OOM log.

Generated by Claude Code

@avrabe

Adds the optional repo-side defense-in-depth from #590 (comment by @avrabe): RLIMIT_AS=~48 G before cargo mutants in the rivet-core shard. A runaway mutation can allocate ~100 G in seconds — faster than the 30 s per-mutant timeout — so the kernel OOM-killer fires first and can take down neighboring jobs on the lean-mem pool. With this cap, the runaway aborts inside its own process (ENOMEM); the shard records it as timeout/error and continue-on-error keeps the gate green. Primary fix is still the infra MemoryMax cgroup cap; the acceptance criterion ("zero kernel OOM kills attributable to rivet") can only be observed by the nightly mutants-core fan-out after this and the cgroup cap both land. Refs: #590 Refs: #509

codecov · 2026-06-26T12:19:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite	Current: `e2d9b14`	Previous: `11db466`	Ratio
`store_lookup/100`	`2190` ns/iter (`± 4`)	`1680` ns/iter (`± 6`)	`1.30`
`store_lookup/1000`	`24729` ns/iter (`± 65`)	`19362` ns/iter (`± 110`)	`1.28`
`store_by_type/100`	`146` ns/iter (`± 0`)	`87` ns/iter (`± 0`)	`1.68`
`store_by_type/1000`	`145` ns/iter (`± 1`)	`87` ns/iter (`± 2`)	`1.67`
`store_by_type/10000`	`145` ns/iter (`± 0`)	`87` ns/iter (`± 0`)	`1.67`
`validate/10000`	`1265427817` ns/iter (`± 14461010`)	`914370645` ns/iter (`± 5183624`)	`1.38`
`traceability_matrix/1000`	`60195` ns/iter (`± 566`)	`40796` ns/iter (`± 521`)	`1.48`
`query/100`	`1143` ns/iter (`± 14`)	`837` ns/iter (`± 4`)	`1.37`
`query/1000`	`15242` ns/iter (`± 35`)	`11489` ns/iter (`± 47`)	`1.33`

This comment was automatically generated by workflow using github-action-benchmark.

avrabe mentioned this pull request Jun 26, 2026

rivet_core test binary OOMs CI: up to 100 GB RSS triggers system-wide OOM-killer on lean-mem runners #590

Open

github-actions Bot reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599

ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599
avrabe wants to merge 1 commit into
mainfrom
fix/issue-590-mutants-ulimit

avrabe commented Jun 26, 2026

Uh oh!

codecov Bot commented Jun 26, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

avrabe commented Jun 26, 2026

What

Why (per #590 re-diagnosis)

Acceptance criterion (from #590)

Test plan

Uh oh!

codecov Bot commented Jun 26, 2026

Codecov Report

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants