Skip to content

ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599

Draft
avrabe wants to merge 1 commit into
mainfrom
fix/issue-590-mutants-ulimit
Draft

ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599
avrabe wants to merge 1 commit into
mainfrom
fix/issue-590-mutants-ulimit

Conversation

@avrabe

@avrabe avrabe commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Refs #590.

What

In .github/workflows/ci.yml, the mutants-core shard now caps virtual address space via ulimit -v 50331648 (~48 G) before invoking cargo mutants. Single-line change wrapped in a multi-line run:; the rest of the step is unchanged.

+        ulimit -v 50331648
         cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || true

Why (per #590 re-diagnosis)

@avrabe's last comment on #590 ruled out an unmutated-source bug — the OOM is a runaway mutant from cargo-mutants that allocates ~100 G in seconds, beating the 30 s per-mutant timeout. The kernel then OOM-kills system-wide and takes down neighboring jobs on the lean-mem pool.

The comment recommends two fixes:

  1. Primary (infra, not in this repo): the MemoryMax ~48 G cgroup cap on the lean-mem runner pool.
  2. Optional repo-side defense-in-depth, works now: ulimit -v 50331648 before cargo mutants. ← this PR.

continue-on-error: true + the existing || true mean a clipped mutant is still recorded as timeout/error rather than failing the gate. The carefully-tuned --jobs 2 is preserved — its rationale (smithy operator tuning, 16 G/worker target) is still valid with the per-process cap in place.

Acceptance criterion (from #590)

No rivet_core process exceeds the lean-mem MemoryMax (~48 G); zero kernel OOM kills attributable to rivet.

Status Note
⏳ Partially addressable here This PR caps each cargo-mutants worker process at ~48 G via RLIMIT_AS. With --jobs 2, the sum can still reach ~96 G, so this PR alone does not bound the cgroup-level total — that's what the infra MemoryMax cap will do. The two together meet the criterion.
🔭 Operational verification The criterion is verifiable only by the nightly mutants-core fan-out post-merge (the OOMs cluster ~once per nightly run on high-load days). Not testable inside one PR run.

Why draft: opening as a draft for two reasons:

  1. The author's own comment marks the repo-side change as optional and points at infra as the real fix; landing this should be coordinated with whoever owns the MemoryMax rollout.
  2. This run's mandatory pre-PR step — re-reading https://pulseengine.eu/blog/ for current process guidance — failed (the host returned HTTP 503 / expired TLS cert; I did not bypass verification). Mark ready for review once the blog is reachable and the cgroup cap is scheduled.

Test plan

  • Manually trigger mutants-core via workflow_dispatch (or wait for next nightly).
  • Confirm shards still complete within the 45 min budget (--jobs 2 is preserved, so wall-clock should be unchanged for non-runaway mutants).
  • Once the infra MemoryMax cap lands, confirm a runaway mutant aborts as timeout/error in mutants-out/ instead of triggering an entry in the host's kernel OOM log.

Generated by Claude Code

Adds the optional repo-side defense-in-depth from #590 (comment by
@avrabe): RLIMIT_AS=~48 G before cargo mutants in the rivet-core
shard. A runaway mutation can allocate ~100 G in seconds — faster
than the 30 s per-mutant timeout — so the kernel OOM-killer fires
first and can take down neighboring jobs on the lean-mem pool. With
this cap, the runaway aborts inside its own process (ENOMEM); the
shard records it as timeout/error and continue-on-error keeps the
gate green.

Primary fix is still the infra MemoryMax cgroup cap; the acceptance
criterion ("zero kernel OOM kills attributable to rivet") can only
be observed by the nightly mutants-core fan-out after this and the
cgroup cap both land.

Refs: #590
Refs: #509
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: e2d9b14 Previous: 11db466 Ratio
store_lookup/100 2190 ns/iter (± 4) 1680 ns/iter (± 6) 1.30
store_lookup/1000 24729 ns/iter (± 65) 19362 ns/iter (± 110) 1.28
store_by_type/100 146 ns/iter (± 0) 87 ns/iter (± 0) 1.68
store_by_type/1000 145 ns/iter (± 1) 87 ns/iter (± 2) 1.67
store_by_type/10000 145 ns/iter (± 0) 87 ns/iter (± 0) 1.67
validate/10000 1265427817 ns/iter (± 14461010) 914370645 ns/iter (± 5183624) 1.38
traceability_matrix/1000 60195 ns/iter (± 566) 40796 ns/iter (± 521) 1.48
query/100 1143 ns/iter (± 14) 837 ns/iter (± 4) 1.37
query/1000 15242 ns/iter (± 35) 11489 ns/iter (± 47) 1.33

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants