ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599
Draft
avrabe wants to merge 1 commit into
Draft
ci(mutants-core): cap per-process address space at ~48 G via ulimit -v (#590)#599avrabe wants to merge 1 commit into
avrabe wants to merge 1 commit into
Conversation
Adds the optional repo-side defense-in-depth from #590 (comment by @avrabe): RLIMIT_AS=~48 G before cargo mutants in the rivet-core shard. A runaway mutation can allocate ~100 G in seconds — faster than the 30 s per-mutant timeout — so the kernel OOM-killer fires first and can take down neighboring jobs on the lean-mem pool. With this cap, the runaway aborts inside its own process (ENOMEM); the shard records it as timeout/error and continue-on-error keeps the gate green. Primary fix is still the infra MemoryMax cgroup cap; the acceptance criterion ("zero kernel OOM kills attributable to rivet") can only be observed by the nightly mutants-core fan-out after this and the cgroup cap both land. Refs: #590 Refs: #509
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.
| Benchmark suite | Current: e2d9b14 | Previous: 11db466 | Ratio |
|---|---|---|---|
store_lookup/100 |
2190 ns/iter (± 4) |
1680 ns/iter (± 6) |
1.30 |
store_lookup/1000 |
24729 ns/iter (± 65) |
19362 ns/iter (± 110) |
1.28 |
store_by_type/100 |
146 ns/iter (± 0) |
87 ns/iter (± 0) |
1.68 |
store_by_type/1000 |
145 ns/iter (± 1) |
87 ns/iter (± 2) |
1.67 |
store_by_type/10000 |
145 ns/iter (± 0) |
87 ns/iter (± 0) |
1.67 |
validate/10000 |
1265427817 ns/iter (± 14461010) |
914370645 ns/iter (± 5183624) |
1.38 |
traceability_matrix/1000 |
60195 ns/iter (± 566) |
40796 ns/iter (± 521) |
1.48 |
query/100 |
1143 ns/iter (± 14) |
837 ns/iter (± 4) |
1.37 |
query/1000 |
15242 ns/iter (± 35) |
11489 ns/iter (± 47) |
1.33 |
This comment was automatically generated by workflow using github-action-benchmark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs #590.
What
In
.github/workflows/ci.yml, themutants-coreshard now caps virtual address space viaulimit -v 50331648(~48 G) before invokingcargo mutants. Single-line change wrapped in a multi-linerun:; the rest of the step is unchanged.+ ulimit -v 50331648 cargo mutants -p ${{ matrix.crate }} --shard ${{ matrix.shard }} --timeout 30 --jobs 2 --output mutants-out -- --lib || trueWhy (per #590 re-diagnosis)
@avrabe's last comment on #590 ruled out an unmutated-source bug — the OOM is a runaway mutant from
cargo-mutantsthat allocates ~100 G in seconds, beating the 30 s per-mutant timeout. The kernel then OOM-kills system-wide and takes down neighboring jobs on the lean-mem pool.The comment recommends two fixes:
MemoryMax~48 G cgroup cap on thelean-memrunner pool.ulimit -v 50331648beforecargo mutants. ← this PR.continue-on-error: true+ the existing|| truemean a clipped mutant is still recorded as timeout/error rather than failing the gate. The carefully-tuned--jobs 2is preserved — its rationale (smithy operator tuning, 16 G/worker target) is still valid with the per-process cap in place.Acceptance criterion (from #590)
RLIMIT_AS. With--jobs 2, the sum can still reach ~96 G, so this PR alone does not bound the cgroup-level total — that's what the infraMemoryMaxcap will do. The two together meet the criterion.mutants-corefan-out post-merge (the OOMs cluster ~once per nightly run on high-load days). Not testable inside one PR run.Why draft: opening as a draft for two reasons:
MemoryMaxrollout.Test plan
mutants-coreviaworkflow_dispatch(or wait for next nightly).--jobs 2is preserved, so wall-clock should be unchanged for non-runaway mutants).MemoryMaxcap lands, confirm a runaway mutant aborts astimeout/errorinmutants-out/instead of triggering an entry in the host's kernel OOM log.Generated by Claude Code