ci: adjust CPU and memory settings#3671
Conversation
When the node that your job runs on runs out of memory, then Linux will kill something. Requesting more memory gives more room to prevent this from happening, but generalized, the request should be equal to the limit to prevent this from happening.
|
✨ Fix all issues with BitsAI or with Cursor
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #3671 +/- ##
==========================================
- Coverage 62.22% 62.17% -0.06%
==========================================
Files 141 141
Lines 13352 13352
Branches 1746 1746
==========================================
- Hits 8308 8301 -7
- Misses 4253 4260 +7
Partials 791 791 see 2 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
| - if [ -f /sbin/apk ] && [ $(uname -m) = "aarch64" ]; then ln -sf ../lib/llvm17/bin/clang /usr/bin/clang; fi | ||
|
|
||
| - cd profiling | ||
| - 'echo "nproc: $(nproc)"' |
There was a problem hiding this comment.
Note that this reports a quite large number, such as 66. This means we should be setting the number of works for parallel systems and not let them discover it!
| KUBERNETES_CPU_REQUEST: 3 | ||
| KUBERNETES_CPU_LIMIT: 3 | ||
| KUBERNETES_MEMORY_REQUEST: 6Gi | ||
| KUBERNETES_MEMORY_LIMIT: 6Gi |
There was a problem hiding this comment.
These numbers look good in one of dashboards. Note that we are really memory bound and for parts of it we are only engaging a single core so there's not a lot of value in increasing CPUs at the moment.
5 GiB is too small, I've observed 4.9 GiB in practice.
Description
K8s is scheduling based on REQUEST, but allowed to run up to LIMIT. So if a job requests 3 GiB of memory, it's allowed to go up to 4 GiB. On a whole-node basis, if enough jobs do this then node can run out of memory and start using OOM-kill. Note that for CPU, the limit is not enforced unless the request is equal to the limit (but I'm not sure that's actually happening either, but it's what it's supposed to be doing).
These specific job families failed for me in PRs, which is why I've adjusted these specific jobs. Note that in the case of appsec, it did not reach its own limit, but rather the node it was on became full.
I am increasing the limits by feedback from a coworker who supports our CI infrastructure.
Reviewer checklist