[CI] diff driven test selection#8077
Conversation
Signed-off-by: Stas Bekman <stas@stason.org>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f84aec4ea5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Stas Bekman <stas@stason.org>
|
this is absolutely hilarious - why do we even bother to be in the loop - AI CI flags problems another AI then fixes them, I push the change. We are going to be so replaced really soon now. I guess it's important that the CI and the code writing aren't done by the same model, otherwise it'd be pointless to check quality and detect problems if it's the same entity. |
|
@stas00 May I ask why we are trying to replicate HF Transformers' system? It would be great to learn from your insights on this:) |
There was a problem hiding this comment.
Pull request overview
This PR introduces a diff-driven test selection mechanism for the modal-torch-latest GPU CI workflow to reduce CI runtime by running only tests/unit/v1 tests that are (statically or dynamically) impacted by a PR’s changes, with conservative fail-safe fallbacks to running the full suite.
Changes:
- Add a new selector (
ci/tests_fetcher.py) that computes impacted tests from the PR diff via a repo-wide import graph (+ curated dynamic edges), emittingmode=all|subset|noneand a test target list artifact. - Update the modal runner (
ci/torch_latest.py) and workflow (.github/workflows/modal-torch-latest.yml) to consume the generated test target list and gate the deploy job accordingly. - Add docs and self-tests for maintainability (
.github/workflows/TEST_SELECTION*.md,ci/test_tests_fetcher.py, CONTRIBUTING update, and gitignore for selector outputs).
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
CONTRIBUTING.md |
Documents how diff-based test selection works and how to debug/override it. |
ci/torch_latest.py |
Reads selected test targets and passes them to pytest (fallback to full suite). |
ci/tests_fetcher.py |
New diff/import-graph-based test selection engine + GitHub Actions outputs/summary. |
ci/test_tests_fetcher.py |
New stdlib self-tests that validate selector behavior on synthetic repos. |
.gitignore |
Ignores selector output directory ci/.test_selection/. |
.github/workflows/TEST_SELECTION.md |
Design/usage/security documentation for the selector system. |
.github/workflows/TEST_SELECTION_STATE.md |
Implementation handoff/state notes for ongoing maintenance. |
.github/workflows/modal-torch-latest.yml |
Adds collect-tests job, artifacts, and deploy gating based on selection mode. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * **Testable.** All repo/config state lives on ``TestSelector`` (constructed with | ||
| an arbitrary ``repo_root``), so the logic can be unit-tested against synthetic | ||
| repos -- see ``tests/unit/test_tests_fetcher.py``. | ||
|
|
| status = parts[0] | ||
| if status.startswith("R") and len(parts) >= 3: | ||
| deleted.append(parts[1]) | ||
| changed.append(parts[2]) | ||
| elif status.startswith("D"): | ||
| deleted.append(parts[1]) | ||
| elif len(parts) >= 2: | ||
| changed.append(parts[1]) |
|
|
||
| @staticmethod | ||
| def _reachable_with_parents(seed: Path, reverse: dict[Path, set[Path]]) -> dict[Path, Path | None]: | ||
| """BFS from ``seed`` recording a predecessor for each node (for --explain).""" |
| # NOTE: under pull_request_target this runs PR-authored code; the modal/HF | ||
| # secrets live on this runner, which is why the workflow is gated on | ||
| # review_requested / ready_for_review (a maintainer vets the PR first). The | ||
| # CI orchestration itself is restored from base in the next step so a PR | ||
| # can't repoint it at the secrets. |
|
@stas00 I think it's a great idea to trigger different CI tests based on the PRs. I am actually wondering how accurate the algorithms can be. For example, what's the potential risk of under-testing or over-testing? |
|
HF Transformers has been using it for many years now, this is the same code (from their repo) adapted to DeepSpeed. You can make it much more conservative and run more tests. What Claude also proposed is to always run the full test suite either on PR merge or have a nightly run that always does the full thing. So in the worse case scenario you'd catch a problem at most 24h later - so basically make no new release until you have run the full test suite. This covers all bases, but waiting for sometimes days to get a ready PR merged because of flakey tests and meanwhile someone merging a PR and needing to start from scratch is a madness - and when we run out of modal.com free credits who will pay for CI? I'm going to make another version for the new Deepspeed-based RL training repo we are working on https://github.com/Snowflake-AI-Research/Arctic-Platform since there are too many tests already, so I will be testing it out there should the DS maintainers not wanting to take this very small risk. A PR there is at Snowflake-AI-Research/Arctic-Platform#14 |
What do you mean? We aren't replicating their system, we are trying to reuse a sub-system that can reduce the CI run time+$$ by 95+%. It's been on my list of things I wanted to port since forever, since it'd be so time consuming, but with a good AI model this is now a matter of a few hours of tokens. If you'd like to write one from scratch that's a totally valid option. |
TLDR: Analyze PR's diff and filter out tests that aren't exercising what has changed, potentially cutting down runtime and expense by 95-99% most of the time.
Detailed:
Deepspeed's CI takes forever - most of the time burning $$ and wastes dev time for no reason as most changes require just a few tests to run.
HF Transformers has a system to select which tests to run based on the diff of the PR - Sylvain Gugger wrote it many years ago since that repo has now probably thousands of tests. Deepspeed's CI isn't too bad but can easily take hours.
So I asked Claude Opus 4.8 to replicate the system for Deepspeed. Please have a look. It looks super complicated, so I'm not sure how easy it'd be to maintain/operate unless we always use AI to continue maintaining it. I asked Claude to leave a detailed state file so that it or another model could pick it up where it left.
I started with just the slowest costliest workload
.github/workflows/modal-torch-latest.ymlto see if it works well. If you're happy we can replicate it to the rest of the workloads.CC: @loadams, @tjruwase - please tag others if you think they would be helpful to discuss this.