ci: resilient HF pre-download + offline tests (fix main CI flake)#75
Merged
Conversation
Root cause of the main-branch CI failure (run 28495801728, Python 3.10, 2026-07-01): the "Pre-download embedding model" step had no retry and `continue-on-error: true`, so a transient huggingface.co blip left the HF cache empty on that one matrix leg. The offline-unaware test suite then re-fetched the model at runtime and cascaded into 58 spurious "couldn't connect to huggingface.co" failures while the model itself was fine (3.11/3.12/3.13 legs, which got the cache, all passed). Fix, applied to all three pre-download sites (test, test-sqlite, test-windows): - Retry the download 5× with linear backoff so a transient blip self-heals. - Drop `continue-on-error` so a genuine persistent failure surfaces at the download step instead of cascading into a misleading test failure. - Set HF_HUB_OFFLINE / TRANSFORMERS_OFFLINE on the test-run steps: the model is already cached by the step above, so tests never touch the network mid-suite — deterministic and flake-free. - Windows pre-download runs under `shell: bash` for the retry loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01UwnQrVh2tnNMWJabhAQgaN
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
mainCI failed on run 28495801728 — 58 tests failed on the Python 3.10 matrix leg only, all with:The other legs (3.11/3.12/3.13) passed. This is a transient-network / cache-cold flake, not a code defect: the model (
all-MiniLM-L6-v2) is fine.Root cause
The
Pre-download embedding modelstep had no retry andcontinue-on-error: true. When huggingface.co blipped on the 3.10 runner, the step was masked as ✓ but the HF cache stayed empty. The test suite — which is not offline-aware — then tried to re-fetch the model at runtime and cascaded into 58 misleading failures.Fix (all three pre-download sites:
test,test-sqlite,test-windows)continue-on-error→ a genuine persistent failure surfaces clearly at the download step, not as a confusing test cascade.HF_HUB_OFFLINE/TRANSFORMERS_OFFLINEon the test-run steps → the model is already cached, so tests never touch the network mid-suite (deterministic, flake-free).shell: bashfor the retry loop.Verification
yaml.safe_load).🤖 Generated with Claude Code