diff --git a/errors/caching-artifacts/setup-go-gomod-toolchain-directive-mismatch-tar-conflict.yml b/errors/caching-artifacts/setup-go-gomod-toolchain-directive-mismatch-tar-conflict.yml new file mode 100644 index 0000000..accc38c --- /dev/null +++ b/errors/caching-artifacts/setup-go-gomod-toolchain-directive-mismatch-tar-conflict.yml @@ -0,0 +1,132 @@ +id: ca-152 +title: '`actions/setup-go` Cache Restore Fails With "Cannot open: File exists" When `go.mod` Has Mismatched `go` and `toolchain` Directives' +category: caching-artifacts +severity: warning +tags: + - setup-go + - cache + - tar + - go-mod + - toolchain-directive + - go-1-21 + - file-exists + - go-toolchain +patterns: + - regex: 'golang\.org/toolchain@.*Cannot open: File exists' + flags: 'i' + - regex: 'Cannot open: File exists.*toolchain@v0\.0\.1-go' + flags: 'i' + - regex: 'Warning: Failed to restore.*tar.*exit code 2' + flags: 'i' + - regex: 'go/pkg/mod/golang\.org/toolchain@.*Cannot open' + flags: 'i' +error_messages: + - "/usr/bin/tar: ../../../go/pkg/mod/golang.org/toolchain@v0.0.1-go1.21.1.linux-amd64/...: Cannot open: File exists" + - "/usr/bin/tar: Exiting with failure status due to previous errors" + - "Warning: Failed to restore: \"/usr/bin/tar\" failed with error: The process '/usr/bin/tar' failed with exit code 2" +root_cause: | + When a `go.mod` file specifies both a `go` version and a **different** + `toolchain` version (a Go 1.21+ feature), the Go toolchain automatically + downloads and installs the requested toolchain during the first `go` + invocation in the workflow: + + ``` + go 1.21.0 + toolchain go1.21.1 + ``` + + On the **first** `actions/setup-go` cache-enabled run: + 1. `setup-go` installs Go 1.21.0 (the `go` directive version). + 2. Go detects the `toolchain go1.21.1` directive and auto-downloads toolchain + 1.21.1 from the module proxy, writing files to: + `$GOPATH/pkg/mod/golang.org/toolchain@v0.0.1-go1.21.1.linux-amd64/` + 3. `setup-go`'s post-step saves the module cache (which now includes the + downloaded toolchain files) to the Actions cache. + + On **subsequent** runs: + 1. `setup-go` again installs Go 1.21.0 — Go immediately auto-downloads + toolchain 1.21.1 again **before** the cache restore completes, writing + the toolchain files to disk. + 2. `setup-go`'s cache restore step then tries to extract the saved cache + archive, which also contains the toolchain files. `tar` finds them + already on disk and fails with "Cannot open: File exists" (exit code 2). + 3. A warning is emitted; the job continues but the cache restore is partial. + + This is distinct from the Go 1.23 telemetry issue (`golang.org/x/telemetry`) + and the double-invocation issue — the trigger here is the `toolchain` + directive in `go.mod` causing automatic toolchain downloads that race with + cache restore. +fix: | + **Option 1 (recommended)**: Upgrade to `actions/setup-go@v5` or later, which + handles this scenario better. Version 5 introduced awareness of the `toolchain` + directive and avoids the race. + + **Option 2**: Align the `go` and `toolchain` directives in `go.mod` so they + reference the same version, preventing the auto-download entirely. + + **Option 3**: Disable Go's toolchain auto-download by setting + `GOTOOLCHAIN=local` in the workflow environment. This prevents Go from + downloading a different toolchain version and avoids the file-exists conflict. + + **Option 4**: Pre-delete the conflicting toolchain directory before + `setup-go` restores the cache. +fix_code: + - language: yaml + label: 'Recommended: set GOTOOLCHAIN=local to prevent auto-download' + code: | + env: + GOTOOLCHAIN: local # Prevents Go from auto-downloading toolchain per go.mod directive + + jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - uses: actions/setup-go@v5 + with: + go-version-file: go.mod + cache: true + + - run: go build ./... + - language: yaml + label: 'Alternative: align go and toolchain versions in go.mod' + code: | + # In go.mod — use the same version for both directives to avoid auto-download + # BEFORE (causes toolchain auto-download): + # go 1.21.0 + # toolchain go1.21.1 + # + # AFTER (no auto-download needed, no tar conflict): + # go 1.21.1 + # toolchain go1.21.1 + + # Run locally to update: + # go get go@1.21.1 + - language: yaml + label: 'Workaround: clean toolchain cache dir before setup-go' + code: | + - name: Remove pre-existing toolchain cache to prevent tar conflict + run: | + rm -rf "$(go env GOPATH)/pkg/mod/golang.org/toolchain@"* 2>/dev/null || true + # Go may not be in PATH yet; use a raw path if needed: + # rm -rf ~/go/pkg/mod/golang.org/toolchain@* 2>/dev/null || true + + - uses: actions/setup-go@v5 + with: + go-version-file: go.mod + cache: true +prevention: + - 'Set `GOTOOLCHAIN=local` in workflow-level `env` when using `go.mod` with a `toolchain` directive; this prevents auto-download races with cache restore' + - 'Prefer `actions/setup-go@v5` or later which has improved handling of the `toolchain` directive' + - 'If you intentionally use different `go` and `toolchain` versions in `go.mod`, expect cache restore warnings on second run and add `GOTOOLCHAIN=local` to silence them' + - 'Run `go env GOTOOLCHAIN` locally to verify toolchain behavior before committing a `go.mod` with mismatched directives' +docs: + - url: 'https://github.com/actions/setup-go/issues/424' + label: 'actions/setup-go#424 — Tar errors on cache restore after toolchain installation (13 reactions)' + - url: 'https://go.dev/doc/toolchain' + label: 'Go Toolchain documentation — go and toolchain directives in go.mod' + - url: 'https://pkg.go.dev/cmd/go#hdr-Go_toolchain_selection' + label: 'Go toolchain selection — GOTOOLCHAIN env var behavior' + - url: 'https://github.com/actions/setup-go' + label: 'actions/setup-go README' diff --git a/errors/caching-artifacts/stale-state-cache-pagination-miss.yml b/errors/caching-artifacts/stale-state-cache-pagination-miss.yml new file mode 100644 index 0000000..c704854 --- /dev/null +++ b/errors/caching-artifacts/stale-state-cache-pagination-miss.yml @@ -0,0 +1,100 @@ +id: ca-151 +title: '`actions/stale` State Cache Missed Due to Pagination — Processing Restarts From First Issue' +category: caching-artifacts +severity: warning +tags: + - actions-stale + - cache + - pagination + - state + - operations-per-run + - first-page + - checkpoint +patterns: + - regex: 'The saved state was not found.*process starts from the first issue' + flags: 'i' + - regex: 'Unable to reserve cache with key _state.*another job may be creating this cache' + flags: 'i' + - regex: 'Failed to save.*cache entry with the same key.*scope already exists' + flags: 'i' + - regex: 'Cache already exists.*Scope.*Key: _state' + flags: 'i' +error_messages: + - "The saved state was not found, the process starts from the first issue." + - "Failed to save: Unable to reserve cache with key _state, another job may be creating this cache. More details: Cache already exists. Scope: refs/heads/master, Key: _state, Version: ..." + - "Received non-retryable error: Failed request: (409) Conflict: cache entry with the same key, version, and scope already exists" +root_cause: | + `actions/stale` uses the GitHub Actions cache API to persist its processing + checkpoint between runs (stored under the key `_state`). To check whether + this cache entry already exists before a run, the action calls + `checkIfCacheExists`, which lists the repository's caches using the default + page size of 30 entries — **without filtering by key or ref**. + + If a repository accumulates more than 30 cache entries (common on active + repos with many matrix jobs, dependency caches, or build caches), the + `_state` entry may be pushed to page 2 or later of the API results. The + action only inspects the first page, so `_state` is treated as absent. + + As a result, on each run: + 1. The state is reported missing → stale processing restarts from issue #1 + instead of continuing from the saved checkpoint. + 2. At the end of the run, when the action tries to save the new state, + a 409 Conflict is returned because the `_state` cache key already exists + from the previous run (it was not found, so it was never deleted first). + + This bug was reported in actions/stale#1136. The underlying fix requires + passing `key` and `ref` query parameters to the list-caches API call so + only the relevant cache entry is checked, regardless of pagination. +fix: | + **Short-term workaround** — manually delete the `_state` cache entry from + the repository's Actions cache list before the next run: + 1. Go to `Settings → Actions → Caches` and delete any `_state` cache entry. + 2. The next `actions/stale` run will start fresh without the 409 conflict. + + **Medium-term** — pin to the community fork that has the pagination fix: + `itchyny/actions-stale` includes a correct `checkIfCacheExists` implementation. + + **Long-term** — upgrade `actions/stale` when a release incorporating the + PR #1152 pagination fix is published. +fix_code: + - language: yaml + label: 'Workaround: delete _state cache via GitHub CLI before stale runs' + code: | + # Add this as a step before actions/stale in your workflow + - name: Clear stale action state cache + env: + GH_TOKEN: ${{ github.token }} + run: | + gh api repos/${{ github.repository }}/actions/caches \ + --jq '.actions_caches[] | select(.key == "_state") | .id' \ + | xargs -I{} gh api --method DELETE \ + repos/${{ github.repository }}/actions/caches/{} + - language: yaml + label: 'Workaround: pin to community fork with pagination fix' + code: | + - uses: itchyny/actions-stale@0980a21d84c23bd4d8c62b0958f47f25822286f2 + with: + repo-token: ${{ github.token }} + days-before-stale: 60 + days-before-close: 7 + operations-per-run: 30 + - language: yaml + label: 'Long-term: watch for actions/stale release with PR #1152 fix' + code: | + # When actions/stale releases a version including PR #1152, upgrade: + - uses: actions/stale@v10 # bump to the version that includes PR #1152 + with: + repo-token: ${{ github.token }} + days-before-stale: 60 + days-before-close: 7 +prevention: + - 'If a repository has many cache entries (matrix jobs, large mono-repos), periodically clean up unused caches to keep the total count under 30 to avoid stale state pagination issues' + - 'Monitor `actions/stale` runs for the "saved state was not found" message — it indicates the checkpoint is being ignored and processing is restarting unnecessarily' + - 'Set `operations-per-run` high enough that stale can complete in a single run, eliminating the need for cross-run state persistence' +docs: + - url: 'https://github.com/actions/stale/issues/1136' + label: 'actions/stale#1136 — State restoration fails if a repo has many caches' + - url: 'https://github.com/actions/stale/pull/1152' + label: 'actions/stale#1152 — Fix checkIfCacheExists to use key and ref filters' + - url: 'https://docs.github.com/en/rest/actions/cache#list-github-actions-caches-for-a-repository' + label: 'GitHub REST API — list caches for a repository' diff --git a/errors/silent-failures/download-artifact-v4-azure-blob-timeout-silent-success.yml b/errors/silent-failures/download-artifact-v4-azure-blob-timeout-silent-success.yml new file mode 100644 index 0000000..3c15045 --- /dev/null +++ b/errors/silent-failures/download-artifact-v4-azure-blob-timeout-silent-success.yml @@ -0,0 +1,121 @@ +id: sf-230 +title: '`actions/download-artifact` v4 Azure Blob Request Timeout — Step Exits 0 Despite Download Failure' +category: silent-failures +severity: silent-failure +tags: + - download-artifact + - azure-blob-storage + - timeout + - silent-failure + - intermittent + - matrix + - v4 +patterns: + - regex: 'Unable to download and extract artifact: Request timeout' + flags: 'i' + - regex: 'Unable to download artifact\(s\):.*Request timeout' + flags: 'i' + - regex: 'Unexpected HTTP response from blob storage: 503' + flags: 'i' + - regex: 'Unable to download and extract artifact: Unexpected HTTP response from blob storage' + flags: 'i' +error_messages: + - "Error: Unable to download artifact(s): Unable to download and extract artifact: Request timeout: /actions-results/..." + - "Error: Unable to download artifact(s): Unable to download and extract artifact: Unexpected HTTP response from blob storage: 503 The server is busy." +root_cause: | + `actions/download-artifact@v4` changed its storage backend from Azure Blob + Storage legacy API to the new "actions-results" service. During the v4 early + rollout (late 2023–early 2024), the new backend exhibited intermittent + connection timeouts and 503 responses under load, particularly in matrix + jobs where multiple jobs simultaneously download the same artifact. + + The silent-failure aspect: when the download times out, the step may report + **exit code 0** (success) instead of failing the job. At least one confirmed + case shows the step always succeeds regardless of whether the artifact was + actually downloaded, leaving downstream steps operating on a missing or + empty artifact directory without any workflow failure signal. + + Additional HTTP errors observed from the same backend instability: + - `503 The server is busy` — Azure Blob Storage overloaded + - `409 Public access is not permitted on this storage account` — wrong + endpoint or storage account configuration for the new v4 backend + + These are transient infrastructure errors on GitHub's side, not caused by + workflow misconfiguration. The issue was most prevalent in December 2023 + when v4 was first released, but intermittent timeouts continue to occur + under high concurrency. +fix: | + 1. **Retry on failure**: The most reliable fix is to re-run the failed job. + The issue is transient and typically resolves on retry. + + 2. **Explicit failure check**: After downloading, verify the expected files + actually exist before proceeding. This surfaces the silent failure: + + 3. **Reduce concurrent downloads**: If many matrix jobs download the same + artifact simultaneously, stagger them with `max-parallel` to reduce + pressure on the storage backend. + + 4. **Version consistency**: Ensure upload and download actions use the + same major version (v4 with v4, v3 with v3) — mismatched versions can + also cause download failures that manifest as timeouts. +fix_code: + - language: yaml + label: 'Add file existence verification after download to catch silent failures' + code: | + - name: Download artifact + uses: actions/download-artifact@v4 + with: + name: my-artifact + path: dist/ + + # Explicitly verify download succeeded — surfaces silent timeout failures + - name: Verify artifact download + run: | + if [ ! -f dist/expected-file.txt ]; then + echo "::error::Artifact download failed silently — expected file missing" + exit 1 + fi + - language: yaml + label: 'Limit parallel matrix downloads to reduce backend pressure' + code: | + jobs: + test: + strategy: + matrix: + os: [ubuntu-latest, windows-latest, macos-latest] + max-parallel: 2 # Stagger artifact downloads across matrix legs + steps: + - uses: actions/download-artifact@v4 + with: + name: build-output + path: dist/ + - language: yaml + label: 'Ensure version consistency between upload and download' + code: | + jobs: + build: + steps: + - uses: actions/upload-artifact@v4 # ✅ v4 + with: + name: my-artifact + path: dist/ + + test: + needs: build + steps: + - uses: actions/download-artifact@v4 # ✅ must match upload version + with: + name: my-artifact + path: dist/ +prevention: + - 'Always verify artifact file existence after download in critical workflows — do not assume exit 0 means the file is present' + - 'Pin upload and download artifact action versions to the same major version to prevent backend API mismatches' + - 'Use `max-parallel` on matrix jobs that download artifacts to avoid thundering-herd pressure on blob storage' + - 'Add `continue-on-error: false` explicitly and a post-download verification step to detect silent timeouts early' +docs: + - url: 'https://github.com/actions/download-artifact/issues/249' + label: 'actions/download-artifact#249 — Unable to download and extract artifact: Request timeout (115 reactions)' + - url: 'https://github.com/actions/download-artifact/blob/main/docs/MIGRATION.md' + label: 'actions/download-artifact migration guide — v3 to v4 breaking changes' + - url: 'https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-workflow-data-as-artifacts' + label: 'GitHub Docs — Storing workflow data as artifacts'