Skip to content

Prevent shared GitHub GraphQL bucket exhaustion across agent servers#544

Merged
bborn merged 2 commits into
mainfrom
task/3711-prevent-shared-github-graphql-bucket-exh
May 29, 2026
Merged

Prevent shared GitHub GraphQL bucket exhaustion across agent servers#544
bborn merged 2 commits into
mainfrom
task/3711-prevent-shared-github-graphql-bucket-exh

Conversation

@bborn
Copy link
Copy Markdown
Owner

@bborn bborn commented May 29, 2026

Problem

Agents intermittently fail with "GraphQL bucket is exhausted" and fall back to REST.

Root cause (verified 2026-05-29): GitHub's GraphQL rate limit (5,000 points/hr) is PER-USER, not per-token. Multiple TaskYou agent servers all authenticate gh as the same personal account (bborn), so they share one bucket. GraphQL-backed gh pr commands — especially gh pr checks polling loops — collectively drain it. Servers using a GitHub App (bot) identity like offerlab-agents[bot] get their own independent bucket and don't contend.

What this PR does

  1. ty doctor — a new diagnostic command that inspects the local gh auth and GraphQL headroom and warns about the conditions that cause contention:

    • gh not installed / not logged in (GitHub ops silently fail)
    • expired/revoked token (401 Bad credentials)
    • authentication as a personal account (shared per-user bucket)
    • low remaining GraphQL headroom
  2. internal/github/auth.goCheckAuth() probes gh api user + gh api rate_limit, classifies personal vs GitHub App/bot identities, and detects logged-out/expired tokens. Findings() converts that to ordered severity findings (reused by ty doctor).

  3. Agent guidance — agent system instructions now tell agents to prefer REST for PR reads (separate 5k bucket) and to use gh run watch / REST check-runs with backoff instead of busy-polling gh pr checks.

Tests

  • internal/github/auth_test.go — account classification, 401/logged-out parsing, and findings/severity for each auth state.
  • internal/executor/executor_test.go — locks the new GitHub guidance into the agent system prompt.

All affected packages build, vet, and test clean.

Note

This PR was itself opened via the REST API (POST /repos/.../pulls) because the shared GraphQL bucket was exhausted at creation time — a live demonstration of the exact problem, and of the documented REST workaround.

Follow-up (out of this Go repo's scope)

Provision each agent server with its own GitHub App installation token during launch/setup (which lives in the launch plugin), mirroring the offerlab-devs[bot] pattern, so every server gets an independent rate-limit bucket.

🤖 Generated with Claude Code

bborn and others added 2 commits May 29, 2026 06:26
…stion

GitHub's GraphQL rate limit (5,000 pt/hr) is PER-USER, so multiple agent
servers authed as the same personal account (bborn) share one bucket and
drain it, causing intermittent "GraphQL bucket is exhausted" failures.

- internal/github/auth.go: CheckAuth() inspects the local gh identity and
  GraphQL headroom, classifying personal vs GitHub App/bot accounts and
  detecting logged-out / expired (401) tokens. Findings() turns that into
  ordered severity findings.
- cmd/task/main.go: new `ty doctor` command that renders those findings —
  warns on personal-account auth, expired tokens, gh not logged in, and low
  GraphQL headroom; exits non-zero on hard errors.
- internal/executor: agent system instructions now tell agents to prefer
  REST for PR reads and use `gh run watch` / REST check-runs with backoff
  instead of busy-polling `gh pr checks`.
- Tests for account classification, error parsing, findings, and guidance.

Follow-up (out of this Go repo's scope): provision each agent server with
its own GitHub App installation token during launch/setup, mirroring the
offerlab-devs[bot] pattern, for an independent rate-limit bucket.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Err tests

Addresses plan-exit-review findings:
- 1A: add `ty doctor --strict` so fleet sweeps can branch on exit code
  (warnings like personal-account auth now exit non-zero under --strict;
  default behavior unchanged — only hard errors exit non-zero).
- 2A: document why pr.go's batch-gate threshold (200) and auth.go's
  operator-warn threshold (500) intentionally differ, cross-referencing
  each other so a future tuner sees both.
- 3A: extract pure classifyUserErr() from CheckAuth and table-test the
  stderr->auth-state mapping (expired / logged-out / unknown), so a change
  to gh's wording is caught rather than silently mis-routed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bborn bborn merged commit 5f1449a into main May 29, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant