feat(buildrunner): add Buildkite BuildRunner implementation by JamyDev · Pull Request #190 · uber/submitqueue

JamyDev · 2026-06-04T04:05:22Z

Why?

The buildrunner extension defined the BuildRunner/Factory contract and shipped only a noop stub. To run real CI verification in the build stage of the orchestrator pipeline, we need a concrete backend. Buildkite is the first real one. This implementation also has to honor the constraints in doc/rfc/submitqueue/build-runner.md:

Trigger/Cancel must return promptly, no build state may live only in process memory (it must survive a restart), and transient failures must be handled internally rather than surfaced to the caller.

What?

Adds a Buildkite-backed BuildRunner + Factory under submitqueue/extension/buildrunner/buildkite/.

Async, prompt-returning Trigger/Cancel. Both enqueue work on a buffered channel; a background worker contacts Buildkite, keeping the orchestrator's queue loops decoupled from Buildkite
availability (per the RFC's async/sync table).
Cache is not the source of truth. The in-memory build-ID → Buildkite-ref map is a pure latency cache. The SQ build ID is stamped into the Buildkite build's meta_data at create time; Status and Cancel re-derive the ref via a metadata-filtered build lookup on cache miss, so they stay correct after a restart that empties the cache (no Accepted-forever orphans on the post-submit
path).
Retry with backoff. The worker retries transient create/cancel failures. If a submission exhausts its retries, the build is recorded as a submission failure and Status returns terminal Failed (reason in BuildMetadata["error"]) instead of polling Accepted forever.
Conservative state mapping. mapState collapses Buildkite states into BuildStatus; unrecognised states map to non-terminal Unknown (not Failed) so an unknown state never terminally fails a batch.
No head-of-line blocking. Each job runs in its own goroutine so a retrying submit doesn't stall other builds.
Errors are returned plain (classification left to the controller, per core/errs). Config exposes queue sizes, submit timeout, retry attempts/backoff, and test overrides (HTTP client, base
URL).

Known limitation (out of scope): a restart while a trigger job is still buffered (never submitted to Buildkite) leaves the build Accepted with nothing to recover (TBD on if we want the orchestrator to handle this)

Test Plan

Unit tests cover trigger/status/cancel, payload + metadata stamping, cache-miss recovery (Status and Cancel), retry-then-succeed, retry-exhaustion→Failed, and full state mapping. All local CI-parity checks pass:

$ go test ./submitqueue/extension/buildrunner/buildkite/...
ok github.com/uber/submitqueue/submitqueue/extension/buildrunner/buildkite 0.834s
$ go vet ./submitqueue/extension/buildrunner/buildkite/... # clean
$ gofmt -l submitqueue/extension/buildrunner/buildkite/ # clean
$ make check-gazelle
BUILD files are up to date.

$ bazel test //submitqueue/extension/buildrunner/buildkite:buildkite_test --test_output=errors
//submitqueue/extension/buildrunner/buildkite:buildkite_test PASSED in 0.4s
Executed 1 out of 1 test: 1 test passes.

The Buildkite REST calls are exercised against an httptest server (no live Buildkite dependency); the worker is driven synchronously in tests via drainTrigger/drainCancel to avoid goroutine timing races.

Implements buildrunner.BuildRunner and buildrunner.Factory backed by the Buildkite CI platform, under submitqueue/extension/buildrunner/buildkite. Design: - Trigger/Cancel return promptly (per the async/sync contract in the build-runner RFC): both enqueue work on a buffered channel and a background worker contacts Buildkite, keeping the orchestrator's queue loops decoupled from Buildkite availability. - The in-memory build-ID -> Buildkite-ref map is a pure latency cache, not the source of truth. The SQ build ID is stamped into the Buildkite build's metadata at create time; Status and Cancel re-derive the ref via a metadata-filtered build lookup on cache miss, so they remain correct after a process restart that empties the cache (no orphaned Accepted-forever builds). - The background worker retries transient create/cancel failures with backoff. If a submission exhausts its retries, the build is recorded as a submission failure and Status reports terminal Failed (with the reason in BuildMetadata["error"]) rather than polling Accepted forever. - mapState collapses Buildkite states into BuildStatus; unrecognised states map to the non-terminal Unknown (not Failed), so an unknown state does not terminally fail a batch. - Each job is dispatched to its own goroutine so a retrying submit does not head-of-line block other builds. Errors are returned plain (classification left to the controller, per core/errs). Includes config knobs (queue sizes, submit timeout, retry attempts/backoff, test overrides) and unit tests covering trigger/status/cancel, cache-miss recovery, retry-then-succeed, retry-exhaustion-fails, and state mapping. Known limitation: a process restart while a trigger job is still buffered (never submitted to Buildkite) leaves the build Accepted with nothing to recover; catching that belongs to an orchestrator-level Accepted deadline, which is out of scope for this change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

behinddwalls · 2026-06-04T05:03:56Z

+// constructed by Factory.For for each queue and bound to a single pipeline.
+type Config struct {
+	// APIToken is the Buildkite personal or agent API token. Required.
+	APIToken string


should this be left to be injected through http transport? similar to changeprovider & mergechecker model?

behinddwalls · 2026-06-04T05:05:43Z

+	QueueName string
+
+	// Branch is the target branch builds run against (e.g. "main"). Required.
+	Branch string


do we need this when have QueueName as the main thing being passed around? should underlying imp infer things based on that queue however they configure or store..rather than we passing something as queueName uniquely represents the repo+target

behinddwalls · 2026-06-04T05:06:05Z

+	TriggerQueueSize int
+
+	// CancelQueueSize is the buffer capacity of the async cancel channel.
+	// Defaults to 256.
+	CancelQueueSize int


i think we can get rid of buffers

behinddwalls · 2026-06-04T05:06:48Z

+	// MaxSubmitAttempts is the number of times the background worker tries a
+	// Buildkite create/cancel call before giving up, to ride out transient
+	// failures. Defaults to 5.
+	MaxSubmitAttempts int
+
+	// SubmitBackoff is the base delay between background-worker retries; the
+	// delay grows linearly with the attempt number. Defaults to 1s.
+	SubmitBackoff time.Duration


should we just accept a retry-policy instead?

behinddwalls · 2026-06-04T05:07:17Z

+
+	// BaseURL overrides the Buildkite API base URL
+	// (default: "https://api.buildkite.com/v2"). Intended for testing.
+	BaseURL string


can this be encoded into http transporter as well?

behinddwalls · 2026-06-04T05:08:00Z

+	// OrgSlug is the Buildkite organisation slug (URL segment after
+	// buildkite.com/). Required.
+	OrgSlug string
+
+	// PipelineSlug is the Buildkite pipeline that runs builds for this queue.
+	// Resolved by Factory from its Pipelines map. Required.
+	PipelineSlug string


can these be in the transporter itself?

behinddwalls · 2026-06-04T05:44:27Z

+
+// For returns a BuildRunner bound to the Buildkite pipeline configured for
+// cfg.QueueName. Returns an error if no pipeline is configured for the queue.
+func (f Factory) For(cfg buildrunner.Config) (buildrunner.BuildRunner, error) {


I was hoping it's implemented outside of the extension, basically in example server we would define this and let implementer decide however they want to wire the implementation per queue, so say one queue could use buildkite, another one can use gh-actions

JamyDev requested review from a team, behinddwalls and sbalabanov as code owners June 4, 2026 04:05

JamyDev added the ChangesPlanned label Jun 4, 2026

behinddwalls reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(buildrunner): add Buildkite BuildRunner implementation#190

feat(buildrunner): add Buildkite BuildRunner implementation#190
JamyDev wants to merge 1 commit into
mainfrom
feat/buildkite-buildrunner

JamyDev commented Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

behinddwalls Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JamyDev commented Jun 4, 2026

Why?

What?

Test Plan

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

behinddwalls Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants