Skip to content

feat(cancel): add Cancel RPC + orchestrator cancel pipeline#157

Merged
behinddwalls merged 1 commit into
mainfrom
sergeyb/request-cancellation
Jun 3, 2026
Merged

feat(cancel): add Cancel RPC + orchestrator cancel pipeline#157
behinddwalls merged 1 commit into
mainfrom
sergeyb/request-cancellation

Conversation

@sbalabanov
Copy link
Copy Markdown
Contributor

Summary

  • New gateway Cancel(sqid) RPC that publishes a CancelRequest to TopicKeyCancel; processing is asynchronous and cancellation is not guaranteed (a request that has already merged, or that races to completion, may still land or error — callers should check the actual outcome via the separate status / request-log API).
  • New orchestrator cancel controller with a two-step transition: RequestStateCancelling (non-terminal intent) → either RequestStateCancelled directly (un-batched) or via batch cancel + conclude fan-out (batched). The Cancelling intent is non-terminal so concurrent merge / failure may still win and prevail with Landed / Error.
  • Every stage controller (validate, batch, score, build, buildsignal, merge, conclude, speculate) is now cancellation-aware via a single IsRequestStateHalted / batch-terminal short-circuit. Speculate treats cancelled deps as out-of-the-way (drop and continue) instead of cascade-failing; conclude maps BatchStateCancelledRequestStateCancelled.
  • Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

Test plan

  • make fmt && make lint && make check-tidy && make check-gazelle && make check-mocks clean
  • bazel test //... --test_tag_filters=-integration,-e2e (32 unit-test targets) all pass
  • Manual e2e via make local-start: Land → Cancel before batch → confirm RequestStateCancelled and no batch created
  • Manual e2e: Land → wait for batch → Cancel → confirm batch goes Cancelled, contained requests reconcile to Cancelled, dependent batches respeculate
  • Integration test exercising both paths through real MySQL extension

Out of scope (deferred)

  • Re-enqueuing non-cancelled requests from a cancelled batch (TODO in cancel controller).
  • Cancelling a real CI build via external API (no-op today; hook is the existing BuildStatusCancelled write).

🤖 Generated with Claude Code

sbalabanov pushed a commit that referenced this pull request May 22, 2026
…c log

Two CI failures on PR #157:

- TestCancelRequest_InvalidSqid expected codes.InvalidArgument but the
  gateway returned a plain fmt.Errorf that gRPC defaults to codes.Unknown.
  Added a unary interceptor in example/server/gateway/main.go that
  translates any error matching controller.IsInvalidRequest into
  codes.InvalidArgument. Benefits Land too without touching the
  controllers' return shapes.

- TestCancelRequest_BeforeBatch read request_log exactly once after
  observing state=cancelled, racing the orchestrator's async log
  publish. Wrapped the assertion in require.Eventually so the check
  waits for the log consumer to persist the terminal entry.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sbalabanov sbalabanov force-pushed the sergeyb/request-cancellation branch 10 times, most recently from 72bdf6e to 164917b Compare June 1, 2026 23:52
@sbalabanov sbalabanov marked this pull request as ready for review June 1, 2026 23:52
@sbalabanov sbalabanov requested review from a team and behinddwalls as code owners June 1, 2026 23:52
@albertywu
Copy link
Copy Markdown
Contributor

Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

What base controller layer? Is this referring to core/errs? If so, it looks like the only retryable error types are context-cancellation and infra-errors -- I don't see any special handing of these specific storage errors.

@behinddwalls
Copy link
Copy Markdown
Collaborator

Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

What base controller layer? Is this referring to core/errs? If so, it looks like the only retryable error types are context-cancellation and infra-errors -- I don't see any special handing of these specific storage errors.

It's referring to the ConsumerController aka BaseController for all queue consumers

@albertywu
Copy link
Copy Markdown
Contributor

Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

What base controller layer? Is this referring to core/errs? If so, it looks like the only retryable error types are context-cancellation and infra-errors -- I don't see any special handing of these specific storage errors.

It's referring to the ConsumerController aka BaseController for all queue consumers

I see -- it looks like there is no base controller implemented. So as things stand today (unless I'm missing something), there is no code that maps storage.ErrNotFound and storage.ErrVersionMismatch to retry-able errors as the PR comment states.

// kicked off, and buildsignal's own halted short-circuit takes over. Both
// truly-terminal and mid-cancellation batches reach the same observable
// behaviour (no build performed).
if entity.IsBatchStateHalted(batch.State) {
Copy link
Copy Markdown
Collaborator

@behinddwalls behinddwalls Jun 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just thinking if this can be wired up into the Consumer Controller in some way,

Basically if each controller provides a way to get the deserialized payload, then consumer can just make a callback using the same entity before calling the process and short-circuit for all the consumer controllers.

// push-already-committed race is acknowledged elsewhere) and do not fan out
// (speculate owns the terminal write to Cancelled and the downstream
// dependent / conclude publishes).
if batch.State == entity.BatchStateCancelling {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this use the IsHalted utility?

@behinddwalls
Copy link
Copy Markdown
Collaborator

behinddwalls commented Jun 2, 2026

Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

What base controller layer? Is this referring to core/errs? If so, it looks like the only retryable error types are context-cancellation and infra-errors -- I don't see any special handing of these specific storage errors.

It's referring to the ConsumerController aka BaseController for all queue consumers

I see -- it looks like there is no base controller implemented. So as things stand today (unless I'm missing something), there is no code that maps storage.ErrNotFound and storage.ErrVersionMismatch to retry-able errors as the PR comment states.

https://github.com/uber/submitqueue/blob/main/core/consumer/controller.go
https://github.com/uber/submitqueue/blob/main/core/consumer/consumer.go

@albertywu
Copy link
Copy Markdown
Contributor

Cancel controller does not classify storage errors itself — storage.ErrNotFound and storage.ErrVersionMismatch propagate as-is so the base controller layer can map them to retryable, matching the pattern in every other orchestrator controller.

What base controller layer? Is this referring to core/errs? If so, it looks like the only retryable error types are context-cancellation and infra-errors -- I don't see any special handing of these specific storage errors.

It's referring to the ConsumerController aka BaseController for all queue consumers

I see -- it looks like there is no base controller implemented. So as things stand today (unless I'm missing something), there is no code that maps storage.ErrNotFound and storage.ErrVersionMismatch to retry-able errors as the PR comment states.

https://github.com/uber/submitqueue/blob/main/core/consumer/controller.go https://github.com/uber/submitqueue/blob/main/core/consumer/consumer.go

Should we make sure these error types are retriable? Otherwise we fall into the DLQ code-path here instead of the retry code-path: https://github.com/uber/submitqueue/blob/main/core/consumer/consumer.go#L374

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Adds a gateway `Cancel(sqid)` RPC that publishes to a new TopicKeyCancel
and an orchestrator cancel controller that performs the actual state
transitions. Cancellation uses a two-step pattern: first the request is
marked RequestStateCancelling (non-terminal intent), then either
transitioned directly to RequestStateCancelled (un-batched path) or
handed off via batch-cancel + conclude fan-out (batched path). The
Cancelling intent is non-terminal so a concurrent merge or failure may
still win the race and prevail with Landed or Error — conclude
reconciles to the actual terminal outcome.

Every stage controller (validate, batch, score, build, buildsignal,
merge, conclude, speculate) is now cancellation-aware. Speculate treats
cancelled deps as out-of-the-way (drop and continue) instead of
cascade-failing. Conclude maps BatchStateCancelled → RequestStateCancelled.

Cancellation is asynchronous and not guaranteed; gateway.proto comments
direct callers to check the actual outcome via the separate status API.

Infrastructure changes that fell out of building this:

- gRPC error mapping: a unary interceptor in example/server/gateway/main.go
  translates any error matching controller.IsInvalidRequest into
  codes.InvalidArgument so invalid-sqid (and Land's invalid inputs) no
  longer surface as codes.Unknown.

- PublishLog message ID scoping: log entries are now keyed by
  `<requestID>/<status>` so distinct statuses for the same request
  coexist on the queue's `(topic, partition_key, id)` unique index
  (start writes `started`, cancel writes `cancelling` then `cancelled`).
  Same-status retries still dedupe.

Regression coverage: unit tests for every new controller and the
two-step transition, plus e2e tests for cancel-before-batch (including
idempotent re-cancel) and the InvalidArgument mapping. The idempotent
re-cancel e2e test depends on the MySQL queue `Insert` becoming
idempotent on (topic, partition_key, id) — extracted to a separate PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@behinddwalls behinddwalls force-pushed the sergeyb/request-cancellation branch from 164917b to 4e01daf Compare June 3, 2026 06:22
@behinddwalls behinddwalls merged commit 92cef12 into main Jun 3, 2026
14 of 15 checks passed
@behinddwalls behinddwalls deleted the sergeyb/request-cancellation branch June 3, 2026 06:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants