Skip to content

Conversation

@msbutler
Copy link
Collaborator

@msbutler msbutler commented Dec 23, 2025

The claim query should take less than a second, but we have seen it hang for
multiple hours because the underlying transaction continuously retries and
deadlocks. To prevent this, this patch causes the claim query to timeout after
a minute by default, set by the private cluster setting
jobs.registry.claim_query.timeout. The per node claim loop will try again in
the next iteration.

Informs #158976

Release note: none

@msbutler msbutler self-assigned this Dec 23, 2025
@blathers-crl
Copy link

blathers-crl bot commented Dec 23, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler
Copy link
Collaborator Author

When testing this patch on a 70 node draft roachtest which hammers the job system, the claim loops hobble along instead of completely flatlining:
image

In this run, at 16:50, i bumped the claim loop from 15s to 5s, to further stress things. With a 15s interval, 25.2.6 deadlocks while this patch, the job system didn't have any multi minute stalls. At 5s intervals, the job system can hobble along with this patch instead of flatlining.

@msbutler msbutler marked this pull request as ready for review December 23, 2025 20:08
@msbutler msbutler requested review from a team as code owners December 23, 2025 20:08
@msbutler msbutler requested review from dt and kev-cao and removed request for a team and kev-cao December 23, 2025 20:08
@msbutler msbutler added backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 backport-26.1.x Flags PRs that need to be backported to 26.1 labels Dec 23, 2025
// available.
func (r *Registry) claimJobs(ctx context.Context, s sqlliveness.Session) error {

ctx, cancel := context.WithDeadlineCause(ctx, timeutil.Now().Add(time.Minute), errors.New("claim jobs transaction took too long"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i lean towards hard coding this, but if folks think this should be a cluster setting, that's fine too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backport I lean toward setting, in case we regress someone and need to fix it quickly

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added cluster setting

@msbutler msbutler force-pushed the butler-jobs-query-timeout branch from 021f5c5 to 6de0c57 Compare December 29, 2025 20:59
@msbutler msbutler changed the title jobs: time out claim jobs query after 1 minute jobs: time out claim jobs query after 1 minute by default Dec 29, 2025
The claim query should take less than a second, but we have seen it hang for
multiple hours because the underlying transaction continuously retries and
deadlocks. To prevent this, this patch causes the claim query to timeout after
a minute by default, set by the private cluster setting
jobs.registry.claim_query.timeout. The per node claim loop will try again in
the next iteration.

Informs cockroachdb#158976

Release note: none
@msbutler msbutler force-pushed the butler-jobs-query-timeout branch from 6de0c57 to f36dc98 Compare December 29, 2025 21:41
@msbutler
Copy link
Collaborator Author

TFTR!

bors r=dt

@craig
Copy link
Contributor

craig bot commented Dec 29, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 backport-26.1.x Flags PRs that need to be backported to 26.1 target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants