-
Notifications
You must be signed in to change notification settings - Fork 4.1k
jobs: time out claim jobs query after 1 minute by default #160084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
When testing this patch on a 70 node draft roachtest which hammers the job system, the claim loops hobble along instead of completely flatlining: In this run, at 16:50, i bumped the claim loop from 15s to 5s, to further stress things. With a 15s interval, 25.2.6 deadlocks while this patch, the job system didn't have any multi minute stalls. At 5s intervals, the job system can hobble along with this patch instead of flatlining. |
pkg/jobs/adopt.go
Outdated
| // available. | ||
| func (r *Registry) claimJobs(ctx context.Context, s sqlliveness.Session) error { | ||
|
|
||
| ctx, cancel := context.WithDeadlineCause(ctx, timeutil.Now().Add(time.Minute), errors.New("claim jobs transaction took too long")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i lean towards hard coding this, but if folks think this should be a cluster setting, that's fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For backport I lean toward setting, in case we regress someone and need to fix it quickly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added cluster setting
021f5c5 to
6de0c57
Compare
The claim query should take less than a second, but we have seen it hang for multiple hours because the underlying transaction continuously retries and deadlocks. To prevent this, this patch causes the claim query to timeout after a minute by default, set by the private cluster setting jobs.registry.claim_query.timeout. The per node claim loop will try again in the next iteration. Informs cockroachdb#158976 Release note: none
6de0c57 to
f36dc98
Compare
|
TFTR! bors r=dt |

The claim query should take less than a second, but we have seen it hang for
multiple hours because the underlying transaction continuously retries and
deadlocks. To prevent this, this patch causes the claim query to timeout after
a minute by default, set by the private cluster setting
jobs.registry.claim_query.timeout. The per node claim loop will try again in
the next iteration.
Informs #158976
Release note: none