Skip to content

scheduler: pauseJob() is silently no-op on in-flight jobs (executor caches status at pickup) #148

@truffle-dev

Description

@truffle-dev

Summary

SchedulerService.pauseJob() does nothing for an in-flight job. The
executor reads job.status at pickup and writes that cached value back
to the row on completion, stomping the paused flag that pauseJob
just wrote.

The pause sticks for jobs that are not currently running — those are
the only cases where the documented behavior holds.

Fault site

src/scheduler/executor.ts:59

let newStatus = job.status; // captured from the job object passed at pickup

The happy-path (runStatus === "ok" + cron schedule) flows through to
the UPDATE at line 108–132 without touching newStatus, so the value
written back is whatever the row was at pickup time. If pauseJob ran
during the job's runtime work (ctx.runtime.handleMessage on line 43,
which is the long-running call), the SQL update on line 118 stomps the
paused row back to active.

Reproduction

  1. Create a cron job whose handler sleeps for ~30 s (any long-running
    task works).
  2. Wait for the job to fire — confirm it is in handleMessage.
  3. Call pauseJob(id). The SQL update succeeds (status = 'paused').
  4. Let the job finish. The executor's UPDATE on line 108 runs and
    writes status = 'active' back. Pause is gone, job fires on next
    cron tick.

A unit test can reproduce this without runtime sleep by stubbing
runtime.handleMessage with a promise that the test resolves manually
after pauseJob has been called.

Workaround

Set enabled = 0 via direct SQL instead of calling pauseJob on a
running job. enabled is checked at fire time, not cached at pickup,
so the next scheduled fire is suppressed. This is what
feedback_scheduler_pause_use_enabled_zero documents as the current
operator workaround.

Proposed fix

One of two shapes:

  1. Re-read row status before the UPDATE. Right before the
    ctx.db.run(UPDATE ...) on line 108, re-query the current row's
    status. If it is paused, write paused back instead of the
    cached newStatus. One extra read per execution.

  2. Compose with WHERE on the UPDATE. Add AND status != 'paused'
    to the WHERE on line 120, then run a follow-up UPDATE that writes
    only the run telemetry (last_run_at, last_run_status, run_count,
    consecutive_errors, last_delivery_status) regardless of pause
    state. Run telemetry should always reflect that the job ran; only
    the lifecycle status field needs to honor the operator's pause.

Shape 2 is the more correct one: the row's status field is the
operator's lifecycle intent (active / paused / failed / completed),
and the executor should not be in the business of overwriting that
intent for happy-path runs. The executor only owns lifecycle changes
when it has cause to (success + at-kind → completed, MAX errors
failed, etc.); for a happy-path cron run, the executor has no
lifecycle decision to make and should not touch status at all.

Why this matters

Operators who hit "pause" on a job in the dashboard expect the job to
stop. If it's mid-flight, the pause silently fails. The
enabled = 0 workaround works but is undocumented surface for an
operator who reasonably trusts the pauseJob API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions