Add tdbg schedule audit command by fretz12 · Pull Request #10376 · temporalio/temporal

fretz12 · 2026-05-24T19:41:16Z

What changed?

Adds a new tdbg schedule audit subcommand under tdbg schedule. The command lists every schedule in the target namespace(s), computes the nominal fire times each spec should have produced inside a user-specified window, queries visibility for the workflows that actually ran, and emits a per-schedule classification of each expected fire:

real_miss — expected fire with no matching workflow and nothing else from the schedule running to justify a skip
skip_overlap — fire correctly skipped because a prior workflow was still running
inconclusive_schedule_changed — schedule spec was modified during the audit window; historical spec can't be recovered
unsupported_policy — schedule uses a policy this audit doesn't fully model (BUFFER_ALL, ALLOW_ALL, CANCEL_OTHER, TERMINATE_OTHER, KeepOriginalWorkflowId); surfaced rather than miscounted

Output is either a CSV bundle (summary.csv + per-namespace files) when --output-dir is set, or a single flat CSV stream to stdout when it isn't.

Why?

We've had several incidents where schedules silently missed fires, and answering "did the scheduler fire when it should have, during this window?" required ad-hoc visibility queries per schedule. This tool makes that question answerable in a single command across all schedules in many namespaces in parallel, and produces a machine-readable artifact that can be diffed across days to spot regressions.

How did you test it?

New `tdbg schedule audit` command that detects missed schedule fires by comparing expected fires from each schedule's spec against actual workflow executions in visibility. Reports per-schedule classification (real_miss / skip_overlap / inconclusive_schedule_changed / unsupported_policy) as CSV, either bundled per-namespace to a directory or streamed flat to stdout.

chaptersix · 2026-05-28T14:29:27Z

+//   - UnsupportedReason != "" -> reclassify real_miss to unsupported_policy and stamp the reason. These are
+//     corner-case policy/state configs the algorithm does not model correctly today, so we move the count out of
+//     the trusted real_miss bucket and surface the row for manual review. Reasons currently detected:
+//   - keep_original_workflow_id  -- all fires share one WorkflowID, collapsing the chain-by-WorkflowID model.


https://github.com/temporalio/api/blob/0a2f0c3aff1f58e9cec5877d2896bdff5985431d/temporal/api/schedule/v1/message.proto#L213

this is in the API but not currently supported. It won't be for quite some time.

gotcha, changed comment

chaptersix · 2026-05-28T14:31:39Z

I think input and output should be json/josnl. we have jq available in our environments to process input and output.

chaptersix · 2026-05-28T14:36:25Z

+	NamespaceConcurrency int
+}
+
+func parseAuditInputs(c *cli.Context) (*auditInputs, error) {


could we support unix pipes instead of specifying an input file? the piped input could contain, namespace and schedule id (optional).

that way someone can stream the input from another process without writing a file and they ca cat a file into stdin.

yup, that was on my todo list.

FILE FORMAT (for --file / stdin) One audit target per line as 'namespace[,schedule_id]'. Examples: checker-ses-northwest-prod.90d0d datastore-northwest-prod.90d0d,DeleteExpiredSecretsScheduledWorkflow--one synthetics-northwest-prod.90d0d,my-schedule Lines starting with '#' and blank lines are ignored. Schedule IDs must not contain commas. EXAMPLES Single namespace, 1-day window, write CSV bundle: tdbg schedule audit --namespace my-ns --start 2026-05-19T00:00:00Z --end 2026-05-20T00:00:00Z --output-dir ./audit-out Many targets from a file: tdbg schedule audit -f ./targets.csv --start 2026-05-01T19:30:00Z --end 2026-05-02T10:00:00Z \ --output-dir ./audit-out Pipe targets from stdin (cat, psql, awk, etc.): cat ./targets.csv | tdbg schedule audit -f - --start 2026-05-01T00:00:00Z --end 2026-05-02T00:00:00Z

chaptersix · 2026-05-28T14:39:38Z

+	}
+}
+
+// expectedFireTimes returns the nominal (pre-jitter) fire times the spec would produce in (start, end]. Uses the


post jitter?

Actually pre-jitter is intentional...workflows started by the scheduler carry TemporalScheduledStartTime set to the nominal (pre-jitter) time, so the audit needs nominal to match the fired vs expected. I found this to be most reliable. I actually removed the jitterseed as it's not relaly needed and in mislead. Added comment as well.

chaptersix · 2026-05-28T14:42:42Z

+
+// maxAuditWindow caps how wide a single audit window can be. Catches typos (e.g. wrong month in --end) and discourages
+// expensive multi-day runs that should be chunked into separate invocations.
+const maxAuditWindow = 7 * 24 * time.Hour


I think this should be overridable. some ns may have schedules that run once a year.

ah yea good point. Turned it into a flag, and added warning if exceeds 7days

fretz12 added 2 commits May 24, 2026 12:36

fix lint

67f70fd

fretz12 marked this pull request as ready for review May 26, 2026 14:35

fretz12 requested a review from a team as a code owner May 26, 2026 14:35

fretz12 requested review from chaptersix, davidporter-id-au and lina-temporal May 26, 2026 14:39

chaptersix reviewed May 28, 2026

View reviewed changes

addressed PR comments

d13ec41

fretz12 requested a review from chaptersix May 29, 2026 23:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tdbg schedule audit command#10376

Add tdbg schedule audit command#10376
fretz12 wants to merge 3 commits into
temporalio:mainfrom
fretz12:fredtzeng/tdbg-schedule-audit

fretz12 commented May 24, 2026

Uh oh!

chaptersix May 28, 2026

Uh oh!

fretz12 May 29, 2026

Uh oh!

chaptersix commented May 28, 2026

Uh oh!

chaptersix May 28, 2026

Uh oh!

fretz12 May 29, 2026

Uh oh!

chaptersix May 28, 2026

Uh oh!

fretz12 May 29, 2026

Uh oh!

chaptersix May 28, 2026

Uh oh!

fretz12 May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fretz12 commented May 24, 2026

What changed?

Why?

How did you test it?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaptersix commented May 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants