Skip to content

[Scheduler] Add new scheduler metrics#10402

Open
lina-temporal wants to merge 2 commits into
sched-task-reworkfrom
sched-task-rework-metrics
Open

[Scheduler] Add new scheduler metrics#10402
lina-temporal wants to merge 2 commits into
sched-task-reworkfrom
sched-task-rework-metrics

Conversation

@lina-temporal
Copy link
Copy Markdown
Contributor

New metrics

All counters are tagged with namespace and schedule_backend via newTaggedMetricsHandler. Reason-tagged counters use a reason tag with limited cardinality.

Generator

Metric Purpose
schedule_generator_ticks Every Generator fire (baseline for paused-vs-active attribution)
schedule_generator_paused_ticks Generator fires while paused (HWM advanced, no actions buffered)
scheduler_generator_loop_completed Generator stopped rescheduling without arming idle — held open for an external trigger

Idle

Metric Purpose
schedule_idle_task_fired Idle task fired and closed the schedule
schedule_idle_task_invalidated Idle task dropped, tagged reason: held_open / expiration_shift / closed

Invoker

Metric Purpose
schedule_invoker_process_buffer_fired Each ProcessBuffer execute
schedule_invoker_process_buffer_invalidated ProcessBuffer dropped by Validate, reason: stale_hwm
schedule_invoker_execute_fired Each Execute side-effect task
schedule_invoker_execute_invalidated Execute work dropped, reason: no_work (Validate) or already_recorded (concurrent ExecuteTask already wrote RunId)
schedule_buffered_start_dropped Buffered start dropped, reason: missed_catchup_window or paused_or_limited

Backfiller

Metric Purpose
schedule_backfiller_fired Each Backfiller execute
schedule_backfiller_invalidated Backfiller dropped by Validate, reason: stale_hwm
schedule_backfiller_completed Backfiller drained and deleted itself (end-to-end lifecycle signal)

newTaggedMetricsHandler(h.metricsHandler, scheduler).
Counter(metrics.ScheduleInvokerExecuteInvalidated.Name()).
Record(int64(count), metrics.ReasonTag(invokerExecuteInvalidatedAlreadyRecorded))
newTaggedLogger(h.baseLogger, scheduler).Debug(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll be moved to event log shortly, I just don't want too many PRs stacked.

@chaptersix
Copy link
Copy Markdown
Contributor

Couple suggestions:

1. Use an outcome tag to collapse fired/invalidated/completed variants.

Several of these new metrics follow a <component>_fired / <component>_invalidated / <component>_completed naming pattern. We could use an outcome tag (which already exists in common/metrics/tags.go as OutcomeTag) instead of encoding the outcome in the metric name.

2. Take it further with a component tag to get down to a single metric.

Rather than one metric per scheduler component, a single schedule_task metric with a component tag (generator, idle, invoker_execute, invoker_process_buffer, backfiller) would let us query across the entire scheduler in one shot.

Instead of 13 separate metrics:

schedule_generator_ticks
schedule_generator_paused_ticks
scheduler_generator_loop_completed
schedule_idle_task_fired
schedule_idle_task_invalidated
schedule_invoker_execute_fired
schedule_invoker_execute_invalidated
schedule_invoker_process_buffer_fired
schedule_invoker_process_buffer_invalidated
schedule_backfiller_fired
schedule_backfiller_invalidated
schedule_backfiller_completed

One metric:

schedule_task{component="generator",    outcome="fired"}
schedule_task{component="generator",    outcome="paused"}
schedule_task{component="generator",    outcome="loop_completed"}
schedule_task{component="idle",         outcome="fired"}
schedule_task{component="idle",         outcome="invalidated"}
schedule_task{component="invoker_execute",        outcome="fired"}
schedule_task{component="invoker_execute",        outcome="invalidated"}
schedule_task{component="invoker_process_buffer", outcome="fired"}
schedule_task{component="invoker_process_buffer", outcome="invalidated"}
schedule_task{component="backfiller",   outcome="fired"}
schedule_task{component="backfiller",   outcome="invalidated"}
schedule_task{component="backfiller",   outcome="completed"}

This makes common queries much simpler:

# total fire rate across all components
sum(rate(schedule_task{outcome="fired"}[5m]))

# invalidation rate across all components
sum(rate(schedule_task{outcome="invalidated"}[5m]))

# fired-to-invalidated ratio for a specific component
schedule_task{component="invoker_execute", outcome="fired"}
/ schedule_task{component="invoker_execute"}

# full lifecycle view for backfiller -- single query, auto-fans by outcome
schedule_task{component="backfiller"}

Cardinality impact is negligible -- component is a small fixed enum (5 values), outcome is 3-5 values, and newTaggedMetricsHandler only adds namespace + schedule_backend. The existing reason tag nests naturally under outcome="invalidated" for drill-down.

schedule_buffered_start_dropped is a different concept (not a task lifecycle event) so it makes sense to keep that as its own metric with reason tags as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants