PAYMENTS-11567 Resque latency metrics#30
Closed
WillemHoman wants to merge 1 commit into
Closed
Conversation
fe8acb7 to
12b01d8
Compare
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 12b01d8. Configure here.
There was a problem hiding this comment.
Pull request overview
Adds an opt-in Resque integration to emit per-job Prometheus histograms (queue latency and perform duration) from the parent worker process, avoiding the per-job synchronous flush overhead that previously regressed throughput in fork-per-job children.
Changes:
- Add
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLEDconfiguration and parent-side Resque worker instrumentation to emitresque_job_*envelopes. - Introduce a dedicated
resque_jobtype collector (TypeCollectors::ResqueJob) to expose the new per-job histograms. - Add/adjust specs and documentation (README + changelog) for the new functionality.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| spec/bigcommerce/prometheus/type_collectors/resque_spec.rb | Updates spec to exercise TypeCollectors::Base#collect label-merging behavior. |
| spec/bigcommerce/prometheus/type_collectors/resque_job_spec.rb | Adds spec coverage for new ResqueJob type collector routing + histogram observation behavior. |
| spec/bigcommerce/prometheus/integrations/resque/job_payload_spec.rb | Adds spec coverage for payload parsing (job_class + anchor timestamp selection/parsing). |
| spec/bigcommerce/prometheus/integrations/resque/job_metrics_spec.rb | Adds spec coverage for envelope shape and error-rescue behavior without requiring Resque. |
| README.md | Documents opt-in per-job metrics and adds configuration table row. |
| lib/bigcommerce/prometheus/type_collectors/resque.rb | Clarifies responsibility boundaries between aggregate Resque metrics vs per-job metrics. |
| lib/bigcommerce/prometheus/type_collectors/resque_job.rb | New type collector for per-job histograms with explicit type: 'resque_job' routing. |
| lib/bigcommerce/prometheus/integrations/resque/job_payload.rb | New payload parser used by per-job metrics logic. |
| lib/bigcommerce/prometheus/integrations/resque/job_metrics.rb | New parent-side Resque worker instrumentation + metric envelope emission. |
| lib/bigcommerce/prometheus/integrations/resque.rb | Wires JobMetrics.start into the Resque integration startup path. |
| lib/bigcommerce/prometheus/instrumentors/resque.rb | Registers the new ResqueJob type collector with the exporter server. |
| lib/bigcommerce/prometheus/configuration.rb | Adds resque_per_job_metrics_enabled config key (env var gated). |
| lib/bigcommerce/prometheus.rb | Requires new integration/type collector files. |
| CHANGELOG.md | Adds pending-release entry for the new per-job Resque metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
30d8f7c to
599ca5c
Compare
fba1fde to
4f67902
Compare
… on worker processing time
141bfc9 to
7be85b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What? Why?
Adds opt-in latency metrics for Resque jobs.
resque_job_queue_latency_seconds{job_class}: how long the job spent on the queue prior to being picked up for processingresque_job_perform_duration_seconds{job_class}: how long it takes to process the jobOff by default. Opt in per service by setting
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED=1.Supports both ActiveJob and vanilla Resque jobs.
resque_job_perform_duration_secondscovers every job.resque_job_queue_latency_secondsis supported for ActiveJob-enqueued jobs only, because it reads fields from the ActiveJob payload — see below.The metrics introduced
resque_job_queue_latency_seconds
How long the job spent on the queue: time from
scheduled_at(orenqueued_at) until the worker picks the job up.Buckets:
[0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300].Latency is clamped to
0.0minimum so that clock skew between the enqueuer and the worker host cannot produce a negative observation.This metric relies on the job having the ActiveJob payload shape.
i.e.
payload['args'][0](a Hash) must containjob_class(string) — used as the metric label.enqueued_at(ISO 8601 string) — anchor whenscheduled_atis absent.scheduled_at(ISO 8601 string, optional) — preferred anchor when present.For vanilla Resque jobs without these fields, the metric silently no-ops for them.
Sample payloads for both shapes are in Payload parsing below.
The gem detects the payload by the fields above rather than by matching the
JobWrapperclass name so that non ActiveJob Resque tasks can also opt in by populating the necessary fields.Converting vanilla jobs to support resque_job_queue_latency_seconds
Vanilla Resque jobs that need queue latency should be either
Details for both routes are at the end of this doc: Converting to ActiveJob and Wrapper class alternative.
Retries
ActiveJob re-stamps both
enqueued_atandscheduled_aton every retry (including each re-queue).This means that
resque_job_queue_latency_secondsonly reflects the latency for each individual retry.To measure the entire duration from initial enqueue until dequeue of the final attempt, a separate userland metric is needed.
For example, domain events should have an
occurred_attimestamp as per the docs.resque_job_perform_duration_seconds
How long does the processing of the job take.
This is the total child-process lifetime from fork to
Process.waitpidreturn.Buckets:
[0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60].This works for both vanilla Resque jobs and ActiveJobs and doesn't rely on any payload fields.
Why this is implemented in bc-prometheus-ruby instead of each service
In-child metric collection requires a synchronous flush before
exit!which is Resque's default child-exit.The flush is bounded by the
bc-prometheus-rubyworker thread's sleep cadence (0.5 s by default) — which makes it slow relative to fast publishes.For example this metric was initially introduced in https://github.com/bigcommerce/bigpay/pull/10597 however had to be reverted as it increased Resque job run-time from 20ms to 500ms.
In contrast, the parent process is long-lived, so the bc-prom worker thread drains naturally between jobs without any synchronous wait.
Moving the instrumentation to the parent eliminates the per-job latency tax entirely.
This also ensures consistent metric names and labels.
Opt-in env var
This is opt-in because per-job series can be high cardinality when a service has many Resque job classes.
This is enabled by
PROMETHEUS_RESQUE_PER_JOB_METRICS_ENABLED(default0).It is read once at boot via the existing
Bigcommerce::Prometheus.configure.This mirrors the existing
PROMETHEUS_ENABLEDopt-in pattern.Implementation
bc-prometheus-rubyfollows a producer/aggregator split.Worker processes push metric observations as JSON envelopes to a long-running aggregator server, which accumulates them and exposes
/metricsfor Prometheus to scrape.The new per-job metrics follow the same pattern:
type: 'resque_job'envelopes.TypeCollectors::ResqueJob) on the server side receives those envelopes and records them into two Prometheus histograms.Producer —
Bigcommerce::Prometheus::Integrations::Resque::JobMetricsRecords the metrics and pushes them to the aggregator.
Payload parsing
Resque payloads arrive in one of two shapes: ActiveJob-enqueued or vanilla.
An ActiveJob enqueued via
.perform_lateris stored with ActiveJob'sJobWrapperas the Resque class, and the job's own details serialized intoargs[0]:{ "class": "ActiveJob::QueueAdapters::ResqueAdapter::JobWrapper", "args": [ { "job_class": "BigPay::SomePublishJob", "job_id": "07b08b09-8a1e-4f4d-9e0e-1f2a3b4c5d6e", "queue_name": "scheduled_action", "arguments": [12345], "enqueued_at": "2026-06-11T01:23:45.123456789Z", "scheduled_at": null } ] }A vanilla job enqueued via
Resque.enqueue(BigPay::Kount::ProcessNotificationJob, 12345)is stored with its own class at the top level and raw positional args:{ "class": "BigPay::Kount::ProcessNotificationJob", "args": [12345] }Payload parsing models each shape as its own DTO behind a common interface of
#job_classand#anchor_time:ActiveJobPayload— built from the innerargs[0]hash. From the first example:job_classis"BigPay::SomePublishJob", andanchor_timeparses fromscheduled_at, falling back toenqueued_at.VanillaResquePayload— built from the whole payload. From the second example:job_classis"BigPay::Kount::ProcessNotificationJob".anchor_timeis always nil, because the payload carries no timestamps. Malformed payloads yield'unknown'.JobPayload.for(resque_job)decides which to build, once per job.The criterion: a payload is ActiveJob-shaped when
args[0]is a Hash carrying a truthyjob_class; everything else is treated as vanilla.In the examples above, the first payload meets the criterion and the second does not.
Both DTOs eagerly extract everything in
initializeand hold no reference to theResque::Jobafterwards.The recording methods take the prebuilt payload rather than reparsing the job on each call.
Consequences:
resque_job_perform_duration_secondsemits a meaningfuljob_classlabel for vanilla Resque jobs too.However
resque_job_queue_latency_secondsno-ops for vanilla Resque jobs, since there is no#anchor_time.How the metrics are recorded
WorkerInstrumentationis a module prepended ontoResque::Workerthat wrapsperform_with_fork.It builds a payload object once per job via
JobPayload.for, records queue latency beforesuper, and records perform duration inensure:Class-method API
JobMetrics.start(client:)— no-op unless the env var is on. PrependsWorkerInstrumentationontoResque::Worker. Idempotent.JobMetrics.record_queue_latency(payload)— records how long the job sat on the queue: the seconds from the payload's#anchor_timeto now. Feedsresque_job_queue_latency_seconds, labelled by the payload's#job_class. Does nothing when the payload has no anchor.JobMetrics.record_perform_duration(payload, duration)— records how long the job took to process; the caller measures and supplies the duration. Feedsresque_job_perform_duration_seconds, labelled by the payload's#job_class.On the wire, each recording is sent to the aggregator with an envelope of
type: 'resque_job'.Both
record_*methods rescueStandardErrorand log a warning — metric push failures never propagate into the publish/perform path.Aggregator — type collectors
Everything above is the producer side, running in the worker process. The aggregator is the other half: the long-running server process that receives the pushed envelopes and exposes the accumulated metrics at
/metricsfor Prometheus to scrape.Two type collectors, one per envelope shape, are registered side-by-side in
Instrumentors::Resque#start.The upstream
PrometheusExporter::Server::Collectorroutes each envelope to whichever collector'stypematchesenvelope['type']— no in-collector dispatch needed.Bigcommerce::Prometheus::TypeCollectors::Resque: is the pre-existing aggregator and continues to own the aggregate worker/queue gauges (resque_workers_total,jobs_failed_total,jobs_pending_total,jobs_processed_total,queues_total,queue_sizes) fed byCollectors::Resque#collect.Bigcommerce::Prometheus::TypeCollectors::ResqueJob: the new aggregator owns the two new histograms:resque_job_queue_latency_secondswith buckets tuned for queue dwell ([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 30, 60, 120, 300]).resque_job_perform_duration_secondswith buckets tuned for per-job work ([0.05, 0.1, 0.25, 0.5, 1, 2, 5, 10, 30, 60]).Wiring
Integrations::Resque.start(client:)now also callsJobMetrics.start(client:).If the env var is off, that call returns immediately and nothing is hooked. If on, the hooks install and the metrics flow.
Integrations::Resque.startis itself invoked inside aResque.before_first_forkblock (registered byInstrumentors::Resque#setup_middleware).Converting to ActiveJob
Converting a vanilla job is mostly mechanical, with three things to plan for:
self.performshim.Wrapper class alternative
An alternative to converting a vanilla job to ActiveJob: a small service-local wrapper class that produces the ActiveJob payload shape at enqueue time.
The wrapper stamps the fields the gem reads into
args[0]and delegatesperformto the target job. The target job class is untouched. Callers opt in by enqueueing through the wrapper.This produces a payload that meets the classification criterion in Payload parsing:
args[0]is a Hash with a truthyjob_class.queue_latencyanchors on the stampedenqueued_at, and both metrics label with the target job's class name.Trade-offs versus converting to ActiveJob:
job_class,enqueued_at,scheduled_at.Testing