Skip to content

Batch BigQuery label fetching and skip GCS for cached job runs#3685

Open
mstaeble wants to merge 1 commit into
openshift:mainfrom
mstaeble:worktree-batch-bq-labels
Open

Batch BigQuery label fetching and skip GCS for cached job runs#3685
mstaeble wants to merge 1 commit into
openshift:mainfrom
mstaeble:worktree-batch-bq-labels

Conversation

@mstaeble

@mstaeble mstaeble commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace per-job BigQuery label queries with a single bulk prefetch before the worker loop, eliminating thousands of individual BQ round-trips per load cycle
  • Move the prowJobRunCache check before the GCS FindAllMatches listing so already-processed runs skip the expensive object listing entirely
  • Keep createOrUpdateProwJob before the cache check to ensure job definitions (variants, release) stay current

Evidence from production logs

Production fetchdata runs (hourly, June 24 2026) show that the prow loader spends 25-29 minutes processing 16-18K jobs from BigQuery, but only 2-3% actually need GCS processing. The rest are duplicates from the
12-hour lookback overlap.

The current code flow in prowJobToJobRun fetches GCS artifacts (path resolution, bucket client, JUnit file matching) before checking the in-memory cache. This means ~3,500 jobs per run go through full GCS
I/O only to be discarded as already processed.

Metric 14:00 run 15:03 run 16:08 run
Jobs from BigQuery 16,368 17,292 18,644
Skipped by in-memory cache (before prowJobToJobRun) ~12,604 (77%) ~13,444 (78%) ~14,699 (79%)
GCS fetched then found already processed 3,297 (20%) 3,327 (19%) 3,537 (19%)
GCS fetched and actually needed 467 (3%) 521 (3%) 408 (2%)
Prow loader time 24m 31s 25m 18s 28m 49s

Moving the cache check before the GCS fetch eliminates ~3,500 unnecessary GCS round-trips per run (~19% of total jobs).

Evidence from staging

Deployed the PR image to staging and ran two load cycles against the staging database.

Run 1 (cold start, empty staging DB):

  • 205,482 jobs from BigQuery
  • Bulk label prefetch: 205K build IDs returned 4,636 labels in 12.7 seconds (single BQ query)
  • Processed 29,705 of 205K jobs in ~43 minutes before being stopped (all jobs required full GCS processing since the DB was empty)
  • 7,291 new job runs inserted

Run 2 (warm start, 7K runs already cached):

  • 21,353 jobs from BigQuery
  • Bulk label prefetch: 21K build IDs returned 363 labels in 2.75 seconds
  • Of 17,041 jobs processed (at time of observation), only 2,977 (17%) needed GCS processing
  • 14,064 jobs (83%) skipped GCS listing entirely via the early cache check
  • Cache-hit jobs processed at the rate of thousands per second (no I/O)

Label verification:

  • Staging database shows 55,667 of 544,340 job runs have labels applied
  • Most recent labeled runs (June 24) show correct values (e.g., ImagePullNeverCompletes, TestFailureDuringHighCPUEvents)

Test plan

  • go vet ./pkg/dataloader/prowloader/... passes
  • go test ./pkg/dataloader/prowloader/... passes
  • go build ./cmd/sippy/... compiles
  • Verify in staging that bulk label prefetch logs show a single BQ query with count and duration
  • Verify cached job runs skip GCS listing (83% of jobs skipped in warm run)
  • Verify newly imported job runs still have correct labels in the database

🤖 Generated with Claude Code

The prow loader was making an individual BigQuery query per job run to
fetch labels, resulting in thousands of round-trips during each load
cycle. Replace with a single bulk query before the worker loop.

Also move the prowJobRunCache check before the GCS FindAllMatches
listing so already-processed runs skip the expensive object listing
entirely. The createOrUpdateProwJob call remains before the cache
check to keep job definitions current.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@mstaeble, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 52 minutes and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: d621747b-c714-49ce-916c-01c6d94f2341

📥 Commits

Reviewing files that changed from the base of the PR and between 09af781 and cb42fa3.

📒 Files selected for processing (1)
  • pkg/dataloader/prowloader/prow.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from deads2k and xueqzhan June 24, 2026 17:13
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mstaeble
Once this PR has been reviewed and has the lgtm label, please assign dgoodwin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e

@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@mstaeble: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant