Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749
Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749timothy-e wants to merge 7 commits intoapache:masterfrom
Conversation
…stamps (long.min for example) (apache#524) ### Notify cc stripe-private-oss-forks/pinot-reviewers ### Summary When `ingestionInfo` is null or `firstStreamIngestionTimeMs` is negative (e.g., `Long.MIN_VALUE`), the method incorrectly computed `clock.millis() - 0`, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970). This fix adds an early return of 0 for invalid/missing timestamps, restoring the original behavior from upstream commit [bea67d04](apache@bea67d04). I'm going to OSS this also, but for the sake of not slowing down pinot 1.5; need to get this in here faster investigated more in https://docs.google.com/document/d/19EUPSq2xjEBiGHynGgZmTMZOIB7zmm2FwdzdC8FjqHg/edit?tab=t.0 but after deployment in rad-canary, we can see the fix in action:  i've made it show the dedup table doesnt use the header, but the non-dedup ones do, and our upstream code sets a value of 1 hour ago, hence the 1 hour lag being shown we don't see the 56 years anymore prior:  ### Motivation https://jira.corp.stripe.com/browse/STREAMANALYTICS-4191 ### Testing deployed on rad-canary QA and metric looks correct now also updated tests ``` ``` $ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false [INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s [INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] BUILD SUCCESS ``` ``` ### Rollout/monitoring/revert plan s:pinot-rad-canary Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-01-30T21:51:39Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524
cc stripe-private-oss-forks/pinot-reviewers https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524 fixed getPartitionEndToEndIngestionDelayMs, but getPartitionIngestionDelayMs has the same code pattern. We saw the [ingestion lag for a table hit 56.2 years](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/d/DltsoWuVk/pinot-components?orgId=1&var-pinot_cluster=rad-noir&from=1770735477256&to=1770757067256&viewPanel=88). Follow the same pattern to avoid returning the timestamp if the _firstStreamIngestionTimeMs is too low. ``` mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false ``` ``` [INFO] Results: [INFO] [INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 15.716 s [INFO] Finished at: 2026-02-10T17:31:51-05:00 [INFO] ------------------------------------------------------------------------ [INFO] 12 goals, 12 executed ``` Stripe-Original-Repo: stripe-private-oss-forks/pinot Stripe-Monotonic-Timestamp: v2/2026-02-11T14:37:30Z/0 Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/528
|
@KKcorps @swaminathanmanish @9aman can you please TAL at this change? Thanks! |
|
cc @noob-se7en |
There was a problem hiding this comment.
Pull request overview
Fixes incorrect “time since epoch” ingestion lag values for partitions without valid timestamps by making ingestion delay getters return 0 when ingestion info is missing or timestamps are invalid (e.g., Long.MIN_VALUE), restoring prior intended behavior.
Changes:
- Return
0fromgetPartitionEndToEndIngestionDelayMs()wheningestionInfois missing or the first-stream ingestion timestamp is invalid. - Return
0fromgetPartitionIngestionDelayMs()wheningestionInfois missing or the ingestion timestamp is invalid. - Update unit tests to assert
0delay for untracked/stopped partitions and for partitions with noupdateMetrics()calls.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java |
Adds early returns to prevent computing delays against an implicit 0 timestamp when ingestion info is missing/invalid. |
pinot-core/src/test/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTrackerTest.java |
Adjusts assertions to validate 0 delay behavior for untracked partitions and metrics sampling without updateMetrics(). |
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #17749 +/- ##
============================================
+ Coverage 63.22% 63.24% +0.01%
- Complexity 1454 1456 +2
============================================
Files 3176 3186 +10
Lines 191002 191621 +619
Branches 29204 29316 +112
============================================
+ Hits 120763 121191 +428
- Misses 60818 60951 +133
- Partials 9421 9479 +58
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
pinot-common/src/main/java/org/apache/pinot/common/metrics/ServerGauge.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Outdated
Show resolved
Hide resolved
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java
Show resolved
Hide resolved
|
@KKcorps @swaminathanmanish can one of you review as well? I need a review from someone with write access. |
@KKcorps @swaminathanmanish will you have a chance to look at this soon? If not, can you recommend anyone else to review? |
When
ingestionInfois null orfirstStreamIngestionTimeMsis negative (e.g.,Long.MIN_VALUE), the method incorrectly computedclock.millis() - 0, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).This fix adds an early null return for invalid/missing timestamps. The upstream commit bea67d04 previously returned 0 in such cases, but that's also not desirable because cases where nothing is published can appear in the metrics to be fine.
A new metric is added to report if ingestion data is valid, to handle cases where firstStreamIngestionTimeMs is never set properly.
A version of this (returning 0, not null) has been running internally on Stripe's Pinot clusters for a few weeks with no issues - prior to this bug fix, we saw 56 year ingestion lag in several cases (sometimes intermittently).