Skip to content

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749

Open
timothy-e wants to merge 7 commits intoapache:masterfrom
timothy-e:fix-ingestion-delay
Open

Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example)#17749
timothy-e wants to merge 7 commits intoapache:masterfrom
timothy-e:fix-ingestion-delay

Conversation

@timothy-e
Copy link

@timothy-e timothy-e commented Feb 23, 2026

When ingestionInfo is null or firstStreamIngestionTimeMs is negative (e.g., Long.MIN_VALUE), the method incorrectly computed clock.millis() - 0, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).

This fix adds an early null return for invalid/missing timestamps. The upstream commit bea67d04 previously returned 0 in such cases, but that's also not desirable because cases where nothing is published can appear in the metrics to be fine.

A new metric is added to report if ingestion data is valid, to handle cases where firstStreamIngestionTimeMs is never set properly.

$ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false

[INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS

A version of this (returning 0, not null) has been running internally on Stripe's Pinot clusters for a few weeks with no issues - prior to this bug fix, we saw 56 year ingestion lag in several cases (sometimes intermittently).

priyen-stripe and others added 2 commits February 23, 2026 15:26
…stamps (long.min for example) (apache#524)

### Notify
cc stripe-private-oss-forks/pinot-reviewers

### Summary
When `ingestionInfo` is null or `firstStreamIngestionTimeMs` is negative (e.g., `Long.MIN_VALUE`), the method incorrectly computed `clock.millis() - 0`, returning ~1.7 trillion ms instead of 0 and e2e lag showing as ~56 years (1970).

This fix adds an early return of 0 for invalid/missing timestamps, restoring the original behavior from upstream commit [bea67d04](apache@bea67d04).

I'm going to OSS this also, but for the sake of not slowing down pinot 1.5; need to get this in here faster

investigated more in https://docs.google.com/document/d/19EUPSq2xjEBiGHynGgZmTMZOIB7zmm2FwdzdC8FjqHg/edit?tab=t.0

but after deployment in rad-canary, we can see the fix in action:
![image](https://git.corp.stripe.com/user-attachments/assets/f60b44d9-4379-42a0-8cb0-a359c6276b86)

i've made it show the dedup table doesnt use the header, but the non-dedup ones do, and our upstream code sets a value of 1 hour ago, hence the 1 hour lag being shown

we don't see the 56 years anymore prior:
![image](https://git.corp.stripe.com/user-attachments/assets/b0db590b-d1b3-4033-b54e-68290aa2ee3c)



### Motivation
https://jira.corp.stripe.com/browse/STREAMANALYTICS-4191

### Testing
deployed on rad-canary QA and metric looks correct now

also updated tests
```
```
$ mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false

[INFO] Running org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.797 s -- in org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testIngestionDelay -- Time elapsed: 0.147 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayOffset -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithAging -- Time elapsed: 0 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testRecordIngestionDelayWithNoAging -- Time elapsed: 0.016 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testShutdown -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelay -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testStopTrackingIngestionDelayWithSegment -- Time elapsed: 0.003 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testTrackerConstructors -- Time elapsed: 0.001 s
[INFO] org.apache.pinot.core.data.manager.realtime.IngestionDelayTrackerTest.testUpdateLatestStreamOffset -- Time elapsed: 0.003 s
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO] BUILD SUCCESS
```
```

### Rollout/monitoring/revert plan
s:pinot-rad-canary


Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-01-30T21:51:39Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524
cc stripe-private-oss-forks/pinot-reviewers

https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/524 fixed getPartitionEndToEndIngestionDelayMs, but getPartitionIngestionDelayMs has the same code pattern.

We saw the [ingestion lag for a table hit 56.2 years](https://g-8916660cfe.grafana-workspace.us-west-2.amazonaws.com/d/DltsoWuVk/pinot-components?orgId=1&var-pinot_cluster=rad-noir&from=1770735477256&to=1770757067256&viewPanel=88).

Follow the same pattern to avoid returning the timestamp if the _firstStreamIngestionTimeMs is too low.

```
mvn test -pl pinot-core -Dtest=IngestionDelayTrackerTest -DfailIfNoTests=false
```
```
[INFO] Results:
[INFO]
[INFO] Tests run: 9, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  15.716 s
[INFO] Finished at: 2026-02-10T17:31:51-05:00
[INFO] ------------------------------------------------------------------------
[INFO] 12 goals, 12 executed
```

Stripe-Original-Repo: stripe-private-oss-forks/pinot
Stripe-Monotonic-Timestamp: v2/2026-02-11T14:37:30Z/0
Stripe-Original-PR: https://git.corp.stripe.com/stripe-private-oss-forks/pinot/pull/528
@timothy-e
Copy link
Author

@KKcorps @swaminathanmanish @9aman can you please TAL at this change? Thanks!

@yashmayya
Copy link
Contributor

cc @noob-se7en

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect “time since epoch” ingestion lag values for partitions without valid timestamps by making ingestion delay getters return 0 when ingestion info is missing or timestamps are invalid (e.g., Long.MIN_VALUE), restoring prior intended behavior.

Changes:

  • Return 0 from getPartitionEndToEndIngestionDelayMs() when ingestionInfo is missing or the first-stream ingestion timestamp is invalid.
  • Return 0 from getPartitionIngestionDelayMs() when ingestionInfo is missing or the ingestion timestamp is invalid.
  • Update unit tests to assert 0 delay for untracked/stopped partitions and for partitions with no updateMetrics() calls.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pinot-core/src/main/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTracker.java Adds early returns to prevent computing delays against an implicit 0 timestamp when ingestion info is missing/invalid.
pinot-core/src/test/java/org/apache/pinot/core/data/manager/realtime/IngestionDelayTrackerTest.java Adjusts assertions to validate 0 delay behavior for untracked partitions and metrics sampling without updateMetrics().

@codecov-commenter
Copy link

codecov-commenter commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.24%. Comparing base (739043c) to head (0bb0f8e).
⚠️ Report is 57 commits behind head on master.

Files with missing lines Patch % Lines
...e/data/manager/realtime/IngestionDelayTracker.java 69.23% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17749      +/-   ##
============================================
+ Coverage     63.22%   63.24%   +0.01%     
- Complexity     1454     1456       +2     
============================================
  Files          3176     3186      +10     
  Lines        191002   191621     +619     
  Branches      29204    29316     +112     
============================================
+ Hits         120763   121191     +428     
- Misses        60818    60951     +133     
- Partials       9421     9479      +58     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 55.57% <71.42%> (-7.61%) ⬇️
java-21 63.22% <71.42%> (+0.03%) ⬆️
temurin 63.24% <71.42%> (+0.01%) ⬆️
unittests 63.24% <71.42%> (+0.01%) ⬆️
unittests1 55.63% <71.42%> (+0.07%) ⬆️
unittests2 34.13% <7.14%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

timothy-e and others added 2 commits February 24, 2026 10:00
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@timothy-e timothy-e changed the title Fix getPartition(EndToEnd)IngestionDelayMs to return 0 for invalid timestamps (long.min for example) Fix getPartition(EndToEnd)IngestionDelayMs to return null for invalid timestamps (long.min for example) Mar 3, 2026
Copy link
Contributor

@noob-se7en noob-se7en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@timothy-e
Copy link
Author

@KKcorps @swaminathanmanish can one of you review as well? I need a review from someone with write access.

@timothy-e
Copy link
Author

@KKcorps @swaminathanmanish can one of you review as well? I need a review from someone with write access.

@KKcorps @swaminathanmanish will you have a chance to look at this soon? If not, can you recommend anyone else to review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants