Skip to content

fix: harden ingesting autoscalers around task-count boundaries#19269

Open
Fly-Style wants to merge 4 commits intoapache:masterfrom
Fly-Style:lag-based-autoscaler
Open

fix: harden ingesting autoscalers around task-count boundaries#19269
Fly-Style wants to merge 4 commits intoapache:masterfrom
Fly-Style:lag-based-autoscaler

Conversation

@Fly-Style
Copy link
Copy Markdown
Contributor

@Fly-Style Fly-Style commented Apr 7, 2026

This PR:

  • fix lag-based autoscaler by using taskCount from ioConfig for scale action calculations instead of activeTaskGroups;
  • hardens seekable-stream autoscalers when a supervisor is configured with a handwritten taskCount outside the allowed bounds. For both cost-based and lag-based autoscalers, if the current taskCount is below taskCountMin or above taskCountMax, the scaler now returns the nearest valid boundary instead of using the out-of-range value as the scaling baseline. This keeps supervisors within configured limits and avoids inconsistent scaling decisions.

This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@Fly-Style Fly-Style changed the title bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups Apr 7, 2026
@Fly-Style Fly-Style changed the title bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups fix: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups Apr 7, 2026
@Fly-Style Fly-Style force-pushed the lag-based-autoscaler branch from 4977473 to 746c760 Compare April 7, 2026 12:36
@Fly-Style Fly-Style requested a review from jtuglu1 April 7, 2026 13:16
Copy link
Copy Markdown
Contributor

@amaechler amaechler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐻

@Fly-Style

This comment was marked as outdated.

@Fly-Style Fly-Style changed the title fix: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups fix: harden ingesting autoscalers around task-count boundaries Apr 8, 2026
@Fly-Style
Copy link
Copy Markdown
Contributor Author

cc @zhangyue19921010


// If task count is out of bounds, scale to the configured boundary
// regardless of optimal task count, to get back to a safe state.
if (isScaleActionAllowed() && isTaskCountOutOfBounds) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to respect this isScaleActionAllowed if we're violating min/max task count bounds?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a tricky thing, but my take here is -- eventually we will scale (and by eventually I mean - within minTriggerScaleActionFrequencyMillis ms), and it might be harmful to scale immediately.

I don't have a strong opinion here, I am open to remove isScaleActionAllowed() from the condition.

|| currentTaskCount > config.getTaskCountMax();
if (isTaskCountOutOfBounds) {
currentTaskCount = Math.min(config.getTaskCountMax(),
Math.max(config.getTaskCountMin(), supervisor.getIoConfig().getTaskCount()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use currentTaskCount

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It is cleaner.


final int result = autoScaler.computeTaskCountForScaleAction();

Assert.assertEquals(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Either mock computeOptimalTaskCount to return a value different from the clamped value (e.g. mock it to return -1 or taskCountMin - 1) and assert the boundary is returned, or use verify(autoScaler, never()).computeOptimalTaskCount(any()) to confirm the early-return path was taken.

* @return Integer, target number of tasksCount. -1 means skip scale action.
*/
private int computeDesiredTaskCount(List<Long> lags)
int computeDesiredTaskCount(List<Long> lags)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's mark with the proper @VisibleForTests annotation

// regardless of optimal task count, to get back to a safe state.
if (isScaleActionAllowed() && isTaskCountOutOfBounds) {
taskCount = currentTaskCount;
log.info("Task count for supervisor[%s] was out of bounds [%d,%d], scaling.", supervisorId, config.getTaskCountMin(), config.getTaskCountMax());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to set: lastScaleActionTimeMillis = DateTimes.nowUtc().getMillis();?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

);

int currentActiveTaskCount = supervisor.getActiveTaskGroupsCount();
int currentActiveTaskCount = supervisor.getIoConfig().getTaskCount();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We donot need to change this no?
We can still reference the activeTaskGroupCount and start clamping things below ?

Copy link
Copy Markdown
Contributor Author

@Fly-Style Fly-Style Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still sure that is the correct approach - autoscaler explicitly changes taskCount and operates with different task counts configurations.

Anyway, my aim of changing this was not only the change itself, but also to born discussion.
@jtuglu1 WDYT regarding changing that?

@Fly-Style Fly-Style requested a review from jtuglu1 April 10, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants