fix: harden ingesting autoscalers around task-count boundaries by Fly-Style · Pull Request #19269 · apache/druid

Fly-Style · 2026-04-07T12:28:30Z

This PR:

fix lag-based autoscaler by using taskCount from ioConfig for scale action calculations instead of activeTaskGroups;
hardens seekable-stream autoscalers when a supervisor is configured with a handwritten taskCount outside the allowed bounds. For both cost-based and lag-based autoscalers, if the current taskCount is below taskCountMin or above taskCountMax, the scaler now returns the nearest valid boundary instead of using the out-of-range value as the scaling baseline. This keeps supervisors within configured limits and avoids inconsistent scaling decisions.

This PR has:

been self-reviewed.
a release note entry in the PR description.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

…skGroups

amaechler

🐻

Fly-Style · 2026-04-08T10:36:16Z

cc @zhangyue19921010

jtuglu1 · 2026-04-09T17:42:08Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+
+    // If task count is out of bounds, scale to the configured boundary
+    // regardless of optimal task count, to get back to a safe state.
+    if (isScaleActionAllowed() && isTaskCountOutOfBounds) {


Do we want to respect this isScaleActionAllowed if we're violating min/max task count bounds?

That's a tricky thing, but my take here is -- eventually we will scale (and by eventually I mean - within minTriggerScaleActionFrequencyMillis ms), and it might be harmful to scale immediately.

I don't have a strong opinion here, I am open to remove isScaleActionAllowed() from the condition.

jtuglu1 · 2026-04-09T17:42:37Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+                                     || currentTaskCount > config.getTaskCountMax();
+    if (isTaskCountOutOfBounds) {
+      currentTaskCount = Math.min(config.getTaskCountMax(),
+                                   Math.max(config.getTaskCountMin(), supervisor.getIoConfig().getTaskCount()));


nit: use currentTaskCount

Good catch! It is cleaner.

jtuglu1 · 2026-04-09T17:43:20Z

.../apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScalerMockTest.java

+
+    final int result = autoScaler.computeTaskCountForScaleAction();
+
+    Assert.assertEquals(


nit: Either mock computeOptimalTaskCount to return a value different from the clamped value (e.g. mock it to return -1 or taskCountMin - 1) and assert the boundary is returned, or use verify(autoScaler, never()).computeOptimalTaskCount(any()) to confirm the early-return path was taken.

jtuglu1 · 2026-04-09T17:43:46Z

.../java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/LagBasedAutoScaler.java

   * @return Integer, target number of tasksCount. -1 means skip scale action.
   */
-  private int computeDesiredTaskCount(List<Long> lags)
+  int computeDesiredTaskCount(List<Long> lags)


Let's mark with the proper @VisibleForTests annotation

jtuglu1 · 2026-04-09T17:44:52Z

...java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/CostBasedAutoScaler.java

+    // regardless of optimal task count, to get back to a safe state.
+    if (isScaleActionAllowed() && isTaskCountOutOfBounds) {
+      taskCount = currentTaskCount;
+      log.info("Task count for supervisor[%s] was out of bounds [%d,%d], scaling.", supervisorId, config.getTaskCountMin(), config.getTaskCountMax());


Don't we want to set: lastScaleActionTimeMillis = DateTimes.nowUtc().getMillis();?

cryptoe · 2026-04-09T14:26:19Z

.../java/org/apache/druid/indexing/seekablestream/supervisor/autoscaler/LagBasedAutoScaler.java

    );

-    int currentActiveTaskCount = supervisor.getActiveTaskGroupsCount();
+    int currentActiveTaskCount = supervisor.getIoConfig().getTaskCount();


We donot need to change this no?
We can still reference the activeTaskGroupCount and start clamping things below ?

I am still sure that is the correct approach - autoscaler explicitly changes taskCount and operates with different task counts configurations.

Anyway, my aim of changing this was not only the change itself, but also to born discussion.
@jtuglu1 WDYT regarding changing that?

github-actions bot added the Area - Ingestion label Apr 7, 2026

Fly-Style changed the title ~~bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups~~ bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups Apr 7, 2026

bug: use taskCount from ioConfig for scale action instead of activeTa…

746c760

…skGroups

Fly-Style changed the title ~~bug: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups~~ fix: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups Apr 7, 2026

Fly-Style force-pushed the lag-based-autoscaler branch from 4977473 to 746c760 Compare April 7, 2026 12:36

Fly-Style requested a review from jtuglu1 April 7, 2026 13:16

Increase timeout for the test

b385bf7

amaechler approved these changes Apr 7, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

In case of boundaries breach, scale to bound limit

b88fe7b

Fly-Style changed the title ~~fix: lag-based autoscaler: use taskCount from ioConfig for scale action instead of activeTaskGroups~~ fix: harden ingesting autoscalers around task-count boundaries Apr 8, 2026

Fly-Style requested a review from zhangyue19921010 April 8, 2026 10:36

jtuglu1 reviewed Apr 9, 2026

View reviewed changes

cryptoe reviewed Apr 9, 2026

View reviewed changes

Address review comments

b65ca37

Fly-Style requested a review from jtuglu1 April 10, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden ingesting autoscalers around task-count boundaries#19269

fix: harden ingesting autoscalers around task-count boundaries#19269
Fly-Style wants to merge 4 commits intoapache:masterfrom
Fly-Style:lag-based-autoscaler

Fly-Style commented Apr 7, 2026 •

edited

Loading

Uh oh!

amaechler left a comment

Uh oh!

This comment was marked as outdated.

Fly-Style commented Apr 8, 2026

Uh oh!

jtuglu1 Apr 9, 2026

Uh oh!

Fly-Style Apr 10, 2026

Uh oh!

jtuglu1 Apr 9, 2026

Uh oh!

Fly-Style Apr 10, 2026

Uh oh!

jtuglu1 Apr 9, 2026

Uh oh!

jtuglu1 Apr 9, 2026

Uh oh!

jtuglu1 Apr 9, 2026

Uh oh!

Fly-Style Apr 10, 2026

Uh oh!

cryptoe Apr 9, 2026

Uh oh!

Fly-Style Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		final int result = autoScaler.computeTaskCountForScaleAction();

		Assert.assertEquals(

Conversation

Fly-Style commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amaechler left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Fly-Style commented Apr 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fly-Style Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fly-Style commented Apr 7, 2026 •

edited

Loading

Fly-Style Apr 10, 2026 •

edited

Loading