fix: harden ingesting autoscalers around task-count boundaries#19269
fix: harden ingesting autoscalers around task-count boundaries#19269Fly-Style wants to merge 4 commits intoapache:masterfrom
Conversation
taskCount from ioConfig for scale action instead of activeTaskGroups
taskCount from ioConfig for scale action instead of activeTaskGroupstaskCount from ioConfig for scale action instead of activeTaskGroups
4977473 to
746c760
Compare
This comment was marked as outdated.
This comment was marked as outdated.
taskCount from ioConfig for scale action instead of activeTaskGroups|
|
||
| // If task count is out of bounds, scale to the configured boundary | ||
| // regardless of optimal task count, to get back to a safe state. | ||
| if (isScaleActionAllowed() && isTaskCountOutOfBounds) { |
There was a problem hiding this comment.
Do we want to respect this isScaleActionAllowed if we're violating min/max task count bounds?
There was a problem hiding this comment.
That's a tricky thing, but my take here is -- eventually we will scale (and by eventually I mean - within minTriggerScaleActionFrequencyMillis ms), and it might be harmful to scale immediately.
I don't have a strong opinion here, I am open to remove isScaleActionAllowed() from the condition.
| || currentTaskCount > config.getTaskCountMax(); | ||
| if (isTaskCountOutOfBounds) { | ||
| currentTaskCount = Math.min(config.getTaskCountMax(), | ||
| Math.max(config.getTaskCountMin(), supervisor.getIoConfig().getTaskCount())); |
There was a problem hiding this comment.
Good catch! It is cleaner.
|
|
||
| final int result = autoScaler.computeTaskCountForScaleAction(); | ||
|
|
||
| Assert.assertEquals( |
There was a problem hiding this comment.
nit: Either mock computeOptimalTaskCount to return a value different from the clamped value (e.g. mock it to return -1 or taskCountMin - 1) and assert the boundary is returned, or use verify(autoScaler, never()).computeOptimalTaskCount(any()) to confirm the early-return path was taken.
| * @return Integer, target number of tasksCount. -1 means skip scale action. | ||
| */ | ||
| private int computeDesiredTaskCount(List<Long> lags) | ||
| int computeDesiredTaskCount(List<Long> lags) |
There was a problem hiding this comment.
Let's mark with the proper @VisibleForTests annotation
| // regardless of optimal task count, to get back to a safe state. | ||
| if (isScaleActionAllowed() && isTaskCountOutOfBounds) { | ||
| taskCount = currentTaskCount; | ||
| log.info("Task count for supervisor[%s] was out of bounds [%d,%d], scaling.", supervisorId, config.getTaskCountMin(), config.getTaskCountMax()); |
There was a problem hiding this comment.
Don't we want to set: lastScaleActionTimeMillis = DateTimes.nowUtc().getMillis();?
| ); | ||
|
|
||
| int currentActiveTaskCount = supervisor.getActiveTaskGroupsCount(); | ||
| int currentActiveTaskCount = supervisor.getIoConfig().getTaskCount(); |
There was a problem hiding this comment.
We donot need to change this no?
We can still reference the activeTaskGroupCount and start clamping things below ?
There was a problem hiding this comment.
I am still sure that is the correct approach - autoscaler explicitly changes taskCount and operates with different task counts configurations.
Anyway, my aim of changing this was not only the change itself, but also to born discussion.
@jtuglu1 WDYT regarding changing that?
This PR:
taskCountfromioConfigfor scale action calculations instead ofactiveTaskGroups;taskCountoutside the allowed bounds. For both cost-based and lag-based autoscalers, if the currenttaskCountis belowtaskCountMinor abovetaskCountMax, the scaler now returns the nearest valid boundary instead of using the out-of-range value as the scaling baseline. This keeps supervisors within configured limits and avoids inconsistent scaling decisions.This PR has: