Skip to content

feat(balancer): add traffic warm-up (slow start) support for upstream nodes#12991

Closed
ChuanFF wants to merge 10 commits intoapache:masterfrom
ChuanFF:feat-upstream-warm-up
Closed

feat(balancer): add traffic warm-up (slow start) support for upstream nodes#12991
ChuanFF wants to merge 10 commits intoapache:masterfrom
ChuanFF:feat-upstream-warm-up

Conversation

@ChuanFF
Copy link
Copy Markdown
Contributor

@ChuanFF ChuanFF commented Feb 9, 2026

Background

In production environments, when new upstream nodes join a service cluster, sudden traffic spikes can cause performance degradation or even service crashes due to un-warmed caches, unestablished database connection pools, or incomplete JIT compilation. Currently, APISIX lacks a traffic protection mechanism for new nodes to gradually migrate traffic smoothly.

Solution

When warm_up_conf is configured in an upstream, APISIX automatically detects newly added nodes and calculates temporary weights for them. These weights start from a configured minimum percentage and gradually increase to the original weight over a specified warm-up period, enabling smooth traffic control.

Key Design Points

  1. Warm-up Algorithm Reference: Inspired by Envoy's Slow Start implementation (Envoy Slow Start Documentation)
  2. Node Identification: Automatically identifies new nodes via the update_time field while preserving timestamps for existing nodes
  3. Weight Calculation: Uses an exponential growth curve controlled by the aggression parameter
  4. Format Limitation: Only supports array-format node definitions ([{"host": "127.0.0.1", "port": 8080}]), not map format ({"127.0.0.1:8080": 100})
  5. Configuration Inheritance: Supports warm-up configuration at Route, Service, and Upstream levels

Configuration Example

{
  "type": "roundrobin",
  "nodes": [
    {"host": "127.0.0.1", "port": 1980, "weight": 100},
    {"host": "127.0.0.1", "port": 1981, "weight": 100}
  ],
  "warm_up_conf": {
    "slow_start_time_seconds": 30,
    "min_weight_percent": 10,
    "aggression": 1,
    "interval": 1
  }
}

Technical Implementation

  • Added warm_up_conf configuration item with warm-up duration, minimum weight percentage, and growth curve parameters
  • Dynamically adjusts actual node weights based on update_time during load balancer weight calculation
  • Ensures weight update consistency through configuration version management
  • Automatically maintains update timestamps during node addition/removal

Testing

Includes comprehensive test cases verifying:

  1. Traffic skew when new nodes join
  2. Traffic growth curve during warm-up
  3. Normal load balancing after warm-up completion
  4. Timestamp preservation during configuration updates
  5. Compatibility across different configuration levels

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Feb 9, 2026
@Baoyuantop Baoyuantop requested a review from Copilot February 13, 2026 06:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds traffic warm-up (slow start) support for upstream nodes in APISIX, addressing the problem of new nodes experiencing performance degradation or crashes when they suddenly receive full traffic loads without time to warm up caches, connection pools, or complete JIT compilation.

Changes:

  • Implements automatic detection and gradual weight ramping for newly added upstream nodes
  • Adds warm_up_conf configuration with slow_start_time_seconds, min_weight_percent, aggression, and interval parameters
  • Preserves update timestamps for existing nodes during configuration updates to maintain their warm-up state
  • Supports warm-up configuration at Upstream, Service, and Route levels

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
t/node/warm-up.t Comprehensive test suite covering warm-up behavior at different configuration levels, traffic distribution during warm-up, and timestamp preservation
docs/zh/latest/admin-api.md Chinese documentation for warm_up_conf parameters and behavior explanation
docs/en/latest/admin-api.md English documentation for warm_up_conf parameters and behavior explanation
apisix/schema_def.lua Schema definition for warm_up_conf and update_time field for nodes
apisix/balancer.lua Core warm-up logic including weight calculation algorithm and version management for cache invalidation
apisix/admin/upstreams.lua Timestamp management logic to preserve update_time for existing nodes and assign timestamps to new nodes
apisix/admin/services.lua Integration of warm-up timestamp management for service-level upstream configuration
apisix/admin/routes.lua Integration of warm-up timestamp management for route-level upstream configuration
apisix/admin/resource.lua Addition of initialize_conf hook to enable timestamp management before configuration storage
apisix/utils/upstream.lua Updated node comparison sort function to handle both host and port for proper node identification
apisix/upstream.lua Clock skew protection to prevent future timestamps from causing issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread t/node/warm-up.t
Comment thread t/node/warm-up.t
Comment thread t/node/warm-up.t Outdated
Comment thread apisix/balancer.lua
Comment thread apisix/balancer.lua
Comment thread apisix/admin/upstreams.lua
Comment thread docs/en/latest/admin-api.md Outdated

To allow Upstream to have a separate connection pool, use `keepalive_pool`. It can be configured by modifying its child fields.

Here is the polished English translation of your text:
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line "Here is the polished English translation of your text:" appears to be editorial text that was accidentally left in the documentation. This should be removed as it doesn't belong in the final documentation.

Copilot uses AI. Check for mistakes.
Comment thread apisix/balancer.lua
end

if warm_up_end_time and ngx_time() < warm_up_end_time then
version = version .. math_floor(ngx_time() / warm_up_conf.interval)
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The version calculation uses string concatenation with math_floor(ngx_time() / warm_up_conf.interval). The default interval is 1 second according to the schema. However, if the interval is set to a very large value (e.g., 3600 seconds), the version would only update once per hour, meaning weight calculations would be cached for an hour even though weights should be recalculated based on elapsed time. This could cause nodes to remain at reduced weights longer than intended during the warm-up period. Consider adding documentation or validation to ensure interval values are appropriate for the warm-up duration, or ensure the cache invalidation logic accounts for this scenario.

Suggested change
version = version .. math_floor(ngx_time() / warm_up_conf.interval)
local interval = warm_up_conf.interval or 1
if interval <= 0 then
interval = 1
end
if warm_up_conf.slow_start_time_seconds
and interval > warm_up_conf.slow_start_time_seconds
then
interval = warm_up_conf.slow_start_time_seconds
end
version = version .. math_floor(ngx_time() / interval)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a schema for interval < slow_start_time_seconds is difficult, and the necessity seems rather low, right?

Comment thread apisix/admin/resource.lua
Comment on lines +128 to +130
if self.initialize_conf then
self.initialize_conf(id, conf)
end
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialize_conf hook is called after schema validation but before encrypt_conf. This ordering means that initialize_conf modifies the configuration by adding update_time fields to nodes, but this modification happens after the schema check has already passed. If the schema validation made a deep copy and validated that, the actual conf being stored might have different fields than what was validated. While this appears intentional for this feature, it's worth verifying that modifying conf after validation doesn't violate any assumptions in the etcd storage layer or cause issues with configuration comparison logic elsewhere in the codebase.

Copilot uses AI. Check for mistakes.
Comment thread docs/zh/latest/admin-api.md
Comment thread t/node/warm-up.t Outdated
--- config
location /t {
content_by_lua_block {
local upstream = require("apisix.upstream").get_by_id(1)
Copy link
Copy Markdown
Member

@membphis membphis Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can call ngx.sleep(10) directly, which is much easier to understand

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread t/node/warm-up.t
--- response_body
passed


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

add a test case to check the node.update_time by Lua script

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirm the Admin API work as expected

Comment thread t/node/warm-up.t
--- response_body
passed


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd better to add a test case

to check the node.update_time, confirm preserve the update_time

it is very important to confirm and show this changing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Baoyuantop Baoyuantop added the wait for update wait for the author's response in this issue/PR label Feb 27, 2026
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 27, 2026
@ChuanFF ChuanFF closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files. wait for update wait for the author's response in this issue/PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants