feat(balancer): add traffic warm-up (slow start) support for upstream nodes#12991
feat(balancer): add traffic warm-up (slow start) support for upstream nodes#12991ChuanFF wants to merge 10 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds traffic warm-up (slow start) support for upstream nodes in APISIX, addressing the problem of new nodes experiencing performance degradation or crashes when they suddenly receive full traffic loads without time to warm up caches, connection pools, or complete JIT compilation.
Changes:
- Implements automatic detection and gradual weight ramping for newly added upstream nodes
- Adds
warm_up_confconfiguration with slow_start_time_seconds, min_weight_percent, aggression, and interval parameters - Preserves update timestamps for existing nodes during configuration updates to maintain their warm-up state
- Supports warm-up configuration at Upstream, Service, and Route levels
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| t/node/warm-up.t | Comprehensive test suite covering warm-up behavior at different configuration levels, traffic distribution during warm-up, and timestamp preservation |
| docs/zh/latest/admin-api.md | Chinese documentation for warm_up_conf parameters and behavior explanation |
| docs/en/latest/admin-api.md | English documentation for warm_up_conf parameters and behavior explanation |
| apisix/schema_def.lua | Schema definition for warm_up_conf and update_time field for nodes |
| apisix/balancer.lua | Core warm-up logic including weight calculation algorithm and version management for cache invalidation |
| apisix/admin/upstreams.lua | Timestamp management logic to preserve update_time for existing nodes and assign timestamps to new nodes |
| apisix/admin/services.lua | Integration of warm-up timestamp management for service-level upstream configuration |
| apisix/admin/routes.lua | Integration of warm-up timestamp management for route-level upstream configuration |
| apisix/admin/resource.lua | Addition of initialize_conf hook to enable timestamp management before configuration storage |
| apisix/utils/upstream.lua | Updated node comparison sort function to handle both host and port for proper node identification |
| apisix/upstream.lua | Clock skew protection to prevent future timestamps from causing issues |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| To allow Upstream to have a separate connection pool, use `keepalive_pool`. It can be configured by modifying its child fields. | ||
|
|
||
| Here is the polished English translation of your text: |
There was a problem hiding this comment.
The line "Here is the polished English translation of your text:" appears to be editorial text that was accidentally left in the documentation. This should be removed as it doesn't belong in the final documentation.
| end | ||
|
|
||
| if warm_up_end_time and ngx_time() < warm_up_end_time then | ||
| version = version .. math_floor(ngx_time() / warm_up_conf.interval) |
There was a problem hiding this comment.
The version calculation uses string concatenation with math_floor(ngx_time() / warm_up_conf.interval). The default interval is 1 second according to the schema. However, if the interval is set to a very large value (e.g., 3600 seconds), the version would only update once per hour, meaning weight calculations would be cached for an hour even though weights should be recalculated based on elapsed time. This could cause nodes to remain at reduced weights longer than intended during the warm-up period. Consider adding documentation or validation to ensure interval values are appropriate for the warm-up duration, or ensure the cache invalidation logic accounts for this scenario.
| version = version .. math_floor(ngx_time() / warm_up_conf.interval) | |
| local interval = warm_up_conf.interval or 1 | |
| if interval <= 0 then | |
| interval = 1 | |
| end | |
| if warm_up_conf.slow_start_time_seconds | |
| and interval > warm_up_conf.slow_start_time_seconds | |
| then | |
| interval = warm_up_conf.slow_start_time_seconds | |
| end | |
| version = version .. math_floor(ngx_time() / interval) |
There was a problem hiding this comment.
Adding a schema for interval < slow_start_time_seconds is difficult, and the necessity seems rather low, right?
| if self.initialize_conf then | ||
| self.initialize_conf(id, conf) | ||
| end |
There was a problem hiding this comment.
The initialize_conf hook is called after schema validation but before encrypt_conf. This ordering means that initialize_conf modifies the configuration by adding update_time fields to nodes, but this modification happens after the schema check has already passed. If the schema validation made a deep copy and validated that, the actual conf being stored might have different fields than what was validated. While this appears intentional for this feature, it's worth verifying that modifying conf after validation doesn't violate any assumptions in the etcd storage layer or cause issues with configuration comparison logic elsewhere in the codebase.
| --- config | ||
| location /t { | ||
| content_by_lua_block { | ||
| local upstream = require("apisix.upstream").get_by_id(1) |
There was a problem hiding this comment.
I think we can call ngx.sleep(10) directly, which is much easier to understand
| --- response_body | ||
| passed | ||
|
|
||
|
|
There was a problem hiding this comment.
ditto
add a test case to check the node.update_time by Lua script
There was a problem hiding this comment.
confirm the Admin API work as expected
| --- response_body | ||
| passed | ||
|
|
||
|
|
There was a problem hiding this comment.
we'd better to add a test case
to check the node.update_time, confirm preserve the update_time
it is very important to confirm and show this changing
Background
In production environments, when new upstream nodes join a service cluster, sudden traffic spikes can cause performance degradation or even service crashes due to un-warmed caches, unestablished database connection pools, or incomplete JIT compilation. Currently, APISIX lacks a traffic protection mechanism for new nodes to gradually migrate traffic smoothly.
Solution
When
warm_up_confis configured in an upstream, APISIX automatically detects newly added nodes and calculates temporary weights for them. These weights start from a configured minimum percentage and gradually increase to the original weight over a specified warm-up period, enabling smooth traffic control.Key Design Points
update_timefield while preserving timestamps for existing nodesaggressionparameter[{"host": "127.0.0.1", "port": 8080}]), not map format ({"127.0.0.1:8080": 100})Configuration Example
{ "type": "roundrobin", "nodes": [ {"host": "127.0.0.1", "port": 1980, "weight": 100}, {"host": "127.0.0.1", "port": 1981, "weight": 100} ], "warm_up_conf": { "slow_start_time_seconds": 30, "min_weight_percent": 10, "aggression": 1, "interval": 1 } }Technical Implementation
warm_up_confconfiguration item with warm-up duration, minimum weight percentage, and growth curve parametersupdate_timeduring load balancer weight calculationTesting
Includes comprehensive test cases verifying:
Checklist