feat(balancer): add traffic warm-up (slow start) support for upstream nodes by ChuanFF · Pull Request #12991 · apache/apisix

ChuanFF · 2026-02-09T16:38:25Z

Background

In production environments, when new upstream nodes join a service cluster, sudden traffic spikes can cause performance degradation or even service crashes due to un-warmed caches, unestablished database connection pools, or incomplete JIT compilation. Currently, APISIX lacks a traffic protection mechanism for new nodes to gradually migrate traffic smoothly.

Solution

When warm_up_conf is configured in an upstream, APISIX automatically detects newly added nodes and calculates temporary weights for them. These weights start from a configured minimum percentage and gradually increase to the original weight over a specified warm-up period, enabling smooth traffic control.

Key Design Points

Warm-up Algorithm Reference: Inspired by Envoy's Slow Start implementation (Envoy Slow Start Documentation)
Node Identification: Automatically identifies new nodes via the update_time field while preserving timestamps for existing nodes
Weight Calculation: Uses an exponential growth curve controlled by the aggression parameter
Format Limitation: Only supports array-format node definitions ([{"host": "127.0.0.1", "port": 8080}]), not map format ({"127.0.0.1:8080": 100})
Configuration Inheritance: Supports warm-up configuration at Route, Service, and Upstream levels

Configuration Example

{
  "type": "roundrobin",
  "nodes": [
    {"host": "127.0.0.1", "port": 1980, "weight": 100},
    {"host": "127.0.0.1", "port": 1981, "weight": 100}
  ],
  "warm_up_conf": {
    "slow_start_time_seconds": 30,
    "min_weight_percent": 10,
    "aggression": 1,
    "interval": 1
  }
}

Technical Implementation

Added warm_up_conf configuration item with warm-up duration, minimum weight percentage, and growth curve parameters
Dynamically adjusts actual node weights based on update_time during load balancer weight calculation
Ensures weight update consistency through configuration version management
Automatically maintains update timestamps during node addition/removal

Testing

Includes comprehensive test cases verifying:

Traffic skew when new nodes join
Traffic growth curve during warm-up
Normal load balancing after warm-up completion
Timestamp preservation during configuration updates
Compatibility across different configuration levels

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

Copilot

Pull request overview

This pull request adds traffic warm-up (slow start) support for upstream nodes in APISIX, addressing the problem of new nodes experiencing performance degradation or crashes when they suddenly receive full traffic loads without time to warm up caches, connection pools, or complete JIT compilation.

Changes:

Implements automatic detection and gradual weight ramping for newly added upstream nodes
Adds warm_up_conf configuration with slow_start_time_seconds, min_weight_percent, aggression, and interval parameters
Preserves update timestamps for existing nodes during configuration updates to maintain their warm-up state
Supports warm-up configuration at Upstream, Service, and Route levels

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
t/node/warm-up.t	Comprehensive test suite covering warm-up behavior at different configuration levels, traffic distribution during warm-up, and timestamp preservation
docs/zh/latest/admin-api.md	Chinese documentation for warm_up_conf parameters and behavior explanation
docs/en/latest/admin-api.md	English documentation for warm_up_conf parameters and behavior explanation
apisix/schema_def.lua	Schema definition for warm_up_conf and update_time field for nodes
apisix/balancer.lua	Core warm-up logic including weight calculation algorithm and version management for cache invalidation
apisix/admin/upstreams.lua	Timestamp management logic to preserve update_time for existing nodes and assign timestamps to new nodes
apisix/admin/services.lua	Integration of warm-up timestamp management for service-level upstream configuration
apisix/admin/routes.lua	Integration of warm-up timestamp management for route-level upstream configuration
apisix/admin/resource.lua	Addition of initialize_conf hook to enable timestamp management before configuration storage
apisix/utils/upstream.lua	Updated node comparison sort function to handle both host and port for proper node identification
apisix/upstream.lua	Clock skew protection to prevent future timestamps from causing issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-13T06:25:31Z


 To allow Upstream to have a separate connection pool, use `keepalive_pool`. It can be configured by modifying its child fields.

+Here is the polished English translation of your text:


The line "Here is the polished English translation of your text:" appears to be editorial text that was accidentally left in the documentation. This should be removed as it doesn't belong in the final documentation.

Copilot · 2026-02-13T06:25:32Z

+    end
+
+    if warm_up_end_time and ngx_time() < warm_up_end_time then
+        version = version .. math_floor(ngx_time() / warm_up_conf.interval)


The version calculation uses string concatenation with math_floor(ngx_time() / warm_up_conf.interval). The default interval is 1 second according to the schema. However, if the interval is set to a very large value (e.g., 3600 seconds), the version would only update once per hour, meaning weight calculations would be cached for an hour even though weights should be recalculated based on elapsed time. This could cause nodes to remain at reduced weights longer than intended during the warm-up period. Consider adding documentation or validation to ensure interval values are appropriate for the warm-up duration, or ensure the cache invalidation logic accounts for this scenario.

Suggested change

version = version .. math_floor(ngx_time() / warm_up_conf.interval)

local interval = warm_up_conf.interval or 1

if interval <= 0 then

interval = 1

end

if warm_up_conf.slow_start_time_seconds

and interval > warm_up_conf.slow_start_time_seconds

then

interval = warm_up_conf.slow_start_time_seconds

end

version = version .. math_floor(ngx_time() / interval)

Adding a schema for interval < slow_start_time_seconds is difficult, and the necessity seems rather low, right?

Copilot · 2026-02-13T06:25:32Z

+    if self.initialize_conf then
+        self.initialize_conf(id, conf)
+    end


The initialize_conf hook is called after schema validation but before encrypt_conf. This ordering means that initialize_conf modifies the configuration by adding update_time fields to nodes, but this modification happens after the schema check has already passed. If the schema validation made a deep copy and validated that, the actual conf being stored might have different fields than what was validated. While this appears intentional for this feature, it's worth verifying that modifying conf after validation doesn't violate any assumptions in the etcd storage layer or cause issues with configuration comparison logic elsewhere in the codebase.

membphis · 2026-02-27T02:41:11Z

+--- config
+    location /t {
+        content_by_lua_block {
+            local upstream = require("apisix.upstream").get_by_id(1)


I think we can call ngx.sleep(10) directly, which is much easier to understand

membphis · 2026-02-27T02:42:13Z

+--- response_body
+passed
+
+


ditto

add a test case to check the node.update_time by Lua script

confirm the Admin API work as expected

membphis · 2026-02-27T02:46:28Z

+--- response_body
+passed
+
+


we'd better to add a test case

to check the node.update_time, confirm preserve the update_time

it is very important to confirm and show this changing

warm-up注释修正

ChuanFF added 3 commits February 9, 2026 23:07

warm up code, test case, docs

d318d7c

doc lint

176e801

doc lint

de093b1

dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. enhancement New feature or request labels Feb 9, 2026

Baoyuantop requested a review from Copilot February 13, 2026 06:17

Copilot started reviewing on behalf of Baoyuantop February 13, 2026 06:17 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

membphis reviewed Feb 27, 2026

View reviewed changes

Baoyuantop added the wait for update wait for the author's response in this issue/PR label Feb 27, 2026

root and others added 2 commits February 27, 2026 22:00

Merge branch 'master' into feat-upstream-warm-up

a72ecdf

未开启预热不设置update_time；测试用例优化；补全admin api

a800af3

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 27, 2026

ChuanFF and others added 5 commits February 28, 2026 01:53

code lint

a7aaab8

code lint

7e49e47

warm-up注释修正

1df65bf

warm-up注释修正

Change aggression minimum value from 0 to 0.01

1250757

doc fix

9e10343

ChuanFF closed this Mar 2, 2026


		To allow Upstream to have a separate connection pool, use `keepalive_pool`. It can be configured by modifying its child fields.

		Here is the polished English translation of your text:

-        version = version .. math_floor(ngx_time() / warm_up_conf.interval)
+        local interval = warm_up_conf.interval or 1
+        if interval <= 0 then
+            interval = 1
+        end
+        if warm_up_conf.slow_start_time_seconds
+           and interval > warm_up_conf.slow_start_time_seconds
+        then
+            interval = warm_up_conf.slow_start_time_seconds
+        end
+        version = version .. math_floor(ngx_time() / interval)

		--- response_body
		passed

		--- response_body
		passed

Conversation

ChuanFF commented Feb 9, 2026

Background

Solution

Key Design Points

Configuration Example

Technical Implementation

Testing

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanFF Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

membphis Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChuanFF Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

membphis Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

membphis Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

membphis Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanFF Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

membphis Feb 27, 2026 •

edited

Loading