Add UvicornMonitor for zero-downtime API server worker recycling#60919
Add UvicornMonitor for zero-downtime API server worker recycling#60919kaxil wants to merge 1 commit intoapache:mainfrom
Conversation
Implement rolling worker restarts for the API server using uvicorn's SIGTTIN/SIGTTOU signal-based worker management. This provides feature parity with Airflow 2's GunicornMonitor while maintaining zero downtime. Key features: - Rolling restart pattern: spawn new workers before killing old ones - Health check verification before killing old workers - Automatic rollback if new workers fail health checks - Configurable refresh interval and batch size New configuration options: - worker_refresh_interval: Seconds between refresh cycles (default: 0/disabled) - worker_refresh_batch_size: Workers to refresh at a time (default: 1) This addresses memory accumulation in long-running API server workers by periodically recycling them while ensuring continuous availability.
| version_added: 3.2.0 | ||
| type: integer | ||
| example: ~ | ||
| default: "0" |
There was a problem hiding this comment.
Should we use some more robust default here?
I guess we could have min 2 workers by default and have rolling restarts enabled by default, and refresh interval set to some default, reasonable value (say 1hr).
While it increases memory used as explained in the PR description, it is far more robust for "out-of-the-box" installation, and if users would like to save memory by limiting it to 1, they still should be able to do so (and then we could disable the uvicorn restarts altogether). I think it's more important to have "robust non leaking, gently restarting" by default, rather than "potentially leaking smaller initial memory".
Since Fast API recommendation is 1 worker per Pod - we could also recommend to the users - in the same docs - to use the rolling restarts of deployment instead those restarts that are handled by Uvicorn monitor. And we could add description to the number of workers paraneter, that setting it to 1 is ok as long as you have more than one instance of api-server and implement refreshing rolling restart outside of Airlfow. This way, the users who would like to change it to 1 (and read the docs that is) - would know that they also need to take care about potential leaks.
There are even some ways to do it automatically - which we could implement in our Helm Chart as an option potentially:
https://stackoverflow.com/questions/67423919/k8s-cronjob-rolling-restart-every-day
|
Closing this in favor of #60940 |
Alternate/complement to #60804
Why
API server workers accumulate memory over time. Airflow 2 had
GunicornMonitorwith rolling restarts - spawn new workers, health check them, then kill old ones. This was lost when we switched to uvicorn.Uvicorn's built-in
limit_max_requestsdoesn't help here - it kills old workers before spawning new ones, causing downtime.What
New
UvicornMonitorclass that does rolling restarts:Enabled via config:
Gotchas (vs GunicornMonitor)
Single-worker mode: Gunicorn always uses master/worker architecture even with
workers=1(master + 1 worker), so signals always work. Uvicorn withworkers=1runs in single-process mode with no supervisor, so SIGTTIN/SIGTTOU are ignored. Workaround: start with 2 workers and scale down after startup. Works, but briefly has 2 workers during refresh cycles.Uvicorn kills newest worker, not oldest: Gunicorn kills the oldest worker on SIGTTOU (
workers.pop(0)sorted by age). Uvicorn kills the newest (processes.pop()- LIFO). This would kill our fresh healthy workers instead of the old ones. Fixed by sending SIGTERM directly to specific old PIDs instead of using SIGTTOU.No memory sharing between workers: Unlike Gunicorn (which uses preload + fork for copy-on-write memory sharing), uvicorn's multiprocess mode spawns independent workers that each load everything separately. This means:
This is an inherent limitation of uvicorn's multiprocessing approach vs gunicorn's preload+fork model.
Testing
Related