Add UvicornMonitor for zero-downtime API server worker recycling by kaxil · Pull Request #60919 · apache/airflow

kaxil · 2026-01-22T03:47:15Z

Alternate/complement to #60804

Why

API server workers accumulate memory over time. Airflow 2 had GunicornMonitor with rolling restarts - spawn new workers, health check them, then kill old ones. This was lost when we switched to uvicorn.

Uvicorn's built-in limit_max_requests doesn't help here - it kills old workers before spawning new ones, causing downtime.

What

New UvicornMonitor class that does rolling restarts:

[n workers] → spawn batch → [n + batch workers] → health check → kill old batch → [n workers]

Enabled via config:

[api]
worker_refresh_interval = 1800  # seconds, 0 = disabled (default)
worker_refresh_batch_size = 1

Gotchas (vs GunicornMonitor)

Single-worker mode: Gunicorn always uses master/worker architecture even with workers=1 (master + 1 worker), so signals always work. Uvicorn with workers=1 runs in single-process mode with no supervisor, so SIGTTIN/SIGTTOU are ignored. Workaround: start with 2 workers and scale down after startup. Works, but briefly has 2 workers during refresh cycles.

Uvicorn kills newest worker, not oldest: Gunicorn kills the oldest worker on SIGTTOU (workers.pop(0) sorted by age). Uvicorn kills the newest (processes.pop() - LIFO). This would kill our fresh healthy workers instead of the old ones. Fixed by sending SIGTERM directly to specific old PIDs instead of using SIGTTOU.

No memory sharing between workers: Unlike Gunicorn (which uses preload + fork for copy-on-write memory sharing), uvicorn's multiprocess mode spawns independent workers that each load everything separately. This means:

Each worker has its own copy of all loaded modules, configurations, and cached data
Total memory usage scales linearly with worker count (N workers = ~N× memory)

This is an inherent limitation of uvicorn's multiprocessing approach vs gunicorn's preload+fork model.

Testing

# Start API server with worker recycling enabled
export AIRFLOW__API__WORKER_REFRESH_INTERVAL=60
airflow api-server --workers 2

# Watch logs for rolling restart messages:
# "Starting worker refresh: current workers PIDs [1001, 1002]"
# "Spawned new workers: PIDs [1003]"
# "Killing old workers: PIDs [1001]"
# "Worker refresh completed: new PIDs [1003] replaced old PIDs [1001]"

Implement rolling worker restarts for the API server using uvicorn's SIGTTIN/SIGTTOU signal-based worker management. This provides feature parity with Airflow 2's GunicornMonitor while maintaining zero downtime. Key features: - Rolling restart pattern: spawn new workers before killing old ones - Health check verification before killing old workers - Automatic rollback if new workers fail health checks - Configurable refresh interval and batch size New configuration options: - worker_refresh_interval: Seconds between refresh cycles (default: 0/disabled) - worker_refresh_batch_size: Workers to refresh at a time (default: 1) This addresses memory accumulation in long-running API server workers by periodically recycling them while ensuring continuous availability.

potiuk · 2026-01-22T14:43:06Z

airflow-core/src/airflow/config_templates/config.yml

+      version_added: 3.2.0
+      type: integer
+      example: ~
+      default: "0"


Should we use some more robust default here?

I guess we could have min 2 workers by default and have rolling restarts enabled by default, and refresh interval set to some default, reasonable value (say 1hr).

While it increases memory used as explained in the PR description, it is far more robust for "out-of-the-box" installation, and if users would like to save memory by limiting it to 1, they still should be able to do so (and then we could disable the uvicorn restarts altogether). I think it's more important to have "robust non leaking, gently restarting" by default, rather than "potentially leaking smaller initial memory".

Since Fast API recommendation is 1 worker per Pod - we could also recommend to the users - in the same docs - to use the rolling restarts of deployment instead those restarts that are handled by Uvicorn monitor. And we could add description to the number of workers paraneter, that setting it to 1 is ok as long as you have more than one instance of api-server and implement refreshing rolling restart outside of Airlfow. This way, the users who would like to change it to 1 (and read the docs that is) - would know that they also need to take care about potential leaks.

There are even some ways to do it automatically - which we could implement in our Helm Chart as an option potentially:

https://stackoverflow.com/questions/67423919/k8s-cronjob-rolling-restart-every-day

Yeah certainly an option.

kaxil · 2026-01-28T15:02:43Z

Closing this in favor of #60940

kaxil requested review from bugraoz93 and potiuk as code owners January 22, 2026 03:47

boring-cyborg bot added area:CLI area:ConfigTemplates kind:documentation labels Jan 22, 2026

kaxil mentioned this pull request Jan 22, 2026

Add configurable LRU+TTL caching for API server DAG retrieval #60804

Draft

jason810496 self-requested a review January 22, 2026 08:07

potiuk reviewed Jan 22, 2026

View reviewed changes

kaxil mentioned this pull request Jan 22, 2026

Add gunicorn support for API server with rolling worker restarts #60940

Merged

kaxil closed this Jan 28, 2026

kaxil reopened this Jan 28, 2026

kaxil closed this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UvicornMonitor for zero-downtime API server worker recycling#60919

Add UvicornMonitor for zero-downtime API server worker recycling#60919
kaxil wants to merge 1 commit intoapache:mainfrom
astronomer:uvicorn-cycling

kaxil commented Jan 22, 2026 •

edited

Loading

Uh oh!

potiuk Jan 22, 2026

Uh oh!

kaxil Jan 22, 2026

Uh oh!

kaxil commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kaxil commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Gotchas (vs GunicornMonitor)

Testing

Related

Uh oh!

potiuk Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

kaxil commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kaxil commented Jan 22, 2026 •

edited

Loading