fix: Graceful shutdown for all UDFs by BulkBeing · Pull Request #337 · numaproj/numaflow-python

BulkBeing · 2026-03-16T10:00:33Z

Tested all vertex types by running it in a pipeline

Code and setup https://github.com/BulkBeing/reproduce-issues/tree/main/numa-python-all-udfs-graceful-shutdown
Checked the behavior for each on normal shutdown (kubectl delete pod) as well as UDF exception

NOTE: multiproc map was not tested since numaflow Rust version doesn't support it. I only checked the behaviour on pod deletion.

codecov · 2026-03-16T15:05:36Z

Codecov Report

❌ Patch coverage is 77.00258% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.50%. Comparing base (bcdc4c3) to head (e0e2624).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...flow/pynumaflow/sourcer/servicer/async_servicer.py	74.24%	6 Missing and 11 partials ⚠️
...flow/pynumaflow/reducer/servicer/async_servicer.py	75.00%	5 Missing and 2 partials ⚠️
...ow/pynumaflow/accumulator/servicer/task_manager.py	16.66%	5 Missing ⚠️
.../pynumaflow/accumulator/servicer/async_servicer.py	88.23%	0 Missing and 4 partials ⚠️
...flow/pynumaflow/mapper/_servicer/_sync_servicer.py	87.09%	2 Missing and 2 partials ⚠️
...s/pynumaflow/pynumaflow/mapper/multiproc_server.py	50.00%	4 Missing ⚠️
...w/pynumaflow/sourcetransformer/multiproc_server.py	50.00%	4 Missing ⚠️
...pynumaflow/sourcetransformer/servicer/_servicer.py	87.09%	2 Missing and 2 partials ⚠️
.../pynumaflow/pynumaflow/accumulator/async_server.py	66.66%	2 Missing and 1 partial ⚠️
.../pynumaflow/pynumaflow/batchmapper/async_server.py	66.66%	3 Missing ⚠️
... and 14 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
- Coverage   94.34%   92.50%   -1.84%     
==========================================
  Files          67       67              
  Lines        3182     3509     +327     
  Branches      179      229      +50     
==========================================
+ Hits         3002     3246     +244     
- Misses        145      199      +54     
- Partials       35       64      +29

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

vigith · 2026-03-17T15:56:16Z

packages/pynumaflow/pynumaflow/batchmapper/async_server.py

+        # event loop explicitly here, the python process will not exit.
+        # It reamins stuck for 5 minutes until liveness and readiness probe
+        # fails enough times and k8s sends a SIGTERM
+        asyncio.get_event_loop().stop()


Suggested change

asyncio.get_event_loop().stop()

asyncio.get_running_loop().stop()

asyncio.get_event_loop() is deprecated when called outside a running loop. Use asyncio.get_running_loop() instead ?

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

vigith

I am doing a cursory review, please let me know if this is valid.

vigith · 2026-03-18T00:24:57Z

packages/pynumaflow/pynumaflow/batchmapper/async_server.py

+        shutdown_task = asyncio.create_task(_watch_for_shutdown())
+        try:
+            await server.wait_for_termination()
+        except asyncio.CancelledError:


if this happens, then rest of the code after except won't be invoked, correct?

The asyncio.CancelledError will be raised when event loop shutdown or the task is cancelled explicitly. This will cause the block of code under except asyncio.CancelledError to execute. We want to ignore this exception.
All other exceptions will be caught in the BaseException catching blocks, which are categorized as critical and mostly indicate a UDF error, which we should propagate to numa.

can you add a comment why we are doing so for posterity?

I checked this part of the code again. There are no BaseException catching here. The reason is added under CancelledError exception block.

# SIGTERM received — aiorun cancels all tasks. We must stop # the gRPC server explicitly so its __del__ doesn't try to # schedule a coroutine on the already-closed event loop.

I was seeing something like below due to Python's GC during shutdown of the server:

Exception ignored in: <function _Server.__del__ at 0x...> Traceback (most recent call last): File ".../grpc/aio/_server.py", line ..., in __del__ self._loop.call_soon_threadsafe(...) RuntimeError: Event loop is closed RuntimeError: cannot schedule new futures after shutdown

I will update the comment with more details.

vigith · 2026-03-18T00:26:39Z

packages/pynumaflow/pynumaflow/reducer/servicer/async_servicer.py

+            return
+
        except BaseException as e:
            _LOGGER.critical("Reduce Error", exc_info=True)


do we need both the critical logs? one with err_msg seems to contain more info, you might want to prefix "Reduce Error: "

vigith · 2026-03-18T00:28:17Z

packages/pynumaflow/pynumaflow/reducer/servicer/async_servicer.py

in packages/pynumaflow/pynumaflow/accumulator/servicer/async_servicer.py i see a bigger diff with calls to .stop() while this file doesn't seem to have it. Is that ok? :)

Both use different approaches.
Reducer handles exceptions directly - the servicer itself directly iterates requests, manages tasks, and yields responses. Errors in __invoke_reduce propagate as exceptions up to the servicer's ReduceFn where _shutdown_event.set() is called.

In accumulator, errors from the task manager flow through the queue to the servicer, which then triggers shutdown. So AccumulateFn has 4 shutdown points: consumer loop error, consumer CancelledError, producer await error, producer CancelledError. Meanwhile the task manager also has its own error handling that puts errors on the result queue (but does not set _shutdown_event).

This approach felt easier for accumulator.

vigith

please update with comments

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

BulkBeing added 3 commits March 16, 2026 21:07

graceful shutdown for all UDFs

2c2886b

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

file formatting

aa9496f

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

fix CI

6c37cb5

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

BulkBeing force-pushed the graceful-shutdown branch from 45e64ff to 6c37cb5 Compare March 16, 2026 15:37

kohlisid mentioned this pull request Mar 16, 2026

fix: Clean shutdown for multithreaded unary Map #329

Closed

BulkBeing added 8 commits March 17, 2026 05:42

tested graceful shutdown for source

72849ec

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

tested graceful shutdown from sync map

3fd5398

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

fixes

3de2701

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

multiproc map clean shutdown

d8c1eac

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

tested all for graceful shutdown on kubectl delete

18b3d87

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

minor changes

0003243

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

unit tests

4f412c2

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

fix lints

1b4db21

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

BulkBeing marked this pull request as ready for review March 17, 2026 06:02

BulkBeing requested review from kohlisid and vigith as code owners March 17, 2026 06:02

BulkBeing requested a review from yhl25 March 17, 2026 06:07

more tests

abecb26

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

vigith reviewed Mar 17, 2026

View reviewed changes

BulkBeing added 2 commits March 18, 2026 05:03

more tests, remove unused variables

f5f73ce

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

Use get_running_loop instead of get_event_loop

6b360ac

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

BulkBeing requested a review from vigith March 18, 2026 00:13

vigith reviewed Mar 18, 2026

View reviewed changes

vigith approved these changes Mar 18, 2026

View reviewed changes

More documentation comments

e0e2624

Signed-off-by: Sreekanth <prsreekanth920@gmail.com>

BulkBeing merged commit 7a908f2 into main Mar 18, 2026
10 of 11 checks passed

BulkBeing deleted the graceful-shutdown branch March 18, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Graceful shutdown for all UDFs#337

fix: Graceful shutdown for all UDFs#337
BulkBeing merged 15 commits intomainfrom
graceful-shutdown

BulkBeing commented Mar 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

vigith Mar 17, 2026

Uh oh!

vigith left a comment

Uh oh!

vigith Mar 18, 2026

Uh oh!

BulkBeing Mar 18, 2026

Uh oh!

vigith Mar 18, 2026

Uh oh!

BulkBeing Mar 18, 2026

Uh oh!

vigith Mar 18, 2026

Uh oh!

vigith Mar 18, 2026

Uh oh!

BulkBeing Mar 18, 2026

Uh oh!

vigith left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	asyncio.get_event_loop().stop()
	asyncio.get_running_loop().stop()

Conversation

BulkBeing commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vigith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vigith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BulkBeing commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading