[runtime] Handle notifyCheckpointAborted to stop leaking checkpoint entries by weiqingy · Pull Request #667 · apache/flink-agents

weiqingy · 2026-05-13T05:45:05Z

Closes #665.

What this PR does

DurableExecutionManager.checkpointIdToSeqNums leaked entries on aborted checkpoints. Flink calls notifyCheckpointAborted(...) when a checkpoint is aborted (timeout, alignment failure, backend pressure); the existing code only handled the complete path. Under sustained abort pressure the map grew unboundedly.

Production changes

DurableExecutionManager.notifyCheckpointAborted(long) (new, package-private) — removes the entry from checkpointIdToSeqNums. No pruneState call — durable writes for an aborted checkpoint were never committed, so the prior committed checkpoint's recovery state is still load-bearing and must not be pruned. Guarded by actionStateStore != null to mirror the symmetric guard from [Bug][runtime] Fix memory leak in DurableExecutionManager.checkpointIdToSeqNums #645.
ActionExecutionOperator.notifyCheckpointAborted(long) (new @Override) — thin delegate to the manager, then super.notifyCheckpointAborted(...). Mirrors the existing notifyCheckpointComplete override exactly.
Javadoc invariant statement on snapshotLastCompletedSequenceNumbers and notifyCheckpointComplete strengthened to name BOTH release paths (complete OR abort). The actionStateStore != null guard now lives on three methods; the javadoc makes the three-way symmetry explicit and cross-links [Bug][runtime] Fix memory leak in DurableExecutionManager.checkpointIdToSeqNums #645 + [Bug][runtime] DurableExecutionManager leaks checkpointIdToSeqNums entries on aborted checkpoints #665.
@VisibleForTesting getCheckpointIdToSeqNums() accessor — mirrors getActionStateStore() precedent. (Same addition as in [runtime] Lock null-store symmetry invariant in DurableExecutionManager #666; the second PR to land drops the duplicate on rebase.)

Tests (DEM-level, three new)

notifyAbortedRemovesEntryWithoutPruning — entry released, durable state untouched. Uses a real InMemoryActionStateStore (not a mock) so wrongful pruning would be observable.
completedAndAbortedInterleavedKeepsInFlightEntries — three in-flight checkpoints; one completes (state pruned), one aborts (state preserved), one remains.
noStoreModeNotifyCheckpointAbortedIsNoOp — symmetric null-store no-op coverage.

Sanity-mutation verified locally:

Emptying the new method's body → 2 of 3 new tests fail.
Adding a wrongful actionStateStore.pruneState(...) call → 2 of 3 new tests fail (state was incorrectly pruned).

Operator-level harness test deferred to #646 — the new operator override is a one-line delegate; the logic is in the manager.

Test plan

mvn test -Dtest=DurableExecutionManagerTest -pl runtime — 5/5 pass (2 existing + 3 new)
mvn test -Dtest=ActionExecutionOperatorTest -pl runtime — 28/28 pass
mvn test -pl runtime — 307/307 pass (no regressions)
./tools/lint.sh -c — 0 violations
./tools/check-license.sh — clean (no new tracked files)
Sanity mutation: empty new method body → expected tests fail
Sanity mutation: wrongful pruneState call on abort → expected tests fail

Documentation

doc-needed
doc-not-needed
doc-included

wenjin272 · 2026-05-14T12:40:17Z

Hi, @weiqingy. It appears that after the merge of #659, both #666 and #667 have some conflicts.

weiqingy · 2026-05-15T06:16:04Z

Hi @wenjin272, the PR has been updated to resolve the conflicts.

weiqingy · 2026-05-15T06:33:46Z

The one CI failure (it-python [java-17] [python-3.12] [flink-2.1]) is the known test_react_agent_on_local_runner LLM flake against Ollama qwen3:1.7b, not caused by this PR:

FAILED flink_agents/e2e_tests/e2e_tests_integration/react_agent_test.py::test_react_agent_on_local_runner
  - assert 432596736 == 1386528

The test expects 4444 × 312 = 1386528, but the LLM made an extra unnecessary multiply(1386528, 312) call and returned 432596736. The test source has a comment right next to the assertion: "This may be caused by the LLM response does not match the output schema, you can rerun this case."

This same failure (same exact numbers, 432596736 == 1386528) is currently failing on main at b38ae21 — the commit this PR is rebased onto — and on several other recent main-branch runs. Failure runs through the Python local_runner, which logs "Local runner does not support durable execution; recovery is not available." — the Java DurableExecutionManager / ActionExecutionOperator paths changed by this PR are never exercised.

Will re-run CI.

wenjin272 · 2026-05-15T06:41:25Z

The one CI failure (it-python [java-17] [python-3.12] [flink-2.1]) is the known test_react_agent_on_local_runner LLM flake against Ollama qwen3:1.7b, not caused by this PR:
FAILED flink_agents/e2e_tests/e2e_tests_integration/react_agent_test.py::test_react_agent_on_local_runner
  - assert 432596736 == 1386528
The test expects 4444 × 312 = 1386528, but the LLM made an extra unnecessary multiply(1386528, 312) call and returned 432596736. The test source has a comment right next to the assertion: "This may be caused by the LLM response does not match the output schema, you can rerun this case."

This same failure (same exact numbers, 432596736 == 1386528) is currently failing on main at b38ae21 — the commit this PR is rebased onto — and on several other recent main-branch runs. Failure runs through the Python local_runner, which logs "Local runner does not support durable execution; recovery is not available." — the Java DurableExecutionManager / ActionExecutionOperator paths changed by this PR are never exercised.

Will re-run CI.

I believe we need to polish the stability and observability of CI in version 0.4. If you encounter any unstable cases, please contact me to rerun them. I now have the permission to rerun failed CI jobs.

@VisibleForTesting

…ntries Issue apache#665. When Flink aborts a checkpoint, it calls notifyCheckpointAborted instead of notifyCheckpointComplete. The DurableExecutionManager only handled the complete path, so the per-checkpoint sequence-number entry recorded by snapshotLastCompletedSequenceNumbers was never released for aborted checkpoints. Under sustained abort pressure (timeouts, alignment failures, backend pressure), checkpointIdToSeqNums grew unboundedly. Changes: - Add DurableExecutionManager.notifyCheckpointAborted(long): removes the entry from checkpointIdToSeqNums, guarded by the same actionStateStore != null check as notifyCheckpointComplete. Does NOT prune durable action state — the aborted checkpoint's writes were never committed, so the prior committed checkpoint's recovery state is still load-bearing and must not be pruned. - Add ActionExecutionOperator.notifyCheckpointAborted(long): thin override that delegates to the manager and then calls super, mirroring the existing notifyCheckpointComplete override. - Extend the symmetric-guard invariant javadoc on snapshotLastCompletedSequenceNumbers and notifyCheckpointComplete to name both release paths (complete OR abort). The actionStateStore != null guard now lives on three methods; the cross-linked javadoc makes that explicit and cites issues apache#645 and apache#665. - Three new DurableExecutionManagerTest cases (using the existing getCheckpointIdToSeqNums() @VisibleForTesting accessor introduced in apache#659): * notifyAbortedRemovesEntryWithoutPruning — entry released, durable state untouched (verified against a real InMemoryActionStateStore so wrongful pruning would be observable). * completedAndAbortedInterleavedKeepsInFlightEntries — three in-flight checkpoints, one completes (state pruned), one aborts (state preserved), one remains. * noStoreModeNotifyCheckpointAbortedIsNoOp — symmetric null-store no-op coverage matching the existing notifyCheckpointComplete null-store case.

wenjin272

Hi, @weiqingy, LGTM. I left a comment that may need to be confirmed.

wenjin272 · 2026-05-15T08:20:12Z

        }
    }

    void maybePruneState(Object key, long sequenceNum) throws Exception {


It appears that only tests are calling this method. We may need to verify whether this is an unnecessary interface or if the call was accidentally removed during a previous conflict resolution.

weiqingy mentioned this pull request May 13, 2026

[Bug][runtime] Fix memory leak in DurableExecutionManager.checkpointIdToSeqNums #645

Open

2 tasks

weiqingy force-pushed the 665-impl branch from 06d81db to 05dd58c Compare May 15, 2026 06:13

weiqingy force-pushed the 665-impl branch from cc1b248 to b93199e Compare May 15, 2026 06:50

wenjin272 reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[runtime] Handle notifyCheckpointAborted to stop leaking checkpoint entries#667

[runtime] Handle notifyCheckpointAborted to stop leaking checkpoint entries#667
weiqingy wants to merge 1 commit into
apache:mainfrom
weiqingy:665-impl

weiqingy commented May 13, 2026 •

edited

Loading

Uh oh!

wenjin272 commented May 14, 2026

Uh oh!

weiqingy commented May 15, 2026

Uh oh!

weiqingy commented May 15, 2026

Uh oh!

wenjin272 commented May 15, 2026

Uh oh!

wenjin272 left a comment

Uh oh!

wenjin272 May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

weiqingy commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Production changes

Tests (DEM-level, three new)

Test plan

Documentation

Uh oh!

wenjin272 commented May 14, 2026

Uh oh!

weiqingy commented May 15, 2026

Uh oh!

weiqingy commented May 15, 2026

Uh oh!

wenjin272 commented May 15, 2026

Uh oh!

wenjin272 left a comment

Choose a reason for hiding this comment

Uh oh!

wenjin272 May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

weiqingy commented May 13, 2026 •

edited

Loading