Skip to content

[SPARK-55661][CORE] Ensure TaskRunner.run() sends StatusUpdate on setup failures#54522

Open
AMRUTH-ASHOK wants to merge 2 commits intoapache:masterfrom
AMRUTH-ASHOK:SPARK-55661
Open

[SPARK-55661][CORE] Ensure TaskRunner.run() sends StatusUpdate on setup failures#54522
AMRUTH-ASHOK wants to merge 2 commits intoapache:masterfrom
AMRUTH-ASHOK:SPARK-55661

Conversation

@AMRUTH-ASHOK
Copy link

What changes were proposed in this pull request?

Move TaskRunner.run() setup code (classloader isolation, thread naming, serializer creation) inside the existing try/catch/finally block so that exceptions during setup are caught and reported to the driver via StatusUpdate.

  • Changed val isolatedSession to var isolatedSession: IsolatedSessionState = defaultSessionState (declared before try with safe default so finally can access it for cleanup)
  • Changed val ser to var ser: SerializerInstance = null (declared before try so catch can check whether setup completed)
  • Moved all setup code inside the existing try block
  • Added a setup-failure branch in the catch-all case t: Throwable handler: when ser == null (serializer was never created), sends StatusUpdate(FAILED) or StatusUpdate(KILLED) so the driver releases resources

Closes: https://issues.apache.org/jira/browse/SPARK-55661

Why are the changes needed?

Previously, setup code (lines 806-833) ran outside the try block. If any setup line threw an exception (e.g., InterruptedException from a concurrent AQE stage cancellation), execution jumped out of run() entirely and no StatusUpdate was sent to the driver, runningTasks was never cleaned up, and allocated resources (GPU/CPU) were leaked on the driver side.

Does this PR introduce any user-facing change?

No. This is an internal bug fix

How was this patch tested?

Two new unit tests added to ExecutorSuite:

  1. SPARK-55661: TaskRunner.run() setup failure should send StatusUpdate to prevent driver resource leak Mocks env.closureSerializer.newInstance() to throw during setup, verifies that StatusUpdate(FAILED) is sent with the correct ExceptionFailure reason, and that runningTasks is cleaned up.

  2. SPARK-55661: TaskRunner.run() setup failure on killed task should send StatusUpdate(KILLED) to prevent driver resource leak Same as above, but pre-sets a killMark (simulating an AQE cancellation arriving before run() starts)

Was this patch authored or co-authored using generative AI tooling?

Yes, Generated-by: Cursor 2.5.17

…up failure

Move TaskRunner.run() setup code (classloader isolation, thread naming,
serializer creation) inside the existing try/catch/finally block so that
exceptions during setup are caught and reported to the driver via
StatusUpdate. Previously, setup code ran outside the try block, causing
silent failures that leaked GPU/CPU resources on the driver.

The fix changes 'val isolatedSession' and 'val ser' to 'var' declarations
before the try block (with safe defaults), and adds a setup-failure branch
in the catch-all handler that sends StatusUpdate(FAILED) or
StatusUpdate(KILLED) when the serializer was never initialized (ser == null).

Closes: https://issues.apache.org/jira/browse/SPARK-55661
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant