[SPARK-55661][CORE] Ensure TaskRunner.run() sends StatusUpdate on setup failures#54522
Open
AMRUTH-ASHOK wants to merge 2 commits intoapache:masterfrom
Open
[SPARK-55661][CORE] Ensure TaskRunner.run() sends StatusUpdate on setup failures#54522AMRUTH-ASHOK wants to merge 2 commits intoapache:masterfrom
AMRUTH-ASHOK wants to merge 2 commits intoapache:masterfrom
Conversation
…up failure Move TaskRunner.run() setup code (classloader isolation, thread naming, serializer creation) inside the existing try/catch/finally block so that exceptions during setup are caught and reported to the driver via StatusUpdate. Previously, setup code ran outside the try block, causing silent failures that leaked GPU/CPU resources on the driver. The fix changes 'val isolatedSession' and 'val ser' to 'var' declarations before the try block (with safe defaults), and adds a setup-failure branch in the catch-all handler that sends StatusUpdate(FAILED) or StatusUpdate(KILLED) when the serializer was never initialized (ser == null). Closes: https://issues.apache.org/jira/browse/SPARK-55661
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Move
TaskRunner.run()setup code (classloader isolation, thread naming, serializer creation) inside the existingtry/catch/finallyblock so that exceptions during setup are caught and reported to the driver viaStatusUpdate.val isolatedSessiontovar isolatedSession: IsolatedSessionState = defaultSessionState(declared beforetrywith safe default sofinallycan access it for cleanup)val sertovar ser: SerializerInstance = null(declared beforetrysocatchcan check whether setup completed)tryblockcase t: Throwablehandler: whenser == null(serializer was never created), sendsStatusUpdate(FAILED)orStatusUpdate(KILLED)so the driver releases resourcesCloses: https://issues.apache.org/jira/browse/SPARK-55661
Why are the changes needed?
Previously, setup code (lines 806-833) ran outside the try block. If any setup line threw an exception (e.g.,
InterruptedExceptionfrom a concurrent AQE stage cancellation), execution jumped out ofrun()entirely and noStatusUpdatewas sent to the driver,runningTaskswas never cleaned up, and allocated resources (GPU/CPU) were leaked on the driver side.Does this PR introduce any user-facing change?
No. This is an internal bug fix
How was this patch tested?
Two new unit tests added to
ExecutorSuite:SPARK-55661: TaskRunner.run() setup failure should send StatusUpdate to prevent driver resource leakMocksenv.closureSerializer.newInstance()to throw during setup, verifies thatStatusUpdate(FAILED)is sent with the correctExceptionFailurereason, and thatrunningTasksis cleaned up.SPARK-55661: TaskRunner.run() setup failure on killed task should send StatusUpdate(KILLED) to prevent driver resource leakSame as above, but pre-sets akillMark(simulating an AQE cancellation arriving beforerun()starts)Was this patch authored or co-authored using generative AI tooling?
Yes, Generated-by: Cursor 2.5.17