Skip to content

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994

Open
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/heartbeat-timeout-loss-reason-oss
Open

[SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason#55994
sunchao wants to merge 2 commits into
apache:masterfrom
sunchao:dev/chao/codex/heartbeat-timeout-loss-reason-oss

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented May 19, 2026

What changes were proposed in this pull request?

This PR preserves the original loss reason when Spark replaces an executor after a heartbeat timeout.

Current flow:

  1. HeartbeatReceiver detects that an executor has stopped heartbeating.
  2. Spark creates ExecutorProcessLost("Executor heartbeat timed out ...").
  3. Spark requests executor replacement.
  4. The backend may later report the removal as generic ExecutorKilled.

Step 4 drops the more useful heartbeat-timeout diagnosis.

This PR keeps that timeout reason through the replacement flow, with two safeguards:

  • use it only when the backend reports generic ExecutorKilled,
  • do not override a more specific backend reason such as ExecutorExited.

It also clears the pending preserved reason if the kill request is rejected or fails.

Why are the changes needed?

Spark already knows that the executor was replaced because of a heartbeat timeout, but that information can be lost before the scheduler records the final executor loss reason.

Keeping the original reason makes executor-loss reporting more accurate and avoids collapsing a timeout-driven replacement into an ordinary generic kill.

This fixes SPARK-56952.

Does this PR introduce any user-facing change?

Yes.

Executor loss reporting is more specific for heartbeat-timeout removals. Cases that previously appeared as generic ExecutorKilled can now retain:

ExecutorProcessLost("Executor heartbeat timed out ...")

If the backend provides a concrete loss reason, Spark still keeps that backend reason instead.

How was this patch tested?

Unit tests cover:

  • preserving the heartbeat-timeout reason when the backend reports ExecutorKilled,
  • preserving a concrete backend-provided reason instead of overriding it,
  • clearing the pending timeout reason when executor kill is rejected.

The OSS port itself has not been run through a local full test command yet.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex

(cherry picked from commit 81dae0f3fdedb15c232adc34ccdd7bbd468d18d2)
@sunchao sunchao changed the title [SPARK-56952] Preserve heartbeat timeout executor loss reason [SPARK-56952][CORE] Preserve heartbeat timeout executor loss reason May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant