Propagate CPU errors to events by zcbenz · Pull Request #3742 · ml-explore/mlx

zcbenz · 2026-06-22T10:25:53Z

This PR implements exception handling for errors happened in eval_cpu. Similar to #3523, the cpu scheduler would poison all pending events in the stream whenever an error happened, and an exception would throw when the poisoned event is synchronized.

Most of this PR is doing refactoring:

Move the error handling from metal::EventImpl to the public Event class.
Add methods to Scheduler to make it capable of setting errors in events.
Refactor platform event implementations to use the new Scheduler methods to signal/wait events.

Note that most of the errors happened in eval_cpu would be fatal and not recoverable, so this PR does not catch all errors, instead we have to catch the expected errors and pass to the scheduler explicitly, this PR handles the IO error in Load::eval_cpu as example.

aleroot · 2026-06-22T12:22:00Z

+  });
+}
+
+void Scheduler::set_error(Stream s, std::shared_ptr<std::string> error) {


I think this needs to remember stream errors, not only poison events that are pending at this exact moment.

One real interleaving for mx.eval(mx.load(..., stream=mx.cpu)) is:

Load::eval_cpu enqueues the future join/error task with plain scheduler::enqueue.

The CPU stream worker runs that task quickly; the IO future is already failed, so it calls scheduler::set_error(s).

At this point the synchronizer/completion event has not necessarily been inserted into events_[s.index] yet. That insertion happens later when eval_impl reaches Event::signal(s), which calls enqueue_event.

set_error() sees an empty list and drops the error.

The later synchronizer event is inserted and signaled cleanly, so the eval can complete without throwing.

aleroot · 2026-06-22T12:23:56Z

+    }
+  }
+
+  Error& error();


Exposing Error& error() makes the new error state potential race subject

aleroot · 2026-06-22T12:29:46Z

This PR adds errors to Event, but array::is_available() can now silently discard them.

If a CPU load fails, the event may have both error != nullptr and is_signaled() == true by the time the caller reaches array::wait(). In that case is_available() takes this branch, detaches the event, marks the array available, and never calls Event::wait() / check_error().

A concrete fast-failure interleaving is: the event is inserted, set_error() poisons it, the stream later signals it, and all of that finishes before the main thread calls eval_impl(...).wait(). The final wait then sees a signaled event and swallows the error.

I think either array::is_available() must check/take the event error before detaching a signaled event, or Event::is_signaled() needs to surface poisoned events somehow.

For context, these issues would prevent me from reliably landing ml-explore/mlx-swift#427, which is why I opened my original MLX PR.

That Swift PR depends on CPU lazy-load read failures propagating deterministically to eval. If those errors can be dropped or swallowed, the progress API can work for the happy path but still cannot safely handle truncated or failed safetensors reads.

aleroot · 2026-06-22T14:50:05Z

+      // Poison all pending events if there was an error.
+      if (err) {
+        for (auto& event : list) {
+          event.set_error(err);


events_[s.index] contains both events this stream is waiting on and events this stream will signal, so poisoning every remaining entry can propagate an error backwards into unrelated producer events.

For example, if stream B waits on producer events A and C, and A is poisoned, completion of A's wait can call set_error(A_error) on C. Since copies of Event share the same implementation, C then becomes poisoned for all of its consumers even though C itself succeeded.

@zcbenz you were right CPU exceptions are hard ...

zcbenz · 2026-06-23T06:14:49Z

Thanks a lot for reviewing this!

I updated the PR with a different strategy: the error happened in eval_cpu is now persistent in scheduler per stream, until the eval ends. All signaled events in the stream would be poisoned by the error in stream, and all waited events would poison the stream if an error happened.

On the race condition of error() I made method private and added a thread-safe load_error() to replace it.

On array::is_available() swallowing the error, I made array::detach_event check error before detaching.

aleroot · 2026-06-23T10:38:55Z

      e->second.signal(s);
    }
+    if (s.device == Device::cpu) {
+      scheduler::finalize(s);


I think clearing the per-stream error here is still too early for two async evaluations queued on the same CPU stream.

For example:

async_eval(bad_load) queues the failed-load join, A’s event signal, and this finalize task.

Before the read completes, async_eval(add(bad_load, 1, stream=s)) queues evaluation B on the same stream.

Since B’s input event belongs to the same stream, eval_impl does not enqueue wait_event, so B has no task that imports A’s event error.

A’s join stores errors_[s]; A’s signal poisons A’s event; this finalize erases errors_[s].

B then executes, and B’s signal sees a clean stream. Waiting on B can therefore succeed even though its input load failed.

This matters for my lazy safetensors loading in ml-explore/mlx-swift#427 as an upper layer can queue dependent work or asyncEval on the same CPU stream before the lazy read finishes.

A deterministic regression test could use a blocking failing reader, queue async_eval(bad), then queue async_eval(dependent) on the same stream, release the reader, and verify that waiting on dependent throws.

As you can see in:

mlx/mlx/transforms.cpp

Lines 254 to 261 in a871934

} else if (in.event().valid()) {

if (in.event().is_signaled()) {

in.detach_event();

} else if (in.event().stream() != stream) {

// Use event to wait across async eval

in.event().wait(stream);

}

}

B's input event will wait when it is not signaled, so while the stream's error would have been cleared when the waiting happens, the event itself still carries the error and would poison the stream again.

When the input event has already been signaled, detach_event would be called and the carried error would throw.

aleroot

Thank you for this work, once released I will definitely make use of it in my apps.

zcbenz mentioned this pull request Jun 22, 2026

Propagate CPU load task exceptions #3734

Closed

4 tasks

aleroot reviewed Jun 22, 2026

View reviewed changes

Add load tests

713f2b4

zcbenz force-pushed the cpu-error branch from 7992768 to 84a9b7a Compare June 23, 2026 02:35

Propagate CPU errors to events

b71f0ec

zcbenz force-pushed the cpu-error branch from 84a9b7a to b71f0ec Compare June 23, 2026 03:55

aleroot reviewed Jun 23, 2026

View reviewed changes

aleroot approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propagate CPU errors to events#3742

Propagate CPU errors to events#3742
zcbenz wants to merge 2 commits into
ml-explore:mainfrom
zcbenz:cpu-error

zcbenz commented Jun 22, 2026

Uh oh!

aleroot Jun 22, 2026

Uh oh!

aleroot Jun 22, 2026

Uh oh!

aleroot commented Jun 22, 2026

Uh oh!

aleroot Jun 22, 2026

Uh oh!

zcbenz commented Jun 23, 2026

Uh oh!

aleroot Jun 23, 2026

Uh oh!

zcbenz Jun 23, 2026

Uh oh!

aleroot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	} else if (in.event().valid()) {
	if (in.event().is_signaled()) {
	in.detach_event();
	} else if (in.event().stream() != stream) {
	// Use event to wait across async eval
	in.event().wait(stream);
	}
	}

Conversation

zcbenz commented Jun 22, 2026

Uh oh!

aleroot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

aleroot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

aleroot commented Jun 22, 2026

Uh oh!

aleroot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz commented Jun 23, 2026

Uh oh!

aleroot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

zcbenz Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

aleroot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants