Propagate CPU errors to events#3742
Conversation
| }); | ||
| } | ||
|
|
||
| void Scheduler::set_error(Stream s, std::shared_ptr<std::string> error) { |
There was a problem hiding this comment.
I think this needs to remember stream errors, not only poison events that are pending at this exact moment.
One real interleaving for mx.eval(mx.load(..., stream=mx.cpu)) is:
Load::eval_cpuenqueues the future join/error task with plainscheduler::enqueue.- The CPU stream worker runs that task quickly; the IO future is already failed, so it calls
scheduler::set_error(s). - At this point the synchronizer/completion event has not necessarily been inserted into
events_[s.index]yet. That insertion happens later wheneval_implreachesEvent::signal(s), which callsenqueue_event. set_error()sees an empty list and drops the error.- The later synchronizer event is inserted and signaled cleanly, so the eval can complete without throwing.
| } | ||
| } | ||
|
|
||
| Error& error(); |
There was a problem hiding this comment.
Exposing Error& error() makes the new error state potential race subject
|
This PR adds errors to If a CPU load fails, the event may have both A concrete fast-failure interleaving is: the event is inserted, I think either For context, these issues would prevent me from reliably landing ml-explore/mlx-swift#427, which is why I opened my original MLX PR. That Swift PR depends on CPU lazy-load read failures propagating deterministically to |
| // Poison all pending events if there was an error. | ||
| if (err) { | ||
| for (auto& event : list) { | ||
| event.set_error(err); |
There was a problem hiding this comment.
events_[s.index] contains both events this stream is waiting on and events this stream will signal, so poisoning every remaining entry can propagate an error backwards into unrelated producer events.
For example, if stream B waits on producer events A and C, and A is poisoned, completion of A's wait can call set_error(A_error) on C. Since copies of Event share the same implementation, C then becomes poisoned for all of its consumers even though C itself succeeded.
@zcbenz you were right CPU exceptions are hard ...
|
Thanks a lot for reviewing this! I updated the PR with a different strategy: the error happened in On the race condition of On |
| e->second.signal(s); | ||
| } | ||
| if (s.device == Device::cpu) { | ||
| scheduler::finalize(s); |
There was a problem hiding this comment.
I think clearing the per-stream error here is still too early for two async evaluations queued on the same CPU stream.
For example:
async_eval(bad_load)queues the failed-load join, A’s event signal, and this finalize task.- Before the read completes,
async_eval(add(bad_load, 1, stream=s))queues evaluation B on the same stream. - Since B’s input event belongs to the same stream,
eval_impldoes not enqueuewait_event, so B has no task that imports A’s event error. - A’s join stores
errors_[s]; A’s signal poisons A’s event; this finalize eraseserrors_[s]. - B then executes, and B’s signal sees a clean stream. Waiting on B can therefore succeed even though its input load failed.
This matters for my lazy safetensors loading in ml-explore/mlx-swift#427 as an upper layer can queue dependent work or asyncEval on the same CPU stream before the lazy read finishes.
A deterministic regression test could use a blocking failing reader, queue async_eval(bad), then queue async_eval(dependent) on the same stream, release the reader, and verify that waiting on dependent throws.
There was a problem hiding this comment.
As you can see in:
Lines 254 to 261 in a871934
B's input event will wait when it is not signaled, so while the stream's error would have been cleared when the waiting happens, the event itself still carries the error and would poison the stream again.
When the input event has already been signaled, detach_event would be called and the carried error would throw.
aleroot
left a comment
There was a problem hiding this comment.
Thank you for this work, once released I will definitely make use of it in my apps.
This PR implements exception handling for errors happened in
eval_cpu. Similar to #3523, the cpu scheduler would poison all pending events in the stream whenever an error happened, and an exception would throw when the poisoned event is synchronized.Most of this PR is doing refactoring:
metal::EventImplto the publicEventclass.Schedulerto make it capable of setting errors in events.Schedulermethods to signal/wait events.Note that most of the errors happened in
eval_cpuwould be fatal and not recoverable, so this PR does not catch all errors, instead we have to catch the expected errors and pass to the scheduler explicitly, this PR handles the IO error inLoad::eval_cpuas example.