Skip to content

Propagate CPU load task exceptions#3734

Open
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:propagate-load-read-errors
Open

Propagate CPU load task exceptions#3734
aleroot wants to merge 1 commit into
ml-explore:mainfrom
aleroot:propagate-load-read-errors

Conversation

@aleroot

@aleroot aleroot commented Jun 21, 2026

Copy link
Copy Markdown

Proposed changes

Propagate exceptions from CPU scheduler tasks, including lazy-load read failures.

CPU lazy loading dispatches file reads through an IO future, but the scheduler task previously used future::wait(), which does not rethrow exceptions from the read task. This could silently lose read failures during lazy evaluation. This change uses future::get() and adds per-stream scheduler exception propagation so CPU task failures are rethrown on synchronize(stream).

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@zcbenz zcbenz left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagating CPU exceptions is hard to implement and this change is too wrong and I probably need to spend an hour explaining it and I don't think it would worth the time as AI is quite bad at handling this.

In short the cpu command encoder should propagate the error to the event, like what #3523 does.

@aleroot aleroot force-pushed the propagate-load-read-errors branch from 6d7e292 to cdb1df4 Compare June 21, 2026 10:33
@aleroot

aleroot commented Jun 21, 2026

Copy link
Copy Markdown
Author

Propagating CPU exceptions is hard to implement and this change is too wrong and I probably need to spend an hour explaining it and I don't think it would worth the time as AI is quite bad at handling this.

In short the cpu command encoder should propagate the error to the event, like what #3523 does.

@zcbenz thank you for the insulting comment, at the end this is the real thrill of trying to contribute to some open source projects.

See if the last changes are less wrong.
In any case feel free to scrap this PR and implement this yourself, the fact that remains is that not propagating those exceptions is simply simplistic and too wrong as well.

Route CPU read and task failures through eval events so waiters observe errors consistently instead of losing exceptions in scheduler tasks.

Skip dependent CPU work once the active event is poisoned, while keeping event signals and fence updates on an unchecked dispatch path so cross-stream waiters are still released.

Add sync, async, cross-stream, same-stream, and pending-dependent load regressions for the propagated error path.
@aleroot aleroot force-pushed the propagate-load-read-errors branch from cdb1df4 to 0ce0380 Compare June 21, 2026 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants