Skip to content

Clarify worker function load failures#195

Open
JacobZuliani wants to merge 1 commit into
mainfrom
agent/15/clarify-function-load-errors
Open

Clarify worker function load failures#195
JacobZuliani wants to merge 1 commit into
mainfrom
agent/15/clarify-function-load-errors

Conversation

@JacobZuliani

Copy link
Copy Markdown
Contributor

Problem

When a worker failed while deserializing function_pkl, node assignment raised a node-service 500. The client then reported only Failed to assign <node>: 500, hiding the real cause, such as mpq.__new__() missing 1 required positional argument: 'p' during cloudpickle.loads().

Solution

  • Catch worker load_function() errors during node assignment.
  • Roll the node back to READY and reset workers before returning.
  • Return a structured 422 function_load_failed response containing the original traceback.
  • Teach the client to surface that message as FunctionLoadError instead of a generic assignment failure.

How This Fixes It

The failure is now treated as a user-function load/deserialization problem before any inputs run, not as node infrastructure failure. Users get the actionable traceback from the worker, and the node is left ready for later jobs instead of rebooting from middleware 500 handling.

Test plan

  • uv run --project client pytest client/tests/test_rpm_exceptions.py -m unit
  • uv run --project node_service python -m py_compile node_service/src/node_service/job_endpoints.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant