node_service: fix race between concurrent POST /jobs/{id} by JacobZuliani · Pull Request #183 · Burla-Cloud/burla

JacobZuliani · 2026-04-24T07:31:22Z

What

Two concurrent POST /jobs/{id} requests to the same node could race past the CallHookOnJobStartMiddleware 409 guard. The middleware checked SELF["RUNNING"] synchronously, but the mutation that flipped RUNNING=True lived inside a wrapped_receive callback that ran later — on the first http.request event. In the window between "guard reads RUNNING=False" and "body arrives and flips RUNNING=True", a second request could enter, re-check, and also pass. Both reached execute, both drove the single TCP worker stream, and the second tripped asyncio's readexactly() called while another coroutine is already waiting for incoming data with a 500.

UX impact

Any user running two remote_parallel_map calls at once against the same cluster can hit this — especially when max_parallelism is high enough that main_service selects both onto the same node. The second rpm returns 500 Failed to assign <node>, and the first job is usually corrupted too (worker stream is now in an undefined state).

Surfaced during test suite work in PR #179; the concurrent-jobs scenario deliberately staggers by 2s to avoid triggering it.

Fix

Move the SELF state flip from the wrapped_receive callback into the middleware entry itself, right after the guard checks pass. Python's asyncio event loop guarantees no other coroutine runs between a sync check and a sync assignment, so a second concurrent request now correctly sees RUNNING=True and 409s.

Details:

on_job_start becomes a regular def (it had no awaits, just one asyncio.create_task for the fire-and-forget Firestore write).
The wrapped_receive indirection is gone — along with the started single-shot flag, since we fire exactly once on middleware entry.
Existing cleanup is unchanged: handle_errors's disconnected_mid_assign branch still resets RUNNING=False and current_job=None if the client drops before job_watcher_task starts.
.cursor/skills/burla-deep-dive/job-lifecycle.md updated to reflect the simpler flow.

Two-file, 34-line diff. No new behavior — tightens an existing race.

Two concurrent POST /jobs/{id} to the same node could race past the `CallHookOnJobStartMiddleware` 409 guard. The guard checks `SELF["RUNNING"]`, but the mutation that flipped RUNNING=True lived inside a `wrapped_receive` callback that ran later — on the first `http.request` event. In the window between "guard checks RUNNING" and "body arrives and flips RUNNING", a second request could enter, re-check, and also pass. Both reached `execute`, both drove the single TCP worker stream, and the second tripped asyncio's `readexactly() called while another coroutine is already waiting for incoming data` with a 500 response. Fix: call `on_job_start(scope)` synchronously at middleware entry, right after the guard checks. Python's asyncio event loop guarantees that between a sync `if` check and a sync assignment no other coroutine runs, so a second request now sees RUNNING=True and 409s. Removed the `wrapped_receive` indirection, dropped the `started` single-shot flag (no longer needed since we fire exactly once), and made `on_job_start` a regular def (it had no awaits). Existing cleanup still covers the early-client-disconnect case: `handle_errors`'s `disconnected_mid_assign` branch resets RUNNING and `current_job` if the client drops before `job_watcher_task` starts, same as before. Also updates .cursor/skills/burla-deep-dive/job-lifecycle.md to reflect the new (simpler) flow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

node_service: fix race between concurrent POST /jobs/{id}#183

node_service: fix race between concurrent POST /jobs/{id}#183
JacobZuliani wants to merge 1 commit into
mainfrom
agent/01/fix-concurrent-job-assignment-race

JacobZuliani commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JacobZuliani commented Apr 24, 2026

What

UX impact

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant