Skip to content

node_service: fix race between concurrent POST /jobs/{id}#183

Open
JacobZuliani wants to merge 1 commit into
mainfrom
agent/01/fix-concurrent-job-assignment-race
Open

node_service: fix race between concurrent POST /jobs/{id}#183
JacobZuliani wants to merge 1 commit into
mainfrom
agent/01/fix-concurrent-job-assignment-race

Conversation

@JacobZuliani

Copy link
Copy Markdown
Contributor

What

Two concurrent POST /jobs/{id} requests to the same node could race past the CallHookOnJobStartMiddleware 409 guard. The middleware checked SELF["RUNNING"] synchronously, but the mutation that flipped RUNNING=True lived inside a wrapped_receive callback that ran later — on the first http.request event. In the window between "guard reads RUNNING=False" and "body arrives and flips RUNNING=True", a second request could enter, re-check, and also pass. Both reached execute, both drove the single TCP worker stream, and the second tripped asyncio's readexactly() called while another coroutine is already waiting for incoming data with a 500.

UX impact

Any user running two remote_parallel_map calls at once against the same cluster can hit this — especially when max_parallelism is high enough that main_service selects both onto the same node. The second rpm returns 500 Failed to assign <node>, and the first job is usually corrupted too (worker stream is now in an undefined state).

Surfaced during test suite work in PR #179; the concurrent-jobs scenario deliberately staggers by 2s to avoid triggering it.

Fix

Move the SELF state flip from the wrapped_receive callback into the middleware entry itself, right after the guard checks pass. Python's asyncio event loop guarantees no other coroutine runs between a sync check and a sync assignment, so a second concurrent request now correctly sees RUNNING=True and 409s.

Details:

  • on_job_start becomes a regular def (it had no awaits, just one asyncio.create_task for the fire-and-forget Firestore write).
  • The wrapped_receive indirection is gone — along with the started single-shot flag, since we fire exactly once on middleware entry.
  • Existing cleanup is unchanged: handle_errors's disconnected_mid_assign branch still resets RUNNING=False and current_job=None if the client drops before job_watcher_task starts.
  • .cursor/skills/burla-deep-dive/job-lifecycle.md updated to reflect the simpler flow.

Two-file, 34-line diff. No new behavior — tightens an existing race.

Two concurrent POST /jobs/{id} to the same node could race past the
`CallHookOnJobStartMiddleware` 409 guard. The guard checks
`SELF["RUNNING"]`, but the mutation that flipped RUNNING=True lived
inside a `wrapped_receive` callback that ran later — on the first
`http.request` event. In the window between "guard checks RUNNING"
and "body arrives and flips RUNNING", a second request could enter,
re-check, and also pass. Both reached `execute`, both drove the
single TCP worker stream, and the second tripped asyncio's
`readexactly() called while another coroutine is already waiting
for incoming data` with a 500 response.

Fix: call `on_job_start(scope)` synchronously at middleware entry,
right after the guard checks. Python's asyncio event loop guarantees
that between a sync `if` check and a sync assignment no other
coroutine runs, so a second request now sees RUNNING=True and 409s.
Removed the `wrapped_receive` indirection, dropped the `started`
single-shot flag (no longer needed since we fire exactly once),
and made `on_job_start` a regular def (it had no awaits).

Existing cleanup still covers the early-client-disconnect case:
`handle_errors`'s `disconnected_mid_assign` branch resets RUNNING
and `current_job` if the client drops before `job_watcher_task`
starts, same as before.

Also updates .cursor/skills/burla-deep-dive/job-lifecycle.md to
reflect the new (simpler) flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant