fix(mooncake): use separate magic_recv buffer to prevent weight corruption by KunWuLuan · Pull Request #6813 · verl-project/verl

KunWuLuan · 2026-06-22T10:56:07Z

What does this PR do?

Fix weight corruption in MooncakeCheckpointEngine's daisy-chain weight sync where the magic completion marker [0xAB, 0xDC, 0xEF, 0x88] overwrites the first 4 bytes of the data buffer, causing degenerate inference output (e.g. !!!! repeated to max_response_length). Introduces a dedicated magic_recv buffer for completion signals to isolate them from the data path.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: N/A
Format the PR title as [ckpt] fix: use separate magic_recv buffer to prevent weight corruption in Mooncake daisy-chain sync

Test

This change cannot be tested by CI as it requires multi-GPU Mooncake RDMA environment. Validated by full training run:

Metric	With bug	With fix
`response_length/mean`	4096 (maxed out)	506.25
`num_turns/mean`	N/A (degenerate)	4.0
Agent trajectories	`!!!!` repeated	Coherent English text
Checkpoint	Corrupted `embed_tokens`	Saved successfully

API and Usage Example

No API changes. The fix is internal to MooncakeCheckpointEngine — all existing config and usage remains the same.

Design & Code Changes

Problem: After receiving a bucket via RDMA, the receiver writes a 4-byte magic marker to the sender's data buffer as a completion signal. Instrumentation shows the data buffer gets corrupted with magic bytes, and this corruption propagates through the daisy chain to downstream ranks. In Qwen3.6-27B, this overwrites embed_tokens.weight[0:2] with values like -3.85e+17, saturating all attention computations.

Root cause mechanism (whether transfer_sync_write has a local GPU side effect on intra-node RDMA, or another pathway) is still under investigation.

Fix: Introduce a separate RDMA-registered buffer (magic_recv, 8 bytes) for completion signals:

__init__: Add self.magic_recv = torch.zeros(8, ...) and register it with batch_register_memory
send_weights: Include magic_ptr (from magic_slots) in info dict; wait_for_complete checks magic_slots[idx] instead of data buffer
receive_weights: Extract magic_ptr from received info; write magic to magic_ptr (dedicated slot) instead of ptr (data buffer); forward magic_ptr to next rank
wait_for_complete: Reset magic slot to 0 after detection for reuse

Regardless of the corruption mechanism, writing to a dedicated buffer ensures any side effects only modify magic_recv (harmless) instead of the data buffer (fatal).

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation. — N/A, internal fix with no user-facing changes.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Requires multi-GPU Mooncake RDMA hardware, not feasible in CI.
Once your PR is ready for CI, send a message in the ci-request channel.
If your PR is related to the recipe submodule. — N/A, no recipe changes.

…ption Problem: MooncakeCheckpointEngine's daisy-chain weight sync produces degenerate inference output (e.g. '!!!!' repeated to max_response_length) when using multi-rank rollout on the same node. In Qwen3.6-27B, embed_tokens.weight first 4 bytes get overwritten with the magic completion marker [0xAB, 0xDC, 0xEF, 0x88], causing astronomical embedding values that saturate all downstream attention computations. Observation: After receiving a bucket, the receiver writes a 4-byte magic marker to the sender's DATA buffer as a completion signal via transfer_sync_write. Instrumentation shows the data buffer gets corrupted with magic bytes between the RDMA read and the subsequent usage, and the corruption propagates through the daisy chain to downstream ranks. The exact mechanism (whether transfer_sync_write has a local GPU side effect on intra-node RDMA, or another pathway) is still under investigation and not yet confirmed. Fix: - Add dedicated magic_recv buffer (8 bytes, one 4-byte slot per double-buffer) - Register magic_recv with TransferEngine for RDMA access - Send magic_ptr in info dict alongside data ptr - Write magic completion signal to magic_ptr instead of data buffer ptr - Regardless of the corruption mechanism, writing to a dedicated buffer ensures any side effects only modify magic_recv (harmless) instead of the data buffer (fatal) - Reset magic slot to 0 after detection for reuse Verification: - With fix, full training completes successfully - response_length/mean: 506.25 (vs 4096 with bug) - Agent trajectories: coherent English text - Checkpoint saved, no crashes

CLAassistant · 2026-06-22T10:56:15Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request introduces a dedicated magic_recv buffer to handle magic completion signals separately from the data buffers, preventing potential data corruption during weight synchronization. However, a critical race condition was identified in wait_for_complete where the asynchronous zeroing of the magic buffer on the GPU can race with incoming RDMA writes from the next rank, potentially leading to a deadlock. Synchronizing the device after resetting the buffer is recommended to resolve this issue.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-22T10:57:50Z

            if torch.equal(buf[:4], magic):
+                buf[:4] = 0  # reset for next use
                break


Critical Race Condition / Deadlock Risk

There is a critical race condition here that can lead to a permanent deadlock during weight synchronization.

Mechanism of the Bug:

buf[:4] = 0 is a GPU operation that is queued asynchronously on the local GPU's default CUDA stream.

Immediately after queuing this operation, wait_for_complete returns, and the receiver rank sends the info metadata (containing the pointer to this magic slot) to the next rank via self.store.send_obj(info, self.rank + 1).

The receiver rank then yields the tensors to the caller, suspending the coroutine. During this suspension, the next rank receives the info metadata, reads the weights, and performs a transfer_sync_write (RDMA write) to write the magic bytes back to this rank's magic slot.

Because transfer_sync_write is an RDMA operation, it writes directly to the GPU memory via PCIe/NIC, completely bypassing the GPU's command processor and CUDA streams.

If the next rank's RDMA write completes before the local GPU has finished executing the queued buf[:4] = 0 kernel on the default stream (which is highly likely if the local GPU is busy or delayed), the local GPU's zeroing kernel will run after the RDMA write, overwriting the newly received magic bytes with 0.

In the next iteration, wait_for_complete will poll this slot forever because the magic bytes were overwritten and will never be sent again, causing a permanent hang.

Solution:

We must call get_torch_device().synchronize() immediately after zeroing the buffer to block the CPU until the zeroing operation is fully completed on the GPU, ensuring the slot is clean before the next rank is notified.

Suggested change

if torch.equal(buf[:4], magic):

buf[:4] = 0 # reset for next use

break

if torch.equal(buf[:4], magic):

buf[:4] = 0

get_torch_device().synchronize()

break

…ve_weights Add get_torch_device().synchronize() before writing magic completion signal to prevent sender from reusing the buffer while consumer still has pending GPU ops. Use info.get("magic_ptr") with fallback to data ptr for backward compatibility with older senders. Add sglang import fallback for StatelessProcessGroup.

gemini-code-assist Bot reviewed Jun 22, 2026

View reviewed changes

KunWuLuan changed the title ~~fix(mooncake): use separate magic_recv buffer to prevent weight corruption~~ WIP fix(mooncake): use separate magic_recv buffer to prevent weight corruption Jun 22, 2026

wuxibin89 marked this pull request as draft June 23, 2026 02:45

KunWuLuan changed the title ~~WIP fix(mooncake): use separate magic_recv buffer to prevent weight corruption~~ fix(mooncake): use separate magic_recv buffer to prevent weight corruption Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(mooncake): use separate magic_recv buffer to prevent weight corruption#6813

fix(mooncake): use separate magic_recv buffer to prevent weight corruption#6813
KunWuLuan wants to merge 2 commits into
verl-project:mainfrom
KunWuLuan:main

KunWuLuan commented Jun 22, 2026

Uh oh!

CLAassistant commented Jun 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KunWuLuan commented Jun 22, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jun 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 22, 2026

Choose a reason for hiding this comment

Critical Race Condition / Deadlock Risk

Mechanism of the Bug:

Solution:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants