[bugfix] fix process_weights_after_loading & non_thinking_prefix#9519
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the vLLM weight reloading process to run process_weights_after_loading once after all weight buckets or groups have been loaded, rather than after each individual load. It introduces a new /process_weights_after_loading/ endpoint to trigger this on all workers and updates finish_vllm_weight_reload to delegate to vLLM's built-in processing when model_config and target_device are available. The review feedback suggests improving performance by using asyncio.gather to concurrently await worker responses instead of a sequential loop, and simplifying a ternary hasattr check to a more Pythonic getattr call.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the weight reloading mechanism for vLLM. Instead of calling weight reloading immediately after each weight chunk or bucket load inside try-finally blocks, it defers the call to process_weights_after_loading until after all weights have been loaded. Additionally, finish_vllm_weight_reload is updated to delegate to vLLM's built-in process_weights_after_loading when model_config and target_device are available, falling back to the FusedMoE-only path otherwise. New endpoints and client methods are introduced to coordinate this process across workers in server mode. I have no feedback to provide as there are no review comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for injecting a non-thinking prefix during token-in-token-out re-encoding in GKD and GRPO trainers when thinking is disabled, and refactors vLLM weight reloading to run once after all weight buckets are loaded. It also updates the padding of routed experts in GRPO to expand the last entry. However, several critical issues were identified: a NameError in swift/ray/megatron/gkd_trainer.py where non_thinking_prefix_ids is undefined, a TypeError in swift/rlhf_trainers/gkd_trainer.py due to an unsupported keyword argument in build_teacher_infer_request, and shape mismatch bugs in both grpo_trainer.py and megatron_worker.py where left-padding incorrectly applies both left and right padding simultaneously.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
I am having trouble creating individual review comments. Click here to see my feedback.
swift/ray/megatron/gkd_trainer.py (288-293)
NameError at Runtime
non_thinking_prefix_ids is referenced here but is never defined within the _fetch_teacher_from_replicas method. This will raise a NameError at runtime when OPSD is enabled with teacher replicas.
To fix this, define non_thinking_prefix_ids by calling get_non_thinking_prefix_ids(self.template) before using it.
opsd_item = build_opsd_teacher_data([item])[0]
if opsd_item.get('response_token_ids'):
non_thinking_prefix_ids = get_non_thinking_prefix_ids(self.template)
opsd_item['messages'] = replace_assistant_response_with_ids(
copy.deepcopy(opsd_item['messages']),
opsd_item['response_token_ids'],
non_thinking_prefix_ids=non_thinking_prefix_ids)
swift/rlhf_trainers/gkd_trainer.py (623-625)
TypeError at Runtime
build_teacher_infer_request is defined in swift/rlhf_trainers/utils.py as def build_teacher_infer_request(data: Dict) -> 'RolloutInferRequest': and does not accept the non_thinking_prefix_ids keyword argument. Passing it here will cause a TypeError at runtime.
To fix this, you should update the signature of build_teacher_infer_request in swift/rlhf_trainers/utils.py to accept non_thinking_prefix_ids and pass it to replace_assistant_response_with_ids inside it.
swift/megatron/trainers/grpo_trainer.py (370-377)
Shape Mismatch Bug in Left Padding
When padding_right is False (left padding), the code prepends left_pad (of length padding_len) to padded_tail. However, padded_tail already has last_entry (of length padding_len) appended to its right. This results in a total sequence length of padding_to + padding_len instead of the expected padding_to, which will cause a shape mismatch error downstream.
To fix this, avoid appending last_entry to the right when left-padding. The padding should only be prepended to the original routed_experts tensor.
if padding_right:
last_entry = routed_experts[-1:].expand(padding_len, -1, -1)
padding_routed_experts = torch.cat([routed_experts, last_entry], dim=0)
else:
left_pad = torch.zeros(padding_len, *routed_experts.shape[1:], dtype=routed_experts.dtype)
padding_routed_experts = torch.cat([left_pad, routed_experts], dim=0)
swift/ray/megatron/megatron_worker.py (600-607)
Shape Mismatch Bug in Left Padding
When padding_right is False (left padding), the code prepends left_pad (of length pad_len) to padded. However, padded already has last_entry (of length pad_len) appended to its right. This results in a total sequence length of target_len + pad_len instead of the expected target_len, causing a shape mismatch error.
To fix this, only prepend left_pad to the original routed tensor when left-padding, without appending last_entry to the right.
if padding_right:
last_entry = routed[-1:].expand(pad_len, *routed.shape[1:])
return torch.cat([routed, last_entry], dim=0)
else:
left_pad = torch.zeros(pad_len, *routed.shape[1:], dtype=routed.dtype)
return torch.cat([left_pad, routed], dim=0)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for injecting non-thinking prefixes during token-in-token-out re-encoding when thinking is disabled, and refactors the vLLM weight reloading process to trigger process_weights_after_loading once after all weight buckets are loaded. The review feedback highlights several improvement opportunities: ensuring finish_vllm_weight_reload falls back to the FusedMoE-only path if the built-in vLLM method fails or lacks configuration, adding defensive checks when accessing template attributes in get_non_thinking_prefix_ids, and replacing the fragile next(llm_model.parameters()).device call with more robust device references (self.accelerator.device or self.device).
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def get_non_thinking_prefix_ids(template) -> Optional[List[int]]: | ||
| """Return the token ids of the non-thinking prefix (e.g. '<think>\n\n</think>\n\n'). | ||
|
|
||
| When enable_thinking=False, the rollout engine injects this prefix into the prompt, so it | ||
| is part of the forwarded sequence (and routed_experts), but the generated response_token_ids | ||
| do NOT contain it. Token-in-token-out re-encoding must re-add it (masked out of the loss) to | ||
| keep the trainer/teacher sequence aligned with the rollout sequence. Returns None when the | ||
| prefix is not applicable (thinking enabled, or template has no non_thinking_prefix). | ||
| """ | ||
| non_thinking_prefix = template.template_meta.non_thinking_prefix | ||
| if template.enable_thinking is False and non_thinking_prefix: | ||
| return template.tokenizer.encode(non_thinking_prefix, add_special_tokens=False) | ||
| return None |
There was a problem hiding this comment.
To prevent potential AttributeError or TypeError if template is None or if certain attributes like template_meta or enable_thinking are missing, use defensive getattr checks and early returns.
def get_non_thinking_prefix_ids(template) -> Optional[List[int]]:
"""Return the token ids of the non-thinking prefix (e.g. '<think>\\n\\n</think>\\n\\n').
When enable_thinking=False, the rollout engine injects this prefix into the prompt, so it
is part of the forwarded sequence (and routed_experts), but the generated response_token_ids
do NOT contain it. Token-in-token-out re-encoding must re-add it (masked out of the loss) to
keep the trainer/teacher sequence aligned with the rollout sequence. Returns None when the
prefix is not applicable (thinking enabled, or template has no non_thinking_prefix).
"""
if template is None:
return None
template_meta = getattr(template, "template_meta", None)
if template_meta is None:
return None
non_thinking_prefix = getattr(template_meta, "non_thinking_prefix", None)
if getattr(template, "enable_thinking", None) is False and non_thinking_prefix:
return template.tokenizer.encode(non_thinking_prefix, add_special_tokens=False)
return None
Uh oh!
There was an error while loading. Please reload this page.