Skip to content

fix: handle empty colocated weight buckets#2134

Open
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:codex/empty-colocated-weight-bucket-20260626
Open

fix: handle empty colocated weight buckets#2134
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:codex/empty-colocated-weight-bucket-20260626

Conversation

@EazyReal

Copy link
Copy Markdown
Contributor

Summary

This fixes colocated raw weight sync when a tensor-parallel rank has no HF tensors for a chunk.

Today _send_to_colocated_engine still tries to build a FlattenedTensorBucket(named_tensors=[]) for the empty local chunk. That path raises before the rank can participate in the Gloo gather_object, so a valid uneven bucket layout can crash raw weight updates.

The fix keeps the collective behavior unchanged:

  • empty local chunks still participate in gather_object
  • all-empty gathered chunks become no-ops
  • source ranks pad empty gathered entries with an empty flattened bucket only when another rank has real tensors to send

Why this matters

This can show up when HF tensor chunks are uneven across TP ranks, especially with PP/EP/MoE layouts where some ranks legitimately have no tensors for a given chunk. The update path should treat that as an empty contribution, not as an invalid bucket construction.

Validation

  • uv run --with pytest python tests/test_empty_colocated_weight_bucket.py
  • uv run --with pytest python -m pytest tests/test_empty_colocated_weight_bucket.py -q
  • uv run --with ruff ruff check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
  • uv run --with black black --check slime/backends/megatron_utils/update_weight/update_weight_from_tensor.py tests/test_empty_colocated_weight_bucket.py
  • fix: handle empty colocated weight buckets EazyReal/slime#3: CodeRabbit review completed, pre-commit passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant