[Megatron-FSDP] MaxPoolAllocator for double-buffering hybrid architectures.#5462
[Megatron-FSDP] MaxPoolAllocator for double-buffering hybrid architectures.#5462cspades wants to merge 9 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
af4ad72 to
b81af2c
Compare
| # Do not release the buckets that are being all-gathered. | ||
| no_fsdp_units = True | ||
| for bucket_id in ag_buckets: | ||
| self.bucket_can_be_released[self.get_bucket_key(bucket_id, bwd)] = False | ||
| fsdp_unit_id = parameter_groups[bucket_id].fsdp_unit_id | ||
| if fsdp_unit_id is not None and fsdp_unit_id >= 0: | ||
| no_fsdp_units = False | ||
|
|
||
| # If prefetch is enabled, we will add prefetch buckets to ag_buckets. | ||
| if prefetch: | ||
| # If there are no FSDP units associated with params, we should not prefetch. | ||
| if prefetch and not no_fsdp_units: | ||
|
|
||
| def next_bucket_id(ag_buckets): | ||
| """ |
There was a problem hiding this comment.
@shjwudp Please take a look at this code. It is a behavior change in how we pre-fetch.
Without this change, Nemotron will hit an error where a parameter owned by a module that is not an FSDP unit will pre-fetch buckets that will not be used for a long time. In this case, I believe the LanguageModelEmbedding will pre-fetch Layer 2 during the last MTP layer of the model.
This pre-fetch of an irrelevant FSDP unit will cause both buffers in the double buffer pool to be used, preventing our code from allocating the MTP layer because we do not have enough free buckets to support it.
There was a problem hiding this comment.
Can we add a precondition for using a double buffer allocator to avoid this modification broadcast to all use cases?
There was a problem hiding this comment.
if prefetch and not (
# When double buffering, if parameters are not members of FSDP units,
# we should skip pre-fetch to efficiently supply buffers from the pool.
# Non-unit module pre-fetch can run inside other FSDP unit modules and
# un-shard irrelevant model components that pointlessly steal buffer
# allocations from the expected FSDP unit allocation and violating
# the maximum limit of 2 buffers allocated at any point in time.
self.buffer.ddp_config.fsdp_double_buffer
and no_fsdp_units
):
So basically, we'll still do a "naive" pre-fetch if we are not using double buffers.
That being said, I feel like sometimes this can increase memory overhead. I think it is a trade-off. If we "naively" pre-fetch the next bucket even though the current bucket is not an FSDP unit, then it means:
- If the next layer is relevant and is computed after the current layer, then we will have better overlap and performance.
LanguageModelEmbedding-> first Transformer layer.
- If the next layer is not relevant and is not computed after the current layer, then we will un-shard some extra bucket(s) and increase the memory overhead to support the current layer, the actual next compute layer, and the next un-used extra layer.
LanguageModelEmbeddingtied toMultiTokePredictionso we pre-fetch maybe 1 Transformer, 1 MoETransformer, and some Layer 2 Transformer.
I think the above 2 points are both somewhat common, but one thing we can do is to suggest users use double buffer for weight-tied output layer, otherwise there will be higher memory overhead at the end of the model if we do not skip this pre-fetch.
If this can be customized by the user at a per-module level, the user can decide the pre-fetch graph instead of us.
| dtype_attr=( | ||
| self.mp_policy.grad_comm_dtype | ||
| if isinstance(self.mp_policy.grad_comm_dtype, torch.dtype) | ||
| else "grad_dtype" | ||
| ), |
There was a problem hiding this comment.
Translate: If the gradient communication data-type is set, then that is what this allocator will allocate, and the MaxPoolAllocator needs to know the correct dtype to properly plan ahead for the bucket assignments. Otherwise, just check the main gradient data-type for the ParameterGroup.
Before this change, there was no data-type argument, the FixedPoolAllocator just used the dtype to find symmetric buckets, so this doesn't change any behavior besides the case where we use a custom gradient communication data-type.
There was a problem hiding this comment.
Translation looks good. Even better if it's a code comment. The comment can go either here or to the allocator's constructor.
| def _build_fixed_max_pool(self): | ||
| """ | ||
| Compute the maximum double-buffer pool required to support all FSDP units. | ||
| """ |
There was a problem hiding this comment.
The max pooling algorithm is here. The rest of the code is similar to FixedPoolAllocator.
There was a problem hiding this comment.
Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?
| else: | ||
| if self.ddp_config.data_parallel_sharding_strategy == "optim_grads_params": | ||
| self.fsdp_unit_modules = [TransformerLayer] | ||
| self.fsdp_unit_modules = [TransformerLayer, MoETransformerLayer, MambaLayer] |
There was a problem hiding this comment.
Before, we did not shard MambaLayer at all.
| if hasattr(torch.autograd.graph, 'set_override_stale_capture_stream'): | ||
| torch.autograd.graph.set_override_stale_capture_stream(True) | ||
| else: | ||
| logger.warning( | ||
| 'torch.autograd.graph.set_override_stale_capture_stream is not ' | ||
| 'available in this PyTorch version; CUDA graph capture may fail ' | ||
| 'if autograd nodes hold stale references to non-capturing streams. ' | ||
| 'Upgrade to a PyTorch build that includes pytorch/pytorch#180090.' | ||
| ) |
There was a problem hiding this comment.
This should just be something that we should call if we have a new enough PyTorch version: pytorch/pytorch#180090
It harmlessly makes things a lot easier w.r.t. stragglers on the Autograd / accumulate stream. cc @nanz-nv
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
… later, and grad_comm_dtype not respected during FixedPool/MaxPool bucket planning. Signed-off-by: Cory Ye <cye@nvidia.com>
…ction. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
5b8512b to
8abf9db
Compare
wujingyue
left a comment
There was a problem hiding this comment.
Deprecates --grad-reduce-in-bf16 / reduce_grad_in_fp32 for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.
Adds a call to torch.autograd.graph.set_override_stale_capture_stream(True) (only supported on new PyTorch versions since pytorch/pytorch#180090) to prevent full-iteration CG errors like this:
Thanks for the PR and the figures!
While I'm still reviewing the rest, can these two changes go to a separate PR(s)? https://google.github.io/eng-practices/review/developer/small-cls.html
| dtype: torch.dtype, | ||
| device: torch.device, | ||
| mem_alloc_context: Optional[Callable] = None, | ||
| strict_assignments: bool = True, |
There was a problem hiding this comment.
Adds the strict_assignment state to attempt to assign the same bucket previously assigned to an FSDP unit before warning the user and assigning a different bucket to the unit.
What's the downside of strict_assignment=True? I wonder if it should be always on, or always on/off for certain allocators so we have fewer knobs to worry about.
There was a problem hiding this comment.
It should always be on, it falls back to the original behavior if it is unsuccessful, but tries to allocate what it has allocated before, the scope of which is defined by this boolean being set to True. So it fixes CG issues while improving the rigor of the double buffer assignment strategy in general.
In V2, we should do this with context managers or something for TracePoolAlloc, but here it is far too messy to implement that when we don't have a use case for multiple model call patterns.
@wujingyue Considering this exact commit needs to be merged for the NeMo release code freeze in a few days, could we make an exception in this case? These three features are all needed for Nemotron benchmarks. I'm concerned that waiting on 3 PR's to be merged in a few work days is not feasible. |
| --record-memory-history | ||
| --memory-snapshot-path "${NSYS_PROFILE_PATH}/torch_memprof_node${SLURM_NODEID}_rank${SLURM_PROCID}.pickle" |
| --eval-interval 100 | ||
| --save-interval 1000 | ||
| --log-throughput | ||
| --logging-level 20 |
|
|
||
| if self.megatron_fsdp_max_pool_double_buffer: | ||
| # MaxPoolAllocator is a type of double-buffer allocator. | ||
| self.fsdp_double_buffer = True |
There was a problem hiding this comment.
Instead of quietly overriding fsdp_double_buffer, we may want to assert fsdp_double_buffer to make sure the user understands the contract.
| dtype_attr=( | ||
| self.mp_policy.grad_comm_dtype | ||
| if isinstance(self.mp_policy.grad_comm_dtype, torch.dtype) | ||
| else "grad_dtype" | ||
| ), |
There was a problem hiding this comment.
Translation looks good. Even better if it's a code comment. The comment can go either here or to the allocator's constructor.
In my experience, reviewing three stacked PRs is usually faster than reviewing a single large PR. Stacked PRs can also be reviewed in parallel, though I may be missing something about how the review process works in Megatron-LM. As a less ideal alternative, you could keep everything in a single PR but split it into three well-structured commits. GitHub's UI supports reviewing commits individually, which provides a similar incremental review experience. |
| assert ( | ||
| len(self.fsdp_double_buffer_units) > 0 | ||
| ), "Found no FSDP units to use max-sized buffering." | ||
| if torch.distributed.get_rank() == 0: |
There was a problem hiding this comment.
Consider log_single_rank to reduce indentation
| ), "Found no FSDP units to use max-sized buffering." | ||
| if torch.distributed.get_rank() == 0: | ||
| if any( | ||
| pg.fsdp_unit_id == -1 or pg.fsdp_unit_id is None for pg in self.fsdp_param_groups |
There was a problem hiding this comment.
It seems weird to have both -1 and None to represent the same meaning. But I guess this is likely a pre-existing problem.
| self.bucket_alloc_index[bucket_id] = (-1, bucket_offset) | ||
|
|
||
| # Log the max pool bucket sizes and bucket IDs responsible. | ||
| if torch.distributed.get_rank() == 0: |
There was a problem hiding this comment.
Still needed given log_single_rank below?
| def __init__( | ||
| self, | ||
| name: str, | ||
| fsdp_param_groups: List["ParameterGroup"], |
There was a problem hiding this comment.
Note to myself: these parameter groups may span across FSDP units.
| def _build_fixed_max_pool(self): | ||
| """ | ||
| Compute the maximum double-buffer pool required to support all FSDP units. | ||
| """ |
There was a problem hiding this comment.
Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?
What does this PR do ?
strict_assignmentstate to attempt to assign the same bucket previously assigned to an FSDP unit before warning the user and assigning a different bucket to the unit.--grad-reduce-in-bf16/reduce_grad_in_fp32for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.torch.autograd.graph.set_override_stale_capture_stream(True)(only supported on new PyTorch versions since Detect and fix stale stream references in autograd during CUDA graph capture pytorch/pytorch#180090) to prevent full-iteration CG errors like this:^ (a) is annoying to implement, (b) is dirty, and (c) is EZ-PZ and recommended.
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.