Codes for the paper "MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training"
The following 2 steps show how to prepare working environment and train a 7B GPT with 128K sequence length and an offloading fraction of 0.25, using pre-generated memory plan.
-
Clone the official pytorch codebase (v2.0.0), and replace the corresponding files with the files within patch/pytorch_2.0.0_patch. These files include:
- c10/cuda/CUDACachingAllocator.h
- c10/cuda/CUDACachingAllocator.cpp
- c10/cuda/CUDAMallocAsyncAllocator.cpp
- torch/csrc/cuda/Module.cpp
- torch/csrc/cuda/CUDAPluggableAllocator.cpp
- torch/csrc/cuda/CUDAPluggableAllocator.h
- torch/cuda/memory.py We modified the PyTorch cuda memory allocator and provide the utilities needed for static memory planning.
-
Rebuild pytorch, following the official README
-
Install TransformerEngine_torch2.0 in patch/TransformerEngine_torch2.0, using patch/TransformerEngine_torch2.0/install.sh.
-
Apply for a license of the Gurobi optimizer and install gurobipy using pip.
cd Megatron-LM_cp_plan_128k_tp4_cp2_0.25/examples/gpt3;
bash cp_multi.shThe following steps show how to run the full pipeline of training a 7B GPT with 128K sequence length with offload fraction = 0.25 from the scratch. For other model sizes and sequence length, the workflow is the same. Just modify the seqlen and model configs in memplan/profile/Megatron-LM_cp_profile_128k_tp4_cp2_0.25/examples/gpt3/cp_multi.sh and memplan/plan/Megatron-LM_cp_plan_128k_tp4_cp2_0.25/examples/gpt3/cp_multi.sh. To change offload fraction, also change the OFFLOAD_FRACTION, OFFLOAD_DENOMINATOR and OFFLOAD_NUMERATOR environmental variables in these two bash scripts. For example, if offload fraction = 0.25(which is 1/4), then set
export OFFLOAD_FRACTION=0.25
export OFFLOAD_NUMERATOR=1
export OFFLOAD_DENOMINATOR=4The full workflow takes two steps: profiling and planning.
-
As desribed in Section 4 of the paper, in the profiling stage, we run a few training steps of a smaller model (with only 1 transformer layer, to avoid the GPU OOM error) and get the memory requests directed to the PyTorch CUDA memory allocator. We then expand the memory requests to the full scale (i.e. for 7B model, there are 32 transformer layers.)
-
In the planning stage, we take the memory request sequences from the previous step, and use a bi-level MIP algorithm to solve for the optimal address of each tensor. We use the Gurobi optimizer in this codebase.
- We provide a script to launch profiling:
cd memplan/profile/Megatron-LM_cp_profile_128k_tp4_cp2_0.25/examples/gpt3 - Change PROFILE_RANK=0. As the memory request sequence of different CP (i.e. context parallel) groups is different, we need to profile the memory request sequence of all CP groups, respectively. Here we use a parallel training strategy of TP size = 4, and CP size = 2, so rank 0,1,2,3 belong to CP group 0, and rank 4,5,6,7 belong to CP group 1. By setting the environment variable PROFILE_RANK = 0 and 4, we can profile the memory request sequence of both CP groups.
- Run profiling script:
bash cp_multi.sh > log. - copy the profiled memory trace in iter_0 from log. An example memory trace can be found in memplan/profile/cp_0/iter_0.
- move the trace file to the profiler working directory for the corresponding CP group.
mv iter_0 memplan/profile/cp_0/
- Repeat step 2-5 for the other CP group by setting PROFILE_RANK=4 in cp_multi.sh
- make sure you have iter_0 in memplan/profile/cp_0 (or cp_1)
- in memplan/profile/cp_0, run the following script:
This step expands the profiled memory request sequence of the smaller model (with only 1 layer) to the memory request sequence of the full 7B model(32 layers).
bash profile_cp_0.sh.
- Repeat the previous step for CP group 1:
cd memplan/profile_cp_1 bash profile_cp_1.sh - Run the bi-level MIP algorithm
This step generates the final memory plan, which determines the address of each tensor.
memplan/plan/cp_0 bash plan_cp_0.sh
- Repeat the previous step for CP group 1:
This step generates the final memory plan for CP group 1.
cd memplan/plan/cp_1 bash plan_cp_1.sh
after previous steps, a final_0.pk (or final_1.pk) will be generated in memplan/plan/cp_0(or cp_1) directory.
-
Prepare the memory plans:
mv memplan/plan/cp_0/final_0.pk memplan/plan/Megatron-LM_cp_plan_128k_cp4_tp_2_0.25/examples/gpt3/; mv memplan/plan/cp_1/final_1.pk memplan/plan/Megatron-LM_cp_plan_128k_cp4_tp_2_0.25/examples/gpt3/ -
Launch Training:
bash cp_multi.sh
Here we explain the functionality of each subdirectory, and how they correspond to our manuscript.
-
patch
- pytorch2.0.0_patch
In this directory, we modified the pytorch cuda memory allocator to provide utilities for static memory planning, as well as profiling. To be specific, these modifications allow users to specify which chunk of GPU memory should be allocated upon a memory allocation request, and allow pytorch to log the memory requests recieved by the cuda memory allocator.
- TransformerEngine_torch2.0
In this directory, we added utilities for token-wise activation recomputation support in the original TransformerEngine (v1.3).
-
memplan
-
profile
- Megatron-LM_cp_profile_128k_tp4_cp2_0.25:
This directory adds modifications to the official Megatron-LM codebase, to implement memory profiling functionalities. Specifically, when running the profiling script (i.e. examples/gpt3/cp_multi.sh), the memory request sequence will be printed. For example, suppose we are using TP_SIZE=4 and CP_SIZE=2 for training a 7B GPT model with 128K sequence length. Note that for different CP groups, the memory request sequence during training is different. So in order to get the memory request sequences of both CP groups, we need to modify the PROFILE_RANK environment variable in the script before we run. After profiling is done, we get a memory request seqeunce log file for both CP groups, which can be found in profile/cp_0/ and profile/cp_1/ .
- cp_0:
After the previous step, we have got the profiled memory request sequence of CP group 0, which is written in iter_0. However, this memory request sequence can not be directed used for memory planning, because it is just the memory request sequence of a much smaller model (with only 1 transformer layer). As we mentioned in Section 4 of our manuscript, we profile a smaller model in the profiling step to avoid the out-of-memory error. So we need expand this memory request sequence to that of a 32-layer model (7B). We can do this by running the profile/cp_0/profile_cp_0.sh
- cp_1:
This directory expands the memory request sequence of CP group 1. We can do this by running profile/cp_1/profile_cp_1.sh
-
plan
- cp_0:
This directory implements the bi-level MIP algorithm discribed in the manuscript. We begin with the memory request sequence we generated in profile/cp_0 (cp_1) directory. First, we use the Gurobi optimizer to solve for the peak memory request and address of each transient activation tensor within a transformer layer (generate_model_fwd.py and generate_model_bwd.py). Then we combine the results and the remaining memory request, and use the Gurobi optimizer again to solve for the address of each activation tensor (generate_model_final.py). We can do all these by running plan/cp_0/plan_cp_0.sh. This script dumps the final memory plan in a pickle file (final_0.pk).
- cp_1:
This directory is similar to cp_0. By running plan/cp_1/plan_cp_1.sh, we get the final memory plan for CP group 1 (final_1.pk).
- Megatron-LM_cp_plan_128k_tp4_cp2_0.25:
In this directory, we added the scheduling logic in the official Megatron-LM codebase. To be specific, we use user-defined hook functions, to schedule the offloading of activation right after the computation of a transformer layer, and schedule the fetching and partial recomputation before the backward pass of a transformer layer (detailed in Section 4 of our manuscript). We can launch the training by running examples/gpt3/cp_multi.sh
-