Describe the bug
CI test tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph failed in job tests/unit_tests/transformer/**/*.py - latest.
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
This is an environment-triggered failure, not a regression in the triggering PR (which only touches DeepSeek-v4 attention code). The guard at megatron/core/transformer/cuda_graphs.py:1506 (in place since Jan 2026) fires when all three hold at runtime:
| Condition |
This run |
GPU compute capability < 10 (sub-Blackwell, e.g. H100/A100) |
✅ |
PYTORCH_CUDA_ALLOC_CONF contains expandable_segments:True |
✅ (set in the runner/container env) |
NCCL_GRAPH_REGISTER == "0" |
❌ not set |
The runner exports expandable_segments:True without NCCL_GRAPH_REGISTER=0 and the test scheduled on a sub-Blackwell GPU, so the guard asserts. Deterministic, not flaky — all three retries (15:35 / 15:47 / 15:59) failed identically.
Failing run
Error
if torch.cuda.get_device_capability()[0] < 10:
assert (
> "expandable_segments:True" not in os.getenv("PYTORCH_CUDA_ALLOC_CONF", "")
or os.getenv("NCCL_GRAPH_REGISTER", "") == "0"
), (
"Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using "
"CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True."
)
E AssertionError: Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
megatron/core/transformer/cuda_graphs.py:1506: AssertionError
Steps/Code to reproduce bug
Re-run the failing CI job linked above, or locally inside the dev container on a sub-Blackwell GPU:
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
pytest tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph
Suggested fix
Either export NCCL_GRAPH_REGISTER=0 alongside PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for the unit-test jobs, or have the test set the env / skip on sub-Blackwell GPUs when expandable_segments:True is active.
Additional context
Triaged automatically via /triage-issue.
Describe the bug
CI test
tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraphfailed in jobtests/unit_tests/transformer/**/*.py - latest.Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.
This is an environment-triggered failure, not a regression in the triggering PR (which only touches DeepSeek-v4 attention code). The guard at
megatron/core/transformer/cuda_graphs.py:1506(in place since Jan 2026) fires when all three hold at runtime:< 10(sub-Blackwell, e.g. H100/A100)PYTORCH_CUDA_ALLOC_CONFcontainsexpandable_segments:TrueNCCL_GRAPH_REGISTER == "0"The runner exports
expandable_segments:TruewithoutNCCL_GRAPH_REGISTER=0and the test scheduled on a sub-Blackwell GPU, so the guard asserts. Deterministic, not flaky — all three retries (15:35 / 15:47 / 15:59) failed identically.Failing run
Error
Steps/Code to reproduce bug
Re-run the failing CI job linked above, or locally inside the dev container on a sub-Blackwell GPU:
Suggested fix
Either export
NCCL_GRAPH_REGISTER=0alongsidePYTORCH_CUDA_ALLOC_CONF=expandable_segments:Truefor the unit-test jobs, or have the test set the env / skip on sub-Blackwell GPUs whenexpandable_segments:Trueis active.Additional context
Triaged automatically via
/triage-issue.