Skip to content

🐛 CI failure: test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph (NCCL_GRAPH_REGISTER/expandable_segments env guard) #5474

Description

@ko3n1g

Describe the bug

CI test tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph failed in job tests/unit_tests/transformer/**/*.py - latest.
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

This is an environment-triggered failure, not a regression in the triggering PR (which only touches DeepSeek-v4 attention code). The guard at megatron/core/transformer/cuda_graphs.py:1506 (in place since Jan 2026) fires when all three hold at runtime:

Condition This run
GPU compute capability < 10 (sub-Blackwell, e.g. H100/A100)
PYTORCH_CUDA_ALLOC_CONF contains expandable_segments:True ✅ (set in the runner/container env)
NCCL_GRAPH_REGISTER == "0" ❌ not set

The runner exports expandable_segments:True without NCCL_GRAPH_REGISTER=0 and the test scheduled on a sub-Blackwell GPU, so the guard asserts. Deterministic, not flaky — all three retries (15:35 / 15:47 / 15:59) failed identically.

Failing run

Field Value
PR #5011: [dev] [DeepSeek-v4] Packed Sequence (THD) support for DSv4 Hybrid Attention
Run 28032794934
Job tests/unit_tests/transformer/**/*.py - latest

Error

    if torch.cuda.get_device_capability()[0] < 10:
        assert (
>           "expandable_segments:True" not in os.getenv("PYTORCH_CUDA_ALLOC_CONF", "")
            or os.getenv("NCCL_GRAPH_REGISTER", "") == "0"
        ), (
            "Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using "
            "CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True."
        )
E       AssertionError: Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.

megatron/core/transformer/cuda_graphs.py:1506: AssertionError

Steps/Code to reproduce bug

Re-run the failing CI job linked above, or locally inside the dev container on a sub-Blackwell GPU:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  pytest tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph

Suggested fix

Either export NCCL_GRAPH_REGISTER=0 alongside PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True for the unit-test jobs, or have the test set the env / skip on sub-Blackwell GPUs when expandable_segments:True is active.

Additional context

Triaged automatically via /triage-issue.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions