🐛 CI failure: test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph (NCCL_GRAPH_REGISTER/expandable_segments env guard)

**Describe the bug**

CI test `tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph` failed in job [`tests/unit_tests/transformer/**/*.py - latest`](https://github.com/NVIDIA/Megatron-LM/actions/runs/28032794934/job/82991762705).
Tag @NVIDIA/mcore-oncall to get oncall's attention to this issue.

This is an **environment-triggered** failure, not a regression in the triggering PR (which only touches DeepSeek-v4 attention code). The guard at `megatron/core/transformer/cuda_graphs.py:1506` (in place since Jan 2026) fires when **all three** hold at runtime:

| Condition | This run |
|-----------|----------|
| GPU compute capability `< 10` (sub-Blackwell, e.g. H100/A100) | ✅ |
| `PYTORCH_CUDA_ALLOC_CONF` contains `expandable_segments:True` | ✅ (set in the runner/container env) |
| `NCCL_GRAPH_REGISTER == "0"` | ❌ not set |

The runner exports `expandable_segments:True` without `NCCL_GRAPH_REGISTER=0` and the test scheduled on a sub-Blackwell GPU, so the guard asserts. Deterministic, not flaky — all three retries (15:35 / 15:47 / 15:59) failed identically.

**Failing run**

| Field | Value |
|-------|-------|
| PR    | [#5011: [dev] [DeepSeek-v4] Packed Sequence (THD) support for DSv4 Hybrid Attention](https://github.com/NVIDIA/Megatron-LM/pull/5011) |
| Run   | [28032794934](https://github.com/NVIDIA/Megatron-LM/actions/runs/28032794934) |
| Job   | [tests/unit_tests/transformer/**/*.py - latest](https://github.com/NVIDIA/Megatron-LM/actions/runs/28032794934/job/82991762705) |

**Error**

```
    if torch.cuda.get_device_capability()[0] < 10:
        assert (
>           "expandable_segments:True" not in os.getenv("PYTORCH_CUDA_ALLOC_CONF", "")
            or os.getenv("NCCL_GRAPH_REGISTER", "") == "0"
        ), (
            "Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using "
            "CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True."
        )
E       AssertionError: Setting NCCL_GRAPH_REGISTER=0 to avoid illegal memory access when using CUDA Graph with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.

megatron/core/transformer/cuda_graphs.py:1506: AssertionError
```

**Steps/Code to reproduce bug**

Re-run the failing CI job linked above, or locally inside the dev container on a sub-Blackwell GPU:

```bash
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  pytest tests/unit_tests/transformer/test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph
```

**Suggested fix**

Either export `NCCL_GRAPH_REGISTER=0` alongside `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` for the unit-test jobs, or have the test set the env / skip on sub-Blackwell GPUs when `expandable_segments:True` is active.

**Additional context**

Triaged automatically via `/triage-issue`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 CI failure: test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph (NCCL_GRAPH_REGISTER/expandable_segments env guard) #5474

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Condition	This run
GPU compute capability `< 10` (sub-Blackwell, e.g. H100/A100)	✅
`PYTORCH_CUDA_ALLOC_CONF` contains `expandable_segments:True`	✅ (set in the runner/container env)
`NCCL_GRAPH_REGISTER == "0"`	❌ not set

Field	Value
PR	#5011: [dev] [DeepSeek-v4] Packed Sequence (THD) support for DSv4 Hybrid Attention
Run	28032794934
Job	tests/unit_tests/transformer/*/.py - latest

Uh oh!

🐛 CI failure: test_cuda_graphs.py::TestParallelTransformerBlockCudagraphs::test_gpu_cudagraph (NCCL_GRAPH_REGISTER/expandable_segments env guard) #5474

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions