Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions gpudirect-tcpx/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# GPUDirect-TCPX Release Notes
This release notes updates support for the following GPUDirect-TCPX components: GKE version, NCCL plugin installer, TCPX-daemon.

For new users, refer [Maximize GPU network bandwidth in Standard mode clusters](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx) to setup GPUDirect-TCPX enabled GKE clusters. This guide always installs the latest versions of GPUDirect-TCPX components.

For existing users, use this release notes to update your cluster with latest versions of GPUDirect-TCPX components.

For best practices, refer to [Best practice to run workload with GPUDirect-TCPX(O)](../gpudirect-tcpxo/best-practice.md).

## How to upgrade to a new release
#### Recommended GKE versions:
- When you want to upgrade NCCL plugin installer image and TCPX-daemon image, it is not a hard requirement to upgrade your GKE cluster and GKE node to the recommended GKE version. But recommended GKE versions have the best guarantee for compatibility.
- To upgrade GKE versions, refer to [Manually upgrading a cluster or node pool](https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster) for general guides.
#### NCCL plugin installer image:
- Directly run `kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/gpudirect-tcpx/nccl-tcpx-installer.yaml` to get your nccl-tcpx-installer daemonset updated. This manifest is always updated to use the latest NCCL plugin installer. The daemonset by default uses rolling upgrade strategies, and the upgrade process will be slow for a big nodepool. Consider delete the old daemonset and create a new one to accelerate the progress.
- Upgrading your NCCL plugin installer version does **NOT** need any VM recreation or reboot. However, pods running within the same workload need to use the same version of the NCCL plugin. Please ensure no workloads are being scheduled/running when applying this upgrade. Otherwise, pods within the same workload may have different NCCL plugin versions installed.
- This upgrade will upgrade the NCCL plugin version for **ALL** A3 High nodes in the cluster. If you only want to upgrade a specific nodepool, please update the [nodeSelector](https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/gpudirect-tcpx/nccl-tcpx-installer.yaml#L25-L29) before deploying the NCCL plugin installer manifest.
#### TCPX-daemon image:
- Update your tcpx-daemon with the new image when deploying your application.
- The tcpx-daemon version is coupled with the NCCL plugin installer version. Please ensure your NCCL plugin installer version is upgraded before applying this tcpx-daemon version upgrade to your applications.
#### Compatible NCCL version:
- The NCCL plugin installer includes NCCL core as well and it is recommended to use this NCCL core.
- If you need to use the open-source NCCL core, please use the compatible NCCL version for best performance.

## Releases
- [Jun 22, 2026](./README.md#jun-22-2026)
- [Feb 15, 2024](./README.md#feb-15-2024)

## Jun 22, 2026
#### NCCL plugin installer image:
```
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
```
#### TCPX-daemon image:
```
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
```
#### Compatible NCCL version:
```
Default NCCL version: nccl-2.19, which is provided by the NCCL plugin installer
qualified and supported: NCCL 2.19.4
```
#### NCCL configs:
```
"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/tcpx/lib64\"",
"NCCL_SOCKET_IFNAME=eth0",
"NCCL_ALGO=Ring,Tree",
"NCCL_PROTO=Simple",
"NCCL_CROSS_NIC=0",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_P2P_PXN_LEVEL=0",
"NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4",
"NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_BUFFSIZE=4194304",
"NCCL_NSOCKS_PERTHREAD=4",
"NCCL_SOCKET_NTHREADS=1",
"NCCL_GPUDIRECTTCPX_TX_BINDINGS=eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177",
"NCCL_GPUDIRECTTCPX_RX_BINDINGS=eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191",
"NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000"
```
#### What is new within release:
* Update GPUDirect-TCPX components to NCCL plugin v3.1.12 and RxDM v2.0.15.
* Improved stability and performance for A3 High GKE clusters.

## Feb 15, 2024
#### NCCL plugin installer image:
```
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
```
#### TCPX-daemon image:
```
us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
```
#### Compatible NCCL version:
```
Default NCCL version: nccl-2.19, which is provided by the NCCL plugin installer
qualified and supported: NCCL 2.19.3
```
#### NCCL configs:
```
"LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:/usr/local/tcpx/lib64\"",
"NCCL_SOCKET_IFNAME=eth0",
"NCCL_ALGO=Ring",
"NCCL_PROTO=Simple",
"NCCL_CROSS_NIC=0",
"NCCL_NET_GDR_LEVEL=PIX",
"NCCL_P2P_PXN_LEVEL=0",
"NCCL_GPUDIRECTTCPX_SOCKET_IFNAME=eth1,eth2,eth3,eth4",
"NCCL_GPUDIRECTTCPX_CTRL_DEV=eth0",
"NCCL_DYNAMIC_CHUNK_SIZE=524288",
"NCCL_P2P_NET_CHUNKSIZE=524288",
"NCCL_P2P_PCI_CHUNKSIZE=524288",
"NCCL_P2P_NVL_CHUNKSIZE=1048576",
"NCCL_BUFFSIZE=4194304",
"NCCL_NSOCKS_PERTHREAD=4",
"NCCL_SOCKET_NTHREADS=1",
"NCCL_GPUDIRECTTCPX_TX_BINDINGS=eth1:8-21,112-125;eth2:8-21,112-125;eth3:60-73,164-177;eth4:60-73,164-177",
"NCCL_GPUDIRECTTCPX_RX_BINDINGS=eth1:22-35,126-139;eth2:22-35,126-139;eth3:74-87,178-191;eth4:74-87,178-191",
"NCCL_GPUDIRECTTCPX_PROGRAM_FLOW_STEERING_WAIT_MICROS=500000"
```
2 changes: 1 addition & 1 deletion gpudirect-tcpx/nccl-tcpx-installer-autopilot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ spec:
hostPath:
path: /home/kubernetes/bin
initContainers:
- image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
- image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
name: nccl-tcpx-installer
resources:
limits:
Expand Down
2 changes: 1 addition & 1 deletion gpudirect-tcpx/nccl-tcpx-installer.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ spec:
hostPath:
path: /home/kubernetes/bin
initContainers:
- image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
- image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
name: nccl-tcpx-installer
resources:
requests:
Expand Down
8 changes: 4 additions & 4 deletions gpudirect-tcpx/nccl-test-latest-autopilot.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ spec:
cloud.google.com/gke-gpu-driver-version: latest
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -101,7 +101,7 @@ spec:
- name: proc
mountPath: /hostprocsysfs
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down Expand Up @@ -177,7 +177,7 @@ spec:
topology.kubernetes.io/zone: us-central1-a
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -208,7 +208,7 @@ spec:
- name: proc-sys
mountPath: /hostprocsysfs
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down
8 changes: 4 additions & 4 deletions gpudirect-tcpx/nccl-test-latest.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -89,7 +89,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down Expand Up @@ -162,7 +162,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -191,7 +191,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -89,7 +89,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down Expand Up @@ -162,7 +162,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -191,7 +191,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down
8 changes: 4 additions & 4 deletions gpudirect-tcpx/nccl-test-without-hostnetwork.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -63,7 +63,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down Expand Up @@ -122,7 +122,7 @@ metadata:
spec:
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.12
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand Down Expand Up @@ -150,7 +150,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down
8 changes: 4 additions & 4 deletions gpudirect-tcpx/nccl-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand All @@ -51,7 +51,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down Expand Up @@ -97,7 +97,7 @@ spec:
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: tcpx-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/tcpgpudmarxd-dev:v2.0.15
imagePullPolicy: Always
command:
- /tcpgpudmarxd/build/app/tcpgpudmarxd
Expand All @@ -121,7 +121,7 @@ spec:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: nccl-test
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.9
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpx/nccl-plugin-gpudirecttcpx-dev:v3.1.12
imagePullPolicy: Always
command:
- /bin/sh
Expand Down
Loading