Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
580 changes: 580 additions & 0 deletions docs/superpowers/plans/2026-06-18-udma-alltoall-demo.md

Large diffs are not rendered by default.

124 changes: 124 additions & 0 deletions docs/superpowers/specs/2026-06-18-udma-alltoall-demo-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# UDMA All-to-All Demo Design

## Goal

Add an all-to-all UDMA operator demo under `tests/udma/demo`.

The demo validates the common all-to-all layout:

- Each rank owns `rank_size` equal input slices.
- Input slice `dst_rank` from rank `src_rank` is sent to rank `dst_rank`.
- Each destination rank writes output slices ordered by source rank.

For rank `i`, input is laid out as `[to0, to1, ..., toN-1]`. For rank `j`, output is laid out as `[from0, from1, ..., fromN-1]`.

## Approach

Extend the existing `tilexr_udma_demo` binary instead of creating a separate demo.

The current demo already handles:

- multi-process local rank launch
- TileXR communicator initialization
- UDMA capability checks
- ordinary `aclrtMalloc` memory registration through `TileXRUDMARegister`
- local TCP barriers for demo synchronization
- per-rank logs and result validation

The all-to-all path will be selected with `test_type=2`. Existing `test_type=0` all-gather and `test_type=1` put-signal behavior must remain unchanged.

## Host Data Layout

For `test_type=2`, the host allocates one registered device-memory payload containing:

- input buffer: `rank_size * elements_per_peer` `int32_t` values
- output buffer: `rank_size * elements_per_peer` `int32_t` values
- signal/debug space if needed by the shared demo structure

The registered allocation remains rounded up to the existing 2 MiB UDMA registration alignment.

Input initialization for rank `src`:

```text
input[dst][elem] = 100000 + src * 1000 + dst
```

Expected output for rank `dst`:

```text
output[src][elem] = 100000 + src * 1000 + dst
```

This makes source and destination rank mistakes visible in validation logs.

## Kernel Behavior

Add a new AICore kernel and launch wrapper in `tilexr_udma_demo_kernel.cpp`.

The kernel reads `rank`, `rankSize`, and UDMA registry state from `CommArgs`.

For each peer:

- If `peer == rank`, copy the local input slice `input[rank]` into local output slice `output[rank]`.
- Otherwise, issue `TileXR::UDMAPutNbi<int32_t>` to write local `input[peer]` into the remote rank's registered output slice for this source rank.
- Call `TileXR::UDMAQuiet(args, peer)` after posting to each remote peer.

The remote byte offset is computed against the peer rank's registered base:

```text
output_offset + rank * elements_per_peer * sizeof(int32_t)
```

The local source pointer is:

```text
input + peer * elements_per_peer
```

## Synchronization

The demo keeps the existing host-side synchronization:

1. Each rank initializes and registers its buffers.
2. Host barrier ensures every rank's registered-memory metadata is visible.
3. Each rank launches the all-to-all kernel and synchronizes its stream.
4. Host barrier ensures all ranks have completed UDMA writes.
5. Host copies output back and validates.

The kernel does not add device-side inter-rank polling beyond `UDMAQuiet`.

## Build And Run

The existing `tests/udma/CMakeLists.txt` continues to build one demo kernel shared object and one `tilexr_udma_demo` executable.

Update the run script and README so:

```bash
bash demo/run_tilexr_udma_demo.sh 2 2 16 2 0
```

runs the all-to-all path.

## Verification

Local checks:

- Build metadata remains scoped to `tests/udma`.
- Existing all-gather and put-signal source paths remain intact.

Remote hardware validation:

- Create a new directory under `/home/aiv-perf/` on `root@141.61.95.18`.
- Copy or sync the repository into that directory.
- Build TileXR core and `tests/udma`.
- Run `bash demo/run_tilexr_udma_demo.sh 2 2 16 2 0`.
- If resources permit, also run a wider case such as `rank_size=4`.

Success requires every rank to print `TileXR UDMA demo success` and all output segments to match the expected all-to-all pattern.

## Out Of Scope

- Refactoring the demo runtime into shared helper classes.
- Adding a production all-to-all collective API.
- Optimizing multi-peer posting or batching.
- Supporting non-`int32_t` element types in this demo.
59 changes: 48 additions & 11 deletions src/comm/tilexr_comm.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,14 @@ int TileXRComm::RegisterUDMAMemory(GM_ADDR localPtr, size_t bytes, TileXRUDMAMem
TILEXR_LOG(ERROR) << "TileXR UDMA memory registration failed: " << ret;
return TILEXR_ERROR_INTERNAL;
}
udmaInfoDev_ = udmaTransport_->GetUDMAInfoDev();
commArgs_.udmaInfoPtr = udmaInfoDev_;
ret = UpdateCommArgsDev();
if (ret != TILEXR_SUCCESS) {
TILEXR_LOG(ERROR) << "TileXRUDMARegister failed to refresh CommArgs after UDMA info update: " << ret;
udmaTransport_->UnregisterMemory(localPtr);
return ret;
}

if (socketExchange_ == nullptr) {
TILEXR_LOG(ERROR) << "TileXRUDMARegister requires live socket exchange";
Expand Down Expand Up @@ -646,10 +654,6 @@ int TileXRComm::EnablePeerAccess()
} else if (physicalInfo_.physicalLink == PhysicalLink::RESERVED) {
physicalInfo_.physicalLink = PhysicalLink::PCIE;
commArgs_.extraFlag |= ExtraFlag::TOPO_PCIE;
if (rankSize_ > PING_PONG_SIZE) {
TILEXR_LOG(ERROR) << "do not support pcie > 2 rank! rankSize_ = " << rankSize_;
return TILEXR_ERROR_INTERNAL;
}
}

physicalInfo_.coreNum = GetCoreNum(physicalInfo_.chipName);
Expand Down Expand Up @@ -864,6 +868,30 @@ int TileXRComm::InitCommMem()
}

if (OpenIpcMem(names) != TILEXR_SUCCESS) {
const char *modeEnv = std::getenv("TILEXR_IPC_PID_MODE");
const bool forceSdid = modeEnv != nullptr && std::string(modeEnv) == "sdid";
if (forceSdid) {
TILEXR_LOG(WARN) << "OpenIpcMem failed after sdid setup, retry with pid setup";
string retryName;
if (setenv("TILEXR_IPC_PID_MODE", "pid_retry", 1) != 0 ||
SetMemoryName(retryName) != TILEXR_SUCCESS ||
SetIpcPidSdid(retryName, pids, sdids) != TILEXR_SUCCESS) {
TILEXR_LOG(ERROR) << "SetIpcPidSdid pid retry failed!";
setenv("TILEXR_IPC_PID_MODE", "sdid", 1);
return TILEXR_ERROR_INTERNAL;
}
retryName.resize(IPC_NAME_SIZE);
ret = GetName(retryName, names);
if (ret != TILEXR_SUCCESS) {
TILEXR_LOG(ERROR) << "GetName pid retry error! ret: " << ret;
setenv("TILEXR_IPC_PID_MODE", "sdid", 1);
return ret;
}
setenv("TILEXR_IPC_PID_MODE", "sdid", 1);
if (OpenIpcMem(names) == TILEXR_SUCCESS) {
return TILEXR_SUCCESS;
}
}
TILEXR_LOG(ERROR) << "rank: " << rank_ << " OpenIpcMem failed!";
return TILEXR_ERROR_INTERNAL;
}
Expand Down Expand Up @@ -909,26 +937,35 @@ int TileXRComm::SetMemoryName(string &name)

int TileXRComm::SetIpcPidSdid(string &name, const uint32_t *pids, const int64_t *sdids) const
{
const char *modeEnv = std::getenv("TILEXR_IPC_PID_MODE");
bool forcePid = modeEnv != nullptr && std::string(modeEnv) == "pid";
bool forceSdid = modeEnv != nullptr && std::string(modeEnv) == "sdid";
bool defaultSdid =
physicalInfo_.chipName >= ChipName::CHIP_910_9391 && physicalInfo_.chipName < ChipName::CHIP_950;
bool useSdid = forceSdid || (!forcePid && defaultSdid);
TILEXR_LOG(INFO) << "SetIpcPidSdid mode=" << (useSdid ? "sdid" : "pid");
for (int i = 0; i < rankSize_; ++i) {
if (i == rank_) {
continue;
}

if (physicalInfo_.chipName < ChipName::CHIP_910_9391) {
// 910B
int32_t pidInt32 = pids[i];
int32_t pidInt32 = pids[i];
if (!useSdid) {
int rtRet = rtSetIpcMemPid(name.c_str(), &pidInt32, HCCL_IPC_PID_ARRAY_SIZE);
if (rtRet != RT_ERROR_NONE) {
TILEXR_LOG(ERROR) << "err " << rtRet;
return TILEXR_ERROR_INTERNAL;
}
} else {
// 910A3
int32_t pidInt32 = pids[i];
int rtRet = rtSetIpcMemorySuperPodPid(name.c_str(), sdids[i], &pidInt32, HCCL_IPC_PID_ARRAY_SIZE);
if (rtRet != RT_ERROR_NONE) {
TILEXR_LOG(ERROR) << "err " << rtRet;
return TILEXR_ERROR_INTERNAL;
TILEXR_LOG(WARN) << "rtSetIpcMemorySuperPodPid err " << rtRet
<< ", fallback to rtSetIpcMemPid";
rtRet = rtSetIpcMemPid(name.c_str(), &pidInt32, HCCL_IPC_PID_ARRAY_SIZE);
if (rtRet != RT_ERROR_NONE) {
TILEXR_LOG(ERROR) << "err " << rtRet;
return TILEXR_ERROR_INTERNAL;
}
}
}
}
Expand Down
4 changes: 3 additions & 1 deletion src/comm/tilexr_internal.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@ const std::unordered_map<std::string, ChipName> CHIP_MAP = {
{"Ascend950DT", ChipName::CHIP_950},
{"Ascend950DT_9581", ChipName::CHIP_950},
{"Ascend950DT_9584", ChipName::CHIP_950},
{"Ascend950PR", ChipName::CHIP_950}
{"Ascend950DT_9592", ChipName::CHIP_950},
{"Ascend950PR", ChipName::CHIP_950},
{"Ascend950PR_9599", ChipName::CHIP_950}
};

/**
Expand Down
Loading