MSCCL all-to-all performance did not improve compared with NCCL

Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by `python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml`. I also tried `alltoall_a100_three_step.py` and `alltoall_allpairs.py`, they all behaved similarly.
The test code is `nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100`, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:
```
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576          4096     float    none      -1   9012.2    0.12    0.11      0    561.5    1.87    1.84    N/A
     2097152          8192     float    none      -1   1067.7    1.96    1.93      0   1046.3    2.00    1.97    N/A
     4194304         16384     float    none      -1   2010.8    2.09    2.05      0   2023.0    2.07    2.04    N/A
     8388608         32768     float    none      -1   5698.5    1.47    1.45      0   4261.4    1.97    1.94    N/A
    16777216         65536     float    none      -1   8339.5    2.01    1.98      0   8211.3    2.04    2.01    N/A
    33554432        131072     float    none      -1    16235    2.07    2.03      0    16281    2.06    2.03    N/A
    67108864        262144     float    none      -1    32252    2.08    2.05      0    51440    1.30    1.28    N/A
   134217728        524288     float    none      -1    63877    2.10    2.07      0    83221    1.61    1.59    N/A
   268435456       1048576     float    none      -1   147334    1.82    1.79      0   142747    1.88    1.85    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.77934 
```
I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
[gpu8-two_step.log](https://github.com/microsoft/msccl/files/10031767/gpu8-two_step.log)
[gpu16-two_step.log](https://github.com/microsoft/msccl/files/10031769/gpu16-two_step.log)
[gpu64-two_step.log](https://github.com/microsoft/msccl/files/10031771/gpu64-two_step.log)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSCCL all-to-all performance did not improve compared with NCCL #48

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MSCCL all-to-all performance did not improve compared with NCCL #48

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions