Skip to content

MSCCL all-to-all performance did not improve compared with NCCL #48

@Musisoul

Description

@Musisoul

Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml. I also tried alltoall_a100_three_step.py and alltoall_allpairs.py, they all behaved similarly.
The test code is nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576          4096     float    none      -1   9012.2    0.12    0.11      0    561.5    1.87    1.84    N/A
     2097152          8192     float    none      -1   1067.7    1.96    1.93      0   1046.3    2.00    1.97    N/A
     4194304         16384     float    none      -1   2010.8    2.09    2.05      0   2023.0    2.07    2.04    N/A
     8388608         32768     float    none      -1   5698.5    1.47    1.45      0   4261.4    1.97    1.94    N/A
    16777216         65536     float    none      -1   8339.5    2.01    1.98      0   8211.3    2.04    2.01    N/A
    33554432        131072     float    none      -1    16235    2.07    2.03      0    16281    2.06    2.03    N/A
    67108864        262144     float    none      -1    32252    2.08    2.05      0    51440    1.30    1.28    N/A
   134217728        524288     float    none      -1    63877    2.10    2.07      0    83221    1.61    1.59    N/A
   268435456       1048576     float    none      -1   147334    1.82    1.79      0   142747    1.88    1.85    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.77934 

I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions