Skip to content

How to run nccltest with msccl on 8 gpus(2 server * 4 P40 gpus)? #62

@JiangboHe

Description

@JiangboHe

Hi! I have two servers and every server has 4 P40 GPUs. How to run nccl-test with msccl.
server1(10.0.0.13) <---> (10.0.0.15)server2
I can run successfully in server1 or server2 with their 4 GPUs, but it runs failed and has some errors when I run with server1 and server2, 2 nodes.

  1. hostfile:
    10.0.0.13 slots=4
    10.0.0.15 slots=4

  2. my xml file:
    python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 1 > test-reduce-8-1.xml

  3. command:
    mpirun --allow-run-as-root -np 8 \

-hostfile hostfile
--prefix /home/nccl-tool/dependency/openmpi
-x LD_LIBRARY_PATH=executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH
-x NCCL_DEBUG=INFO
-x MSCCL_XML_FILES=test-reduce-8-1.xml
-x NCCL_ALGO=MSCCL,RING,TREE
-x NCCL_MSCCL_ENABLE=1
tests/msccl-tests-nccl/build/all_reduce_perf -b 1K -e 1K -f 2 -g 1

  1. error: ------------------------------------------------------------------------------

ubuntu2004-113:210748:210793 [0] misc/ibvwrap.cc:187 NCCL WARN Call to ibv_modify_qp failed with error Network is unreachable
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net_ib.cc:579 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net_ib.cc:786 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO transport/net.cc:730 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO proxy.cc:1310 -> 2
ubuntu2004-113:210748:210793 [0] NCCL INFO proxy.cc:1381 -> 2

ubuntu2004-113:210748:210793 [0] proxy.cc:1523 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2

ubuntu2004-113:210748:210781 [0] misc/socket.cc:29 NCCL WARN socketProgressOpt: Call to recv from 192.168.16.113<47337> failed : Broken pipe
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:46 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:57 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO misc/socket.cc:772 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO proxy.cc:1111 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO transport/net.cc:358 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO transport.cc:174 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO init.cc:1089 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO init.cc:1378 -> 6
ubuntu2004-113:210748:210781 [0] NCCL INFO group.cc:68 -> 6 [Async thread]
ubuntu2004-113:210748:210748 [0] NCCL INFO group.cc:429 -> 6
ubuntu2004-113:210748:210748 [0] NCCL INFO group.cc:115 -> 6
ubuntu2004-113: Test NCCL failure common.cu:973 'remote process exited or there was a network error / '
.. ubuntu2004-113 pid 210748: Test failure common.cu:857

error log file :
msccl-2nodes-faillog.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions