Skip to content

NPKit with nccl-tests: Generated trace file is empty #37

@DevHSA

Description

@DevHSA

I am trying to profile an nccl-tests allReduce call: all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
NCCL version: 2.17.1
NCCL-tests version: 2.13.3

TLDR:

  1. The trace file generated is empty for -g 1. I dot not see any errors and have tried enabling all the NPKIT_FLAGS as well. The NCCL version being used is as per NPKit specification. Is there a particular nccl-test version that needs to be used? Or is this a bug with NPKit's nccl profiling?

  2. I see that values other than -g 1 throws errors (e.g., -g 2). I am wondering how would I be able to observe the collective communication patterns between multiple GPUs/Ranks if i cannot specify them in my tests? Why does this limitation exist?

Edit: Before application of NPkit patch, the tests run for all values of -g.

---------------------------Details----------------------------

First, i wanted to confirm its working with -g 1. I have assigned the correct run parameters to the nccl_test() function in npkit_runner.sh

npkit_runner.sh

function nccl_test() {
  $1 -b $3 -e $3 -n $7 -w $6 -g 1 -c 1 | tee $9/log.txt
}

The test runs successfully:

root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh 
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 1 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 1 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1062079 on c11efb988616 device  0 [0x17] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    16777216       4194304     float     sum      -1    266.8   62.88    0.00      0     0.42  40031.53    0.00      0

c11efb988616:1062079:1062079 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple

However, the npkit_event_trace.json is empty. Does it have something to do with the init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty warning shown above?

{"traceEvents": [], "displayTimeUnit": "ns"}

When I increase the number of gpus to say -g 4

npkit_runner.sh

function nccl_test() {
  $1 -b $3 -e $3 -n $7 -w $6 -g 4 -c 1 | tee $9/log.txt
}

The test fails:

root@c11efb988616:/home/NPKit/nccl_samples# bash npkit_launcher.sh 
+ export NCCL_SRC_DIR=/home/ncclold
+ NCCL_SRC_DIR=/home/ncclold
+ export NPKIT_SRC_DIR=/home/NPKit
+ NPKIT_SRC_DIR=/home/NPKit
+ export NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ NCCL_TEST_BIN=/home/nccl-testsold/build/all_reduce_perf
+ export NPKIT_RUN_DIR=/home/NPKit/results
+ NPKIT_RUN_DIR=/home/NPKit/results
+ export NCCL_MSG_SIZE=16M
+ NCCL_MSG_SIZE=16M
+ export NCCL_ALGO=Ring
+ NCCL_ALGO=Ring
+ export NCCL_PROTO=Simple
+ NCCL_PROTO=Simple
+ export NCCL_NUM_WARMUPS=0
+ NCCL_NUM_WARMUPS=0
+ export NCCL_NUM_ITERS=10
+ NCCL_NUM_ITERS=10
+ NPKIT_FLAGS_CPU_PREFIX=-DENABLE_NPKIT
+ NPKIT_FLAGS_GPU_PREFIX='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_ENTRY -DENABLE_NPKIT_EVENT_ALL_REDUCE_RING_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_DIRECT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_ENTRY -DENABLE_NPKIT_EVENT_RECV_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_COPY_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_ENTRY -DENABLE_NPKIT_EVENT_RECV_REDUCE_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_ENTRY -DENABLE_NPKIT_EVENT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_ENTRY -DENABLE_NPKIT_EVENT_SEND_FROM_OUTPUT_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_WAIT_PEER_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_ENTRY -DENABLE_NPKIT_EVENT_PRIM_SIMPLE_REDUCE_OR_COPY_MULTI_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_WAIT_SEND_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_TIME_SYNC_CPU -DENABLE_NPKIT_EVENT_TIME_SYNC_GPU -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_ENTRY -DENABLE_NPKIT_EVENT_PRIM_LL128_DATA_PROCESS_EXIT'
+ export 'NPKIT_FLAGS=-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ NPKIT_FLAGS='-DENABLE_NPKIT -DENABLE_NPKIT_EVENT_NET_SEND_ENTRY -DENABLE_NPKIT_EVENT_NET_SEND_EXIT -DENABLE_NPKIT_EVENT_NET_RECV_ENTRY -DENABLE_NPKIT_EVENT_NET_RECV_EXIT'
+ bash npkit_runner.sh
++ basename /home/nccl-testsold/build/all_reduce_perf
+ npkit_run_tag=all_reduce_perf//Ring/Simple
+ npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ npkit_trace_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ npkit_result_dir=/home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ rm -rf /home/NPKit/results
+ mkdir -p /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ mkdir -p /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ nccl_test /home/nccl-testsold/build/all_reduce_perf /home/ncclold 16M Ring Simple 0 10 /home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple
+ /home/nccl-testsold/build/all_reduce_perf -b 16M -e 16M -n 10 -w 0 -g 4 -c 1
+ tee /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple/log.txt
# nThread 1 nGpus 4 minBytes 16777216 maxBytes 16777216 step: 1048576(bytes) warmup iters: 0 iters: 10 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 1346195 on c11efb988616 device  0 [0x17] NVIDIA A100 80GB PCIe
#  Rank  1 Group  0 Pid 1346195 on c11efb988616 device  1 [0x65] NVIDIA A100 80GB PCIe
#  Rank  2 Group  0 Pid 1346195 on c11efb988616 device  2 [0xca] NVIDIA A100 80GB PCIe
#  Rank  3 Group  0 Pid 1346195 on c11efb988616 device  3 [0xe3] NVIDIA A100 80GB PCIe
NCCL version 2.17.1+cuda12.0

c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346234 [2] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346235 [3] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes

c11efb988616:1346195:1346232 [0] include/alloc.h:105 NCCL WARN Failed to CUDA calloc 1048576 bytes
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    16777216       4194304     float     sum      -1   5738.2    2.92    4.39      0   9352.9    1.79    2.69      0

c11efb988616:1346195:1346195 [0] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty

c11efb988616:1346195:1346195 [1] init.cc:1481 NCCL WARN NPKIT_DUMP_DIR is empty
terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
+ cd /home/NPKit/nccl_samples
+ python3 npkit_trace_generator.py --npkit_dump_dir=/home/NPKit/results/npkit_dump/all_reduce_perf//Ring/Simple --npkit_event_header_path=/home/ncclold/src/include/npkit/npkit_event.h --output_dir=/home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ cd /home/NPKit/results/npkit_trace/all_reduce_perf//Ring/Simple
+ tar cvzf npkit_result.tar.gz npkit_event_trace.json
npkit_event_trace.json
+ mv npkit_result.tar.gz /home/NPKit/results/npkit_result/all_reduce_perf//Ring/Simple

Any pointers would be helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions