Skip to content

GEQRF (and derivatives) use too many workspaces on GPU #110

Description

@abouteiller

Describe the bug

GEQRF (and derivatives, like LQ, SORMQR etc) use more than the hardcoded 2 GPU workspaces.

Important note

After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for QR+GPU to explicitly test for this case.

To Reproduce

Ctest on Leconte SLURM_TIMELIMIT=2 PARSEC_MCA_device_cuda_memory_use=20 OMPI_MCA_rmaps_base_oversubscribe=true salloc -N1 -wleconte ctest --rerun-failed

125/437 Test: dplasma_sgeqrf_shm
 113 Command: "/usr/bin/srun" "./testing_sgeqrf" "-M" "487" "-N" "283" "-K" "97" "-t" "56" "-x" "-v=5"
 114 Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
 115 "dplasma_sgeqrf_shm" start time: Jan 31 19:38 EST
 116 Output:
 117 ----------------------------------------------------------
 118 srun: Job 4994 step creation temporarily disabled, retrying (Requested nodes are busy)
 119 srun: Step created for job 4994
 120 [1706747884.458034] [leconte:2566339:0]     ucp_context.c:1081 UCX  WARN  network device 'mlx5_0:1' is not available, please use one or more of: 'docker0'
     (tcp), 'enp1s0f0'(tcp), 'enp1s0f1'(tcp), 'lo'(tcp)
 121 ^[[1;37;43mW@00000^[[0m /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
 122 #+++++ cores detected       : 40
 123 #+++++ nodes x cores + gpu  : 1 x 40 + 0 (40+0)
 124 #+++++ thread mode          : THREAD_SERIALIZED
 125 #+++++ P x Q                : 1 x 1 (1/1)
 126 #+++++ M x N x K|NRHS       : 487 x 283 x 97
 127 #+++++ LDA , LDB            : 487 , 487
 128 #+++++ MB x NB , IB         : 56 x 56 , 32
 129 #+++++ KP x KQ              : 4 x 1
 130 ^[[1;37;41mx@00000^[[0m parsec_device_pop_workspace: user requested more than 2 GPU workspaces which is the current hard-coded limit per GPU stream
 131  ^[[36m@parsec_device_pop_workspace:206   (leconte:2566339)^[[0m
 132 --------------------------------------------------------------------------
 133 MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
 134 with errorcode -6.
 135
 136 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
 137 You may or may not see output from other processes, depending on
 138 exactly when Open MPI kills them.
 139 --------------------------------------------------------------------------
 140 slurmstepd: error: *** STEP 4994.4 ON leconte CANCELLED AT 2024-02-01T00:38:06 ***
 141 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 142 srun: error: leconte: task 0: Exited with exit code 250
 143 <end of output>
 144 Test time =   3.17 sec
 145 ----------------------------------------------------------
 146 Test Failed.

Proposed fix

  • Deprecate workspaces in parsec
  • Use the gpu info handles to provide more than 2 workspaces per stream

Environment (please complete the following information):

  • Dplasma: 416aec9 (origin/master, origin/HEAD, master) Merge pull request bugfix: we must count the actual number of cuda devices  #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 22 hours ago
  • Parsec: adabbd4d (origin/master, origin/HEAD, master) Merge pull request #620 from bosilca/fix/osx_warning Thomas Herault 7 days ago
  • config.log: ../configure --prefix=/home/bouteill/parsec/dplasma/build.cuda --with-cuda --without-hip --enable-debug=noisier\,paranoid
Currently Loaded Modulefiles:
 1) ncurses/6.4/gcc-11.3.1-6rvznd           25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj                49) libvterm/0.3.1/gcc-11.3.1-we43r4
 2) htop/3.2.2/gcc-11.3.1-xm6i3t            26) readline/8.2/gcc-11.3.1-b26lae                       50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
 3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5        27) gdbm/1.23/gcc-11.3.1-6u5vme                          51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
 4) zlib/1.2.13/gcc-11.3.1-uhneca           28) perl/5.38.0/gcc-11.3.1-r63sx3                        52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
 5) openssl/3.1.2/gcc-11.3.1-w3u2b2         29) git/2.41.0/gcc-11.3.1-tx4xbg                         53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
 6) curl/8.1.2/gcc-11.3.1-dhcq4d            30) cuda/11.8.0/gcc-11.3.1-vltbfy                        54) neovim/0.9.1/gcc-11.3.1-aro6rp
 7) libmd/1.0.4/gcc-11.3.1-yl2qth           31) libpciaccess/0.17/gcc-11.3.1-qp6jxc                  55) cmake/3.26.3/gcc-11.3.1-6bgawm
 8) libbsd/0.11.7/gcc-11.3.1-rxtb5h         32) hwloc/2.9.1/gcc-11.3.1-hvnu6p                        56) ninja/1.11.1/gcc-11.3.1-qf72ao
 9) expat/2.5.0/gcc-11.3.1-z3mywy           33) numactl/2.0.14/gcc-11.3.1-x35xlq                     57) gmp/6.2.1/gcc-11.3.1-c5vz5h
10) bzip2/1.0.8/gcc-11.3.1-g7buii           34) pmix/3.2.3/gcc-11.3.1-b6ek7p                         58) libffi/3.4.4/gcc-11.3.1-suq3vd
11) libiconv/1.17/gcc-11.3.1-h5tewp         35) slurm/22.05.9/gcc-11.3.1-yqiafz                      59) sqlite/3.42.0/gcc-11.3.1-trzf26
12) xz/5.4.1/gcc-11.3.1-ybherp              36) gdrcopy/2.3/gcc-11.3.1-zm6nhb                        60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
13) libxml2/2.10.3/gcc-11.3.1-jijod2        37) libnl/3.3.0/gcc-11.3.1-s2rfpt                        61) python/3.10.12/gcc-11.3.1-msankb
14) pigz/2.7/gcc-11.3.1-2ysjo2              38) rdma-core/41.0/gcc-11.3.1-zlh7l5                     62) gdb/13.1/gcc-11.3.1-awps3c
15) zstd/1.5.5/gcc-11.3.1-maqtnh            39) ucx/1.14.0/gcc-11.3.1-6ffd5t                         63) libevent/2.1.12/gcc-11.3.1-iqf4hw
16) tar/1.34/gcc-11.3.1-jl543d              40) openmpi/4.1.5/gcc-11.3.1-2rgaqk                      64) tmux/3.3a/gcc-11.3.1-nt2vwg
17) gettext/0.21.1/gcc-11.3.1-sgm6rr        41) gperf/3.1/gcc-11.3.1-lq7yw2                          65) cscope/15.9/gcc-11.3.1-4duk6k
18) libunistring/1.1/gcc-11.3.1-mswbrm      42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl                     66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
19) libidn2/2.3.4/gcc-11.3.1-kp77oe         43) libuv/1.44.1/gcc-11.3.1-ikknoi                       67) intel-oneapi-tbb/2021.10.0/gcc-11.3.1-ptv4p2
20) krb5/1.20.1/gcc-11.3.1-hb7cxy           44) unzip/6.0/gcc-11.3.1-xm5nhk                          68) intel-oneapi-mkl/2023.2.0/gcc-11.3.1-d5uffv
21) libedit/3.1-20210216/gcc-11.3.1-b2res4  45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6  69) mpfr/4.2.0/gcc-11.3.1-n3mu53
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t      46) libluv/1.44.2-1/gcc-11.3.1-pyqvat                    70) mpc/1.3.1/gcc-11.3.1-2x6jci
23) openssh/9.3p1/gcc-11.3.1-jo2led         47) unibilium/2.0.0/gcc-11.3.1-az5pko                    71) gcc/13.2.0/gcc-11.3.1-ir6jns
24) pcre2/10.42/gcc-11.3.1-bk6jhf           48) libtermkey/0.22/gcc-11.3.1-gwvd67

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions