Skip to content

operator_mpi_example.py results in DISTRIBUTED_FAILURE (27) #211

@SzbKrisztian

Description

@SzbKrisztian

Hi,

I'm trying to run the operator_mpi_example.py example. At the point where the communicator of the WorkStrem is set fails, with DISTRIBUTED_FAILURE (27) error.

`Rank 0: ===== device info ======
Rank 0: GPU-local-id: 0
Rank 0: GPU-name: NVIDIA A100-SXM4-40GB
Rank 0: GPU-clock: 1410000
Rank 0: GPU-memoryClock: 1215000
Rank 0: GPU-nSM: 108
Rank 0: GPU-major: 8
Rank 0: GPU-minor: 0
Rank 0: ========================
Rank 1: ===== device info ======
Rank 1: GPU-local-id: 1
Rank 1: GPU-name: NVIDIA A100-SXM4-40GB
Rank 1: GPU-clock: 1410000
Rank 1: GPU-memoryClock: 1215000
Rank 1: GPU-nSM: 108
Rank 1: GPU-major: 8
Rank 1: GPU-minor: 0
Rank 1: ========================
Rank 0: Created WorkStream (execution context) on current device.
Rank 1: Created WorkStream (execution context) on current device.
Traceback (most recent call last):
Traceback (most recent call last):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):

`

I've installed the cuquantum package with conda and openmpi. I tried to look up the meaning of DISTRIBUTED_FAILURE (27), but found basically nothing.

Any suggestions where to start?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions