Skip to content

NVMe layer: --hostnqn passes node name, conflicts with existing NVMe-oF connections #482

@Tydus

Description

@Tydus

Version info:

  • LINSTOR server: 1.33.1
  • LINSTOR client: 1.27.1
  • Kernel: 6.8.0 (Debian 12)
  • Transport: NVMe-oF over RDMA (RoCEv2)

Description:

The NVMe layer in NvmeUtils.java passes --hostnqn=<node-name> when running nvme discover and nvme connect. This causes two problems:

  1. Conflict with existing NVMe-oF connections: On hosts that already have NVMe-oF connections (e.g., to a manually configured nvmet target), the kernel enforces a single hostnqn per hostid. Since the existing connections use the system hostnqn from /etc/nvme/hostnqn, LINSTOR's attempt to use the node name as hostnqn is rejected with Invalid argument.

  2. Non-standard NQN format: LINSTOR node names (e.g., mynode) do not follow the NVMe NQN format (nqn.yyyy-mm.reverse.domain:identifier). While some kernel drivers are lenient about this, it violates the NVMe specification.

Reproduction steps:

  1. On host A, establish an NVMe-oF connection to any target (manually or via another subsystem). This registers the hostnqn from /etc/nvme/hostnqn paired with the hostid from /etc/nvme/hostid.
  2. On host B, create a LINSTOR resource group with --layer-list nvme,storage and spawn a resource (target lands on host B):
    linstor resource-group spawn-resources <rg> test-nvme 1G
    
  3. Create an NVMe initiator on host A:
    linstor resource create --nvme-initiator <hostA> test-nvme
    
  4. LINSTOR runs nvme discover --hostnqn=<hostA-node-name> on host A, which fails because the kernel already has a different hostnqn registered for the same hostid.

LINSTOR output:

# linstor resource create --nvme-initiator <hostA> test-nvme
SUCCESS:
Description:
    New resource 'test-nvme' on node '<hostA>' registered.
SUCCESS:
Description:
    Volume with number '0' on resource 'test-nvme' on node '<hostA>' successfully registered
SUCCESS:
    Added peer(s) '<hostA>' to resource 'test-nvme' on '<hostB>'
ERROR:
Description:
    (<hostA>) Failed to discover NVMe subsystems!
Details:
    Command 'nvme discover --transport=rdma --traddr=<ip> --trsvcid=4420 --hostnqn=<hostA>' returned with exitcode 1.

    Standard out:


    Error message:
    Failed to write to /dev/nvme-fabrics: Invalid argument
    failed to add controller, error invalid arguments/configuration

dmesg on host A:

nvme_fabrics: found same hostid <uuid> but different hostnqn <hostA>

Suggestion:

Reading the hostnqn from /etc/nvme/hostnqn instead, or omitting --hostnqn altogether (nvme-cli automatically reads from /etc/nvme/hostnqn when not specified), would resolve both issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions