Skip to content

DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442

Draft
martin-belanger wants to merge 11 commits into
linux-nvme:masterfrom
martin-belanger:discoverd-rfcs
Draft

DO-NOT-MERGE: RFC: NVMe-oF orchestrator coexistence — nvme-discoverd, ownership registry, and exclusion list#3442
martin-belanger wants to merge 11 commits into
linux-nvme:masterfrom
martin-belanger:discoverd-rfcs

Conversation

@martin-belanger

@martin-belanger martin-belanger commented Jun 11, 2026

Copy link
Copy Markdown

This PR is for review only and will not be merged. It contains four RFC documents proposing a coordinated set of features for NVMe-oF orchestrator coexistence in nvme-cli 3.0. Please use inline comments to provide feedback on specific sections.

Tip: Markdown files are shown as raw text by default in the diff view. Click the "Display the rich diff" button (the document icon at the top-right of each file) to render them as formatted text — much easier to read.


Background

As NVMe-oF deployments grow, hosts increasingly run multiple tools that manage NVMe-oF connections: nvme-stas, dracut/initramfs scripts, and soon nvme-discoverd. Without coordination, these orchestrators can conflict — one may disconnect a controller that another is actively managing, or connect to a controller that another has deliberately excluded.

This RFC set proposes two libnvme building blocks to solve the coexistence problem (ownership registry and exclusion list), and a new daemon orchestrator (nvme-discoverd) that puts them to use.


The four RFCs

rfc-nvme-orchestrator-coexistence.md — Start here

The top-level document. Frames the two distinct conflict scenarios (accidental disconnect, accidental connect), introduces the two prevention mechanisms (ownership registry and exclusion list), defines a three-tier orchestrator hierarchy (raw commands → manual orchestrators → daemon orchestrators), and shows how nvme-discoverd and nvme-stas naturally partition work without requiring IPC coupling between them.

rfc-nvme-registry.md — Ownership registry

A lightweight, cooperative registry under /run/nvme/registry/ that lets orchestrators declare ownership of connected controllers. nvme disconnect-all consults the registry before acting so it never disconnects a controller managed by a running daemon. The registry is a libnvme building block: a C API (libnvmf_registry_*), a Python binding, and a nvme registry CLI command family. An implementation PR already exists: #3425.

rfc-nvme-exclusion.md — Exclusion list

A human-administered exclusion list at /etc/nvme/exclusions/ that prevents orchestrators from auto-connecting to controllers the administrator wants excluded. Its design center is auto-discovered controllers (mDNS, CDC DLP) where there is no configuration entry to remove. Managed via nvme exclusion CRUD commands. Enforcement is cooperative — each orchestrator reads the list and skips matching controllers; libnvme does not enforce it.

rfc-nvme-discoverd.md — nvme-discoverd

A new daemon proposed for nvme-cli 3.0. It manages NVMe-oF connections for statically configured controllers, NBFT boot controllers, FC-discovered controllers, and (in a future release) mDNS-discovered controllers. Key design choices:

  • Connect-only by design — never issues disconnects; eliminates the entire connect/disconnect ordering complexity (CtrlTerminator problem). TP8010 fabric zoning is out of scope; that belongs to nvme-stas.
  • One systemd transient unit per controller — nvme-discoverd never calls libnvmf_connect() directly; blocking /dev/nvme-fabrics writes happen in child processes managed by systemd. Achieves Daniel's no-threading goal.
  • Registers ownership via --owner discoverd on each generated unit; NBFT reconnects use --owner nbft to preserve the lifetime invariant set at boot.
  • Respects the exclusion list before each connect and reconnect; NBFT controllers bypass the check unconditionally.
  • Coexists with nvme-stas through natural discovery partitioning: nvme-stas owns mDNS/TCP, nvme-discoverd owns NBFT/FC/RDMA and manual config. The registry-check rule in nvme-stas eliminates bounce loops without any IPC coupling.

What feedback is sought

  • Architecture: Does the three-tier orchestrator model match how you see these tools being used in practice?
  • Ownership registry: Any concerns with the directory layout, API shape, or cooperative-only enforcement model?
  • Exclusion list: File format, use cases, nvme exclusion command design?
  • nvme-discoverd: Transient unit design, state file layout, startup reconciliation logic, retry policy, NBFT handling?
  • Coexistence rules: The owner=nbft lifetime invariant, NBFT exclusion bypass, unowned-connections-are-fair-game semantics — do these cover the cases you care about?

Related

Martin Belanger added 2 commits June 8, 2026 14:13
RFCs submitted for review only. Not intended to be merged.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Martin Belanger added 6 commits June 11, 2026 18:17
Registry:
- Show only the guarded udev rule in §4; remove the naive unguarded
  version that preceded it — a skimming reader would copy the wrong one
- Aperiodic audit is now bidirectional: re-assert ownership for live
  controllers with missing entries, not only remove stale ones
- Document the residual TOCTOU race (dangerous direction) explicitly

Exclusion list:
- Fix Python match semantics: None/NULL caller params skip the
  comparison (same rule as C); they do not block a match
- Clarify host-iface= scope: only matches interface-pinned connections;
  manual controller= entries without host-iface= are not matched
- Add address normalization note (inet_pton/inet_ntop, not strcmp)
- connect-all --nbft is exempt from the exclusion list

Orchestrator coexistence:
- Scope "no D-Bus signal" to between orchestrators
- "No special-case logic required" scoped to disconnect logic in nvme-stas
- connect-all --nbft exemption noted in Tier 2 summary
- Add versioning note: these behaviors co-ship in nvme-cli 3.0 /
  nvme-stas 3.0; earlier pairings do not provide the guarantees

nvme-discoverd:
- Add RuntimeDirectoryPreserve=yes; without it systemd removes
  /run/nvme/discoverd on every stop/crash, destroying devid files
  needed by active ExecStop= lines and state files for crash recovery
- Add registry ownership check before adopting pre-existing connections:
  skip any controller owned by neither discoverd nor nbft
- Correct varlink -> D-Bus throughout: StartTransientUnit, StopUnit,
  RestartUnit, and JobRemoved are all org.freedesktop.systemd1.Manager
- Scope "never issues disconnects" to steady-state operation
- Clarify referral DC fizzle-out is intentional; add open questions:
  stale-cache aging (#2), shutdown vs. mounts (#3), NBFT root (#4)
- Minor: "three lines" -> "four lines" for --devid-file specifier;
  GPL-2.0-only (not -or-later); inline unit comments moved to their
  own lines (systemd unit syntax has no end-of-line comments)

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Add §3.9 explaining the varlink/D-Bus situation for the systemd
interface. Not depending on D-Bus was a design wish — systemd is
migrating to varlink and new code should follow. However, varlink
is not yet sufficient: io.systemd.Unit.StartTransient was only added
in systemd v260 (March 2026), and StopUnit, RestartUnit, and
ResetFailedUnit have no varlink equivalent at all.

Since discoverd already depends on libsystemd for sd_event, using
sd-bus (systemd's own D-Bus implementation, part of libsystemd)
adds no new library dependency. The D-Bus usage is scoped to the
systemd interface only; discoverd's own client-facing socket (§3.7)
remains varlink-only.

Add open question #5 to track the future migration once the varlink
unit lifecycle API is complete.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Content fixes, verified against the kernel, systemd, libnvme, and
nvme-stas sources where applicable:

- io.systemd.Unit.StartTransient is in systemd v261 (still unreleased),
  not v260 as previously stated; v261 still lacks StopUnit, RestartUnit,
  and ResetFailedUnit, strengthening the D-Bus rationale (§3.9)
- a unique-NQN DC gets the kernel's 5 s NVME_DEFAULT_KATO, not a dead
  session; --keep-alive-tmo=30 normalizes the DC kato regardless of NQN
- FC WWN traddr comparison is case-insensitive string equality; nothing
  strips colon separators
- TID hash field order now matches nvme-stas staslib/trid.py exactly
- atomic write protocol documented as mkstemp() with random suffix,
  matching the implementation
- disconnect-all confirmation: prompt on TTY for both --force and
  --owner; non-interactive invocations proceed without prompting

Design additions from review:

- transient units carry Before=nvme-discoverd.service so discoverd
  stops before shutdown-time disconnects, preventing reconnect attempts
  (and FC kickstart re-issue) against a shutting-down system
- NBFT entries can be Discovery Controllers, not just IOCs; the
  --owner nbft substitution applies to both unit types
- host-traddr= exclusion entries provide interface exclusion for RDMA
  and FC, where host-iface does not apply
- registry §4.4 covers both initramfs boot paths: NBFT and the FC
  kickstart (unowned until discoverd adopts them at startup)
- the recycled-devid stale-entry edge case requires the udev rule race
  and a libnvme bypass to coincide; normally the rule has already
  cleaned up

Readability: split large sections into topical subsections (registry
§1/§4, exclusion §3/§7, discoverd §3.2, coexistence §5) and break up
long paragraphs throughout. No content changes from restructuring.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-fable-5 [Claude Code]
Add §11 "Future Release: DC Retention Policy" documenting how
discovery-derived configuration is retained after a discovery source
becomes unavailable.

Introduce `discovery-retention-time` (replacing the mDNS-only
`zeroconf-stale-timeout`) as a unified parameter covering all dynamic
discovery sources: mDNS, referral DCs, and FC kickstart DCs. Statically
configured and NBFT-derived DCs are retried indefinitely — they represent
explicit intent and are not subject to the retention timer.

Introduce `fc-kickstart-interval-minutes` for periodic FC fabric probing,
motivated by the equipment replacement scenario where a new DC may have a
different address/NQN and can only be discovered via an active kickstart.

Update §4 and §7.1 to note that FC kickstart has no TID cache in the
initial release, but the retention policy will require one. Update §5
retry policy to reflect the static vs. dynamic DC distinction. Align
§11.1 timer-start semantics with §11.2 (timer starts on source
disappearance regardless of connection state).

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
Periodic FC kickstart is opt-in, not on by default. nvme-discoverd ships
on all Linux systems — enabling periodic fabric probing by default on
laptops and desktops with no FC infrastructure would be wasteful and
surprising.

Use 0 to disable (consistent with systemd's convention for interval
parameters, e.g. WatchdogSec=0). 0 was previously invalid; infinity was
the disable value. Any value >= 1 is a valid interval in minutes.

Fix the consistency issue introduced by the default change: soften "not
acceptable in a production environment" to "may not be acceptable in an
FC production environment", and reframe the orthogonality paragraph to
read naturally with the default-is-0 framing.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
The entry-ID hash was unnecessary complexity: exclusion list management
is human-only, so a sequential number scoped to a single interactive
`nvme exclusion remove` invocation suffices, and `libnvmf_exclusion_remove()`
now matches by exact entry-string content instead. The Python bindings
no longer expose exclusion_add/remove/create/delete at all — providing a
programmatic management API would contradict the human-administered
design; only the read-only exclusion_match/lists/entries() remain.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-sonnet-4-6 [Claude Code]
@igaw igaw added the rfc For tracking discussions new features etc. label Jun 16, 2026
@igaw igaw marked this pull request as draft June 16, 2026 18:54
Martin Belanger added 3 commits June 16, 2026 22:05
The RFC referenced the legacy NVMe-oF autoconnect components only in
passing (§1, §7.1), with no single place describing what nvme-cli
installs today, when each piece fires, and how nvme-discoverd subsumes
it. Add §12 with a full inventory and a clear split between components
that establish connections (replaced) and those doing orthogonal tuning,
key provisioning, or interface naming (kept).

Document the NBFT late-connect path in particular: nvmf-connect-nbft.service
has no [Install] section and is driven only by a NetworkManager dispatcher
script, so it never fires under systemd-networkd or other managers.
discoverd's retry loop covers the same case manager-agnostically, making
the late-connect a latency optimization rather than a correctness
requirement.

Renumber Open Questions to §13 and Glossary to §14.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Expand the RFC with reviewer-driven clarifications:

- §6.1: explain why discoverd uses its own discoverd.conf rather than
  reusing config.json (json-c dependency; JSON has no comments) or
  discovery.conf (DC-only; cannot express IOCs or global toggles).
- §12.1/§12.2: nvmf-connect.target was only a collective handle for the
  daemon-less design; the discoverd daemon is that coordinator, so no
  target equivalent is needed.
- §12.2: scope the autoconnect-rule replacement to the running system.
  Initramfs (Phase 1) connect is dracut's job (74nvmf, formerly 95nvmf),
  which has never used 70-nvmf-autoconnect.rules. Note that
  StartTransientUnit works over systemd's private socket without the
  D-Bus daemon, so discoverd is kept out of Phase 1 by design, not
  impossibility. Recommend removing the dead 70-nvmf-autoconnect.conf
  snippet from nvme-cli.
- §12.5: emphasize systemd-networkd as the common NetworkManager
  alternative that the legacy NBFT dispatcher path does not cover.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
Reconcile the exclusion RFC with what was actually built (review item L13):

- §6.1: libnvmf_exclusion_match() takes a struct libnvmf_tid * rather than
  seven positional string arguments — fewer call-site mistakes.  Document the
  new libnvmf_exclusion_entry_valid() helper used to pre-validate hand-edited
  files without filesystem side effects.
- §6.3: mark the proposed key_value_list_parse()/_free() utilities as
  intentionally not provided; libnvmf_tid_parse() already covers parsing a
  semicolon-separated key=value string, and the matcher parses entries
  internally.
- §5: show the actual --name/--entry option form instead of positional
  <name>/<entry> arguments.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]

discoverd provides the unit name to `StartTransientUnit`; systemd does not generate one. The name is derived from the connection's TID — the tuple `(transport, traddr, trsvcid, subsysnqn, host_traddr, host_iface, host_nqn)` — by concatenating the fields in that canonical order and computing an MD5 hash over the result, encoded as 32 lower-case hexadecimal characters. The unit name is `nvme-discoverd-<32hexchars>.service`.

This is the same approach (and the same field order) as nvme-stas's TID hashing in `staslib/trid.py`, and produces a stable, deterministic name with no coordinator state needed. MD5 here is not a security primitive — collision resistance against an adversary is not required — so discoverd can embed a small standalone MD5 implementation with no crypto-library dependency and no FIPS-mode concern.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see having our own implementation of MD5 as necessary since we already depend on openssl for fabrics support. Also, why not use SHA1 for example?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things.

First, nvme-cli does not unconditionally depend on openssl — it's an optional build. openssl is a meson option (auto, and can be explicitly disabled), and fabrics builds fine without it; you only lose TLS and DH-HMAC-CHAP authentication, not the transport. discoverd has to work in those builds too, so I deliberately did not tie a core naming operation to openssl being present.

Second — and more to the point — the hash here is not cryptographic. It only turns a long, variable-length identifier into a short, stable, filesystem- and systemd-safe token for a unit/file name. There's no adversary, no security property, and no collision-resistance requirement in the cryptographic sense — I just need something fast, reliable, and dependency-free. So MD5 vs SHA1 vs anything else is immaterial, and I'm not attached to MD5 at all.

Given that, the cleanest answer is probably neither MD5 nor openssl: nvme-cli already ships a self-contained non-cryptographic hash (util/crc32, plus util/base64), so the token can be derived from that — no new dependency, and no hand-rolled crypto. Happy to use whichever in-tree option you prefer, as long as it doesn't make the unit naming depend on openssl.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, as long the length of the created hash value is relatively small :)

Pure md5 creates 16 byte hashes which seems to be the shortest hashing algorithm. So we might even want to cut off some of the bytes.

Or we just try it with crc32 as this gives us 4 bytes and see how this goes.

This is relatively easy to change later.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on dropping MD5 for a small in-tree non-cryptographic hash; the only refinement is width. Pure crc32 is 32 bits → 8 hex chars, but at our lab scalability reference (~200 concurrently-connected controllers) the birthday-bound collision probability is ~1 in 215,000 — closer than I'd like for something that names a system unit. Widening to 48 bits → 12 hex chars costs only 4 more characters and drops that to ~1 in 14 billion at 200 controllers. So I'll use a 48-bit FNV-1a token (truncated) rather than 32-bit crc32 — same idea (small in-tree hash, no crypto dependency, easy to change later), just a touch wider. I'll also add an at-creation collision check: discoverd already holds the TID→unit-name map in memory, so any collision is detected and logged rather than silently leaving a controller unmanaged. I'll update §3.2.1 from MD5/32-hex to a 48-bit/12-hex hash with that guard.


#### 3.2.1 Unit naming

discoverd provides the unit name to `StartTransientUnit`; systemd does not generate one. The name is derived from the connection's TID — the tuple `(transport, traddr, trsvcid, subsysnqn, host_traddr, host_iface, host_nqn)` — by concatenating the fields in that canonical order and computing an MD5 hash over the result, encoded as 32 lower-case hexadecimal characters. The unit name is `nvme-discoverd-<32hexchars>.service`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just personal preference, but I generally don't like seeing things on my system named with cryptic names like UUIDs and/or hashes. I would like to get a clue what subsystems am I connected to immediately when running systemctl status.

How about we use nvme-discoverd-<NQN>-<hash>.service unit name format, where we'd only use a couple of digits from the hash in a similar way to git log --oneline?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, nqn are potentially very long. I've played with a simple mapping which included the subsysnqn and was not too happy with the result either because the service name was well above 120 chars.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on Daniel's point — putting the NQN in the unit name has two problems:

  • Length. NQNs are spec-limited to 223 bytes, so nvme-discoverd-<NQN>-<hash>.service can run well past the 255-byte filename limit, and is unwieldy long before that (Daniel's >120-char experience).
  • Escaping. NQNs are full of characters systemd must escape in a unit name — : and . (e.g. nqn.2014-08.org.nvmexpress:uuid:…) become \x3a etc. So the "readable" NQN isn't actually readable once it's a unit name; it ends up both long and mangled, which defeats the purpose.

But I do want to solve the real need — seeing what you're connected to from systemctl status. The clean way is to keep the unit name short and stable (hash-based) and put the human-readable identity in the unit's Description=, which is exactly what systemctl status prints. You'd get something like:

nvme-discoverd-<short-hash>.service - Discovery: nqn.2014-08.org.nvmexpress.discovery @ 192.168.1.3:8009

i.e. the clue you want, in the place you'd look for it, without paying the length/escaping cost in the unit name. Would that address the concern?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the escaping of the nqn was also not nice. I think if we knew the ctrl name before hand we would just name it nvme-discoverd-nvme$x and the rest in the description. In this sense the hashing of the tuple is generating a stable id for the service which makes sense to me. As long the hash is relative short.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both counts. The hash is precisely there to get a stable id before the device name exists (we can't name the unit nvme-discoverd-nvmeX because X isn't assigned until after the connect completes). And yes, short: I'll use a 48-bit hash rendered as 12 hex characters (nvme-discoverd-<12hex>.service), which stays compact while keeping collisions negligible even at a few hundred controllers (width rationale in my reply on the hash-algorithm thread).


The ownership ambiguity also undermines CDC fabric zoning (TP8010): nvme-stas can only enforce zoning on controllers it owns — if nvme-discoverd wins the connection race for an mDNS-discovered controller, that controller carries `owner=discoverd` and nvme-stas cannot disconnect it even if the CDC's zone policy demands it.

nvme-stas actively detects this misconfiguration: at startup it reads `/etc/nvme/discoverd.conf` and emits an error-level journal entry if `zeroconf=true` is set, prompting the administrator to disable mDNS in one of the two daemons. The defaults already avoid the conflict in most deployments: nvme-stas enables mDNS by default; nvme-discoverd disables it by default (`zeroconf=false`). The administrator must make a deliberate choice to enable mDNS in nvme-discoverd, and should do so only when nvme-stas is not installed on the same host.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think nvme-stas should be reading /etc/nvme/discoverd.conf to get such information. Configuration files of other programs should not be of concern to nvme-stas. This would create a not so pretty dependency in nvme-stas on the format of the configuration file.

Wouldn't it be better if nvme-stas got this information from nvme-discoverd through some IPC primitive?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Reading /etc/nvme/discoverd.conf from nvme-stas was a shortcut chosen purely for implementation simplicity, and you've put your finger on its real flaw: it couples nvme-stas to discoverd's config-file format, so any change to that format silently breaks the check — no error, just a mismatch that no longer gets detected. That's brittle.

The robust answer is the IPC one you suggest. nvme-discoverd already exposes a varlink interface, so nvme-stas should query the state it cares about (whether discoverd has mDNS/zeroconf enabled) over that socket rather than parsing a file it doesn't own. It costs more code on both sides — a varlink method in discoverd and a client call in stas — but it's the correct boundary, and it doesn't break when either side changes its config syntax. I'll update the RFC to specify the IPC approach and drop the config-file read.


1. The existing udev rules are complex and hard to maintain.
2. There is no retry on connection failure.
3. Support for mDNS/DNS-SD discovery (TP8009) — planned for a future release (see §10. Future Release).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the first point needs to be in the documentation.
IMO,, the main points are

  • Static configuration, no filtering, exclusion
  • manages global resource without 3rd party to hook into

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — I'll reframe around what discoverd is rather than editorializing about the udev rules. The positioning you list is the right spine:

  • it is a static-configuration connector — no DLP-delta reconciliation, no per-controller filtering/exclusion logic of its own (exclusion is a separate, system-wide mechanism it merely consults);
  • it manages a global resource (the host's fabric connections) directly, without requiring a third party to register callbacks or hook into it.

I'll keep retry-on-failure and the planned mDNS/TP8009 support as concrete forward-looking motivators, and drop the "udev rules are complex" point as a stated motivation.

2. There is no retry on connection failure.
3. Support for mDNS/DNS-SD discovery (TP8009) — planned for a future release (see §10. Future Release).

It ships inside `nvme-cli` (not a separate package), is default-on, and targets the nvme-cli 3.0 / SUSE sl16.2 release cycle.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not mention the downstream packaging plans.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This is an upstream RFC; the SUSE sl16.2 release-cycle reference doesn't belong here. I'll remove it and keep the targeting generic (nvme-cli 3.0).


**Some design choices in the initial release:** the exclusion list mechanism, the `zeroconf` config knob, and the `zeroconf=false` default — are deliberately shaped by the anticipated future mDNS support, even though mDNS itself is not yet implemented.

nvme-discoverd is **connect-only by design**. It never disconnects a live controller during steady-state operation; disconnects occur only via systemd unit stop at shutdown (`ExecStop=`), and `StopUnit` is invoked only for devices already removed by the kernel. This eliminates the entire connect/disconnect ordering complexity at the cost of TP8010 Fabric Zoning support, which belongs to nvme-stas.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvme-discovered does the discovery and creates service files which handle the connection. Thus I am not sure what the 'connect-only by design'.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair — the phrase is doing too much work. What I mean is narrower than it reads: discoverd never disconnects a live controller in steady state. It has no stateful DLPE-delta path that decides "this controller dropped out of the log page, tear it down" — that's TP8010 reconciliation, which belongs to nvme-stas. Disconnects happen only as a side effect of a systemd unit stop (shutdown / manual stop), never as a discovery-driven decision.

So "connect-only" is about the disconnect policy, not about discoverd being passive — you're right that it discovers and drives connections. I'll reword it to something like "non-disruptive: discoverd never tears down a live controller in response to discovery state" so the meaning is unambiguous.

Comment thread RFCs/rfc-nvme-discoverd.md

Every write to `/dev/nvme-fabrics` blocks the calling process in uninterruptible D state until the kernel completes or times out the connection. To prevent this from stalling the event loop, nvme-discoverd never writes to `/dev/nvme-fabrics` directly. Instead, it delegates all connections — both DC and IOC — to child processes managed by systemd transient units.

A transient unit is created via the D-Bus method `org.freedesktop.systemd1.Manager.StartTransientUnit`. No unit file is written to disk and no daemon-reload is required; the unit exists only in systemd's memory. `Type=oneshot` is used because the connection process exits immediately after writing to `/dev/nvme-fabrics` — the kernel maintains the connection afterwards. `RemainAfterExit=yes` keeps the unit active after the process exits so that systemd continues to consider it part of the ordering graph at shutdown time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice! Let's see how this goes. A quick documentation on StartTransientUnit gave me also an AI answer which claimed these service types can't be controlled with 'systemctl start/stop'. Which kind makes sense. Anyway, let's give this a try and see how it goes.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AI answer is wrong, or at least imprecise. A transient unit created via StartTransientUnit is a normal unit in systemd's table once created — systemctl stop nvme-discoverd-<hash>.service works and is exactly what runs the ExecStop= disconnect, both at shutdown (via the Before=/After= ordering) and on demand. RemainAfterExit=yes is what makes stop meaningful for a Type=oneshot unit: without it the unit would already be inactive after ExecStart= exits and there'd be nothing to stop. What you can't do is systemctl start it again after it's been garbage-collected (it has no on-disk unit file) — that's discoverd's job via StartTransient/RestartUnit. So: stop yes, restart-from-scratch via systemctl no. Agreed it's worth a real try — I'll note this distinction in §3.2.


#### 3.2.3 Disconnect at unit stop

The actual disconnect is performed by `ExecStop=`, which runs `nvme disconnect -d <device>` before systemd proceeds to deactivate `network.target`. Without `ExecStop=`, stopping the unit would be a no-op — there is no process to kill in a `Type=oneshot` unit, and the kernel connection would remain alive when the network interface tears down, causing I/O errors on in-flight NVMe requests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not work. When the service file is created with nvme connect, we don't know the device name yet. Thus service file can't have a nvme disconnect -d /dev/nvme$x in it. This was the reason why I made the nvme connect a deamon.

I think what we could/should do first, is to extend nvme disconnect to accept the exact same arguments nvme connect accepts which then are used to lookup the controller to shutdown. This would avoid the daemonizing of the nvme connect

We can still implement a nvme-connectd service if needed. But if we keep it simple with nvme-cli, I would like to try the nvme disconnect accepting all the connect args.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, the nvme-connectd might make even sense for retry/error logic. nvme connect does not have any retry loop at this point and we don't handle transient errors. For my plans to make FC writes to '/dev/nvme-fabrics' blocking it is a necessity to handle transient errors.

OR another idea is to use the registry to store some ID in there which we use to identify the device to disconnect. Not so sure about this though as it only solves half the problem.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already solved in the RFC — it's the point of §3.2.4 ("Capturing the device name") and §3.2.5, which come right after this line, so it's easy to have read past. The unit does not bake a device name into ExecStop=. Instead:

  • ExecStart= carries a new nvme connect --devid-file=%t/nvme/discoverd/units/%N.devid option. On success nvme connect writes the kernel-assigned nvmeX name into that file. For an already existing controller (idempotent) the name comes from libnvme's sysfs scan, so the file is populated in that case too.

  • ExecStop= reads it back and disconnects by that name:

    ExecStop=-/bin/sh -c 'DEV=$(cat %t/nvme/discoverd/units/%N.devid 2>/dev/null); \
        [ -n "$DEV" ] && nvme disconnect -d $DEV'

The %t (runtime dir) and %N (unit name) specifiers make the same literal path valid in ExecStart=, ExecStartPost=, ExecStop= and ExecStopPost=, so the device name is handed from connect to disconnect through a per-unit file with no value embedded at unit-creation time and no daemon in the loop. RuntimeDirectoryPreserve=yes keeps the file alive across a discoverd restart (§3.2.2, §3.8).

So the device name is known by the time ExecStop= runs — it's captured at connect time and persisted per-unit. This is exactly the problem that pushed you toward daemonizing nvme connect, and the --devid-file handoff avoids needing the daemon for this purpose.

On your alternative — teaching nvme disconnect to take the full nvme connect argument set and re-derive the controller by matching the TID in sysfs — I think that's a genuinely good addition on its own merits (it's the natural symmetric CLI, useful to humans and to any caller that didn't capture the device name). I'd be happy to see it land. For discoverd specifically I'd still prefer --devid-file, because it disconnects the exact device this unit connected rather than re-resolving a TID→device match at stop time (which has to handle "what if two controllers share a TID view in sysfs" and "what if the match is ambiguous"). The two aren't mutually exclusive: --devid-file is the precise path for the unit; arg-based nvme disconnect is a nice general capability. If you'd rather not add --devid-file at all, arg-based disconnect is a workable fallback for discoverd — it just trades a captured-name lookup for a re-derivation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding nvme-connectd:

Two separate things:

  • Retry / transient-error handling. For discoverd's purposes the retry loop lives in the orchestrator, not in nvme connect: a failed connect leaves the unit failed, and discoverd's reconnect logic (RestartUnit / re-StartTransient on the next trigger, §3.2.6, §3.5) drives the retry. So discoverd doesn't need nvme connect to grow a retry loop. That said, your FC-blocking-write work is a different motivation — making /dev/nvme-fabrics writes block and handling transient FC errors at the connect layer is valuable independent of discoverd, and nvme-connectd may well be the right home for it. I'm not arguing against nvme-connectd in general; I'm arguing discoverd doesn't require it just to know the device name.

  • Registry-stored ID. The ownership registry is already keyed by the nvmeX device name (§3.6, same convention), so the "ID to identify the device to disconnect" would be the device name we're already capturing in the .devid file — storing it a second time in the registry would be redundant, and as you note it only solves half the problem (you still need to capture it at connect time, which is the --devid-file step). So I'd keep the device-name handoff in the per-unit .devid file rather than route it through the registry.


The exception is device removal by the kernel: when a `KOBJ_REMOVE` event fires for a controller, discoverd's event handler removes the state directory from `controllers/` and the corresponding `units/%N.devid` file (§3.5). All other creates and deletes are performed by the unit itself.

`TimeoutStopSec=10` bounds the shutdown wait: `nvme disconnect` on an unresponsive target typically takes 1–3 s; without a bound, systemd's 90 s default would apply. `CollectMode=inactive-or-failed` tells systemd to garbage-collect failed transient units automatically once no job is pending — without it, units whose initial `nvme connect` failed would linger in the unit table until discoverd explicitly called `ResetFailedUnit`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing with larger setup (>= 200 namespaces, 8 paths, 80 CPUs) , 'time nvme disconnect' gives values in the range of 30-40 seconds without any error. So this could easily run way longer. So we really need to figure out how to handle this scenario here.

Obviously, we should also look into the making the disconnect path faster. I think the main reason why this is so slow is the extremly heavy use of rcu_sync in the shutdown path. IIRC there are 5 rcu_sync in this path per namespace...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good data — and it directly invalidates the TimeoutStopSec=10 I put in §3.2.3/§3.4. If a healthy disconnect can take 30–40 s (and more on larger configs), a 10 s stop bound would SIGTERM the disconnect mid-flight and leave the controller half-torn-down — worse than no bound. I'll fix this:

  • Raise the default substantially (well above your observed worst case) rather than the 1–3 s assumption the draft was written against, and
  • make it configurable (a discoverd.conf knob, or just inherit a generous systemd default) so large deployments can tune it, instead of hard-coding a single number.

The point of the bound was only to avoid systemd's 90 s default hanging shutdown on an unresponsive target — but a slow-but-progressing disconnect is the common case at scale, so the bound has to sit above realistic completion time, not below it. I'll rework §3.2.3 to say that and drop the "1–3 s typical" claim.

The kernel-side rcu_sync cost in the disconnect path is a separate (and very worthwhile) optimization target, but it's orthogonal to the RFC — discoverd has to tolerate whatever the disconnect latency is, and the fix here is the timeout policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rfc For tracking discussions new features etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants