Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions libnvme/REGISTRY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# NVMe Controller Ownership Registry

The registry records **which orchestrator owns each connected NVMe-oF controller**. It is a small, cooperative coordination layer that lets independent tools share a host without stepping on each other's controllers — most importantly, so that a sweeping command like `nvme disconnect-all` does not tear down a connection some other component depends on.

> The code is the source of truth. This document summarizes behavior and intent; for exact signatures see the header kdoc in `src/nvme/registry.h` and the `nvme-registry-*` man pages.

## Why it exists

NVMe-oF connections on a host are rarely managed by a single actor. Independent **orchestrators** — agents that decide on their own which controllers to connect and disconnect — coexist on the same machine and share one flat device namespace (`/dev/nvmeX`), with no record of who created or manages each controller.

This holds even on a plain system with no daemons installed. There are already at least two orchestrators: the **initramfs** (NBFT and FC-kickstart connections made during early boot) and a **human** running `nvme connect-all` / `nvme disconnect-all`. Installing `nvme-discoverd` or `nvme-stas` only adds more.

Not every command is an orchestrator:

- `nvme connect` / `nvme disconnect` — single, targeted actions; here the *human* is the orchestrator, deciding what to connect or disconnect.
- `nvme connect-all` / `nvme disconnect-all` — orchestrating commands that, once invoked, decide on their own. `connect-all` reads a discovery controller's discovery log page (DLP), connects every DLP entry, and follows referrals into further discovery controllers, recursing through several layers; `disconnect-all` tears down across the whole host.
- `nvme discover` — in between: it reads a discovery controller's DLP and prints it, but does not connect DLP entries or follow referrals. It is read-mostly; its only state-changing option is `--persistent`, which keeps the discovery connection itself.

The trigger was [issue #2913](https://github.com/linux-nvme/nvme-cli/issues/2913): `disconnect-all` has host-wide scope but no way to tell a boot-critical NBFT connection — or one a daemon depends on — from a throwaway manual connect, so it would disconnect them all. To coordinate, an orchestrator must know who owns what.

**This is why the registry is a new kind of state for `nvme`.** `nvme` already reads configuration (host identity, saved connections) and kernel state from sysfs, but it kept no memory of what it had *done* — each invocation was independent. The registry is that memory: runtime state `nvme`/libnvme writes itself, automatically, as a controller is connected (under `/run` — per-boot, not across reboot), recording which orchestrator owns each controller and read back by a later invocation or any other tool on the host.

The registry is a **cooperative tool, not an enforcement mechanism**. Every orchestrator runs as root and could disconnect anything regardless. The registry simply lets well-behaved tools avoid doing so by accident.

## What it is

The registry lives under `/run/nvme/registry/` — runtime state that is tied to controller lifecycle and does not survive a reboot. It mirrors the sysfs convention: one directory per live controller, one plain-text file per attribute.

```
/run/nvme/registry/
nvme1/
owner
nvme3/
owner
```

```sh
$ cat /run/nvme/registry/nvme3/owner
nbft
```

- **Absence means unowned.** A controller with no directory (or no `owner` file) is managed by nobody. There is no explicit "unowned" marker.
- The well-known attribute is **`owner`** — the orchestrator identity string (e.g. `stas`, `nbft`, `discoverd`). Orchestrators may write additional private attributes; unknown attributes are ignored by everyone else.
- Directories are `0755`, attribute files `0644`: **world-readable, root-writable**. Both the root and per-device directories are created on demand on first write.
- Writes are atomic (`mkstemp` → `fsync` → `rename`), so concurrent writers never corrupt an entry and readers never see a half-written value.
- Controllers the kernel manages directly — PCIe and other memory-based transports — are out of scope: they are not reached over a fabric, and `disconnect-all`'s transport-type check already excludes them.

## Setting an owner

An orchestrator declares its identity once, then every controller it connects through that context is registered automatically on a successful connect:

```c
struct libnvme_global_ctx *ctx = libnvme_create_global_ctx(...);
libnvme_set_owner(ctx, "stas"); /* registers owner=stas on every connect */
```

A process that does not call `libnvme_set_owner()` produces **unowned** connections. On connect, libnvme writes the entry when an owner is set, and clears any stale entry for a recycled instance number when it is not.

From the CLI, pass `--owner NAME` to register ownership at connect time:

```sh
nvme connect --owner discoverd ...
nvme discover --owner discoverd ...
nvme connect-all --owner discoverd ...
nvme connect-all --nbft # NBFT controllers, owner=nbft
nvme connect-all --owner boot --nbft # NBFT controllers, owner=boot (overrides nbft)
```

`nvme connect-all --nbft` records `owner=nbft` automatically to protect firmware boot volumes. That `nbft` is only a default, though: an explicit `--owner NAME` given alongside `--nbft` overrides it, so `nvme connect-all --owner NAME --nbft` records the controllers as `NAME`. A plain `--nbft` (no `--owner`) keeps `owner=nbft`, which is what lets existing boot scripts get ownership for free without being changed. Without `--owner` and without `--nbft`, connections are unowned.

## Automatic cleanup

A controller's entry must be removed once the controller goes away, or stale entries accumulate and a recycled device name looks owned. **Creation and removal deliberately live in different places.** Entries are *written* by libnvme on connect (see *Setting an owner* above) because only the connecting process knows the owner — it lives in that process's libnvme context, never in sysfs or the uevent — so nothing else, a udev rule included, could supply it. *Removal* needs no owner, only the device name, and it must happen even when the kernel drops a controller on its own (connectivity loss, `ctrl-loss-tmo` expiry) with no orchestrator involved, and even on a host running no orchestrator daemon at all. So removal is driven by the device-removal event and performed by a single, always-present agent — udevd: exactly one party cleans up, and it works with no daemon required.

libnvme ships a udev rule for exactly this. The udev daemon — present on essentially every system — removes a controller's registry entry when the kernel removes the controller:

```
ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \
RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I asked this in the PR, shouldn't we use nvme-cli here to remove the entry? I would like to log everything in one place, so journalctl -u nvme-cli (or whatever the command is) shows what happened. Having this only in a udev rule distributes the information and makes it hard to debug.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on debuggability, but there's a structural problem with routing it through nvme. The command would be nvme registry delete, which lives in the registry plugin — and that plugin is optional, at your own request (you asked for the registry to be a plugin rather than a built-in). So a udev rule that shells out to nvme registry delete would make the registry's stale-entry cleanup depend on a command that isn't present on any system that didn't build/install that optional plugin. The cleanup is a property of the registry itself, which lives in libnvme — so it has to work whenever libnvme is installed, with no dependency on nvme-cli or an optional plugin, and no daemon. That's exactly why it's a libnvme-shipped udev rule.

It also wouldn't change the race: a command launched from the rule races the connect path identically — the [ -e ] guard is still what makes it correct. So we'd take on an optional-plugin dependency and gain nothing on correctness.

The debuggability concern is legit on its own, though — happy to solve "see what happened in one place" without that dependency, e.g. have the rule log via logger, or add a libnvme log sink.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also answers the natural follow-up — "if a udev rule removes entries, why doesn't one create them?" The two halves are placed where the information they need actually lives:

  • Create needs the owner — and the owner is purely in-process state of the connecting process, passed to libnvme via the context (ctx->owner). It is not in sysfs, not in the add uevent, nowhere a udev rule could read it (the kernel has no concept of "owner" — it's a libnvme-userspace notion). A rule firing for nvme4 has no way to know who connected it, so it cannot write a correct owner=. Only libnvme, in the connect path, has the owner — which is why the entry is written there. (It's also written synchronously inside connect(), so there is no window where a just-connected controller is briefly unowned and exposed to disconnect-all.)
  • Delete needs no owner — it just removes the entry for that device name (with the [ -e ] recycling guard). No per-process knowledge is required, so a udev rule can do it.

So the asymmetry is deliberate, not an oversight: each operation lives where its required input is available.

```

**Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race window still exists. The only way to ensure this works is by serializing the events, which udevd does. So I don't think this argument is correct.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the disconnect is one specific point: the new controller's registry entry is not written by a udev rule — it's written by libnvme inside the connect() path, synchronously in the connecting process, the moment the kernel assigns the (recycled) instance number. udevd never sees that write and never orders it. So udevd serializing udev events doesn't help here, because the actor that races the remove rule is the connect path, which is outside udev entirely.

Concretely, the dangerous interleaving for the name nvme4:

  1. Kernel removes the old nvme4 → frees instance 4 → emits a remove uevent (now queued in udevd).
  2. Before udevd gets to that uevent, a new controller connects. The kernel allocates the lowest free instance = 4 again, creates /dev/nvme4 (devtmpfs, synchronous), and libnvme — in the connecting process, not via udev — writes /run/nvme/registry/nvme4/owner.
  3. udevd now runs the queued remove rule for the old nvme4. With no guard it does rm -rf /run/nvme/registry/nvme4 and deletes the new, live controller's entry.

udevd processing remove before add in order does not prevent step 3 from clobbering step 2, because step 2's write never came through udev. The only information available at step 3 that distinguishes "stale entry for the controller that's really gone" from "live entry for a controller that recycled the name" is whether the device node currently exists — which is exactly what [ -e /dev/%k ] tests. devtmpfs removes the node synchronously before KOBJ_REMOVE is emitted (device_del() calls devtmpfs_delete_node() before kobject_uevent(KOBJ_REMOVE)), so:

  • genuinely-gone controller → /dev/nvme4 absent → rm runs (correct);
  • recycled name → /dev/nvme4 exists again → rm skipped (correct).

So the guard isn't a weaker substitute for serialization — it's the only mechanism that works, because the competing actor (the connect path) is not a udev event udevd can serialize. The residual window is the microsecond TOCTOU between the [ -e ] test and the rm; serialization can't close that either, for the same reason (the connect write isn't a udev event), and hitting it needs a connect ioctl to complete at that exact instant — negligible.

On the discoverd worry: this is actually why the guard is robust to it. Correctness depends on kernel devtmpfs state, not on udevd event ordering or on who is observing events. Another observer (e.g. discoverd) reconnecting a dropped controller just produces another connect → another /dev/nvmeN + libnvme entry; the [ -e ] check reflects that regardless of event timing. So it does not rely on serialization at all — which is the property we want once there are multiple observers.


The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node *synchronously*, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `||` short-circuits, and `rm` is skipped — preserving the new owner's entry.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, udevd ensures order of rule execution. That is if everything is run through udevd all is good. The problem starts if there is another instance observing udev events and does something. With the nvme-discoverd on the horizon, this udev rule could get really nasty. Just thinking loud, maybe all is good.


Note that udevd serialising udev *events* does not cover this race: the new controller's entry was written by libnvme during `connect()` — not by a udev rule — so udevd never ordered that write against the remove rule. Only the live device-node state, which the guard reads, can tell a stale entry apart from a recycled-and-live one.

Because the rule lives in libnvme, cleanup works whenever libnvme is installed, independent of nvme-cli.

## `disconnect-all` behavior

This is the payoff. `disconnect-all` respects ownership by default:

| Invocation | Disconnects |
|---|---|
| `nvme disconnect-all` | only **unowned** controllers (safe default) |
| `nvme disconnect-all --owner NAME` | only controllers owned by `NAME` (confirmation required) |
| `nvme disconnect-all --force` | **all** fabric controllers regardless of ownership (confirmation required) |

`--owner` and `--force` are mutually exclusive. Both prompt for a typed `yes` when stdin is a terminal; non-interactive callers (scripts) proceed, since passing the flag is itself the statement of intent.

In every case, controllers the kernel manages directly (PCIe and other memory-based transports) are left alone — the transport-type check excludes non-fabric controllers before ownership (or `--force`) is even considered, so `--force` cannot reach them. This means `nvme disconnect-all --force` behaves exactly like the original `nvme disconnect-all` did: it tears down every fabric controller and never touches locally-attached devices. The ownership-aware default is the only new behavior layered on top.

By contrast, `nvme disconnect <device>` targets one named controller and always disconnects it — the caller's intent is unambiguous, so no guardrail applies.

## Inspecting the registry

```sh
nvme registry list # all live entries
nvme registry retrieve <device> -a <attr>
nvme registry update <device> -a note -V "boot-path SAN"
nvme registry delete <device> [-a <attr>] # whole entry, or one attribute
nvme list -v # adds an Orchestrator column: owner,
# '-' (unowned), or 'kernel' (PCIe)
```

Changing an owner (`update -a owner`) or removing an entry (`delete`) can stop an orchestrator from protecting a controller, so when run interactively these prompt for a `[y/N]` confirmation (default no). Updates to other attributes, non-interactive callers, and the C and Python APIs proceed without prompting. This is a guard against accidental mistakes, not access control — anyone with root can edit the files under `/run/nvme/registry/` directly.

## Further reading

- `src/nvme/registry.h` — full API kdoc
- `Documentation/nvme-registry-*.txt` — man pages for the CLI commands
Loading