-
Notifications
You must be signed in to change notification settings - Fork 719
doc: add libnvme registry overview (REGISTRY.md) #3467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # NVMe Controller Ownership Registry | ||
|
|
||
| The registry records **which orchestrator owns each connected NVMe-oF controller**. It is a small, cooperative coordination layer that lets independent tools share a host without stepping on each other's controllers — most importantly, so that a sweeping command like `nvme disconnect-all` does not tear down a connection some other component depends on. | ||
|
|
||
| > The code is the source of truth. This document summarizes behavior and intent; for exact signatures see the header kdoc in `src/nvme/registry.h` and the `nvme-registry-*` man pages. | ||
|
|
||
| ## Why it exists | ||
|
|
||
| NVMe-oF connections on a host are rarely managed by a single actor. Independent **orchestrators** — agents that decide on their own which controllers to connect and disconnect — coexist on the same machine and share one flat device namespace (`/dev/nvmeX`), with no record of who created or manages each controller. | ||
|
|
||
| This holds even on a plain system with no daemons installed. There are already at least two orchestrators: the **initramfs** (NBFT and FC-kickstart connections made during early boot) and a **human** running `nvme connect-all` / `nvme disconnect-all`. Installing `nvme-discoverd` or `nvme-stas` only adds more. | ||
|
|
||
| Not every command is an orchestrator: | ||
|
|
||
| - `nvme connect` / `nvme disconnect` — single, targeted actions; here the *human* is the orchestrator, deciding what to connect or disconnect. | ||
| - `nvme connect-all` / `nvme disconnect-all` — orchestrating commands that, once invoked, decide on their own. `connect-all` reads a discovery controller's discovery log page (DLP), connects every DLP entry, and follows referrals into further discovery controllers, recursing through several layers; `disconnect-all` tears down across the whole host. | ||
| - `nvme discover` — in between: it reads a discovery controller's DLP and prints it, but does not connect DLP entries or follow referrals. It is read-mostly; its only state-changing option is `--persistent`, which keeps the discovery connection itself. | ||
|
|
||
| The trigger was [issue #2913](https://github.com/linux-nvme/nvme-cli/issues/2913): `disconnect-all` has host-wide scope but no way to tell a boot-critical NBFT connection — or one a daemon depends on — from a throwaway manual connect, so it would disconnect them all. To coordinate, an orchestrator must know who owns what. | ||
|
|
||
| **This is why the registry is a new kind of state for `nvme`.** `nvme` already reads configuration (host identity, saved connections) and kernel state from sysfs, but it kept no memory of what it had *done* — each invocation was independent. The registry is that memory: runtime state `nvme`/libnvme writes itself, automatically, as a controller is connected (under `/run` — per-boot, not across reboot), recording which orchestrator owns each controller and read back by a later invocation or any other tool on the host. | ||
|
|
||
| The registry is a **cooperative tool, not an enforcement mechanism**. Every orchestrator runs as root and could disconnect anything regardless. The registry simply lets well-behaved tools avoid doing so by accident. | ||
|
|
||
| ## What it is | ||
|
|
||
| The registry lives under `/run/nvme/registry/` — runtime state that is tied to controller lifecycle and does not survive a reboot. It mirrors the sysfs convention: one directory per live controller, one plain-text file per attribute. | ||
|
|
||
| ``` | ||
| /run/nvme/registry/ | ||
| nvme1/ | ||
| owner | ||
| nvme3/ | ||
| owner | ||
| ``` | ||
|
|
||
| ```sh | ||
| $ cat /run/nvme/registry/nvme3/owner | ||
| nbft | ||
| ``` | ||
|
|
||
| - **Absence means unowned.** A controller with no directory (or no `owner` file) is managed by nobody. There is no explicit "unowned" marker. | ||
| - The well-known attribute is **`owner`** — the orchestrator identity string (e.g. `stas`, `nbft`, `discoverd`). Orchestrators may write additional private attributes; unknown attributes are ignored by everyone else. | ||
| - Directories are `0755`, attribute files `0644`: **world-readable, root-writable**. Both the root and per-device directories are created on demand on first write. | ||
| - Writes are atomic (`mkstemp` → `fsync` → `rename`), so concurrent writers never corrupt an entry and readers never see a half-written value. | ||
| - Controllers the kernel manages directly — PCIe and other memory-based transports — are out of scope: they are not reached over a fabric, and `disconnect-all`'s transport-type check already excludes them. | ||
|
|
||
| ## Setting an owner | ||
|
|
||
| An orchestrator declares its identity once, then every controller it connects through that context is registered automatically on a successful connect: | ||
|
|
||
| ```c | ||
| struct libnvme_global_ctx *ctx = libnvme_create_global_ctx(...); | ||
| libnvme_set_owner(ctx, "stas"); /* registers owner=stas on every connect */ | ||
| ``` | ||
|
|
||
| A process that does not call `libnvme_set_owner()` produces **unowned** connections. On connect, libnvme writes the entry when an owner is set, and clears any stale entry for a recycled instance number when it is not. | ||
|
|
||
| From the CLI, pass `--owner NAME` to register ownership at connect time: | ||
|
|
||
| ```sh | ||
| nvme connect --owner discoverd ... | ||
| nvme discover --owner discoverd ... | ||
| nvme connect-all --owner discoverd ... | ||
| nvme connect-all --nbft # NBFT controllers, owner=nbft | ||
| nvme connect-all --owner boot --nbft # NBFT controllers, owner=boot (overrides nbft) | ||
| ``` | ||
|
|
||
| `nvme connect-all --nbft` records `owner=nbft` automatically to protect firmware boot volumes. That `nbft` is only a default, though: an explicit `--owner NAME` given alongside `--nbft` overrides it, so `nvme connect-all --owner NAME --nbft` records the controllers as `NAME`. A plain `--nbft` (no `--owner`) keeps `owner=nbft`, which is what lets existing boot scripts get ownership for free without being changed. Without `--owner` and without `--nbft`, connections are unowned. | ||
|
|
||
| ## Automatic cleanup | ||
|
|
||
| A controller's entry must be removed once the controller goes away, or stale entries accumulate and a recycled device name looks owned. **Creation and removal deliberately live in different places.** Entries are *written* by libnvme on connect (see *Setting an owner* above) because only the connecting process knows the owner — it lives in that process's libnvme context, never in sysfs or the uevent — so nothing else, a udev rule included, could supply it. *Removal* needs no owner, only the device name, and it must happen even when the kernel drops a controller on its own (connectivity loss, `ctrl-loss-tmo` expiry) with no orchestrator involved, and even on a host running no orchestrator daemon at all. So removal is driven by the device-removal event and performed by a single, always-present agent — udevd: exactly one party cleans up, and it works with no daemon required. | ||
|
|
||
| libnvme ships a udev rule for exactly this. The udev daemon — present on essentially every system — removes a controller's registry entry when the kernel removes the controller: | ||
|
|
||
| ``` | ||
| ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \ | ||
| RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'" | ||
| ``` | ||
|
|
||
| **Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The race window still exists. The only way to ensure this works is by serializing the events, which udevd does. So I don't think this argument is correct.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the disconnect is one specific point: the new controller's registry entry is not written by a udev rule — it's written by libnvme inside the Concretely, the dangerous interleaving for the name
udevd processing
So the guard isn't a weaker substitute for serialization — it's the only mechanism that works, because the competing actor (the connect path) is not a udev event udevd can serialize. The residual window is the microsecond TOCTOU between the On the discoverd worry: this is actually why the guard is robust to it. Correctness depends on kernel devtmpfs state, not on udevd event ordering or on who is observing events. Another observer (e.g. discoverd) reconnecting a dropped controller just produces another connect → another |
||
|
|
||
| The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node *synchronously*, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `||` short-circuits, and `rm` is skipped — preserving the new owner's entry. | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I said, udevd ensures order of rule execution. That is if everything is run through udevd all is good. The problem starts if there is another instance observing udev events and does something. With the nvme-discoverd on the horizon, this udev rule could get really nasty. Just thinking loud, maybe all is good. |
||
|
|
||
| Note that udevd serialising udev *events* does not cover this race: the new controller's entry was written by libnvme during `connect()` — not by a udev rule — so udevd never ordered that write against the remove rule. Only the live device-node state, which the guard reads, can tell a stale entry apart from a recycled-and-live one. | ||
|
|
||
| Because the rule lives in libnvme, cleanup works whenever libnvme is installed, independent of nvme-cli. | ||
|
|
||
| ## `disconnect-all` behavior | ||
|
|
||
| This is the payoff. `disconnect-all` respects ownership by default: | ||
|
|
||
| | Invocation | Disconnects | | ||
| |---|---| | ||
| | `nvme disconnect-all` | only **unowned** controllers (safe default) | | ||
| | `nvme disconnect-all --owner NAME` | only controllers owned by `NAME` (confirmation required) | | ||
| | `nvme disconnect-all --force` | **all** fabric controllers regardless of ownership (confirmation required) | | ||
|
|
||
| `--owner` and `--force` are mutually exclusive. Both prompt for a typed `yes` when stdin is a terminal; non-interactive callers (scripts) proceed, since passing the flag is itself the statement of intent. | ||
|
|
||
| In every case, controllers the kernel manages directly (PCIe and other memory-based transports) are left alone — the transport-type check excludes non-fabric controllers before ownership (or `--force`) is even considered, so `--force` cannot reach them. This means `nvme disconnect-all --force` behaves exactly like the original `nvme disconnect-all` did: it tears down every fabric controller and never touches locally-attached devices. The ownership-aware default is the only new behavior layered on top. | ||
|
|
||
| By contrast, `nvme disconnect <device>` targets one named controller and always disconnects it — the caller's intent is unambiguous, so no guardrail applies. | ||
|
|
||
| ## Inspecting the registry | ||
|
|
||
| ```sh | ||
| nvme registry list # all live entries | ||
| nvme registry retrieve <device> -a <attr> | ||
| nvme registry update <device> -a note -V "boot-path SAN" | ||
| nvme registry delete <device> [-a <attr>] # whole entry, or one attribute | ||
| nvme list -v # adds an Orchestrator column: owner, | ||
| # '-' (unowned), or 'kernel' (PCIe) | ||
| ``` | ||
|
|
||
| Changing an owner (`update -a owner`) or removing an entry (`delete`) can stop an orchestrator from protecting a controller, so when run interactively these prompt for a `[y/N]` confirmation (default no). Updates to other attributes, non-interactive callers, and the C and Python APIs proceed without prompting. This is a guard against accidental mistakes, not access control — anyone with root can edit the files under `/run/nvme/registry/` directly. | ||
|
|
||
| ## Further reading | ||
|
|
||
| - `src/nvme/registry.h` — full API kdoc | ||
| - `Documentation/nvme-registry-*.txt` — man pages for the CLI commands | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I asked this in the PR, shouldn't we use nvme-cli here to remove the entry? I would like to log everything in one place, so
journalctl -u nvme-cli(or whatever the command is) shows what happened. Having this only in a udev rule distributes the information and makes it hard to debug.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point on debuggability, but there's a structural problem with routing it through
nvme. The command would benvme registry delete, which lives in the registry plugin — and that plugin is optional, at your own request (you asked for the registry to be a plugin rather than a built-in). So a udev rule that shells out tonvme registry deletewould make the registry's stale-entry cleanup depend on a command that isn't present on any system that didn't build/install that optional plugin. The cleanup is a property of the registry itself, which lives in libnvme — so it has to work whenever libnvme is installed, with no dependency on nvme-cli or an optional plugin, and no daemon. That's exactly why it's a libnvme-shipped udev rule.It also wouldn't change the race: a command launched from the rule races the connect path identically — the
[ -e ]guard is still what makes it correct. So we'd take on an optional-plugin dependency and gain nothing on correctness.The debuggability concern is legit on its own, though — happy to solve "see what happened in one place" without that dependency, e.g. have the rule log via
logger, or add a libnvme log sink.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also answers the natural follow-up — "if a udev rule removes entries, why doesn't one create them?" The two halves are placed where the information they need actually lives:
owner— and the owner is purely in-process state of the connecting process, passed to libnvme via the context (ctx->owner). It is not in sysfs, not in theadduevent, nowhere a udev rule could read it (the kernel has no concept of "owner" — it's a libnvme-userspace notion). A rule firing fornvme4has no way to know who connected it, so it cannot write a correctowner=. Only libnvme, in the connect path, has the owner — which is why the entry is written there. (It's also written synchronously insideconnect(), so there is no window where a just-connected controller is briefly unowned and exposed todisconnect-all.)owner— it just removes the entry for that device name (with the[ -e ]recycling guard). No per-process knowledge is required, so a udev rule can do it.So the asymmetry is deliberate, not an oversight: each operation lives where its required input is available.