Skip to content

doc: add libnvme registry overview (REGISTRY.md)#3467

Open
martin-belanger wants to merge 1 commit into
linux-nvme:masterfrom
martin-belanger:registry-docs
Open

doc: add libnvme registry overview (REGISTRY.md)#3467
martin-belanger wants to merge 1 commit into
linux-nvme:masterfrom
martin-belanger:registry-docs

Conversation

@martin-belanger

Copy link
Copy Markdown

Add a concise overview of the NVMe controller ownership registry: why it exists (orchestrator coexistence and the disconnect-all coordination problem from issue #2913), the /run/nvme/registry layout, how an owner is set, the udev-driven cleanup and its instance-recycling guard, and the disconnect-all guardrails. The header kdoc and man pages remain the authoritative reference.

Suggested-by: Daniel Wagner wagi@kernel.org

Comment thread libnvme/REGISTRY.md Outdated
Comment thread libnvme/REGISTRY.md

```
ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \
RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I asked this in the PR, shouldn't we use nvme-cli here to remove the entry? I would like to log everything in one place, so journalctl -u nvme-cli (or whatever the command is) shows what happened. Having this only in a udev rule distributes the information and makes it hard to debug.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on debuggability, but there's a structural problem with routing it through nvme. The command would be nvme registry delete, which lives in the registry plugin — and that plugin is optional, at your own request (you asked for the registry to be a plugin rather than a built-in). So a udev rule that shells out to nvme registry delete would make the registry's stale-entry cleanup depend on a command that isn't present on any system that didn't build/install that optional plugin. The cleanup is a property of the registry itself, which lives in libnvme — so it has to work whenever libnvme is installed, with no dependency on nvme-cli or an optional plugin, and no daemon. That's exactly why it's a libnvme-shipped udev rule.

It also wouldn't change the race: a command launched from the rule races the connect path identically — the [ -e ] guard is still what makes it correct. So we'd take on an optional-plugin dependency and gain nothing on correctness.

The debuggability concern is legit on its own, though — happy to solve "see what happened in one place" without that dependency, e.g. have the rule log via logger, or add a libnvme log sink.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also answers the natural follow-up — "if a udev rule removes entries, why doesn't one create them?" The two halves are placed where the information they need actually lives:

  • Create needs the owner — and the owner is purely in-process state of the connecting process, passed to libnvme via the context (ctx->owner). It is not in sysfs, not in the add uevent, nowhere a udev rule could read it (the kernel has no concept of "owner" — it's a libnvme-userspace notion). A rule firing for nvme4 has no way to know who connected it, so it cannot write a correct owner=. Only libnvme, in the connect path, has the owner — which is why the entry is written there. (It's also written synchronously inside connect(), so there is no window where a just-connected controller is briefly unowned and exposed to disconnect-all.)
  • Delete needs no owner — it just removes the entry for that device name (with the [ -e ] recycling guard). No per-process knowledge is required, so a udev rule can do it.

So the asymmetry is deliberate, not an oversight: each operation lives where its required input is available.

Comment thread libnvme/REGISTRY.md
RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'"
```

**Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race window still exists. The only way to ensure this works is by serializing the events, which udevd does. So I don't think this argument is correct.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the disconnect is one specific point: the new controller's registry entry is not written by a udev rule — it's written by libnvme inside the connect() path, synchronously in the connecting process, the moment the kernel assigns the (recycled) instance number. udevd never sees that write and never orders it. So udevd serializing udev events doesn't help here, because the actor that races the remove rule is the connect path, which is outside udev entirely.

Concretely, the dangerous interleaving for the name nvme4:

  1. Kernel removes the old nvme4 → frees instance 4 → emits a remove uevent (now queued in udevd).
  2. Before udevd gets to that uevent, a new controller connects. The kernel allocates the lowest free instance = 4 again, creates /dev/nvme4 (devtmpfs, synchronous), and libnvme — in the connecting process, not via udev — writes /run/nvme/registry/nvme4/owner.
  3. udevd now runs the queued remove rule for the old nvme4. With no guard it does rm -rf /run/nvme/registry/nvme4 and deletes the new, live controller's entry.

udevd processing remove before add in order does not prevent step 3 from clobbering step 2, because step 2's write never came through udev. The only information available at step 3 that distinguishes "stale entry for the controller that's really gone" from "live entry for a controller that recycled the name" is whether the device node currently exists — which is exactly what [ -e /dev/%k ] tests. devtmpfs removes the node synchronously before KOBJ_REMOVE is emitted (device_del() calls devtmpfs_delete_node() before kobject_uevent(KOBJ_REMOVE)), so:

  • genuinely-gone controller → /dev/nvme4 absent → rm runs (correct);
  • recycled name → /dev/nvme4 exists again → rm skipped (correct).

So the guard isn't a weaker substitute for serialization — it's the only mechanism that works, because the competing actor (the connect path) is not a udev event udevd can serialize. The residual window is the microsecond TOCTOU between the [ -e ] test and the rm; serialization can't close that either, for the same reason (the connect write isn't a udev event), and hitting it needs a connect ioctl to complete at that exact instant — negligible.

On the discoverd worry: this is actually why the guard is robust to it. Correctness depends on kernel devtmpfs state, not on udevd event ordering or on who is observing events. Another observer (e.g. discoverd) reconnecting a dropped controller just produces another connect → another /dev/nvmeN + libnvme entry; the [ -e ] check reflects that regardless of event timing. So it does not rely on serialization at all — which is the property we want once there are multiple observers.

Comment thread libnvme/REGISTRY.md

**Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it.

The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node *synchronously*, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `||` short-circuits, and `rm` is skipped — preserving the new owner's entry.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said, udevd ensures order of rule execution. That is if everything is run through udevd all is good. The problem starts if there is another instance observing udev events and does something. With the nvme-discoverd on the horizon, this udev rule could get really nasty. Just thinking loud, maybe all is good.

Comment thread libnvme/REGISTRY.md Outdated
Add a concise overview of the NVMe controller ownership registry: why it
exists (orchestrator coexistence and the disconnect-all coordination
problem from issue linux-nvme#2913), the /run/nvme/registry layout, how an owner is
set, the udev-driven cleanup and its instance-recycling guard, and the
disconnect-all guardrails. The header kdoc and man pages remain the
authoritative reference.

Suggested-by: Daniel Wagner <wagi@kernel.org>
Signed-off-by: Martin Belanger <martin.belanger@dell.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants