doc: add libnvme registry overview (REGISTRY.md) by martin-belanger · Pull Request #3467 · linux-nvme/nvme-cli

martin-belanger · 2026-06-22T17:36:28Z

Add a concise overview of the NVMe controller ownership registry: why it exists (orchestrator coexistence and the disconnect-all coordination problem from issue #2913), the /run/nvme/registry layout, how an owner is set, the udev-driven cleanup and its instance-recycling guard, and the disconnect-all guardrails. The header kdoc and man pages remain the authoritative reference.

Suggested-by: Daniel Wagner wagi@kernel.org

igaw · 2026-06-22T19:50:01Z

+
+```
+ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \
+    RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'"


IIRC, I asked this in the PR, shouldn't we use nvme-cli here to remove the entry? I would like to log everything in one place, so journalctl -u nvme-cli (or whatever the command is) shows what happened. Having this only in a udev rule distributes the information and makes it hard to debug.

Fair point on debuggability, but there's a structural problem with routing it through nvme. The command would be nvme registry delete, which lives in the registry plugin — and that plugin is optional, at your own request (you asked for the registry to be a plugin rather than a built-in). So a udev rule that shells out to nvme registry delete would make the registry's stale-entry cleanup depend on a command that isn't present on any system that didn't build/install that optional plugin. The cleanup is a property of the registry itself, which lives in libnvme — so it has to work whenever libnvme is installed, with no dependency on nvme-cli or an optional plugin, and no daemon. That's exactly why it's a libnvme-shipped udev rule.

It also wouldn't change the race: a command launched from the rule races the connect path identically — the [ -e ] guard is still what makes it correct. So we'd take on an optional-plugin dependency and gain nothing on correctness.

The debuggability concern is legit on its own, though — happy to solve "see what happened in one place" without that dependency, e.g. have the rule log via logger, or add a libnvme log sink.

This also answers the natural follow-up — "if a udev rule removes entries, why doesn't one create them?" The two halves are placed where the information they need actually lives:

Create needs the owner — and the owner is purely in-process state of the connecting process, passed to libnvme via the context (ctx->owner). It is not in sysfs, not in the add uevent, nowhere a udev rule could read it (the kernel has no concept of "owner" — it's a libnvme-userspace notion). A rule firing for nvme4 has no way to know who connected it, so it cannot write a correct owner=. Only libnvme, in the connect path, has the owner — which is why the entry is written there. (It's also written synchronously inside connect(), so there is no window where a just-connected controller is briefly unowned and exposed to disconnect-all.)

Delete needs no owner — it just removes the entry for that device name (with the [ -e ] recycling guard). No per-process knowledge is required, so a udev rule can do it.

So the asymmetry is deliberate, not an oversight: each operation lives where its required input is available.

igaw · 2026-06-22T19:52:54Z

+    RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'"
+```
+
+**Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it.


The race window still exists. The only way to ensure this works is by serializing the events, which udevd does. So I don't think this argument is correct.

I think the disconnect is one specific point: the new controller's registry entry is not written by a udev rule — it's written by libnvme inside the connect() path, synchronously in the connecting process, the moment the kernel assigns the (recycled) instance number. udevd never sees that write and never orders it. So udevd serializing udev events doesn't help here, because the actor that races the remove rule is the connect path, which is outside udev entirely.

Concretely, the dangerous interleaving for the name nvme4:

Kernel removes the old nvme4 → frees instance 4 → emits a remove uevent (now queued in udevd).

Before udevd gets to that uevent, a new controller connects. The kernel allocates the lowest free instance = 4 again, creates /dev/nvme4 (devtmpfs, synchronous), and libnvme — in the connecting process, not via udev — writes /run/nvme/registry/nvme4/owner.

udevd now runs the queued remove rule for the old nvme4. With no guard it does rm -rf /run/nvme/registry/nvme4 and deletes the new, live controller's entry.

udevd processing remove before add in order does not prevent step 3 from clobbering step 2, because step 2's write never came through udev. The only information available at step 3 that distinguishes "stale entry for the controller that's really gone" from "live entry for a controller that recycled the name" is whether the device node currently exists — which is exactly what [ -e /dev/%k ] tests. devtmpfs removes the node synchronously before KOBJ_REMOVE is emitted (device_del() calls devtmpfs_delete_node() before kobject_uevent(KOBJ_REMOVE)), so:

genuinely-gone controller → /dev/nvme4 absent → rm runs (correct);

recycled name → /dev/nvme4 exists again → rm skipped (correct).

So the guard isn't a weaker substitute for serialization — it's the only mechanism that works, because the competing actor (the connect path) is not a udev event udevd can serialize. The residual window is the microsecond TOCTOU between the [ -e ] test and the rm; serialization can't close that either, for the same reason (the connect write isn't a udev event), and hitting it needs a connect ioctl to complete at that exact instant — negligible.

On the discoverd worry: this is actually why the guard is robust to it. Correctness depends on kernel devtmpfs state, not on udevd event ordering or on who is observing events. Another observer (e.g. discoverd) reconnecting a dropped controller just produces another connect → another /dev/nvmeN + libnvme entry; the [ -e ] check reflects that regardless of event timing. So it does not rely on serialization at all — which is the property we want once there are multiple observers.

igaw · 2026-06-22T19:55:39Z

+
+**Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it.
+
+The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node *synchronously*, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `||` short-circuits, and `rm` is skipped — preserving the new owner's entry.


As I said, udevd ensures order of rule execution. That is if everything is run through udevd all is good. The problem starts if there is another instance observing udev events and does something. With the nvme-discoverd on the horizon, this udev rule could get really nasty. Just thinking loud, maybe all is good.

Add a concise overview of the NVMe controller ownership registry: why it exists (orchestrator coexistence and the disconnect-all coordination problem from issue linux-nvme#2913), the /run/nvme/registry layout, how an owner is set, the udev-driven cleanup and its instance-recycling guard, and the disconnect-all guardrails. The header kdoc and man pages remain the authoritative reference. Suggested-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Martin Belanger <martin.belanger@dell.com>

martin-belanger force-pushed the registry-docs branch from efdb24e to b843f28 Compare June 22, 2026 17:40

igaw reviewed Jun 22, 2026

View reviewed changes

Comment thread libnvme/REGISTRY.md Outdated

igaw reviewed Jun 22, 2026

View reviewed changes

Comment thread libnvme/REGISTRY.md Outdated

martin-belanger force-pushed the registry-docs branch from b843f28 to 29b9f8e Compare June 22, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: add libnvme registry overview (REGISTRY.md)#3467

doc: add libnvme registry overview (REGISTRY.md)#3467
martin-belanger wants to merge 1 commit into
linux-nvme:masterfrom
martin-belanger:registry-docs

martin-belanger commented Jun 22, 2026

Uh oh!

Uh oh!

igaw Jun 22, 2026

Uh oh!

martin-belanger Jun 22, 2026

Uh oh!

martin-belanger Jun 22, 2026

Uh oh!

igaw Jun 22, 2026

Uh oh!

martin-belanger Jun 22, 2026

Uh oh!

igaw Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		Why the `[ -e /dev/%k ]` guard. udev rules run asynchronously: udevd processes a `remove` uevent some time after the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a different, live controller that has written its own registry entry — and a blind `rm` would clobber it.

		The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node synchronously, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `\|\|` short-circuits, and `rm` is skipped — preserving the new owner's entry.

Conversation

martin-belanger commented Jun 22, 2026

Uh oh!

Uh oh!

igaw Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

martin-belanger Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

martin-belanger Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

igaw Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

martin-belanger Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

igaw Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants