doc: add libnvme registry overview (REGISTRY.md)#3467
Conversation
efdb24e to
b843f28
Compare
|
|
||
| ``` | ||
| ACTION=="remove", SUBSYSTEM=="nvme", KERNEL=="nvme[0-9]*", \ | ||
| RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'" |
There was a problem hiding this comment.
IIRC, I asked this in the PR, shouldn't we use nvme-cli here to remove the entry? I would like to log everything in one place, so journalctl -u nvme-cli (or whatever the command is) shows what happened. Having this only in a udev rule distributes the information and makes it hard to debug.
There was a problem hiding this comment.
Fair point on debuggability, but there's a structural problem with routing it through nvme. The command would be nvme registry delete, which lives in the registry plugin — and that plugin is optional, at your own request (you asked for the registry to be a plugin rather than a built-in). So a udev rule that shells out to nvme registry delete would make the registry's stale-entry cleanup depend on a command that isn't present on any system that didn't build/install that optional plugin. The cleanup is a property of the registry itself, which lives in libnvme — so it has to work whenever libnvme is installed, with no dependency on nvme-cli or an optional plugin, and no daemon. That's exactly why it's a libnvme-shipped udev rule.
It also wouldn't change the race: a command launched from the rule races the connect path identically — the [ -e ] guard is still what makes it correct. So we'd take on an optional-plugin dependency and gain nothing on correctness.
The debuggability concern is legit on its own, though — happy to solve "see what happened in one place" without that dependency, e.g. have the rule log via logger, or add a libnvme log sink.
There was a problem hiding this comment.
This also answers the natural follow-up — "if a udev rule removes entries, why doesn't one create them?" The two halves are placed where the information they need actually lives:
- Create needs the
owner— and the owner is purely in-process state of the connecting process, passed to libnvme via the context (ctx->owner). It is not in sysfs, not in theadduevent, nowhere a udev rule could read it (the kernel has no concept of "owner" — it's a libnvme-userspace notion). A rule firing fornvme4has no way to know who connected it, so it cannot write a correctowner=. Only libnvme, in the connect path, has the owner — which is why the entry is written there. (It's also written synchronously insideconnect(), so there is no window where a just-connected controller is briefly unowned and exposed todisconnect-all.) - Delete needs no
owner— it just removes the entry for that device name (with the[ -e ]recycling guard). No per-process knowledge is required, so a udev rule can do it.
So the asymmetry is deliberate, not an oversight: each operation lives where its required input is available.
| RUN+="/bin/sh -c '[ -e /dev/%k ] || rm -rf /run/nvme/registry/%k'" | ||
| ``` | ||
|
|
||
| **Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it. |
There was a problem hiding this comment.
The race window still exists. The only way to ensure this works is by serializing the events, which udevd does. So I don't think this argument is correct.
There was a problem hiding this comment.
I think the disconnect is one specific point: the new controller's registry entry is not written by a udev rule — it's written by libnvme inside the connect() path, synchronously in the connecting process, the moment the kernel assigns the (recycled) instance number. udevd never sees that write and never orders it. So udevd serializing udev events doesn't help here, because the actor that races the remove rule is the connect path, which is outside udev entirely.
Concretely, the dangerous interleaving for the name nvme4:
- Kernel removes the old
nvme4→ frees instance 4 → emits aremoveuevent (now queued in udevd). - Before udevd gets to that uevent, a new controller connects. The kernel allocates the lowest free instance = 4 again, creates
/dev/nvme4(devtmpfs, synchronous), and libnvme — in the connecting process, not via udev — writes/run/nvme/registry/nvme4/owner. - udevd now runs the queued
removerule for the oldnvme4. With no guard it doesrm -rf /run/nvme/registry/nvme4and deletes the new, live controller's entry.
udevd processing remove before add in order does not prevent step 3 from clobbering step 2, because step 2's write never came through udev. The only information available at step 3 that distinguishes "stale entry for the controller that's really gone" from "live entry for a controller that recycled the name" is whether the device node currently exists — which is exactly what [ -e /dev/%k ] tests. devtmpfs removes the node synchronously before KOBJ_REMOVE is emitted (device_del() calls devtmpfs_delete_node() before kobject_uevent(KOBJ_REMOVE)), so:
- genuinely-gone controller →
/dev/nvme4absent →rmruns (correct); - recycled name →
/dev/nvme4exists again →rmskipped (correct).
So the guard isn't a weaker substitute for serialization — it's the only mechanism that works, because the competing actor (the connect path) is not a udev event udevd can serialize. The residual window is the microsecond TOCTOU between the [ -e ] test and the rm; serialization can't close that either, for the same reason (the connect write isn't a udev event), and hitting it needs a connect ioctl to complete at that exact instant — negligible.
On the discoverd worry: this is actually why the guard is robust to it. Correctness depends on kernel devtmpfs state, not on udevd event ordering or on who is observing events. Another observer (e.g. discoverd) reconnecting a dropped controller just produces another connect → another /dev/nvmeN + libnvme entry; the [ -e ] check reflects that regardless of event timing. So it does not rely on serialization at all — which is the property we want once there are multiple observers.
|
|
||
| **Why the `[ -e /dev/%k ]` guard.** udev rules run asynchronously: udevd processes a `remove` uevent some time *after* the kernel emits it. Meanwhile the kernel is free to recycle the just-freed instance number — it allocates the lowest available id, so the very next controller to connect can be handed the same `nvmeN`. This opens a rare race: a controller is removed and, at almost the same instant, a new one is connected and inherits its id, all before udevd gets around to running the remove rule for the old controller. By the time that rule finally fires, `nvmeN` already refers to a *different*, live controller that has written its own registry entry — and a blind `rm` would clobber it. | ||
|
|
||
| The guard breaks the race by asking whether the device still exists at the moment the rule runs. devtmpfs removes a controller's device node *synchronously*, before the kernel emits `KOBJ_REMOVE`, so for the controller actually being removed `/dev/nvmeN` is already gone. If `/dev/nvmeN` is absent, the entry is genuinely stale and `rm` runs. If `/dev/nvmeN` exists, the id has been recycled to a new controller: `[ -e /dev/%k ]` is true, the `||` short-circuits, and `rm` is skipped — preserving the new owner's entry. |
There was a problem hiding this comment.
As I said, udevd ensures order of rule execution. That is if everything is run through udevd all is good. The problem starts if there is another instance observing udev events and does something. With the nvme-discoverd on the horizon, this udev rule could get really nasty. Just thinking loud, maybe all is good.
Add a concise overview of the NVMe controller ownership registry: why it exists (orchestrator coexistence and the disconnect-all coordination problem from issue linux-nvme#2913), the /run/nvme/registry layout, how an owner is set, the udev-driven cleanup and its instance-recycling guard, and the disconnect-all guardrails. The header kdoc and man pages remain the authoritative reference. Suggested-by: Daniel Wagner <wagi@kernel.org> Signed-off-by: Martin Belanger <martin.belanger@dell.com>
b843f28 to
29b9f8e
Compare
Add a concise overview of the NVMe controller ownership registry: why it exists (orchestrator coexistence and the disconnect-all coordination problem from issue #2913), the /run/nvme/registry layout, how an owner is set, the udev-driven cleanup and its instance-recycling guard, and the disconnect-all guardrails. The header kdoc and man pages remain the authoritative reference.
Suggested-by: Daniel Wagner wagi@kernel.org