Skip to content

Fix nvme instance storage discovery#407

Merged
neddp merged 13 commits into
cloudfoundry:fix-nvme-instance-storage-discoveryfrom
Ivaylogi98:fix-nvme-instance-storage-discovery
Apr 30, 2026
Merged

Fix nvme instance storage discovery#407
neddp merged 13 commits into
cloudfoundry:fix-nvme-instance-storage-discoveryfrom
Ivaylogi98:fix-nvme-instance-storage-discovery

Conversation

@Ivaylogi98

@Ivaylogi98 Ivaylogi98 commented Feb 23, 2026

Copy link
Copy Markdown
Contributor

Refactor NVMe instance storage discovery to be IaaS-agnostic

Refactors #396 to address review feedback from @rkoster.

What changed from #396

IaaS-agnostic design — the AWS-specific awsNVMeInstanceStorageResolver and its hardcoded EBS pattern are replaced by a generic SymlinkDeviceResolver in infrastructure/devicepathresolver. The managed volume
pattern (/dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_) and device pattern (/dev/nvmen1) are passed in entirely via agent config (InstanceStorageManagedVolumePattern, InstanceStorageDevicePattern), set by
the stemcell builder's bosh_aws_agent_settings stage — not hardcoded in the resolver.

udev race condition fixed — the original PR globbed /dev/disk/by-id/ symlinks without waiting for udev to finish creating them. On fast boots, this returned 0 EBS symlinks, causing all NVMe devices (including
the root EBS volume) to be misidentified as instance storage. udevadm trigger + udevadm settle are now called before globbing.

Unit tests fixed — all compilation errors and test failures from #396 are resolved.

Algorithm

  1. Glob all NVMe devices: /dev/nvme*n1
  2. Run udevadm trigger + udevadm settle
  3. Glob EBS symlinks via InstanceStorageManagedVolumePattern
  4. Resolve each symlink to its target device
  5. Subtract EBS devices from all NVMe → remainder is instance storage
  6. Partition only the discovered instance storage devices

Verified using m6id.large

Enabled raw_instance_storage: true

3 NVMe devices present:

  • /dev/nvme0n1 — root EBS (5G) → correctly excluded
  • /dev/nvme1n1 — ephemeral EBS (50G) → correctly excluded
  • /dev/nvme2n1 — instance storage (118G) → correctly discovered and set up as raw ephemeral

Used Claude Code

@Ivaylogi98 Ivaylogi98 force-pushed the fix-nvme-instance-storage-discovery branch from b79e5c1 to 6ab6cd7 Compare February 25, 2026 12:41
@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6e5322f0-a1b6-444a-b775-810c1a70fe82

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Ivaylogi98 Ivaylogi98 marked this pull request as ready for review April 30, 2026 12:15
@github-project-automation github-project-automation Bot moved this from Inbox to Pending Merge | Prioritized in Foundational Infrastructure Working Group Apr 30, 2026
@neddp neddp merged commit bb98869 into cloudfoundry:fix-nvme-instance-storage-discovery Apr 30, 2026
8 checks passed
@github-project-automation github-project-automation Bot moved this from Pending Merge | Prioritized to Done in Foundational Infrastructure Working Group Apr 30, 2026
neddp added a commit that referenced this pull request Jun 12, 2026
…nks (#396)

* Implement runtime NVMe instance storage discovery using EBS symlinks

* Fix nvme instance storage discovery (#400)

* Refactor instance storage discovery into configurable component

Implement auto-detection for instance storage disk type

* Fix windows tests

* Fix windows tests (but for real this time)

* Remove leftover path normalization (#401)

* Fix nvme instance storage discovery (#407)

* Make implementation iaas-agnostic

* Rename storage resolver files

* Fix tests

* Remove instance storage resolver

* Don't use the aws pattern as default

* Refactor NVMe instance storage discovery and remove unused symlink patterns

* Enhance NVMe instance storage discovery with managed volume pattern support

* Fix unit tests

* Don't run windows unit tests when not supported

* Simplify FakeDevicePathResolver by removing unused fields and methods

* Wait for udev to settle before resolving EBS symlinks

* Add debug logs

* Import udev and add comment about why it's needed

* Fix missing closing bracket

* Update infrastructure/devicepathresolver/symlink_device_resolver.go

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Update infrastructure/devicepathresolver/symlink_device_resolver_test.go

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Fix lint identation

* Fix: skip unresolvable symlinks instead of returning error

ResolveSymlinksToDevices now logs a warning and continues when a
symlink cannot be resolved (e.g. stale/broken symlinks in
/dev/disk/by-id/). This prevents unnecessary deploy failures while
the count validation in discoverNVMeInstanceStorage still catches
any real mismatches.

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Use the already constructed udev instance

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix: use %+v instead of %s for DiskSettings in error message

DiskSettings does not implement fmt.Stringer, so using %s produced
malformed %!s(...) output. Switch to %+v for actionable error messages.

* fix: handle timedOut in discoverIdentityInstanceStorage

* resolver: fail hard on unresolvable symlinks in ResolveSymlinksToDevices

Silently skipping a symlink that cannot be resolved leaves the managed
device exclusion set incomplete. If an EBS volume's by-id symlink is
broken, FilterDevices would not exclude it and the device could be
misidentified as instance storage, potentially causing data loss.

Return a wrapped error instead of continuing, so callers can propagate
the failure rather than proceeding with a partial exclusion set.

* test: update linux_platform test for hard-fail on broken symlinks

The platform-level test still expected the old skip-and-continue behavior.
Updated to assert that a broken managed volume symlink returns an error.

* fix: prioritize timedOut over err; assert nil result on resolver error

- discoverIdentityInstanceStorage: check timedOut before err so the
  explicit timeout error is never shadowed when both are set
- symlink_device_resolver_test: assert result is nil on failure to
  verify the fail-closed contract

---------

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
neddp added a commit that referenced this pull request Jun 14, 2026
…nks (#396)

* Implement runtime NVMe instance storage discovery using EBS symlinks

* Fix nvme instance storage discovery (#400)

* Refactor instance storage discovery into configurable component

Implement auto-detection for instance storage disk type

* Fix windows tests

* Fix windows tests (but for real this time)

* Remove leftover path normalization (#401)

* Fix nvme instance storage discovery (#407)

* Make implementation iaas-agnostic

* Rename storage resolver files

* Fix tests

* Remove instance storage resolver

* Don't use the aws pattern as default

* Refactor NVMe instance storage discovery and remove unused symlink patterns

* Enhance NVMe instance storage discovery with managed volume pattern support

* Fix unit tests

* Don't run windows unit tests when not supported

* Simplify FakeDevicePathResolver by removing unused fields and methods

* Wait for udev to settle before resolving EBS symlinks

* Add debug logs

* Import udev and add comment about why it's needed

* Fix missing closing bracket

* Update infrastructure/devicepathresolver/symlink_device_resolver.go

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Update infrastructure/devicepathresolver/symlink_device_resolver_test.go

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Fix lint identation

* Fix: skip unresolvable symlinks instead of returning error

ResolveSymlinksToDevices now logs a warning and continues when a
symlink cannot be resolved (e.g. stale/broken symlinks in
/dev/disk/by-id/). This prevents unnecessary deploy failures while
the count validation in discoverNVMeInstanceStorage still catches
any real mismatches.

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>

* Use the already constructed udev instance

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

* fix: use %+v instead of %s for DiskSettings in error message

DiskSettings does not implement fmt.Stringer, so using %s produced
malformed %!s(...) output. Switch to %+v for actionable error messages.

* fix: handle timedOut in discoverIdentityInstanceStorage

* resolver: fail hard on unresolvable symlinks in ResolveSymlinksToDevices

Silently skipping a symlink that cannot be resolved leaves the managed
device exclusion set incomplete. If an EBS volume's by-id symlink is
broken, FilterDevices would not exclude it and the device could be
misidentified as instance storage, potentially causing data loss.

Return a wrapped error instead of continuing, so callers can propagate
the failure rather than proceeding with a partial exclusion set.

* test: update linux_platform test for hard-fail on broken symlinks

The platform-level test still expected the old skip-and-continue behavior.
Updated to assert that a broken managed volume symlink returns an error.

* fix: prioritize timedOut over err; assert nil result on resolver error

- discoverIdentityInstanceStorage: check timedOut before err so the
  explicit timeout error is never shadowed when both are set
- symlink_device_resolver_test: assert result is nil on failure to
  verify the fail-closed contract

---------

Co-authored-by: Ivaylo Ivanov <ivaylogi98@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants