feat: Gate Flow power/firmware ops on host assignment state#2117
Merged
Conversation
jw-nvidia
approved these changes
Jun 2, 2026
33a4bce to
bfa3ad3
Compare
Collaborator
|
@kunzhao-nv your commits are not signed, FYI |
Power control, firmware update, and bring-up power-cycle the host or the rack's fabric/power feed, and are unsafe while any target host is still attached to a tenant Instance. Add a per-Manager policy gate (ensureMachinesOperable for compute, ensureRackOperable for nvswitch/powershelf) that polls Core at a fixed interval and fails with a clear error after a 30-minute timeout if any relevant host stays in ManagedHostState::Assigned/*. The gate delegates to a new AssignmentChecker primitive in the nicoprovider package. The NSM-based NVSwitch manager now also requires the nico provider so the rack check can run. Three nicoapi.Client methods are added for the rack-scope check: FindHostMachineIdsByRack, FindSwitchRackIDs, FindPowerShelfRackIDs - all using existing Core gRPC surface. Switches and power shelves not yet associated with a rack are logged and skipped rather than blocked. An override_assignment_check bool is added to the five disruptive request messages (UpgradeFirmware, PowerOn/Off/Reset Rack, BringUpRack) and plumbed end-to-end. When set, the gate logs a warning and skips the AssignmentChecker. Defaults to false. Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
bfa3ad3 to
9a9d8be
Compare
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-06-02 18:43:15 UTC | Commit: 9a9d8be |
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a safety gate that refuses Flow's disruptive operations — power control, firmware update, and bring-up — while any affected host is still attached to a tenant Instance (ManagedHostState::Assigned/*). Why this matters: all three operations power-cycle the host.
Adds an optional input flag
override_assignment_checkto by pass the check.Type of Change
Related Issues (Optional)
Breaking Changes
Testing
Additional Notes
This is the first step for the feature. The followup: