Skip to content

feat: Gate Flow power/firmware ops on host assignment state#2117

Merged
kunzhao-nv merged 3 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-check-assigned
Jun 2, 2026
Merged

feat: Gate Flow power/firmware ops on host assignment state#2117
kunzhao-nv merged 3 commits into
NVIDIA:mainfrom
kunzhao-nv:feat/flow-check-assigned

Conversation

@kunzhao-nv

@kunzhao-nv kunzhao-nv commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a safety gate that refuses Flow's disruptive operations — power control, firmware update, and bring-up — while any affected host is still attached to a tenant Instance (ManagedHostState::Assigned/*). Why this matters: all three operations power-cycle the host.

Adds an optional input flag override_assignment_check to by pass the check.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

This is the first step for the feature. The followup:

  1. support it from REST api interface
  2. potentially add some predefined exceptional cases checking

@kunzhao-nv kunzhao-nv requested a review from a team as a code owner June 2, 2026 18:09
@copy-pr-bot

copy-pr-bot Bot commented Jun 2, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kunzhao-nv kunzhao-nv force-pushed the feat/flow-check-assigned branch from 33a4bce to bfa3ad3 Compare June 2, 2026 18:36
@ajf

ajf commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

@kunzhao-nv your commits are not signed, FYI

Power control, firmware update, and bring-up power-cycle the host or
the rack's fabric/power feed, and are unsafe while any target host is
still attached to a tenant Instance.

Add a per-Manager policy gate (ensureMachinesOperable for compute,
ensureRackOperable for nvswitch/powershelf) that polls Core at a
fixed interval and fails with a clear error after a 30-minute timeout
if any relevant host stays in ManagedHostState::Assigned/*. The gate
delegates to a new AssignmentChecker primitive in the nicoprovider
package. The NSM-based NVSwitch manager now also requires the nico
provider so the rack check can run.

Three nicoapi.Client methods are added for the rack-scope check:
FindHostMachineIdsByRack, FindSwitchRackIDs, FindPowerShelfRackIDs -
all using existing Core gRPC surface. Switches and power shelves not
yet associated with a rack are logged and skipped rather than blocked.

An override_assignment_check bool is added to the five disruptive
request messages (UpgradeFirmware, PowerOn/Off/Reset Rack,
BringUpRack) and plumbed end-to-end. When set, the gate logs a
warning and skips the AssignmentChecker. Defaults to false.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv force-pushed the feat/flow-check-assigned branch from bfa3ad3 to 9a9d8be Compare June 2, 2026 18:41
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-06-02 18:43:15 UTC | Commit: 9a9d8be

@kunzhao-nv kunzhao-nv enabled auto-merge (squash) June 2, 2026 18:45
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
nico-flow 66 4 34 18 2 8
nico-nsm 82 2 28 43 9 0
nico-psm 67 4 35 18 2 8
nico-rest-api 100 6 53 30 3 8
nico-rest-cert-manager 65 4 34 18 1 8
nico-rest-db 66 4 34 18 2 8
nico-rest-site-agent 65 4 34 18 1 8
nico-rest-site-manager 65 4 34 18 1 8
nico-rest-workflow 67 4 35 18 2 8
TOTAL 643 36 321 199 23 64

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@kunzhao-nv kunzhao-nv merged commit b9748c5 into NVIDIA:main Jun 2, 2026
88 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants