Skip to content
This repository was archived by the owner on Jun 2, 2026. It is now read-only.

feat: Gate Flow power/firmware ops on host assignment state#581

Open
kunzhao-nv wants to merge 5 commits into
mainfrom
feat/flow-check-assigned
Open

feat: Gate Flow power/firmware ops on host assignment state#581
kunzhao-nv wants to merge 5 commits into
mainfrom
feat/flow-check-assigned

Conversation

@kunzhao-nv

@kunzhao-nv kunzhao-nv commented May 28, 2026

Copy link
Copy Markdown
Contributor

Description

Adds a safety gate that refuses Flow's disruptive operations — power control, firmware update, and bring-up — while any affected host is still attached to a tenant Instance (ManagedHostState::Assigned/*). Why this matters: all three operations power-cycle the host.

Adds an optional input flag override_assignment_check to by pass the check

Type of Change

  • Feature - New feature or functionality (feat:)
  • Fix - Bug fixes (fix:)
  • Chore - Modification or removal of existing functionality (chore:)
  • Refactor - Refactoring of existing functionality (refactor:)
  • Docs - Changes in documentation or OpenAPI schema (docs:)
  • CI - Changes in GitHub workflows. Requires additional scrutiny (ci:)
  • Version - Issuing a new release version (version:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • Flow - Flow service updated
  • Powershelf Manager - Powershelf Manager updated
  • NVSwitch Manager - NVSwitch Manager updated

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

@kunzhao-nv kunzhao-nv requested a review from a team as a code owner May 28, 2026 01:03
@copy-pr-bot

copy-pr-bot Bot commented May 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR implements rack-assignment safety gates for compute and fabric operations. It extends the NICo API with topology queries (host machines by rack, switch-to-rack, power-shelf-to-rack mappings), introduces an AssignmentChecker that polls for machines and racks to exit assigned lifecycle states, and integrates this checker across compute, NVSwitch, and power-shelf component managers to prevent disruptive operations during active tenant assignments.

Changes

Rack Assignment Safety Infrastructure

Layer / File(s) Summary
NICo topology query API and implementations
flow/internal/nicoapi/mod.go, flow/internal/nicoapi/grpc.go, flow/internal/nicoapi/mock.go
Extends the Client interface with FindHostMachineIdsByRack, FindSwitchRackIDs, FindPowerShelfRackIDs and adds mock state and setters. Implements GRPC helpers that filter/omit unmapped entries.
AssignmentChecker state polling implementation
flow/internal/task/componentmanager/providers/nico/assignment.go
Defines AssignmentChecker with configurable timeout/interval, IsAssignedState predicate, WaitForMachinesUnassigned polling (dedupe, deadline, transient-missing tolerant), WaitForRacksUnassigned resolving racks→machines, and helpers sleep/dedupSorted.
AssignmentChecker unit tests
flow/internal/task/componentmanager/providers/nico/assignment_test.go
Covers IsAssignedState, short-circuits (nil/empty inputs), success/timeouts, missing-state handling, rack-resolution, empty-rack pass-through, and context-cancellation behavior.
Compute NICo component host-level assignment gates
flow/internal/task/componentmanager/compute/nico/nico.go, flow/internal/task/componentmanager/compute/nico/nico_test.go
Manager adds AssignmentChecker. PowerControl, FirmwareControl, and BringUpControl call ensureMachinesOperable (WaitForMachinesUnassigned) and return refused: errors when targets remain assigned. Tests validate refusal/allow scenarios.
NVSwitch NICo component rack-level assignment gates
flow/internal/task/componentmanager/nvswitch/nico/nico.go, flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
Manager adds AssignmentChecker and ensureRackOperable which maps switches→racks, logs orphan switches, and waits for racks to be unassigned before disruptive operations. Tests added for refusal and allowance cases.
NVSwitch Manager rack-level assignment gates with dual provider wiring
flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
Manager now accepts NSM and NICo clients and initializes AssignmentChecker. Factory retrieves both providers and Descriptor lists both required providers. PowerControl and FirmwareControl invoke ensureRackOperable before NSM operations.
Power shelf NICo component rack-level assignment gates
flow/internal/task/componentmanager/powershelf/nico/nico.go, flow/internal/task/componentmanager/powershelf/nico/nico_test.go
Manager adds AssignmentChecker. PowerControl and FirmwareControl call ensureRackOperable resolving shelves→racks and block when racks/hosts remain assigned; tests cover refusal and allow scenarios with short-timeout helpers.
Test fixture update for dual-provider dependency
flow/internal/task/componentmanager/builtin/builtin_test.go
Updates TestServiceCatalog nvswitch/nvswitchmanager fixture to expect both nsmprovider and nicoprovider in RequiredProviders.

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.82% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: Gate Flow power/firmware ops on host assignment state' accurately describes the primary change: introducing a safety gate for Flow's disruptive operations based on host assignment state.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly articulates the purpose: adding a safety gate to refuse disruptive operations while hosts remain assigned to tenant instances, with specific rationale and scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/flow-check-assigned

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Test Results

9 867 tests  +34   9 867 ✅ +34   7m 34s ⏱️ +33s
  161 suites ± 0       0 💤 ± 0 
   14 files   ± 0       0 ❌ ± 0 

Results for commit 1fc3a0c. ± Comparison against base commit d3a07c6.

♻️ This comment has been updated with latest results.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
flow/internal/task/componentmanager/compute/nico/nico_test.go (1)

544-558: 💤 Low value

Consider adding a test for FirmwareControl refusal when assigned.

The test suite covers PowerControl and BringUpControl refusal scenarios, but FirmwareControl also invokes ensureMachinesOperable. Adding a parallel test would ensure complete coverage of the safety gate across all three operations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/compute/nico/nico_test.go` around lines
544 - 558, Add a parallel unit test that verifies FirmwareControl refuses
machines in the "Assigned/Provisioning" state: create a mock client via
nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1",
State: "Assigned/Provisioning"}, construct the manager using
newManagerForSafetyTest(t, client), build a common.Target with Type
devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call
m.FirmwareControl(context.Background(), target) and assert an error is returned
and its message contains both "refused" and "Assigned state" (mirroring the
existing TestBringUpControl_RefusesAssignedMachine pattern and exercising
ensureMachinesOperable).
flow/internal/task/componentmanager/nvswitch/nico/nico.go (1)

144-183: 💤 Low value

Minor inefficiency: rackIDs may contain duplicates when multiple switches belong to the same rack.

While WaitForRacksUnassigned internally deduplicates via dedupSorted, the slice passed to it could contain duplicates. This is a minor inefficiency—the downstream call handles it—but deduplicating here would improve clarity and reduce unnecessary iterations.

♻️ Proposed optimization
-	rackIDs := make([]string, 0, len(rackBySwitch))
-	for _, rid := range rackBySwitch {
-		rackIDs = append(rackIDs, rid)
-	}
+	seen := make(map[string]struct{}, len(rackBySwitch))
+	rackIDs := make([]string, 0, len(rackBySwitch))
+	for _, rid := range rackBySwitch {
+		if _, ok := seen[rid]; !ok {
+			seen[rid] = struct{}{}
+			rackIDs = append(rackIDs, rid)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go` around lines 144 -
183, The rackIDs slice built in ensureRackOperable may contain duplicate rack
IDs when multiple switches map to the same rack; before calling
m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use
a temporary map[string]struct{} to collect unique IDs or sort+unique) so you
pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs
(keeping orphan handling unchanged) to populate a unique set and then convert it
back to a slice for the WaitForRacksUnassigned call.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/nicoapi/mock.go`:
- Around line 146-149: The mock implementation of
mockClient.FindHostMachineIdsByRack currently returns (nil, nil) when rackID ==
"" which diverges from the real client that rejects empty rack IDs; update
mockClient.FindHostMachineIdsByRack to return the same error the real client
uses for invalid/empty rack IDs (e.g., ErrInvalidRackID or a descriptive
fmt.Errorf("empty rackID")) instead of nil, so tests mirror production
validation behavior.

In `@flow/internal/task/componentmanager/providers/nico/assignment.go`:
- Around line 123-125: Split the inline assignment in the if-statement: call
sleep(ctx, c.pollInterval) and assign its result to a named variable (e.g., err)
on its own line, then follow with a separate if err != nil { return err } check;
update the code around the sleep(ctx, c.pollInterval) invocation in
assignment.go so the assignment and condition are two statements (referencing
the sleep function, ctx, and c.pollInterval).

---

Nitpick comments:
In `@flow/internal/task/componentmanager/compute/nico/nico_test.go`:
- Around line 544-558: Add a parallel unit test that verifies FirmwareControl
refuses machines in the "Assigned/Provisioning" state: create a mock client via
nicoapi.NewMockClient(), AddMachine with MachineDetail{MachineID: "machine-1",
State: "Assigned/Provisioning"}, construct the manager using
newManagerForSafetyTest(t, client), build a common.Target with Type
devicetypes.ComponentTypeCompute and ComponentIDs ["machine-1"], call
m.FirmwareControl(context.Background(), target) and assert an error is returned
and its message contains both "refused" and "Assigned state" (mirroring the
existing TestBringUpControl_RefusesAssignedMachine pattern and exercising
ensureMachinesOperable).

In `@flow/internal/task/componentmanager/nvswitch/nico/nico.go`:
- Around line 144-183: The rackIDs slice built in ensureRackOperable may contain
duplicate rack IDs when multiple switches map to the same rack; before calling
m.assignment.WaitForRacksUnassigned(ctx, rackIDs) deduplicate rackIDs (e.g. use
a temporary map[string]struct{} to collect unique IDs or sort+unique) so you
pass only unique rack IDs. Update the logic around rackBySwitch and rackIDs
(keeping orphan handling unchanged) to populate a unique set and then convert it
back to a slice for the WaitForRacksUnassigned call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3ecf4648-0d82-4bd5-9b37-db4f9c8c3fa1

📥 Commits

Reviewing files that changed from the base of the PR and between 10d3187 and 675da63.

📒 Files selected for processing (13)
  • flow/internal/nicoapi/grpc.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go

Comment thread flow/internal/nicoapi/mock.go
Comment thread flow/internal/task/componentmanager/providers/nico/assignment.go Outdated
Disruptive operations on a Machine — power control, firmware update,
and bring-up — power-cycle the host, which is unsafe to run while
that host is still attached to a tenant Instance. Same applies, at
rack scope, to NVSwitch and PowerShelf operations: those reset the
rack's fabric/power feed, so any Assigned host on the rack would see
the disturbance.

This change adds a per-Manager policy gate (`ensureMachinesOperable`
for compute, `ensureRackOperable` for nvswitch/powershelf) that
blocks the operation until every relevant host has left
`ManagedHostState::Assigned/*`, polling Core at a fixed interval and
failing the task with a clear error after a 30-minute timeout. Today
the gate delegates to a new `AssignmentChecker` primitive in the
nicoprovider package; future operator-approved overrides will be
composed inside the same `ensure...Operable` helpers without touching
the call sites.

For the NSM-based NVSwitch manager (which previously had no Core
client), the `nico` provider is now also required so the rack check
can run; the existing `nico`-based managers already had it.

To support the rack-scope check, three nicoapi.Client methods are
added: `FindHostMachineIdsByRack`, `FindSwitchRackIDs`, and
`FindPowerShelfRackIDs`. Mock-only setters mirror them so tests can
configure rack topology without standing up a fake Core. The mock's
`FindHostMachineIdsByRack` rejects empty rack IDs to match the gRPC
client, so tests can't pass an empty input silently and miss the
production validation path.

Switches and power shelves that Core does not yet associate with a
rack (e.g. mid bring-up, pre-ingest) are logged and skipped rather
than blocked: failing closed on a missing rack association would
deadlock the very flow that's supposed to populate it.

Tests cover the AssignmentChecker primitive (timeout, cancellation,
rack→machine resolution) and add focused integration-style tests on
each Manager that exercise both the refused (host Assigned) and
allowed (host Ready) paths via the mock client for PowerControl,
FirmwareControl, and BringUpControl. The NSM nvswitch manager
descriptor test is updated to expect the new `nico` required-provider
entry.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv force-pushed the feat/flow-check-assigned branch from 675da63 to ca28792 Compare May 28, 2026 06:13
@github-actions

Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-05-28 06:14:14 UTC | Commit: ca28792

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
flow/internal/task/componentmanager/nvswitch/nico/nico_test.go (2)

221-230: 💤 Low value

Consider extracting timeout constants for reusability.

The timeout and poll interval values (50ms, 10ms) are reasonable but hard-coded. If additional safety tests are added in the future, extracting these as package-level test constants would improve maintainability.

♻️ Proposed refactor
+const (
+	testAssignmentTimeout  = 50 * time.Millisecond
+	testAssignmentInterval = 10 * time.Millisecond
+)
+
 // newManagerForSafetyTest swaps the long default 30-minute assignment
 // timeout for a tight one so the wait loop actually times out within the
 // test budget. Tests in this file use the same package, so they can reach
 // the unexported assignment field directly.
 func newManagerForSafetyTest(t *testing.T, client nicoapi.Client) *Manager {
 	t.Helper()
 	m := New(client)
-	m.assignment = nicoprovider.NewAssignmentChecker(client, 50*time.Millisecond, 10*time.Millisecond)
+	m.assignment = nicoprovider.NewAssignmentChecker(client, testAssignmentTimeout, testAssignmentInterval)
 	return m
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
221 - 230, Extract the hard-coded timeouts in newManagerForSafetyTest into
package-level test constants (e.g., safetyAssignmentTimeout and
safetyPollInterval) and use them when calling nicoprovider.NewAssignmentChecker;
update the function newManagerForSafetyTest (which creates a Manager via New and
sets m.assignment) to reference those constants so future tests can reuse and
adjust the timeout/poll values centrally.

271-290: ⚡ Quick win

Add complementary test for FirmwareControl allowance.

PowerControl has both refusal (line 232) and allowance (line 253) tests, but FirmwareControl only tests the refusal path. Since both operations share the same ensureRackOperable safety gate, symmetric test coverage is warranted.

🧪 Recommended test addition
func TestFirmwareControl_AllowsWhenRackHostsReady(t *testing.T) {
	client := nicoapi.NewMockClient()
	client.SetSwitchRackID("sw-1", "rack-A")
	client.SetRackHostMachineIDs("rack-A", []string{"host-1"})
	client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Ready"})

	m := newManagerForSafetyTest(t, client)
	target := common.Target{
		Type:         devicetypes.ComponentTypeNVSwitch,
		ComponentIDs: []string{"sw-1"},
	}

	err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{
		Operation:     operations.FirmwareOperationUpgrade,
		TargetVersion: "1.0.0",
	})
	require.NoError(t, err)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
271 - 290, Add a complementary positive test for FirmwareControl to mirror the
existing refusal test: create a test (e.g.,
TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(),
SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A",
[]string{"host-1"}), and AddMachine with MachineDetail.State "Ready";
instantiate the manager with newManagerForSafetyTest and call
m.FirmwareControl(context.Background(), target,
operations.FirmwareControlTaskInfo{Operation:
operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert
require.NoError(t, err). This ensures FirmwareControl (which uses
ensureRackOperable) allows the operation when rack hosts are Ready.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 271-290: The test TestFirmwareControl_RefusesWhenRackHostAssigned
should also assert that the machine ID is included in the error message; after
calling m.FirmwareControl and verifying err is non-nil and contains "refused"
and "Assigned state", add an assertion that err.Error() contains "host-1" to
match the behavior of the assignment checker (same pattern as the PowerControl
test), referencing the test name and the m.FirmwareControl call to locate the
change.

---

Nitpick comments:
In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go`:
- Around line 221-230: Extract the hard-coded timeouts in
newManagerForSafetyTest into package-level test constants (e.g.,
safetyAssignmentTimeout and safetyPollInterval) and use them when calling
nicoprovider.NewAssignmentChecker; update the function newManagerForSafetyTest
(which creates a Manager via New and sets m.assignment) to reference those
constants so future tests can reuse and adjust the timeout/poll values
centrally.
- Around line 271-290: Add a complementary positive test for FirmwareControl to
mirror the existing refusal test: create a test (e.g.,
TestFirmwareControl_AllowsWhenRackHostsReady) that uses nicoapi.NewMockClient(),
SetSwitchRackID("sw-1","rack-A"), SetRackHostMachineIDs("rack-A",
[]string{"host-1"}), and AddMachine with MachineDetail.State "Ready";
instantiate the manager with newManagerForSafetyTest and call
m.FirmwareControl(context.Background(), target,
operations.FirmwareControlTaskInfo{Operation:
operations.FirmwareOperationUpgrade, TargetVersion: "1.0.0"}) and assert
require.NoError(t, err). This ensures FirmwareControl (which uses
ensureRackOperable) allows the operation when rack hosts are Ready.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bbb1a347-1e55-4084-8dba-f389b39a37b0

📥 Commits

Reviewing files that changed from the base of the PR and between 675da63 and ca28792.

📒 Files selected for processing (13)
  • flow/internal/nicoapi/grpc.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
  • flow/internal/task/componentmanager/builtin/builtin_test.go
  • flow/internal/nicoapi/mod.go
  • flow/internal/task/componentmanager/powershelf/nico/nico.go
  • flow/internal/nicoapi/mock.go
  • flow/internal/nicoapi/grpc.go
  • flow/internal/task/componentmanager/powershelf/nico/nico_test.go
  • flow/internal/task/componentmanager/compute/nico/nico.go
  • flow/internal/task/componentmanager/providers/nico/assignment_test.go
  • flow/internal/task/componentmanager/compute/nico/nico_test.go
  • flow/internal/task/componentmanager/nvswitch/nico/nico.go
  • flow/internal/task/componentmanager/providers/nico/assignment.go
  • flow/internal/task/componentmanager/nvswitch/nvswitchmanager/nvswitchmanager.go

Comment on lines +271 to +290
func TestFirmwareControl_RefusesWhenRackHostAssigned(t *testing.T) {
client := nicoapi.NewMockClient()
client.SetSwitchRackID("sw-1", "rack-A")
client.SetRackHostMachineIDs("rack-A", []string{"host-1"})
client.AddMachine(nicoapi.MachineDetail{MachineID: "host-1", State: "Assigned/Provisioning"})

m := newManagerForSafetyTest(t, client)
target := common.Target{
Type: devicetypes.ComponentTypeNVSwitch,
ComponentIDs: []string{"sw-1"},
}

err := m.FirmwareControl(context.Background(), target, operations.FirmwareControlTaskInfo{
Operation: operations.FirmwareOperationUpgrade,
TargetVersion: "1.0.0",
})
require.Error(t, err)
assert.Contains(t, err.Error(), "refused")
assert.Contains(t, err.Error(), "Assigned state")
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add assertion for machine ID in error message.

The error message from the assignment checker includes the machine IDs that remain in Assigned state (as verified in the PowerControl test at line 250). For consistency and thoroughness, this test should also verify that "host-1" appears in the error message.

🔍 Proposed fix
 	require.Error(t, err)
 	assert.Contains(t, err.Error(), "refused")
 	assert.Contains(t, err.Error(), "Assigned state")
+	assert.Contains(t, err.Error(), "host-1")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@flow/internal/task/componentmanager/nvswitch/nico/nico_test.go` around lines
271 - 290, The test TestFirmwareControl_RefusesWhenRackHostAssigned should also
assert that the machine ID is included in the error message; after calling
m.FirmwareControl and verifying err is non-nil and contains "refused" and
"Assigned state", add an assertion that err.Error() contains "host-1" to match
the behavior of the assignment checker (same pattern as the PowerControl test),
referencing the test name and the m.FirmwareControl call to locate the change.

@kunzhao-nv kunzhao-nv marked this pull request as draft May 29, 2026 18:46
The assignment safety gate added in PR #581 blocks power, firmware, and
bring-up operations whenever any target host (or, for rack-scoped
components, any host on the owning rack) is still in the Assigned/*
lifecycle state. Operator-supervised maintenance windows need an
explicit, audited way around that gate when tenant impact has been
acknowledged out-of-band; otherwise the only options are detaching
tenants first or waiting them out, neither of which is appropriate for
a planned maintenance procedure.

Add an override_assignment_check bool to the five disruptive request
messages (UpgradeFirmware, PowerOn/Off/Reset Rack, BringUpRack) and
plumb it end-to-end inside flow/:

  proto -> TaskInfo -> gRPC server handler -> activity ->
  component manager -> ensure{Machines,Rack}Operable

When the flag is true the policy gate logs a warning naming the
operation and the affected machine / shelf / switch IDs, then returns
without consulting the AssignmentChecker. For rack-scoped components
that also skips the topology lookup (FindSwitchRackIDs /
FindPowerShelfRackIDs / FindHostMachineIdsByRack) so the bypass does
not depend on Core reachability.

The flag defaults to false, preserving the gate as the safe default.
Authorisation for setting the override lives upstream of flow/ and is
not part of this change.

BringUpController.BringUpControl gains an operations.BringUpTaskInfo
parameter so the flag can reach the component manager the same way it
already does for PowerControl and FirmwareControl. The corresponding
Temporal activity and mock implementations are updated to match.

In the workflow action layer, BringUp rules that expand into
synthesised PowerControl or FirmwareControl sub-actions previously
discarded the parent task's operationInfo, which would silently drop
the override on the synthesised child. extractOverrideAssignmentCheck
recovers the flag from a parent task's operationInfo regardless of
which concrete TaskInfo type it is, including the map[string]any form
produced by Temporal child-workflow argument serialisation. The
ScheduledOperation -> TaskInfo converter is updated to carry the flag
through scheduled fires of the same operations.

Tests:

  * compute/nico: override bypass on PowerControl and BringUpControl
    against an Assigned host
  * nvswitch/nico: override bypass on PowerControl against a rack
    whose host is Assigned
  * powershelf/nico: override bypass on FirmwareControl against a rack
    whose host is Assigned
  * workflow: extractOverrideAssignmentCheck covers nil, nil typed
    pointers, value and pointer forms of the three TaskInfo types,
    map[string]any from JSON round-trip, and the non-marshalable
    fallback

Tests on the old two-arg BringUpControl signature (activity_test,
bringup_test, child_workflow_test, compute/nico nico_test) are updated
to pass an empty BringUpTaskInfo so the existing assertions still hold.

Out of scope: the workflow-schema/ proto sync (make flow-proto +
make flow-protogen) and any upstream carbide-rest API / cloud workflow
plumbing of this flag. Both are intentionally deferred to keep this
change confined to flow/.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
Signed-off-by: Kun Zhao <kunzhao@nvidia.com>
@kunzhao-nv kunzhao-nv marked this pull request as ready for review June 1, 2026 22:32
The OnActivity matcher in TestBringUpProgress used two mock.Anything
matchers, matching the old BringUpControl(ctx, target) shape. After
adding info to the activity signature the third positional argument
was unmatched, surfacing as

  FAIL:  (operations.BringUpTaskInfo={  false}) != (Missing)

from the testify mock layer. Add a third mock.Anything matcher.

Signed-off-by: Kun Zhao <kunzhao@nvidia.com>

@jw-nvidia jw-nvidia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one suggestion: for the APIs, we should consider to pass in a flag in case we need more requirements like override assignment check.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants