Skip to content

bug: [quickstart-automation] New site registration requires extensive manual patching #2245

@vipulagarwal

Description

@vipulagarwal

Version

v0.8.4

Describe the bug.

Creating a site via carbidecli site create does not produce a registered site. The REST call creates a carbide-rest DB row and a Site CR with a one-time password, but status stays Pending until a non-trivial sequence of manual kubectl patches is applied. None of this is automated or idempotent via
the helm charts and setup.sh script

Minimum reproducible example

* Run setup.sh as per the quickstart guide
* Create your first time as per https://docs.nvidia.com/infra-controller/documentation/getting-started/quick-start-guide#6-create-your-first-site
* Check site status `carbidecli site list` which stays `Pending`
* carbide-admin-cli won't accept expected_machines

Relevant log output

Other/Misc.

Root Causes

1. Helm deploys site-agent with a placeholder UUID

helm-prereqs/setup.sh installs carbide-rest-site-agent with a hardcoded CLUSTER_ID of a1b2c3d4e5f6-4000-8000-000000000001. This is never the UUID of any real site. The agent handshakes against this placeholder indefinitely until manually corrected.

2. Bootstrap Job will not re-run

The helm bootstrap.yaml Job that writes the site-registration secret is idempotent on the secret's existence. Once the secret exists (even with the wrong UUID/OTP), the Job is a no-op. There is no automated reconciliation to update it when a site is (re)created.

3. No registration endpoint exists

There is no API call to register a site. Pending --> Registered is driven exclusively by cloud-worker processing a non-failed Machine inventory upload (workflow/pkg/activity/machine/machine.go:161). An empty inventory suffices, but the upload only happens after the site-agent handshake completes with the correct UUID.

4. Temporal namespace must be pre-created manually

If the Temporal namespace matching $SITE_UUID does not exist before the site-agent starts, the agent panics. Namespace creation is not part of carbidecli site create or any automated post-create hook.

Impact

Every new site bring-up requires an operator to:

  1. Capture $SITE_UUID and $SITE_OTP from the site create response.
  2. Manually create the Temporal namespace via kubectl exec into temporal-admintools.
  3. Patch the site-registration secret with the correct UUID + OTP.
  4. Patch the carbide-rest-site-agent-config ConfigMap with the new CLUSTER_ID and TEMPORAL_SUBSCRIBE_NAMESPACE.
  5. Rollout-restart the site-agent StatefulSet and wait for it.
  6. Wait up to ~30 s for inventory upload to flip status to Registered.

Missing or ordering these steps in the wrong sequence leaves the site stuck in Pending with no clear error surfaced to the operator - only site-agent pod logs reveal the cause.

AI-Assisted: Claude

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

Type

No fields configured for Bug.

Projects

Status
In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions