Version
v0.8.4
Describe the bug.
Creating a site via carbidecli site create does not produce a registered site. The REST call creates a carbide-rest DB row and a Site CR with a one-time password, but status stays Pending until a non-trivial sequence of manual kubectl patches is applied. None of this is automated or idempotent via
the helm charts and setup.sh script
Minimum reproducible example
* Run setup.sh as per the quickstart guide
* Create your first time as per https://docs.nvidia.com/infra-controller/documentation/getting-started/quick-start-guide#6-create-your-first-site
* Check site status `carbidecli site list` which stays `Pending`
* carbide-admin-cli won't accept expected_machines
Relevant log output
Other/Misc.
Root Causes
1. Helm deploys site-agent with a placeholder UUID
helm-prereqs/setup.sh installs carbide-rest-site-agent with a hardcoded CLUSTER_ID of a1b2c3d4e5f6-4000-8000-000000000001. This is never the UUID of any real site. The agent handshakes against this placeholder indefinitely until manually corrected.
2. Bootstrap Job will not re-run
The helm bootstrap.yaml Job that writes the site-registration secret is idempotent on the secret's existence. Once the secret exists (even with the wrong UUID/OTP), the Job is a no-op. There is no automated reconciliation to update it when a site is (re)created.
3. No registration endpoint exists
There is no API call to register a site. Pending --> Registered is driven exclusively by cloud-worker processing a non-failed Machine inventory upload (workflow/pkg/activity/machine/machine.go:161). An empty inventory suffices, but the upload only happens after the site-agent handshake completes with the correct UUID.
4. Temporal namespace must be pre-created manually
If the Temporal namespace matching $SITE_UUID does not exist before the site-agent starts, the agent panics. Namespace creation is not part of carbidecli site create or any automated post-create hook.
Impact
Every new site bring-up requires an operator to:
- Capture
$SITE_UUID and $SITE_OTP from the site create response.
- Manually create the Temporal namespace via
kubectl exec into temporal-admintools.
- Patch the
site-registration secret with the correct UUID + OTP.
- Patch the
carbide-rest-site-agent-config ConfigMap with the new CLUSTER_ID and TEMPORAL_SUBSCRIBE_NAMESPACE.
- Rollout-restart the site-agent StatefulSet and wait for it.
- Wait up to ~30 s for inventory upload to flip status to
Registered.
Missing or ordering these steps in the wrong sequence leaves the site stuck in Pending with no clear error surfaced to the operator - only site-agent pod logs reveal the cause.
AI-Assisted: Claude
Code of Conduct
Version
v0.8.4
Describe the bug.
Creating a site via
carbidecli site createdoes not produce a registered site. The REST call creates acarbide-restDB row and a Site CR with a one-time password, butstatusstaysPendinguntil a non-trivial sequence of manual kubectl patches is applied. None of this is automated or idempotent viathe helm charts and
setup.shscriptMinimum reproducible example
Relevant log output
Other/Misc.
Root Causes
1. Helm deploys site-agent with a placeholder UUID
helm-prereqs/setup.shinstallscarbide-rest-site-agentwith a hardcodedCLUSTER_IDofa1b2c3d4e5f6-4000-8000-000000000001. This is never the UUID of any real site. The agent handshakes against this placeholder indefinitely until manually corrected.2. Bootstrap Job will not re-run
The helm
bootstrap.yamlJob that writes thesite-registrationsecret is idempotent on the secret's existence. Once the secret exists (even with the wrong UUID/OTP), the Job is a no-op. There is no automated reconciliation to update it when a site is (re)created.3. No registration endpoint exists
There is no API call to register a site.
Pending --> Registeredis driven exclusively by cloud-worker processing a non-failed Machine inventory upload (workflow/pkg/activity/machine/machine.go:161). An empty inventory suffices, but the upload only happens after the site-agent handshake completes with the correct UUID.4. Temporal namespace must be pre-created manually
If the Temporal namespace matching
$SITE_UUIDdoes not exist before the site-agent starts, the agent panics. Namespace creation is not part ofcarbidecli site createor any automated post-create hook.Impact
Every new site bring-up requires an operator to:
$SITE_UUIDand$SITE_OTPfrom thesite createresponse.kubectl execintotemporal-admintools.site-registrationsecret with the correct UUID + OTP.carbide-rest-site-agent-configConfigMap with the newCLUSTER_IDandTEMPORAL_SUBSCRIBE_NAMESPACE.Registered.Missing or ordering these steps in the wrong sequence leaves the site stuck in
Pendingwith no clear error surfaced to the operator - only site-agent pod logs reveal the cause.AI-Assisted: Claude
Code of Conduct