From bc9d186d7881b576c9d1655e7d0e9c4168f6e155 Mon Sep 17 00:00:00 2001 From: Kyle Felter Date: Wed, 10 Jun 2026 03:03:50 -0500 Subject: [PATCH 1/3] docs: Rename infra-controller-core references to infra-controller --- docs/README.md | 2 +- docs/architecture/health_aggregation.md | 4 +- docs/architecture/infiniband/nic_selection.md | 2 +- docs/architecture/overview.md | 20 +- docs/architecture/redfish_workflow.md | 2 +- docs/configuration/tenant_management.md | 2 +- docs/development.md | 6 +- docs/development/vm_pxe_client.md | 2 +- .../day0-ip-network-config.md | 10 +- .../installation-options/reference-install.md | 675 +----------------- docs/getting-started/quick-start.md | 45 +- docs/index.md | 2 +- docs/manuals/building_nico_containers.md | 2 +- docs/manuals/nico-admin-cli.md | 2 +- docs/manuals/nicocli-reference.md | 3 +- docs/manuals/nvlink_partitioning.md | 8 +- docs/manuals/rack_level_admin.md | 40 +- docs/manuals/repair/overview.md | 2 +- docs/openapi/getting_started.md | 2 +- docs/openapi/spec.yaml | 4 +- docs/release-notes.md | 4 +- helm-prereqs/README.md | 8 +- 22 files changed, 83 insertions(+), 764 deletions(-) diff --git a/docs/README.md b/docs/README.md index 756657bfb3..0bca21a06a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -100,7 +100,7 @@ The REST layer can be deployed in the datacenter with Infra Controller Core, or in Cloud with Site Agent connecting from the datacenter. Multiple Infra Controller Cores running in different datacenters can also connect to Infra Controller REST through respective Site Agents. -For details on NICo REST, please refer to [NICo REST Github Repository](https://github.com/NVIDIA/infra-controller-rest) and [NICo REST API Schema](https://nvidia.github.io/infra-controller-rest/). +For details on NICo REST, please refer to the [infra-controller GitHub repository](https://github.com/NVIDIA/infra-controller) and the [REST API Reference](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference). ### Managed Hosts diff --git a/docs/architecture/health_aggregation.md b/docs/architecture/health_aggregation.md index 76cb9baad1..d2bfd071ee 100644 --- a/docs/architecture/health_aggregation.md +++ b/docs/architecture/health_aggregation.md @@ -252,7 +252,7 @@ Details can be found in the [SKU Validation guide](../provisioning/sku-validatio ### BMC health monitoring -The [`nico-hw-health`](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/health) service periodically queries all Host and DPU BMCs in the system for health information. It emits the captured health datapoints as metrics on a metrics endpoint that can be scraped by a standard telemetry system (prometheus/otel). +The [`nico-hw-health`](https://github.com/NVIDIA/infra-controller/blob/main/crates/health) service periodically queries all Host and DPU BMCs in the system for health information. It emits the captured health datapoints as metrics on a metrics endpoint that can be scraped by a standard telemetry system (prometheus/otel). Health metrics fetched from BMCs include: - Fan speeds @@ -282,7 +282,7 @@ In certain conditions the scraping process will place a health alert on the host ### dpu-agent based health monitoring -[`dpu-agent`](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/agent) collects health information directly on the DPU and sends a health-**rollup** towards `nico-core`. The agent monitors a variety of health conditions, including +[`dpu-agent`](https://github.com/NVIDIA/infra-controller/blob/main/crates/agent) collects health information directly on the DPU and sends a health-**rollup** towards `nico-core`. The agent monitors a variety of health conditions, including - whether BGP sessions are established to peers according to the current configuration of the DPU - whether all required services on the DPU are running - whether the DPU is configured in restricted mode diff --git a/docs/architecture/infiniband/nic_selection.md b/docs/architecture/infiniband/nic_selection.md index 42351adb3b..6eb14b17c9 100644 --- a/docs/architecture/infiniband/nic_selection.md +++ b/docs/architecture/infiniband/nic_selection.md @@ -74,7 +74,7 @@ use the independent devices. ### NICo machine hardware enumeration When NICo discovers a machine that is intended to be managed by the NICo site controller, -it enumerates its hardware details using the [nico-scout](https://github.com/NVIDIA/infra-controller-core/tree/main/crates/scout) tool. +it enumerates its hardware details using the [nico-scout](https://github.com/NVIDIA/infra-controller/tree/main/crates/scout) tool. The tool reports all discovered hardware information (e.g. the number and type of CPUs, GPUs, and network interfaces), and this information gets persisted diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 7732bc4745..c26799b88a 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -32,7 +32,7 @@ NICo deploys a set of binaries on these hosts during various points of their lif ### Scout -[scout](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/scout) is an agent that NICo runs on the host and DPU of managed hosts for a variety of tasks: +[scout](https://github.com/NVIDIA/infra-controller/blob/main/crates/scout) is an agent that NICo runs on the host and DPU of managed hosts for a variety of tasks: - "Inventory" collection: Scout collects and transmits hardware properties of the host to [NICo core](#nico-core) which can not be determined through out-of-band tooling. - Execution of cleanup tasks whenever the bare metal instance using the host is released by a user - Execution of machine validation tests @@ -40,7 +40,7 @@ NICo deploys a set of binaries on these hosts during various points of their lif ### DPU Agent -[dpu-agent](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/agent) is an agent that NICo runs exclusively on DPUS managed by NICo as a daemon. +[dpu-agent](https://github.com/NVIDIA/infra-controller/blob/main/crates/agent) is an agent that NICo runs exclusively on DPUS managed by NICo as a daemon. DPU agent performs the following tasks: - Configuring the DPU as required at any state during the hosts lifecycle. This process is described more in depth in [DPU configuration](dpu_configuration.md). @@ -51,24 +51,24 @@ DPU agent performs the following tasks: ### DHCP Server -NICo runs a [custom DHCP server](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dhcp-server) on the DPU, which handles all DHCP requests of the actual host. This means DHCP requests on the hosts primary networking interfaces will never leave the DPU and show up on the underlay network - which provides enhanced security and reliability. +NICo runs a [custom DHCP server](https://github.com/NVIDIA/infra-controller/blob/main/crates/dhcp-server) on the DPU, which handles all DHCP requests of the actual host. This means DHCP requests on the hosts primary networking interfaces will never leave the DPU and show up on the underlay network - which provides enhanced security and reliability. The DHCP server is configured by dpu-agent. ## NICo Control plane services The NICo control plane consists of a number of services which work together to orchestrate the lifecycle of a managed host: -- [nico-core](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/api): The NICo core service is the entrypoint into the control plane. It provides a [gRPC](https://grpc.io) API that all other components as well as users (site providers/tenants/site administrators) interact with, as well as implements the lifecycle management of all NICo managed resources (VPCs, prefixes, Infiniband and NVLink partitions and bare metal instances). The [NICo Core](#nico_core_architecture) section describes it further in detail. -- [nico-dhcp (DHCP)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dhcp): The DHCP server responds to DHCP requests for all +- [nico-core](https://github.com/NVIDIA/infra-controller/blob/main/crates/api): The NICo core service is the entrypoint into the control plane. It provides a [gRPC](https://grpc.io) API that all other components as well as users (site providers/tenants/site administrators) interact with, as well as implements the lifecycle management of all NICo managed resources (VPCs, prefixes, Infiniband and NVLink partitions and bare metal instances). The [NICo Core](#nico_core_architecture) section describes it further in detail. +- [nico-dhcp (DHCP)](https://github.com/NVIDIA/infra-controller/blob/main/crates/dhcp): The DHCP server responds to DHCP requests for all devices on underlay networks. This includes Host BMCs, DPU BMCs and DPU OOB addresses. nico-dhcp can be thought of as a stateless proxy: It does not actually perform any IP address management - it just converts DHCP requests into gRPC format and forwards the gRPC based DHCP requests to nico core. -- [nico-pxe (iPXE)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/pxe): The PXE server provides boot artifacts like iPXE scripts, iPXE user-data and OS images to managed hosts at boot time over HTTP. It determines which OS data to provide for a specific host by requesting the respective data from nico core - therefore the PXE server is also stateless. +- [nico-pxe (iPXE)](https://github.com/NVIDIA/infra-controller/blob/main/crates/pxe): The PXE server provides boot artifacts like iPXE scripts, iPXE user-data and OS images to managed hosts at boot time over HTTP. It determines which OS data to provide for a specific host by requesting the respective data from nico core - therefore the PXE server is also stateless. Currently, managed hosts are configured to always boot from PXE. If a local bootable device is found, the host will boot it. Hosts can also be configured to always boot from a particular image for stateless configurations. -- [nico-hw-health (Hardware health)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/health): This service scrapes all host and DPU BMCs known by NICo for system health information. It extracts measurements like fan speeds, temperatures and leak indicators. These measurements are emitted as prometheus metrics on a `/metrics` endpoint on port 9009. In addition to that, the service calls the nico-core API `RecordHardwareHealthReport` to set health alerts based on issues identified within the metrics. These alerts are merged within nico-core into the aggregated-host-health - which is emitted in overall health metrics and used to decide whether hosts are usable as bare metal instances for tenants. -- [ssh-console](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/ssh-console): The SSH console provides bare metal-tenants and site-administrators virtual serial console access to hosts managed by NICo. The ssh-console service also sends the output of each hosts serial console to +- [nico-hw-health (Hardware health)](https://github.com/NVIDIA/infra-controller/blob/main/crates/health): This service scrapes all host and DPU BMCs known by NICo for system health information. It extracts measurements like fan speeds, temperatures and leak indicators. These measurements are emitted as prometheus metrics on a `/metrics` endpoint on port 9009. In addition to that, the service calls the nico-core API `RecordHardwareHealthReport` to set health alerts based on issues identified within the metrics. These alerts are merged within nico-core into the aggregated-host-health - which is emitted in overall health metrics and used to decide whether hosts are usable as bare metal instances for tenants. +- [ssh-console](https://github.com/NVIDIA/infra-controller/blob/main/crates/ssh-console): The SSH console provides bare metal-tenants and site-administrators virtual serial console access to hosts managed by NICo. The ssh-console service also sends the output of each hosts serial console to the logging system (Loki), from where it can be queried using Grafana and logcli. In order to provide this functionality, the ssh-console service *continuously* connects to all host BMCs. The ssh-console service only forwards logs to users ("bare metal tenants") if they connect to the service and get authenticated. -- [nico-dns (DNS)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dns): Domain name service (DNS) functionality +- [nico-dns (DNS)](https://github.com/NVIDIA/infra-controller/blob/main/crates/dns): Domain name service (DNS) functionality is handled by two services. The `nico-dns` service handles DNS queries from the site controller and managed nodes and is authoritative for delegated zones. ## NICo Core @@ -203,7 +203,7 @@ pods. There are three different K8s statefulsets that run on the controller node The point of having a site controller is to administer a site that has been populated with tenant managed hosts. Each managed host is a pairing of a Bluefield (BF) 2/3 DPUs and a host server (only two DPUs have been tested). -During initial deployment [scout](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/scout) runs and +During initial deployment [scout](https://github.com/NVIDIA/infra-controller/blob/main/crates/scout) runs and informs nico-api of any discovered DPUs. NICo completes the installation of services on the DPU and boots into regular operation mode. Thereafter the nico-dpu-agent starts as a daemon. diff --git a/docs/architecture/redfish_workflow.md b/docs/architecture/redfish_workflow.md index 0d5ba20625..1b822b54cb 100644 --- a/docs/architecture/redfish_workflow.md +++ b/docs/architecture/redfish_workflow.md @@ -120,7 +120,7 @@ Once all DPUs are matched and validated, the host enters an "ingestable" state a ## 4. DPU Provisioning -After pairing, the DPU must be provisioned with NICo software. This is orchestrated via Temporal workflows (in `nico-rest`) with Redfish power control (in `infra-controller-core`). +After pairing, the DPU must be provisioned with NICo software. This is orchestrated via Temporal workflows (in `nico-rest`) with Redfish power control (in `infra-controller`). ### Boot Configuration diff --git a/docs/configuration/tenant_management.md b/docs/configuration/tenant_management.md index 8bbb7beadf..02d7976c24 100644 --- a/docs/configuration/tenant_management.md +++ b/docs/configuration/tenant_management.md @@ -12,7 +12,7 @@ This guide assumes you have completed the [Quick Start Guide](../getting-started - A running NICo deployment with healthy REST API, database, Temporal workflow engine, and at least one site controller. - At least one site registered and in `Registered` status, with machines discovered and available for allocation. -- `nicocli` installed (`make nico-cli` from the infra-controller-rest repo) and reachable on `$PATH`. +- `nicocli` installed (`make nico-cli` from the `rest-api/` directory of the `infra-controller` repo) and reachable on `$PATH`. If you plan to enable SPIFFE JWT-SVID **machine identity**, complete [Day 0 Machine Identity](../getting-started/installation-options/day0-machine-identity.md) before provisioning instances, then configure per-org identity after tenants exist — see [Machine Identity](machine_identity.md). diff --git a/docs/development.md b/docs/development.md index c654387994..d74ba03992 100644 --- a/docs/development.md +++ b/docs/development.md @@ -76,7 +76,7 @@ environment. 8. Install `direnv` using your package manager It would be best to install `direnv` on your host. `direnv` requires a shell hook to work. See `man direnv` (after install) for - more information on setting it up. Once you clone the `infra-controller-core` repo, you need to run `direnv allow` the first time you cd into your local copy. + more information on setting it up. Once you clone the `infra-controller` repo, you need to run `direnv allow` the first time you cd into your local copy. Running `direnv allow` exports the necessary environmental variables while in the repo and cleans up when not in the repo. There are preset environment variables that are used throughout the repo. `${REPO_ROOT}` represents the top of the repo. @@ -202,13 +202,13 @@ One time build: RUN cd /usr/local/bin && curl -fL https://getcli.jfrog.io | sh ``` - `docker build -t myarm myarm` # give it a cooler name - - `docker run -it -v /home/user/src/infra-controller-core:/infra-controller-core myarm /bin/bash` + - `docker run -it -v /home/user/src/infra-controller:/infra-controller myarm /bin/bash` Daily usage: - `docker start ` - `docker attach ` -Now that you're in the container go into `/infra-controller-core` and work normally (`cargo build --release`). The binary rust produces will be aarch64. You can `scp` it to a DPU and run it. +Now that you're in the container go into `/infra-controller` and work normally (`cargo build --release`). The binary rust produces will be aarch64. You can `scp` it to a DPU and run it. The build may hang the first time. I don't know why. Ctrl-C and try again. You may want to `docker commit` after it succeeds to update the image. diff --git a/docs/development/vm_pxe_client.md b/docs/development/vm_pxe_client.md index 7a1c176aff..61e458b654 100644 --- a/docs/development/vm_pxe_client.md +++ b/docs/development/vm_pxe_client.md @@ -87,7 +87,7 @@ You can also use graphical interface `virt-manager`. The virtual machine should fail to PXE boot from IPv4 (but gets an IP address) and IPv6, and then succeed from "HTTP boot IPv4", getting both an IP address and a boot image. This should boot you into the pre-exec image. The user is `root` and password -is specified in the [mkosi.default](https://github.com/NVIDIA/infra-controller-core/tree/main/pxe) file. +is specified in the [mkosi.default](https://github.com/NVIDIA/infra-controller/tree/main/pxe) file. In order to exit out of console use `ctrl-a x` diff --git a/docs/getting-started/installation-options/day0-ip-network-config.md b/docs/getting-started/installation-options/day0-ip-network-config.md index c65739a184..4a0261fd5f 100644 --- a/docs/getting-started/installation-options/day0-ip-network-config.md +++ b/docs/getting-started/installation-options/day0-ip-network-config.md @@ -191,7 +191,7 @@ After deployment, validate the DHCP path end-to-end: kubectl get svc nico-dhcp -n nico-system ``` -Both EXTERNAL-IP and TYPE=`LoadBalancer` must be populated. A `` IP indicates a MetalLB issue — see [Reference Installation — MetalLB troubleshooting](reference-install.md#metallb-loadbalancer-services-stuck-in-pending). +Both EXTERNAL-IP and TYPE=`LoadBalancer` must be populated. A `` IP indicates a MetalLB issue — see the [Reference Installation](reference-install.md) guides for MetalLB troubleshooting. **Tail `nico-dhcp` logs while a BMC powers on:** @@ -278,7 +278,7 @@ A fixed set of NICo service hostnames are resolved by DPU agents, host PXE loade Two TLD conventions exist: - **`.nico`** is the compiled default in `crates/agent/src/util.rs` and the host PXE loader scripts. The agent resolves `nico-pxe.nico`, `nico-ntp.nico`, etc. at startup. This is the TLD used by deployments built from the current binaries. -- **`.nico`** is the rebranded TLD documented in [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller-core/blob/main/deploy/DNS.md). New deployments may use this convention, but only if the agent and PXE images have been rebuilt with the new TLD. +- **`.nico`** is the rebranded TLD documented in [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller/blob/main/deploy/DNS.md). New deployments may use this convention, but only if the agent and PXE images have been rebuilt with the new TLD. Choose the convention that matches your binaries — do not mix. Verify by checking what the agent actually resolves at startup (`kubectl exec -n nico-system -- getent hosts nico-pxe.nico` or the `.nico` equivalent). @@ -293,7 +293,7 @@ The required A records (shown for `.nico`; substitute `.nico` if your binaries u | `unbound.nico` | 53 | `unbound` LoadBalancer VIP | Recursive DNS resolver | Yes — the resolver address itself is distributed via DHCP option 6 | | `otel-receiver.nico` | 443 | OTel receiver VIP on the site controller | OTLP ingestion endpoint for DPU otel-collector sidecars | Yes — set in the otel-collector configuration YAML and re-deployed | -One additional `.nico` hostname, `socks.nico`, is hardcoded into the DPU agent as the SOCKS5 outbound proxy for DPU extension-service pods. Add a corresponding A record only if your environment runs a SOCKS5 proxy for that purpose; it is not part of every NICo deployment. For per-endpoint detail (consumers, in-cluster addresses, hardcode locations, and the `unbound`-vs-other-resolver guidance), see [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller-core/blob/main/deploy/DNS.md). That file is the canonical endpoint reference; the table above is the operator-facing summary. +One additional `.nico` hostname, `socks.nico`, is hardcoded into the DPU agent as the SOCKS5 outbound proxy for DPU extension-service pods. Add a corresponding A record only if your environment runs a SOCKS5 proxy for that purpose; it is not part of every NICo deployment. For per-endpoint detail (consumers, in-cluster addresses, hardcode locations, and the `unbound`-vs-other-resolver guidance), see [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller/blob/main/deploy/DNS.md). That file is the canonical endpoint reference; the table above is the operator-facing summary. > **Note:** Neither `.nico` nor `.nico` is a publicly registered TLD. Both are used exclusively on the isolated OOB management network. Configure the recursive resolver to treat the chosen TLD as locally authoritative and **not** forward queries to upstream public resolvers. @@ -361,6 +361,6 @@ When every item is checked, proceed to [Ingesting Hosts](../../provisioning/inge - [BMC and Out-of-Band Setup](../prerequisites/bmc-oob-setup.md) — OOB physical network, DHCP relay setup, BMC credentials. - [IP Resource Pools](../../manuals/networking/ip_resource_pools.md) — `lo-ip` / `vpc-dpu-lo` semantics, sizing, `admin-cli resource-pool grow`. - [Quick Start Guide](../quick-start.md) — the install flow that consumes the configuration described here. -- [Reference Installation](reference-install.md) — phase-by-phase install with troubleshooting for MetalLB, DNS, and `nico-api`. +- [Reference Installation](reference-install.md) — pointers to the manual, manifest-level install and troubleshooting references. - [Ingesting Hosts](../../provisioning/ingesting-hosts.md) — `expected_machines.json` schema and upload commands. -- [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller-core/blob/main/deploy/DNS.md) — canonical reference for NICo service hostnames, ports, and hardcoded-vs-configurable status. +- [`deploy/DNS.md`](https://github.com/NVIDIA/infra-controller/blob/main/deploy/DNS.md) — canonical reference for NICo service hostnames, ports, and hardcoded-vs-configurable status. diff --git a/docs/getting-started/installation-options/reference-install.md b/docs/getting-started/installation-options/reference-install.md index 2bd27dd7de..99ee1e3d2b 100644 --- a/docs/getting-started/installation-options/reference-install.md +++ b/docs/getting-started/installation-options/reference-install.md @@ -1,673 +1,18 @@ -# Reference Installation — Manual Phase-by-Phase +# Reference Installation -This guide breaks down every phase of NICo's `setup.sh` installation with the exact commands being run. Use this if you need to re-run a single phase, debug a failure, or understand what the script does before running it. +NICo is deployed by the `setup.sh` orchestrator in `helm-prereqs/`, which installs every prerequisite and NICo component (Core, REST, and Flow) in dependency order. The [Quick Start Guide](../quick-start.md) is the end-to-end walkthrough: building images, preparing the cluster, configuring the site, and running `setup.sh`. -For the automated end-to-end installation using `setup.sh`, see the [Quick Start Guide](../quick-start.md). +This page collects the maintained, manifest-level references for operators who need to run a phase by hand, re-run a single step, or debug a failure. -**Prerequisites:** complete all configuration steps in [Step 3 of the Quick Start Guide](../quick-start.md#step-3--configure-the-site) before running any phase manually. +## Automated installation -All commands below assume you are in the `helm-prereqs/` directory with the required environment variables set: +- [Quick Start Guide](../quick-start.md) — end-to-end deployment driven by `setup.sh`. -```bash -cd helm-prereqs/ -export KUBECONFIG=/path/to/kubeconfig -export REGISTRY_PULL_SECRET= -export NCX_IMAGE_REGISTRY= -export NCX_CORE_IMAGE_TAG= -export NCX_REST_IMAGE_TAG= -export NCX_REPO=/path/to/ncx-infra-controller-rest # or let preflight auto-detect -``` +## Manual installation references ---- +`setup.sh` is a thin wrapper over the Helm charts and kustomize manifests in the repository. The source-of-truth guides document each step, the order of operations, the PKI and secrets model, and troubleshooting: -## Phase 0 — DNS check +- **Prerequisites and NICo Core** — [`helm-prereqs/README.md`](https://github.com/NVIDIA/infra-controller/blob/main/helm-prereqs/README.md) covers the prerequisite stack (local-path-provisioner, postgres-operator, MetalLB, cert-manager, Vault, External Secrets), the NICo Core deployment, the per-site values files, the `.forge` compatibility DNS, and the `health-check.sh` verification. +- **NICo REST** — [`rest-api/deploy/INSTALLATION.md`](https://github.com/NVIDIA/infra-controller/blob/main/rest-api/deploy/INSTALLATION.md) is the prescriptive, manifest-by-manifest bring-up for the REST control plane (PostgreSQL, Keycloak, Temporal, the internal cert-manager, site-manager, API, workflow workers, and site-agent). -Detects cluster type and verifies DNS is ready before any workloads are deployed. - -- **Kubespray clusters** — checks if the `nodelocaldns` DaemonSet is ready; deploys `operators/nodelocaldns-daemonset.yaml` if missing and waits for rollout -- **kubeadm / other** — checks CoreDNS readyReplicas >= 1; warns but does not fail if not ready - -```bash -# Kubespray: deploy NodeLocal DNSCache if missing -if kubectl get configmap nodelocaldns -n kube-system &>/dev/null; then - kubectl apply -f operators/nodelocaldns-daemonset.yaml 2>/dev/null || true - kubectl rollout status daemonset/nodelocaldns -n kube-system --timeout=120s -else - # kubeadm: just verify CoreDNS is up - kubectl get deployment coredns -n kube-system -fi -``` - ---- - -## Phase 1 — local-path-provisioner - -Deploys StorageClasses for Vault and PostgreSQL PVCs. The `local-path-persistent` StorageClass uses `reclaimPolicy: Retain` so data survives pod deletion and node restarts. - -```bash -kubectl apply -f operators/local-path-provisioner.yaml -# Delete before re-apply - the provisioner field is immutable -kubectl delete -f operators/storageclass-local-path-persistent.yaml --ignore-not-found 2>/dev/null || true -kubectl apply -f operators/storageclass-local-path-persistent.yaml -kubectl rollout status deployment/local-path-provisioner -n local-path-storage --timeout=120s -# Mark local-path as the cluster default StorageClass -kubectl annotate storageclass local-path \ - storageclass.kubernetes.io/is-default-class=true --overwrite -``` - ---- - -## Phase 1b — postgres-operator - -Installs the Zalando PostgreSQL Operator. Must be up before Phase 5 creates the `nico-pg-cluster` resource — the `postgresql.acid.zalan.do` CRD must be registered first. - -```bash -helmfile sync -l name=postgres-operator -``` - ---- - -## Phase 1c — MetalLB - -Installs MetalLB 0.14.5 with the FRR BGP speaker, then applies your site-specific IP pool and BGP configuration. - -```bash -helmfile sync -l name=metallb -kubectl wait --for=condition=Available deployment/metallb-controller \ - -n metallb-system --timeout=120s -kubectl apply -f values/metallb-config.yaml -``` - -Expected result: MetalLB controller and speaker pods running in `metallb-system`. BGPPeer sessions established with your TOR switches. - ---- - -## Phase 2 — cert-manager + Vault TLS bootstrap - -Three sub-steps — all must complete before Phase 3 (Vault). - -### 2a — cert-manager - -```bash -helmfile sync -l name=cert-manager -``` - -### 2b — Vault TLS bootstrap - -Vault requires TLS to start — but the Vault-backed issuer can't exist before Vault is running. This step breaks the chicken-and-egg problem by using `site-issuer` (backed by `site-root` CA) to issue Vault's own TLS certs before Vault starts. - -```bash -kubectl create namespace vault --dry-run=client -o yaml | kubectl apply -f - -# Run from the helm-prereqs/ directory (the chart root) -helm template nico-prereqs . \ - --show-only templates/site-root-certificate.yaml \ - --show-only templates/vault-tls-certs.yaml \ - | kubectl apply --server-side --field-manager=helm -f - -# Wait for all three certs to be issued -kubectl wait --for=condition=Ready certificate/site-root -n cert-manager --timeout=120s -kubectl wait --for=condition=Ready certificate/nicoca-vault-client -n vault --timeout=120s -kubectl wait --for=condition=Ready certificate/vault-raft-tls -n vault --timeout=120s -``` - ---- - -## Phase 3 — Vault - -Installs HashiCorp Vault 0.25.0 in 3-replica HA Raft mode. TLS secrets exist in the `vault` namespace by this point so pods start immediately. - -```bash -helmfile sync -l name=vault -``` - ---- - -## Phase 4 — Initialize and unseal Vault - -```bash -./unseal_vault.sh -./bootstrap_ssh_host_key.sh -``` - -`unseal_vault.sh` handles both first-run init and re-unseal on subsequent runs: -- First run: `vault operator init -key-shares=5 -key-threshold=3`, stores init JSON as `vault-cluster-keys` secret, unseals all three pods -- Creates the `nico-system` namespace with Helm ownership labels -- Copies root token to `nico-vault-token` in `nico-system` for the `vault-pki-config` Job - -`bootstrap_ssh_host_key.sh` pre-creates the `ssh-host-key` Secret in OpenSSH PEM format (idempotent — skips if the secret already exists). - -To verify Vault is unsealed: - -```bash -kubectl exec -n vault vault-0 -c vault -- vault status -``` - ---- - -## Phase 5 — external-secrets + nico-prereqs - -```bash -helmfile sync -l name=external-secrets -helmfile sync -l name=nico-prereqs -``` - -After `nico-prereqs` installs, wait for the PostgreSQL cluster to provision and for ESO to sync credentials: - -```bash -# Wait for the Patroni cluster to reach Running state (can take 3-5 minutes) -kubectl wait --for=jsonpath='{.status.PostgresClusterStatus}'=Running \ - postgresql/nico-pg-cluster -n postgres --timeout=600s - -# Verify ESO synced the DB credentials into nico-system -kubectl get secret nico-system.nico.nico-pg-cluster.credentials -n nico-system -``` - ---- - -## Phase 6 — NCX Core - -Deploys the main NCX Core application chart. Run from the **repo root** (`ncx-infra-controller-core/`), not from `helm-prereqs/`. - -```bash -cd .. # repo root (ncx-infra-controller-core/) -helm upgrade --install nico ./helm \ - --namespace nico-system \ - -f helm-prereqs/values/ncx-core.yaml \ - --set global.image.repository="${NCX_IMAGE_REGISTRY}/nvmetal-nico" \ - --set global.image.tag="${NCX_CORE_IMAGE_TAG}" \ - --timeout 600s --wait -``` - -Verify LoadBalancer IPs were assigned from your MetalLB pool: - -```bash -kubectl get svc -n nico-system | grep LoadBalancer -``` - ---- - -## Phase 7 — NCX REST (nico-rest) - -All sub-steps run from the NCX REST repo directory (`$NCX_REPO`). - -### 7a — CA signing secret - -Generates the `ca-signing-secret` used by the `nico-rest-ca-issuer` ClusterIssuer for Temporal mTLS. Idempotent — skips if the secret already exists. - -```bash -(cd "${NCX_REPO}" && bash scripts/gen-site-ca.sh) -``` - -### 7b — nico-rest-ca-issuer - -```bash -(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/cert-manager-io) -``` - -### 7c — NCX REST postgres - -```bash -(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/postgres) -kubectl rollout status statefulset/postgres -n postgres --timeout=300s -``` - -### 7d — Keycloak - -```bash -(cd "${NCX_REPO}" && kubectl apply -k deploy/kustomize/base/keycloak -n nico-rest) -kubectl rollout status deployment/keycloak -n nico-rest --timeout=300s -``` - -### 7e — Temporal TLS bootstrap - -```bash -(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/namespace.yaml) -(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/db-creds.yaml) -(cd "${NCX_REPO}" && kubectl apply -f deploy/kustomize/base/temporal-helm/certificates.yaml) -# Wait for the three mTLS certs to be issued by nico-rest-ca-issuer -kubectl wait --for=condition=Ready certificate/server-interservice-cert -n temporal --timeout=120s -kubectl wait --for=condition=Ready certificate/server-cloud-cert -n temporal --timeout=120s -kubectl wait --for=condition=Ready certificate/server-site-cert -n temporal --timeout=120s -``` - -### 7f — Temporal - -```bash -helm upgrade --install temporal "${NCX_REPO}/temporal-helm/temporal" \ - --namespace temporal \ - -f "${NCX_REPO}/temporal-helm/temporal/values-kind.yaml" \ - --timeout 600s --wait - -# Create the Temporal namespaces for NCX REST workers -_TEMPORAL_ADDR="temporal-frontend.temporal:7233" -_TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local" -kubectl exec -n temporal deploy/temporal-admintools -- \ - sh -c "temporal operator namespace create -n cloud --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true -kubectl exec -n temporal deploy/temporal-admintools -- \ - sh -c "temporal operator namespace create -n site --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true -``` - -### 7g — NCX REST helm chart - -```bash -# Build the image pull secret dockerconfigjson -_ncx_docker_cfg="$(printf '{"auths":{"nvcr.io":{"username":"$oauthtoken","password":"%s"}}}' \ - "${REGISTRY_PULL_SECRET}" | base64 | tr -d '\n')" - -helm upgrade --install nico-rest "${NCX_REPO}/helm/charts/nico-rest" \ - --namespace nico-rest \ - -f values/ncx-rest.yaml \ - --set global.image.repository="${NCX_IMAGE_REGISTRY}" \ - --set global.image.tag="${NCX_REST_IMAGE_TAG}" \ - --set "nico-rest-common.secrets.imagePullSecret.dockerconfigjson=${_ncx_docker_cfg}" \ - --timeout 600s --wait -``` - -### 7h — NCX REST site-agent - -The deployment order is critical — do not skip steps. - -```bash -NCX_SITE_UUID="${NCX_SITE_UUID:-a1b2c3d4-e5f6-4000-8000-000000000001}" -NCX_SITE_AGENT_CHART="${NCX_REPO}/helm/charts/nico-rest-site-agent" - -# Step 1 - pre-apply the gRPC client cert so it exists before the pod starts -helm template nico-rest-site-agent "${NCX_SITE_AGENT_CHART}" \ - --namespace nico-rest \ - -f values/ncx-site-agent.yaml \ - --set global.image.repository="${NCX_IMAGE_REGISTRY}" \ - --set global.image.tag="${NCX_REST_IMAGE_TAG}" \ - --show-only templates/certificate.yaml | kubectl apply -f - -kubectl annotate certificate/core-grpc-client-site-agent-certs -n nico-rest \ - "meta.helm.sh/release-name=nico-rest-site-agent" \ - "meta.helm.sh/release-namespace=nico-rest" --overwrite -kubectl label certificate/core-grpc-client-site-agent-certs -n nico-rest \ - "app.kubernetes.io/managed-by=Helm" --overwrite -kubectl wait --for=condition=Ready certificate/core-grpc-client-site-agent-certs \ - -n nico-rest --timeout=120s - -# Step 2 - create per-site Temporal namespace (site-agent panics without it) -_TEMPORAL_ADDR="temporal-frontend.temporal:7233" -_TEMPORAL_TLS="--tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local" -kubectl exec -n temporal deploy/temporal-admintools -- \ - sh -c "temporal operator namespace create -n '${NCX_SITE_UUID}' --address ${_TEMPORAL_ADDR} ${_TEMPORAL_TLS}" 2>/dev/null || true - -# Step 3 - install site-agent (pre-install hook registers site and creates site-registration secret) -helm upgrade --install nico-rest-site-agent "${NCX_SITE_AGENT_CHART}" \ - --namespace nico-rest \ - -f values/ncx-site-agent.yaml \ - --set global.image.repository="${NCX_IMAGE_REGISTRY}" \ - --set global.image.tag="${NCX_REST_IMAGE_TAG}" \ - --set "envConfig.CLUSTER_ID=${NCX_SITE_UUID}" \ - --set "envConfig.TEMPORAL_SUBSCRIBE_NAMESPACE=${NCX_SITE_UUID}" \ - --set "envConfig.TEMPORAL_SUBSCRIBE_QUEUE=site" \ - --timeout 300s --wait - -# Step 4 - verify gRPC connection to nico-api -kubectl logs -n nico-rest -l app.kubernetes.io/name=nico-rest-site-agent --prefix \ - | grep "NicoClient:" -``` - ---- - -## PKI architecture - -The PKI has three layers, built bottom-up: - -``` -selfsigned-bootstrap ClusterIssuer - └── site-root CA Certificate (10-year self-signed CA, Secret "site-root" in cert-manager ns) - └── site-issuer ClusterIssuer (issues Vault's own TLS certs - no Vault dependency) - ├── nicoca-vault-client (Vault port 8200 listener TLS, Secret in vault ns) - └── vault-raft-tls (Vault Raft port 8201 peer TLS, Secret in vault ns) - -vault (running, unsealed) - └── vault-pki-config Job (imports site-root CA into Vault PKI engine "nicoca") - └── vault-nico-issuer ClusterIssuer (issues all workload SPIFFE certs via Vault PKI) -``` - -NCX REST has its own parallel PKI chain for internal services: - -``` -nico-rest-ca-issuer ClusterIssuer (backed by ca-signing-secret in nico-rest ns) - └── Temporal mTLS certificates (server-interservice-cert, server-cloud-cert, server-site-cert) - -vault-nico-issuer ClusterIssuer (same Vault PKI CA as NCX Core) - └── site-agent gRPC client cert (core-grpc-client-site-agent-certs in nico-rest ns) - SPIFFE URI: spiffe://nico.local/nico-system/sa/elektra-site-agent -``` - -The site-agent uses the Vault PKI CA for both directions of mTLS with nico-api: -- Site-agent presents its client cert (Vault-signed) — nico-api trusts it via the same CA. -- Site-agent verifies nico-api's server cert using `ca.crt` from the issued secret (Vault PKI CA). - -### Layer 1 — Bootstrap (no external dependencies) - -`selfsigned-bootstrap` is a cert-manager `selfSigned` ClusterIssuer with no dependencies. It issues `site-root`: a 10-year CA certificate stored as Secret `site-root` in the `cert-manager` namespace. This is the trust anchor for the entire cluster. - -### Layer 2 — site-issuer (Vault TLS bootstrap) - -`site-issuer` is a `ca` ClusterIssuer backed by `site-root`. It can issue certificates without Vault being up. - -**This solves the Vault TLS chicken-and-egg problem.** Vault requires TLS to start — but `vault-nico-issuer` (the Vault-backed issuer) can't exist before Vault is running. `site-issuer` breaks the cycle by issuing Vault's own TLS secrets before Vault starts: - -| Secret | Namespace | Purpose | -|--------|-----------|---------| -| `nicoca-vault-client` | `vault` | Port 8200 listener cert (mounted at `/vault/userconfig/nicoca-vault/`) | -| `vault-raft-tls` | `vault` | Raft port 8201 peer cert (mounted at `/vault/userconfig/vault-raft-tls/`) | - -These secrets must exist **before** `helmfile sync -l name=vault` — setup.sh creates them explicitly in Phase 2 using `helm template | kubectl apply`. - -### Layer 3 — vault-nico-issuer (workload PKI) - -Once Vault is running and unsealed, the `vault-pki-config` Job (Helm post-install hook) configures Vault as a PKI backend: - -1. Enables the `nicoca` PKI secrets engine, tunes it to a 10-year max TTL. -2. Imports `site-root` (cert + key) into Vault PKI — Vault becomes an intermediate CA under the same trust root. -3. Creates PKI role `nico-cluster` — allows any name, allows SPIFFE URI SANs, 720h max TTL, EC P-256. -4. Enables Kubernetes auth and writes two policies: `cert-manager-nico-policy` (sign via PKI) and `nico-vault-policy` (read KV secrets). -5. Enables KV v2 at `secrets/` and AppRole auth for the `nico` role. - -`vault-nico-issuer` is then created as a cert-manager ClusterIssuer authenticating to Vault via Kubernetes auth. All NCX Core workload SPIFFE certificates and the site-agent's gRPC client certificate are issued through this issuer. - -### nico-roots — CA distribution - -The `nico-roots` Secret (containing `site-root`'s `ca.crt`) must be present in every namespace where NCX workloads run so pods can verify each other's SPIFFE certificates. - -``` -site-root Secret (cert-manager ns) - → ClusterSecretStore "cert-manager-ns-secretstore" (Kubernetes provider) - → ClusterExternalSecret "nico-roots-eso" - → ExternalSecret in every namespace labeled nico.nvidia.com/managed=true - → Secret "nico-roots" (ca.crt) -``` - -`creationPolicy: Orphan` prevents Kubernetes GC from cascading a delete to `nico-roots` if the ExternalSecret is recreated on helm upgrade. - ---- - -## PostgreSQL architecture - -PostgreSQL is deployed as a production-grade 3-node HA cluster managed by the **Zalando PostgreSQL Operator** (`acid.zalan.do`). NCX REST also deploys its own simpler postgres StatefulSet in the same `postgres` namespace for temporal, keycloak, and NCX REST databases. - -``` -postgres-operator (postgres ns) - └── nico-pg-cluster postgresql CRD (postgres ns) ← NCX Core - ├── nico-pg-cluster-0 (Patroni leader) - ├── nico-pg-cluster-1 (Patroni replica) - └── nico-pg-cluster-2 (Patroni replica) - each pod: postgres + postgres-exporter sidecar - -postgres StatefulSet (postgres ns, service: postgres) ← NCX REST - └── Databases: nico, temporal, temporal_visibility, keycloak, elektratest -``` - -### Credential flow (NCX Core) - -The operator automatically creates a per-user credential Secret in the `postgres` namespace: -``` -nico-system.nico.nico-pg-cluster.credentials.postgresql.acid.zalan.do - username: nico-system.nico - password: -``` - -ESO's `nico-db-eso` ClusterExternalSecret mirrors this into `nico-system` as: -``` -nico-system.nico.nico-pg-cluster.credentials - username: nico-system.nico - password: -``` - -### nico-pg-cluster-env ConfigMap - -The operator injects the `nico-pg-cluster-env` ConfigMap (in the `postgres` namespace) into every postgres pod as environment variables. Currently provides: - -``` -TMP_SITE = -``` - -The ConfigMap is rendered by the `nico-prereqs` chart (from `Values.siteName`) so it flows in at install time and can be overridden per-site with `--set siteName=`. - -### ssh-host-key format - -`ssh-console-rs` requires the SSH host key in **OpenSSH PEM format** (`-----BEGIN OPENSSH PRIVATE KEY-----`). Helm's `genPrivateKey "ed25519"` produces PKCS8 format which the binary rejects at startup. `bootstrap_ssh_host_key.sh` pre-creates the secret using `ssh-keygen` before `helmfile sync -l name=nico-prereqs` runs. The `lookup` in `templates/_helpers.tpl` detects the existing secret and reuses it, so Helm never overwrites it. - ---- - -## Secrets reference - -All secrets created by setup. The Vault unseal keys (`vault-cluster-keys`) are the most sensitive — back them up to a secure location after first install. - -| Secret | Namespace | Created by | Purpose | -|--------|-----------|------------|---------| -| `site-root` | `cert-manager` | cert-manager (selfsigned-bootstrap) | Self-signed root CA cert + key. Trust anchor for all PKI. | -| `nicoca-vault-client` | `vault` | cert-manager (site-issuer) | Vault port 8200 TLS listener cert | -| `vault-raft-tls` | `vault` | cert-manager (site-issuer) | Vault Raft port 8201 TLS peer cert | -| `vault-cluster-keys` | `vault` | `unseal_vault.sh` | Full Vault init JSON (5 unseal keys + root token). **Back this up.** | -| `vaultunsealkeys` | `vault` | `unseal_vault.sh` | Individual unseal keys (0-4) for automated re-unseal | -| `vaultroottoken` | `vault` | `unseal_vault.sh` | Vault root token. Limit use after setup. | -| `nico-system.nico.nico-pg-cluster.credentials.postgresql.acid.zalan.do` | `postgres` | Zalando operator | Operator-generated DB credentials (source of truth) | -| `nico-vault-token` | `nico-system` | `unseal_vault.sh` | Root token copy for `vault-pki-config` Job | -| `nico-vault-approle-tokens` | `nico-system` | `vault-pki-config` Job | AppRole role-id and secret-id for NCX Core services | -| `nvcr-nico-dev` | `nico-system` | `nico-prereqs` chart | Image pull secret for NCX Core registry | -| `ssh-host-key` | `nico-system` | `bootstrap_ssh_host_key.sh` | ed25519 host key for `nico-ssh-console-rs` in OpenSSH format | -| `nico-roots` | `nico-system` | ESO (nico-roots-eso) | Site-root CA cert (`ca.crt`) for SPIFFE cert verification | -| `nico-system.nico.nico-pg-cluster.credentials` | `nico-system` | ESO (nico-db-eso) | DB credentials mirrored from `postgres` ns for `nico-api` | -| `ca-signing-secret` | `nico-rest` | `gen-site-ca.sh` | NCX REST internal CA for Temporal mTLS | -| `core-grpc-client-site-agent-certs` | `nico-rest` | cert-manager (vault-nico-issuer) | Site-agent mTLS client cert for nico-api gRPC | - -### ClusterIssuers - -| Name | Backed by | Issues | -|------|-----------|--------| -| `selfsigned-bootstrap` | cert-manager selfSigned | `site-root` CA only | -| `site-issuer` | `site-root` CA Secret | Vault TLS certs (`nicoca-vault-client`, `vault-raft-tls`) | -| `vault-nico-issuer` | Vault PKI engine (`nicoca/sign/nico-cluster`) | All NCX Core SPIFFE certs + site-agent gRPC client cert | -| `nico-rest-ca-issuer` | `ca-signing-secret` | Temporal mTLS certs | - -### ClusterSecretStores - -| Name | Reads from | Used for | -|------|------------|---------| -| `cert-manager-ns-secretstore` | `cert-manager` namespace | Syncing `site-root` CA to `nico-roots` | -| `postgres-ns-secretstore` | `postgres` namespace | Syncing operator DB credentials to `nico-system` | - -## Troubleshooting - -### nico-api CrashLoopBackOff — siteConfig parse error - -If `nico-api` crashes immediately after Phase 6 with a config parse error, the most common cause is empty required fields in the `nicoApiSiteConfig` TOML block. Fields that must be non-empty: - -- `[networks.admin]` — `prefix` and `gateway` (empty string crashes the binary) -- `[pools.lo-ip]`, `[pools.vlan-id]`, `[pools.vni]` — `ranges` must have at least one entry - -Check the pod logs for the specific field: -```bash -kubectl logs -n nico-system -l app.kubernetes.io/name=nico-api --previous -``` - -Fix the value in `values/ncx-core.yaml` and re-run: -```bash -helm upgrade nico ./helm --namespace nico-system -f helm-prereqs/values/ncx-core.yaml \ - --set global.image.repository="${NCX_IMAGE_REGISTRY}/nvmetal-nico" \ - --set global.image.tag="${NCX_CORE_IMAGE_TAG}" -``` - -### DNS resolution failing in pods - -On **Kubespray clusters**, setup.sh deploys the NodeLocal DNSCache DaemonSet automatically. If it is not ready: -```bash -kubectl get daemonset nodelocaldns -n kube-system -kubectl apply -f operators/nodelocaldns-daemonset.yaml -kubectl rollout status daemonset/nodelocaldns -n kube-system -``` - -On **kubeadm clusters**, NodeLocal DNSCache is not used — setup.sh checks CoreDNS readyReplicas instead: -```bash -kubectl get pods -n kube-system -l k8s-app=kube-dns -kubectl rollout restart deployment/coredns -n kube-system -``` - -### Vault TLS bootstrap certificates not Ready - -```bash -kubectl get certificate -n cert-manager -kubectl get certificate -n vault -kubectl describe certificate nicoca-vault-client -n vault -``` - -Common cause: cert-manager webhook not ready yet. Wait 30 seconds and re-run Phase 2. - -### Vault pods stuck in Init or CrashLoop - -```bash -kubectl get secret nicoca-vault-client vault-raft-tls -n vault -kubectl logs vault-0 -n vault -c vault -``` - -### vault-pki-config Job failing - -```bash -kubectl logs -n nico-system job/vault-pki-config -c wait-vault -kubectl logs -n nico-system job/vault-pki-config -c configure -``` - -Common causes: -- Vault still sealed — `kubectl exec -n vault vault-0 -c vault -- vault status` -- `nico-vault-token` missing — re-run `./unseal_vault.sh` -- `site-root` Secret not readable by the Job's service account - -### nico-pg-cluster not reaching Running state - -```bash -kubectl get postgresql nico-pg-cluster -n postgres -kubectl describe postgresql nico-pg-cluster -n postgres -kubectl get pods -n postgres -kubectl logs -n postgres nico-pg-cluster-0 -c postgres -``` - -Common causes: -- `local-path-persistent` StorageClass missing — re-run Phase 1 -- `nico-pg-cluster-env` ConfigMap missing in `postgres` namespace — re-run Phase 5 -- Insufficient node resources — tune `postgresql.resources` in `values.yaml` - -### DB credentials not appearing in nico-system - -```bash -kubectl get clustersecretstore postgres-ns-secretstore -kubectl get clusterexternalsecret nico-db-eso -kubectl describe externalsecret -n nico-system -``` - -The source secret (`nico-system.nico.nico-pg-cluster.credentials.postgresql.acid.zalan.do`) is created by the operator only after the cluster reaches `Running` state. If the ClusterSecretStore shows `Invalid`, check that the `eso-postgres-ns` ServiceAccount token exists in the `postgres` namespace: -```bash -kubectl get secret eso-postgres-ns-token -n postgres -``` - -### nico-roots Secret not appearing - -```bash -kubectl get clustersecretstore cert-manager-ns-secretstore -kubectl get clusterexternalsecret nico-roots-eso -kubectl get namespace nico-system --show-labels -# Should include: nico.nvidia.com/managed=true -``` - -If the label is missing: -```bash -kubectl label namespace nico-system nico.nvidia.com/managed=true -``` - -### Site-agent gRPC connection to nico-api failing (nil NicoClient) - -The site-agent connects to nico-api at startup with a 5-second deadline. If the connection fails, the `NicoClient` stays nil permanently and all inventory activities panic with a nil-pointer dereference. setup.sh detects this and restarts the StatefulSet automatically, but you can also diagnose manually: - -```bash -# Check which pods connected successfully -kubectl logs -n nico-rest -l app.kubernetes.io/name=nico-rest-site-agent --prefix \ - | grep -E "NicoClient: (successfully connected|failed to get version)" - -# Check mTLS cert was issued -kubectl get certificate core-grpc-client-site-agent-certs -n nico-rest - -# Check the cert was projected into the pod -kubectl exec -n nico-rest nico-rest-site-agent-0 -- ls /etc/nico-certs/ - -# Check DNS resolution of nico-api from the pod -kubectl exec -n nico-rest nico-rest-site-agent-0 -- \ - nslookup nico-api.nico-system.svc.cluster.local -``` - -Common causes and fixes: - -| Symptom | Cause | Fix | -|---------|-------|-----| -| `DeadlineExceeded` in pod logs | DNS cold cache on the node at startup | `kubectl rollout restart statefulset/nico-rest-site-agent -n nico-rest` | -| `certificate signed by unknown authority` | Site-agent cert issued by wrong CA | Check `values/ncx-site-agent.yaml` — `global.certificate.issuerRef.name` must be `vault-nico-issuer` | -| `Unauthenticated` from nico-api | SPIFFE URI does not match `InternalRBACRules` | Check `values/ncx-site-agent.yaml` — `certificate.uris` must be `spiffe://nico.local/nico-system/sa/elektra-site-agent` | -| `transport: error while dialing` | Wrong `NICO_SEC_OPT` | Check `envConfig.NICO_SEC_OPT: "2"` in `ncx-site-agent.yaml` (2 = MutualTLS) | -| cert secret missing at pod start | Race: StatefulSet started before cert was issued | Re-run Phase 7h — pre-apply Certificate step ensures cert exists first | - -### Temporal namespace not found (site-agent startup panic) - -If the site-agent panics on startup with a nil pointer in `RegisterCron`: -```bash -kubectl exec -n temporal deploy/temporal-admintools -- \ - sh -c "temporal operator namespace list --address temporal-frontend.temporal:7233 \ - --tls-cert-path /var/secrets/temporal/certs/server-interservice/tls.crt \ - --tls-key-path /var/secrets/temporal/certs/server-interservice/tls.key \ - --tls-ca-path /var/secrets/temporal/certs/server-interservice/ca.crt \ - --tls-server-name interservice.server.temporal.local" -``` - -If the namespace for the site UUID is missing, create it manually: -```bash -kubectl exec -n temporal deploy/temporal-admintools -- \ - sh -c "temporal operator namespace create -n '' \ - --address temporal-frontend.temporal:7233 ..." -``` -Then restart the site-agent. - -### MetalLB LoadBalancer services stuck in `` - -If NCX Core services never get an external IP: - -```bash -kubectl get pods -n metallb-system -kubectl get ipaddresspool -n metallb-system -kubectl get bgppeer -n metallb-system -kubectl describe bgppeer -n metallb-system -kubectl logs -n metallb-system -l app=metallb,component=speaker --tail=50 -kubectl get svc -n nico-system -l app.kubernetes.io/name=nico-api -``` - -Common causes: - -| Symptom | Cause | Fix | -|---------|-------|-----| -| `IPAddressPool` not found | `values/metallb-config.yaml` was not applied | Re-run `kubectl apply -f values/metallb-config.yaml` | -| BGP session `Idle` / never establishes | Wrong `peerAddress` or ASN, or firewall blocking TCP 179 | Verify with your network team | -| BGP session up but no IP assigned | IP pool addresses exhausted or CIDR is wrong | Check `kubectl describe ipaddresspool -n metallb-system` | -| All services pending after MetalLB looks healthy | FRR speaker not running | Set `speaker.frr.enabled: true` in `operators/values/metallb.yaml` and re-run Phase 1c | - -### Checking overall health after setup - -```bash -kubectl get clusterissuer -kubectl get clustersecretstore -kubectl get pods -n metallb-system -kubectl get ipaddresspool,bgppeer -n metallb-system -kubectl get pods -n postgres -kubectl get pods -n nico-system -kubectl get jobs -n nico-system -kubectl get secret nico-roots -n nico-system -kubectl get secret nico-system.nico.nico-pg-cluster.credentials -n nico-system -kubectl get pods -n nico-rest -kubectl get pods -n temporal -kubectl get certificate core-grpc-client-site-agent-certs -n nico-rest -``` +> **NICo REST is in-tree.** The REST stack lives in this repository under `rest-api/`; it is no longer a separate repository. `setup.sh` resolves it automatically, so no `NCX_REPO` clone is required. diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md index 09fb147bae..206f911b48 100644 --- a/docs/getting-started/quick-start.md +++ b/docs/getting-started/quick-start.md @@ -72,10 +72,9 @@ Obtain an NGC API key at [ngc.nvidia.com](https://ngc.nvidia.com) → **API Keys |----------|----------|-------------| | `REGISTRY_PULL_SECRET` | **Yes** | Pull secret and API key for your image registry. Used to create the image pull secret for both Infra Controller Core and Infra Controller REST. | | `NCX_IMAGE_REGISTRY` | **Yes** | Base image registry for all Infra Controller images (e.g. `my-registry.example.com/ncx`). Used for Infra Controller Core (`/nvmetal-nico`) and Infra Controller REST (`/nico-rest-*`). | -| `NCX_CORE_IMAGE_TAG` | **Yes** | Infra Controller Core (infra-controller-core) image tag (e.g. `v2025.12.30`). | -| `NCX_REST_IMAGE_TAG` | **Yes** | Infra Controller REST (infra-controller-rest) image tag (e.g. `v1.0.4`). | +| `NCX_CORE_IMAGE_TAG` | **Yes** | Infra Controller Core image tag (e.g. `v2025.12.30`). | +| `NCX_REST_IMAGE_TAG` | **Yes** | Infra Controller REST image tag (e.g. `v1.0.4`). | | `KUBECONFIG` | **Yes** | Path to your cluster kubeconfig. | -| `NCX_REPO` | No | Path to a local clone of `infra-controller-rest`. Auto-detected from sibling directories; `preflight.sh` offers to clone it if not found. | | `NCX_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. | ### 3b. Set your Site Name @@ -134,35 +133,9 @@ All fields are documented with inline comments in the file. These fields are safe to leave as empty arrays: `dhcp_servers`, `site_fabric_prefixes`, `deny_prefixes`. Do not delete any field from the TOML block; missing keys cause a different crash than empty ones. -### 3d. Get the NCX REST Repository +### 3d. NICo REST source tree -NCX REST (`infra-controller-rest`) is a separate repository that contains the Helm chart, kustomize bases, and helper scripts that `setup.sh` uses for [Phase 7](#setup-script-phases). It is *not* bundled inside this repo--you need a local clone before running setup. - -**Option 1: Let `setup.sh` handle it automatically (recommended)** - -`setup.sh` looks for the repo in these locations in order: - -1. `NCX_REPO` env var (explicit path--use this if you cloned it somewhere non-standard) -2. Sibling directories next to this repo: `../nico-rest`, `../ncx-infra-controller-rest`, `../ncx` -3. If not found anywhere, `preflight.sh` offers to clone it for you before setup proceeds - -If you place the clone next to this repo (the recommended layout), no env var is needed: - -``` -your-workspace/ - ncx-infra-controller-core/ ← this repo - ncx-infra-controller-rest/ ← NCX REST repo (clone here) -``` - -**Option 2: Clone it manually** - -Use the following commands to clone the repository: - -```bash -git clone https://github.com/NVIDIA/infra-controller-rest.git -# Then either place it as a sibling, or: -export NCX_REPO=/path/to/infra-controller-rest -``` +NICo REST lives in this repository under `rest-api/`. The Helm charts, kustomize bases, and helper scripts that `setup.sh` uses for [Phase 7](#setup-script-phases) are resolved in-tree automatically--there is no separate repository to clone and no `NCX_REPO` to set. `preflight.sh` errors out only if `rest-api/` is missing from the checkout. ### 3e. Configure NCX REST Authentication @@ -289,7 +262,7 @@ The `preflight.sh` script checks the following: | Per-node: kernel parameters | `net.bridge.bridge-nf-call-iptables=1` and `net.ipv4.ip_forward=1` on every node | | Per-node: DNS | `kubernetes.default.svc.cluster.local` resolves on every node. | | Registry connectivity | The registry host responds to an HTTPS probe. | -| NCX REST repo | Resolves the repo from `NCX_REPO` env var, sibling directories, or offers to clone from GitHub | +| NICo REST source tree | Verifies `rest-api/` is present in the checkout (REST is in-tree; no separate clone) | For air-gapped clusters, the per-node checks pull `busybox:1.36` by default. If your cluster cannot reach Docker Hub, set `PREFLIGHT_CHECK_IMAGE` to a local mirror: @@ -335,7 +308,7 @@ vault (hashicorp/vault 0.25.0, 3-node HA Raft, TLS) external-secrets (external-secrets/external-secrets 0.14.3) nico-prereqs (this Helm chart - nico-system namespace) NCX Core (../helm - ncx-core.yaml values) -NCX REST (ncx-infra-controller-rest/helm/charts/nico-rest) +NCX REST (rest-api/helm/charts/nico-rest) ├── nico-rest-ca-issuer ClusterIssuer (cert-manager.io) ├── postgres StatefulSet (temporal + keycloak + NCX databases) ├── keycloak (dev OIDC IdP, nico-dev realm) @@ -422,12 +395,12 @@ NICo has two CLIs that serve different purposes: | `nicocli` | NICo REST (REST API) | Site management, org bootstrap, instance operations | | `nico-admin-cli` | NICo Core (gRPC API) | Host ingestion, credentials, expected machines, TPM approval | -`nicocli` is built from the NCX REST repo. `nico-admin-cli` is built from the NCX Core repo (`crates/admin-cli`). +`nicocli` is built from the `rest-api/` directory. `nico-admin-cli` is built from `crates/admin-cli`. #### 1. Build and Install the CLI ```bash -cd "$NCX_REPO" +cd rest-api make nico-cli # installs to $(go env GOPATH)/bin/nicocli ``` @@ -520,7 +493,7 @@ For detailed OOB network requirements, refer to the [BMC and Out-of-Band Setup]( This step uses `nico-admin-cli`, the gRPC CLI for NICo Core. Build it from the NCX Core repo: ```bash -cd ncx-infra-controller-core/ +cd infra-controller/ cargo build --release -p nico-admin-cli # Binary: target/release/nico-admin-cli ``` diff --git a/docs/index.md b/docs/index.md index b633db1b5d..149aa09416 100644 --- a/docs/index.md +++ b/docs/index.md @@ -19,4 +19,4 @@ NICo is open source under the Apache 2.0 license. - [Hardware Compatibility List](hcl.md) — Supported servers and DPUs - [Release Notes](release-notes.md) — What's new in each version - [FAQs](faq.md) — Common questions answered -- [GitHub: NICo Core](https://github.com/NVIDIA/ncx-infra-controller-core) | [NICo REST](https://github.com/NVIDIA/ncx-infra-controller-rest) +- [GitHub](https://github.com/NVIDIA/infra-controller) diff --git a/docs/manuals/building_nico_containers.md b/docs/manuals/building_nico_containers.md index f5b2fb6cc2..cef6ba460e 100644 --- a/docs/manuals/building_nico_containers.md +++ b/docs/manuals/building_nico_containers.md @@ -15,7 +15,7 @@ assume an `apt`-based distribution such as Ubuntu 24.04. 2. [Add the correct hook for your shell](https://direnv.net/docs/hook.html) 3. Install rustup: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh` (select Option 1) 4. Start a new shell to pick up changes made from direnv and rustup. -5. Clone NICo - `git clone git@github.com:NVIDIA/infra-controller-core.git infra-controller` +5. Clone NICo - `git clone git@github.com:NVIDIA/infra-controller.git infra-controller` 6. `cd infra-controller` 7. `direnv allow` 8. `cd $REPO_ROOT/pxe` diff --git a/docs/manuals/nico-admin-cli.md b/docs/manuals/nico-admin-cli.md index d41e584ae1..c59c442d22 100644 --- a/docs/manuals/nico-admin-cli.md +++ b/docs/manuals/nico-admin-cli.md @@ -314,7 +314,7 @@ With this configuration, a client certificate with subject You can see an example of a complete nico-api configuration file -[here](https://github.com/NVIDIA/ncx-infra-controller-core/blob/main/crates/api/src/cfg/test_data/full_config.toml) +[here](https://github.com/NVIDIA/infra-controller/blob/main/crates/api-core/src/cfg/test_data/full_config.toml) ### Permissive mode diff --git a/docs/manuals/nicocli-reference.md b/docs/manuals/nicocli-reference.md index 107422b3bb..d4afcf4fb1 100644 --- a/docs/manuals/nicocli-reference.md +++ b/docs/manuals/nicocli-reference.md @@ -6,9 +6,10 @@ The [Day One Operations](day_one_operations.md) guide uses this reference for al ## Installation -Build and install from the `infra-controller-rest` repo: +Build and install from the `rest-api/` directory of the `infra-controller` repo: ``` +cd rest-api make nico-cli # installs to $(go env GOPATH)/bin/nicocli make nico-cli INSTALL_DIR=/usr/local/bin # install elsewhere ``` diff --git a/docs/manuals/nvlink_partitioning.md b/docs/manuals/nvlink_partitioning.md index e7706eeda4..70790ab989 100644 --- a/docs/manuals/nvlink_partitioning.md +++ b/docs/manuals/nvlink_partitioning.md @@ -43,13 +43,13 @@ NICo users can create NVLink Logical Partitions and plan GPU assignments using N In general, the steps are: -1. The user creates a NVLink Logical Partition using the `POST /v2/org/{org}/nico/nvlink-logical-partition` [REST API endpoint](https://nvidia.github.io/infra-controller-rest/#tag/NVLink-Logical-Partition/operation/create-nvlink-logical-partition). NICo creates an entry in the database and returns an NVLink Logical Partition ID. At this point, there is no underlying NVLink Partition associated with the NVLink Logical Partition. +1. The user creates a NVLink Logical Partition using the `POST /v2/org/{org}/nico/nvlink-logical-partition` [REST API endpoint](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/nvlink-logical-partition/create-nvlink-logical-partition). NICo creates an entry in the database and returns an NVLink Logical Partition ID. At this point, there is no underlying NVLink Partition associated with the NVLink Logical Partition. -2. When creating an Instance, the user specifies NVLink Interface configuration for each GPU by referencing their preferred NVLink Logical Partition ID in the `POST /v2/org/{org}/nico/instance` [REST API endpoint request](https://nvidia.github.io/infra-controller-rest/#tag/Instance/operation/create-instance). +2. When creating an Instance, the user specifies NVLink Interface configuration for each GPU by referencing their preferred NVLink Logical Partition ID in the `POST /v2/org/{org}/nico/instance` [REST API endpoint request](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/instance/create-instance). a. If this is the first Instance to be added to specified NVLink Logical Partitions, NICo Core will create and assign NVLink Partitions for them and add the Instance GPUs to the NVLink Partitions. -> **Note**: To ensure that machines in the same Rack are assigned to the same NVLink Partition, an Instance Type can be created for the Rack and all Machines in the Rack assigned to the same Instance Type. Alternatively users can use the [Batch Instance creation REST API endpoint](https://nvidia.github.io/infra-controller-rest/#tag/Instance/operation/batch-create-instances) and set `topologyOptimized` to `true`. +> **Note**: To ensure that machines in the same Rack are assigned to the same NVLink Partition, an Instance Type can be created for the Rack and all Machines in the Rack assigned to the same Instance Type. Alternatively users can use the [Batch Instance creation REST API endpoint](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/instance/batch-create-instances) and set `topologyOptimized` to `true`. 3. If the user does not want to specify NVLink Interfaces for each GPU when creating an Instance, they can: @@ -65,7 +65,7 @@ In general, the steps are: ### Updating an Instance to change NVLink Logical Partition assignment for its GPUs -If a NICo user wants to update an Instance to change NVLink Logical Partition assignment for its GPUs, they can do so by calling the `PATCH /v2/org/{org}/nico/instance/{instance-id}` [REST API endpoint](https://nvidia.github.io/infra-controller-rest/#tag/Instance/operation/update-instance) +If a NICo user wants to update an Instance to change NVLink Logical Partition assignment for its GPUs, they can do so by calling the `PATCH /v2/org/{org}/nico/instance/{instance-id}` [REST API endpoint](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/instance/update-instance) The user can specify the NVLink Logical Partition ID for each GPU in the Instance by passing the `nvLinkInterfaces` list. diff --git a/docs/manuals/rack_level_admin.md b/docs/manuals/rack_level_admin.md index 94ce797c52..d32d6c6d5a 100644 --- a/docs/manuals/rack_level_admin.md +++ b/docs/manuals/rack_level_admin.md @@ -125,28 +125,28 @@ Currently, NICo only supports GB200 NVL72 racks, where a rack and a NVL domain o ### Rack Endpoints -- [GET /v2/org/{org}/carbide/rack](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/get-all-rack): Retrieve all racks in the specified site. -- [GET /v2/org/{org}/carbide/rack/{rack_id}](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/get-rack): Retrieve a rack with the specified ID. -- [GET /v2/org/{org}/carbide/rack/validation](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/validate-racks): Validate components of all racks in the specified site by comparing the expected inventory data to the actual inventory data. -- [GET /v2/org/{org}/carbide/rack/{rack_id}/validation](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/validate-rack): Validate components of the specified rack by comparing the expected inventory data to the actual inventory data. -- [PATCH /v2/org/{org}/carbide/rack/power](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/power-control-racks): Control power of all or selected racks in the site. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. -- [PATCH /v2/org/{org}/carbide/rack/{id}/power](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/firmware-update-rack): Control power of the specified rack. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. -- [PATCH /v2/org/{org}/carbide/rack/firmware](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/firmware-update-racks): Update firmware on all or selected racks in the site. -- [PATCH /v2/org/{org}/carbide/rack/{id}/firmware](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/firmware-update-rack): Update firmware on the specified rack. -- [POST /v2/org/{org}/carbide/rack/bringup](hhttps://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/bringup-racks): Bring up all or selected racks in the site. -- [POST /v2/org/{org}/carbide/rack/{id}/bringup](hhttps://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/bringup-racks): Bring up the specified rack. -- [GET /v2/org/{org}/carbide/rack/task/{id}](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/get-rack-task): Retrieve the status of the specified rack task. -- [GET /v2/org/{org}/carbide/rack/task/{id}/cancel](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Rack/operation/get-rack-task): Cancel the specified rack task. +- [GET /v2/org/{org}/carbide/rack](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/get-all-rack): Retrieve all racks in the specified site. +- [GET /v2/org/{org}/carbide/rack/{rack_id}](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/get-rack): Retrieve a rack with the specified ID. +- [GET /v2/org/{org}/carbide/rack/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/validate-racks): Validate components of all racks in the specified site by comparing the expected inventory data to the actual inventory data. +- [GET /v2/org/{org}/carbide/rack/{rack_id}/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/validate-rack): Validate components of the specified rack by comparing the expected inventory data to the actual inventory data. +- [PATCH /v2/org/{org}/carbide/rack/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/power-control-racks): Control power of all or selected racks in the site. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. +- [PATCH /v2/org/{org}/carbide/rack/{id}/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-rack): Control power of the specified rack. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. +- [PATCH /v2/org/{org}/carbide/rack/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-racks): Update firmware on all or selected racks in the site. +- [PATCH /v2/org/{org}/carbide/rack/{id}/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-rack): Update firmware on the specified rack. +- [POST /v2/org/{org}/carbide/rack/bringup](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/bringup-racks): Bring up all or selected racks in the site. +- [POST /v2/org/{org}/carbide/rack/{id}/bringup](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/bringup-racks): Bring up the specified rack. +- [GET /v2/org/{org}/carbide/rack/task/{id}](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/get-rack-task): Retrieve the status of the specified rack task. +- [GET /v2/org/{org}/carbide/rack/task/{id}/cancel](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/get-rack-task): Cancel the specified rack task. ### Tray (Rack Component) Endpoints -- [GET /v2/org/{org}/carbide/tray](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/get-all-trays): Retrieve all trays in the specified site. -- [GET /v2/org/{org}/carbide/tray/{id}](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/get-tray): Retrieve a tray with the specified id. -- [GET /v2/org/{org}/carbide/tray/validation](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/validate-trays): Validate all or selected trays in the site by comparing the expected inventory data to the actual inventory data. -- [GET /v2/org/{org}/carbide/tray/{id}/validation](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/validate-tray): Validate the specified tray by comparing the expected inventory data to the actual inventory data. -- [PATCH /v2/org/{org}/carbide/tray/power](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/power-control-trays): Control the power of all or selected trays in the site. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. -- [PATCH /v2/org/{org}/carbide/tray/{id}/power](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/power-control-tray): Control the power of the specified tray. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. -- [PATCH /v2/org/{org}/carbide/tray/firmware](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/firmware-update-trays): Update the firmware on all or selected trays in the site. -- [PATCH /v2/org/{org}/carbide/tray/{id}/firmware](https://nvidia.github.io/ncx-infra-controller-rest/#tag/Tray/operation/firmware-update-tray): Update the firmware on the specified tray. +- [GET /v2/org/{org}/carbide/tray](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/get-all-trays): Retrieve all trays in the specified site. +- [GET /v2/org/{org}/carbide/tray/{id}](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/get-tray): Retrieve a tray with the specified id. +- [GET /v2/org/{org}/carbide/tray/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/validate-trays): Validate all or selected trays in the site by comparing the expected inventory data to the actual inventory data. +- [GET /v2/org/{org}/carbide/tray/{id}/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/validate-tray): Validate the specified tray by comparing the expected inventory data to the actual inventory data. +- [PATCH /v2/org/{org}/carbide/tray/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/power-control-trays): Control the power of all or selected trays in the site. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. +- [PATCH /v2/org/{org}/carbide/tray/{id}/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/power-control-tray): Control the power of the specified tray. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. +- [PATCH /v2/org/{org}/carbide/tray/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/firmware-update-trays): Update the firmware on all or selected trays in the site. +- [PATCH /v2/org/{org}/carbide/tray/{id}/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/tray/firmware-update-tray): Update the firmware on the specified tray. diff --git a/docs/manuals/repair/overview.md b/docs/manuals/repair/overview.md index dcccf66049..e1fb15f3bf 100644 --- a/docs/manuals/repair/overview.md +++ b/docs/manuals/repair/overview.md @@ -39,7 +39,7 @@ Tenant privileges are described in terms of Capabilities. The capabilities of a In service account mode, the `targetedInstanceCreation` capability is granted to Service Account Tenant when [`GET /v2/org/{org}/nico/service-account/current` REST API endpoint](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/service-account/get-current-service-account) is called. -At present turning this capability on for regular Tenants (who are not part of a Service Account org) is not supported via the REST API. However the feature is in active development, relevant issue can be tracked [here](https://github.com/NVIDIA/infra-controller-rest/issues/304). +At present turning this capability on for regular Tenants (who are not part of a Service Account org) is not supported via the REST API. However the feature is in active development, relevant issue can be tracked [here](https://github.com/NVIDIA/infra-controller/issues/2104). NOTE: Privileged Tenants still need Network Allocations from Provider in order to create Instances. diff --git a/docs/openapi/getting_started.md b/docs/openapi/getting_started.md index 71ac5c7a87..61cd0ad998 100644 --- a/docs/openapi/getting_started.md +++ b/docs/openapi/getting_started.md @@ -2,7 +2,7 @@ This section provides a quick overview of the API and how to get started. ### Authentication The first step is to authenticate using a JWT bearer token. Organization structures and roles depend on the authentication configuration used. For details on authentication, please consult the -NICo REST `auth` module [README](https://github.com/NVIDIA/infra-controller-rest/tree/main/auth). +NICo REST `auth` module [README](https://github.com/NVIDIA/infra-controller/tree/main/rest-api/auth). ### API Version The next step is to be aware of the API version being used. The API version can be retrieved by calling the [Retrieve Metadata endpoint](/infra-controller/rest-api-reference/api-reference/metadata/get-metadata). In general, the API maintains backward diff --git a/docs/openapi/spec.yaml b/docs/openapi/spec.yaml index 29d1428e5c..4017fd6125 100644 --- a/docs/openapi/spec.yaml +++ b/docs/openapi/spec.yaml @@ -34,7 +34,7 @@ tags: ### Authentication The first step is to authenticate using a JWT bearer token. Organization structures and roles depend on the authentication configuration used. For details on authentication, please consult the - NICo REST `auth` module [README](https://github.com/NVIDIA/ncx-infra-controller-rest/tree/main/auth). + NICo REST `auth` module [README](https://github.com/NVIDIA/infra-controller/tree/main/rest-api/auth). ### API Version The next step is to be aware of the API version being used. The API version can be retrieved by calling the [Retrieve Metadata endpoint](#tag/Metadata/operation/get-metadata). In general, the API maintains backward @@ -54,7 +54,7 @@ tags: Once the Provider and the Tenant are initialized, the user can create resources by making calls to the appropriate endpoints. ### Creating Site Level IP Blocks - To utilize a NICo Site, the Provider or Service Account holder must create IP Blocks for each network overlay defined in NICo Core [Site configuration toml file](https://github.com/NVIDIA/ncx-infra-controller-core/blob/v0.5.8/deploy/files/nico-api/nico-api-site-config.toml#L30). + To utilize a NICo Site, the Provider or Service Account holder must create IP Blocks for each network overlay defined in NICo Core [Site configuration toml file](https://github.com/NVIDIA/infra-controller/blob/main/deploy/files/nico-api/nico-api-site-config.toml#L30). To create an IP Block, the user must make a call to the [Create IP Block endpoint](#tag/IP-Block/operation/create-ipblock). diff --git a/docs/release-notes.md b/docs/release-notes.md index f8c556f944..887905855e 100644 --- a/docs/release-notes.md +++ b/docs/release-notes.md @@ -7,7 +7,7 @@ This document contains release notes for the NVIDIA Infra Controller (NICo) proj ### Highlights - **Documentation refresh + unified REST API docs**: Updated the docs look and feel at [https://docs.nvidia.com/infra-controller/documentation/introduction](https://docs.nvidia.com/infra-controller/documentation/introduction), and consolidated REST API information into the same documentation set. -- **Simplified deployment**: Added NICo deployment [prerequisite tool](https://github.com/NVIDIA/ncx-infra-controller-core/tree/main/helm-prereqs) `helm-prereqs` to install required dependencies and enable easy NICo deployment. +- **Simplified deployment**: Added NICo deployment [prerequisite tool](https://github.com/NVIDIA/infra-controller/tree/main/helm-prereqs) `helm-prereqs` to install required dependencies and enable easy NICo deployment. - **Rack Level Administration (RLA)**: Significantly expanded rack/tray operations via REST APIs (validation, power, firmware, bring-up). ### Compatibility Matrix @@ -35,7 +35,7 @@ The following dependencies have been validated for this release: - Helm/Helmfile-driven installation of NICo prerequisites--including MetalLB, Zalando PostgreSQL Operator, cert-manager, HashiCorp Vault, and external-secrets--along with the main NICo components--NICo Core and NICo REST. - Includes orchestration and automation scripts such as `helmfile.yaml`, `setup.sh`, `preflight.sh`, and `clean.sh`. - This tool significantly reduces installation time compared to manual installation. - - Location: [https://github.com/NVIDIA/ncx-infra-controller-core/tree/main/helm-prereqs](https://github.com/NVIDIA/ncx-infra-controller-core/tree/main/helm-prereqs) + - Location: [https://github.com/NVIDIA/infra-controller/tree/main/helm-prereqs](https://github.com/NVIDIA/infra-controller/tree/main/helm-prereqs) #### Rack Level Administration (RLA) diff --git a/helm-prereqs/README.md b/helm-prereqs/README.md index e0f1dfdcbd..fc1361d9b7 100644 --- a/helm-prereqs/README.md +++ b/helm-prereqs/README.md @@ -13,7 +13,7 @@ export NICO_REST_IMAGE_TAG= # unless using --skip-rest ## Documentation -For complete step-by-step deployment instructions, see the **[Quick Start Guide](https://nvidia.github.io/ncx-infra-controller-core/documentation/getting-started/quick-start-guide)** in the NICo documentation site. The Quick Start Guide covers: +For complete step-by-step deployment instructions, see the **[Quick Start Guide](https://docs.nvidia.com/infra-controller/documentation/getting-started/quick-start-guide)** in the NICo documentation site. The Quick Start Guide covers: 1. Building NICo containers 2. Preparing the Kubernetes cluster @@ -23,7 +23,7 @@ For complete step-by-step deployment instructions, see the **[Quick Start Guide] 6. Discovering your first host 7. Verifying the deployment -For manual phase-by-phase installation (re-running individual phases, debugging failures), see the **[Reference Installation](https://nvidia.github.io/ncx-infra-controller-core/documentation/getting-started/installation-options/reference-installation)** guide. +For manual phase-by-phase installation (re-running individual phases, debugging failures), see the **[Reference Installation](https://docs.nvidia.com/infra-controller/documentation/getting-started/installation-options/reference-installation)** guide. ## Directory structure @@ -88,7 +88,7 @@ Once the above is done, run `./setup.sh -y`. ## Configuration reference Detailed field-by-field instructions for each values file live in the -[Quick Start Guide — Step 3](https://nvidia.github.io/ncx-infra-controller-core/documentation/getting-started/quick-start-guide#step-3--configure-the-site). +[Quick Start Guide — Step 3](https://docs.nvidia.com/infra-controller/documentation/getting-started/quick-start-guide#step-3--configure-the-site). The tables below summarize the keys that must be set per site. ### Environment variables @@ -212,7 +212,7 @@ NICo Core (../helm - nico-core.yaml values) ├── nico-pxe (Deployment - HTTP PXE boot) ├── nico-ssh-console-rs (Deployment - SSH console proxy) └── unbound (Deployment - .forge zone DNS, opt-in) -NICo REST (infra-controller-rest/helm/charts/nico-rest) +NICo REST (rest-api/helm/charts/nico-rest) ├── nico-rest-ca-issuer ClusterIssuer (cert-manager.io) ├── postgres StatefulSet (temporal + keycloak + NICo databases) ├── keycloak (dev OIDC IdP, nico-dev realm) From 3aa7e8f4b8972ff688662b62d9b5caf43c658520 Mon Sep 17 00:00:00 2001 From: Kyle Felter Date: Wed, 10 Jun 2026 03:33:04 -0500 Subject: [PATCH 2/3] docs: Rename infra-controller-core/-rest references in repo markdown --- AGENTS.md | 4 ++-- CONTRIBUTING.md | 4 ++-- README.md | 7 ++----- STYLE_GUIDE.md | 2 +- book/src/configuration/configurability.md | 8 ++++---- crates/bmc-proxy/README.md | 2 +- dev/deployment/devspace/README.md | 2 +- rest-api/AGENTS.md | 4 ++-- rest-api/CONTRIBUTING.md | 4 ++-- rest-api/README.md | 4 ++-- rest-api/cli/INSTALL.md | 2 +- rest-api/openapi/README.md | 2 +- rest-api/sdk/simple/README.md | 2 +- 13 files changed, 22 insertions(+), 25 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 2b3b42b4c2..60492e84f2 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -1,7 +1,7 @@ # AGENTS.md This file provides guidance for AI coding agents working in the -`ncx-infra-controller-core` repository. +`infra-controller` repository. ## Project Overview @@ -26,7 +26,7 @@ to fast-track building next-generation AI Cloud offerings. ## Repository Structure ``` -ncx-infra-controller-core/ +infra-controller/ ├── crates/ # Rust crate implementations. To discover all crates │ # and their purpose, run `ls crates/` or see the │ # [workspace] members list in `Cargo.toml` — each diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 779be266c5..6967b418a1 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -110,11 +110,11 @@ All pull requests are automatically checked for DCO compliance via DCO bot. Pull ## Fork and Setup -Developers must first fork the upstream [NCX Infra Controller repository](https://github.com/NVIDIA/ncx-infra-controller-core). +Developers must first fork the upstream [Infra Controller repository](https://github.com/NVIDIA/infra-controller). ### 1. Fork the Repository -1. Navigate to the [NCX Infra Controller repository](https://github.com/NVIDIA/ncx-infra-controller-core) on GitHub. +1. Navigate to the [Infra Controller repository](https://github.com/NVIDIA/infra-controller) on GitHub. 2. Click the **Fork** button in the upper right corner. 3. Select your GitHub account as the destination. diff --git a/README.md b/README.md index ed50d9f32b..a615a8adbe 100644 --- a/README.md +++ b/README.md @@ -34,7 +34,7 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe ```bash # 1. Build and push images to your registry # NICo Core image: /nvmetal-nico: (this repo) -# NICo REST images: /nico-rest-api:, etc. (infra-controller-rest) +# NICo REST images: /nico-rest-api:, etc. # 2. Set environment variables export KUBECONFIG=/path/to/kubeconfig @@ -52,10 +52,7 @@ export NICO_REST_IMAGE_TAG= # e.g. 2.0.0-pr-58-g38a54a3 # Edit helm-prereqs/values.yaml: # siteName — short site identifier -# 4. Point NICO_REST_REPO at infra-controller-rest (auto-detected if a sibling directory) -export NICO_REST_REPO=/path/to/infra-controller-rest # optional - -# 5. Run setup — installs common services, NICo Core, and NICo REST in order +# 4. Run setup — installs common services, NICo Core, and NICo REST in order cd helm-prereqs ./setup.sh # interactive — prompts before deploying Core and REST ./setup.sh -y # non-interactive — deploys everything (CI/CD) diff --git a/STYLE_GUIDE.md b/STYLE_GUIDE.md index f09a399c3a..5e3e44a2b9 100644 --- a/STYLE_GUIDE.md +++ b/STYLE_GUIDE.md @@ -1,4 +1,4 @@ -# How to write Rust in ncx-infra-controller-core +# How to write Rust in infra-controller The goal of this document is to help keep our codebase consistent and maintainable by outlining best-practices we've learned through experience. It is currently a mix of best practices for _this codebase_ (ie. how we expect code to diff --git a/book/src/configuration/configurability.md b/book/src/configuration/configurability.md index 3389ab10da..9189397d2f 100644 --- a/book/src/configuration/configurability.md +++ b/book/src/configuration/configurability.md @@ -689,8 +689,8 @@ it with the rack-level state machine. See The NICo REST stack (separate helm release named `nico-rest`, in the `nico-rest` namespace) sits on top of NICo Core and provides the public REST API, workflow orchestration, optional Keycloak IdP, and the -per-site agent. Its source repo is -[`infra-controller-rest`](https://github.com/NVIDIA/ncx-infra-controller-rest); +per-site agent. Its source lives in the +[`rest-api/`](https://github.com/NVIDIA/infra-controller/tree/main/rest-api) tree; this guide covers only the *site-side* configuration knobs. ### nico-rest helm release — `helm-prereqs/values/nico-rest.yaml` @@ -732,7 +732,7 @@ Temporal is deployed by `setup.sh` Phase 7f using the upstream Temporal helm chart with mTLS enabled. The mTLS issuer (`nico-rest-ca-issuer`) is installed in Phase 7b. Operators usually don't touch Temporal config directly; see the temporal subchart values in -[`infra-controller-rest/helm/charts/temporal/values.yaml`](https://github.com/NVIDIA/ncx-infra-controller-rest) +[`rest-api/temporal-helm/temporal/values.yaml`](https://github.com/NVIDIA/infra-controller/tree/main/rest-api/temporal-helm/temporal) if you need to tune retention or task queue counts. ### Keycloak (dev IdP) @@ -814,7 +814,7 @@ also re-applies operator-chart defaults that may not match your production tuning. For the REST stack the equivalent is `helm upgrade nico-rest …` against -`infra-controller-rest/helm/charts/nico-rest`. +`rest-api/helm/charts/nico-rest`. See [`helm/README.md` → Upgrading](../../../helm/README.md#upgrading) for the diff-then-apply pattern. diff --git a/crates/bmc-proxy/README.md b/crates/bmc-proxy/README.md index 28912271de..c779f111d8 100644 --- a/crates/bmc-proxy/README.md +++ b/crates/bmc-proxy/README.md @@ -223,7 +223,7 @@ sequenceDiagram This crate is meant to implement a clean architectural boundary, but the implementation still couples to nico in slightly uncomfortable ways: -1. It's still a component of the ncx-infra-controller-core repo, so it's not fully independent +1. It's still a component of the infra-controller repo, so it's not fully independent 2. It expects nico-api to resolve proxied BMC IPs through `FindMacAddressByBmcIp`. 3. It expects nico-api to return credentials from `GetBmcCredentials` for every proxied BMC. diff --git a/dev/deployment/devspace/README.md b/dev/deployment/devspace/README.md index 6eaa8e7313..57793b8c7a 100644 --- a/dev/deployment/devspace/README.md +++ b/dev/deployment/devspace/README.md @@ -119,7 +119,7 @@ docker build -t "machine-a-tron:" -f dev/deployment/devs DevSpace then deploys the Helm chart with the built `nico-api` image wired into `global.image.repository` and `global.image.tag`, the built `nico-bmc-proxy` image wired into the `nico-bmc-proxy` chart values, and applies the local-only `machine-a-tron` manifest with its image wired into the `Deployment` spec. -## Re-initializing ncx-infra-controller-core to a clean slate +## Re-initializing infra-controller to a clean slate Once deployed, the `nico-api` container will run and initialize its database, and the `machine-a-tron` container will run a set of mock machines, which will be discovered and ingested into the database, and run through the state machine until they reach a Ready state. diff --git a/rest-api/AGENTS.md b/rest-api/AGENTS.md index d37da41a49..6b77dce378 100644 --- a/rest-api/AGENTS.md +++ b/rest-api/AGENTS.md @@ -1,7 +1,7 @@ # AGENTS.md This file provides guidance for AI coding agents working in the -`infra-controller-rest` repository. +`rest-api/` tree of the `infra-controller` repository. ## Project Overview @@ -27,7 +27,7 @@ concert with Core services for on-site hardware operations. ## Repository Structure ```text -infra-controller-rest/ +rest-api/ ├── api/ # Main REST API server (Echo-based) ├── auth/ # Authentication (Keycloak, JWT, service accounts) ├── cert-manager/ # Native PKI certificate management (credsmgr) diff --git a/rest-api/CONTRIBUTING.md b/rest-api/CONTRIBUTING.md index ed0b13ad76..5aa6e87b51 100644 --- a/rest-api/CONTRIBUTING.md +++ b/rest-api/CONTRIBUTING.md @@ -116,8 +116,8 @@ Developers must first fork the upstream [NVIDIA Infrastructure Controller REST r ### 2. Clone Your Fork ```bash -git clone https://github.com//infra-controller-rest.git -cd infra-controller-rest +git clone https://github.com//infra-controller.git +cd infra-controller ``` ### 3. Add Upstream Remote diff --git a/rest-api/README.md b/rest-api/README.md index c0f5392ab8..512bf51396 100644 --- a/rest-api/README.md +++ b/rest-api/README.md @@ -11,7 +11,7 @@ In deployments, NVIDIA Infrastructure Controller REST requires Core services to The REST layer can be deployed in the datacenter with NVIDIA Infrastructure Controller Core, or deployed anywhere in Cloud and allow Site Agent to connect from the datacenter. Multiple NVIDIA Infrastructure Controller Cores running in different datacenters can also connect to NVIDIA Infrastructure Controller REST through respective Site Agents. -View latest OpenAPI schema on [GitHub pages](https://nvidia.github.io/infra-controller-rest/). +View the latest OpenAPI schema in the [REST API Reference](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference). ## Prerequisites @@ -227,7 +227,7 @@ az acr login --name myregistry 2. Build and push: ```bash -REGISTRY=my-registry.example.com/infra-controller-rest +REGISTRY=my-registry.example.com/infra-controller TAG=v1.0.0 make docker-build IMAGE_REGISTRY=$REGISTRY IMAGE_TAG=$TAG diff --git a/rest-api/cli/INSTALL.md b/rest-api/cli/INSTALL.md index 295cffd18a..1e7941cde0 100644 --- a/rest-api/cli/INSTALL.md +++ b/rest-api/cli/INSTALL.md @@ -125,7 +125,7 @@ If any of those are not true, the task is not complete. - **`fatal: unable to access 'https://github.com/...': ...` or `go mod download` network errors** — confirm the user has internet access and (if behind a corporate proxy) that `HTTPS_PROXY` / `GOPROXY` are set. Surface the exact error to the user; do not retry indefinitely. -- **Repo not found at `NVIDIA/infra-controller-rest`** — the repo may have been renamed. Check for a repo whose name contains `infra-controller` or `nico`. If you cannot find it, stop and ask the user for the current location. +- **Repo not found at `NVIDIA/infra-controller`** — the repo may have been renamed. Check for a repo whose name contains `infra-controller` or `nico`. If you cannot find it, stop and ask the user for the current location. - **`go: cannot find main module`** — you are not inside the cloned repo. Re-run `cd` into the cloned directory. diff --git a/rest-api/openapi/README.md b/rest-api/openapi/README.md index 48d0d3f14a..4eae15ba17 100644 --- a/rest-api/openapi/README.md +++ b/rest-api/openapi/README.md @@ -5,7 +5,7 @@ SPDX-License-Identifier: Apache-2.0 # NVIDIA Infra Controller REST OpenAPI Schema -This repo contains OpenAPI schema for NVIDIA Infra Controller REST endpoints. The latest Redoc-rendered version is available at https://nvidia.github.io/infra-controller-rest/ +This repo contains OpenAPI schema for NVIDIA Infra Controller REST endpoints. The latest rendered version is available at https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference # Development diff --git a/rest-api/sdk/simple/README.md b/rest-api/sdk/simple/README.md index 963db93ceb..d35928871d 100644 --- a/rest-api/sdk/simple/README.md +++ b/rest-api/sdk/simple/README.md @@ -42,7 +42,7 @@ go get github.com/NVIDIA/infra-controller/rest-api/sdk/simple For local development, use a `replace` directive: ```go -replace github.com/NVIDIA/infra-controller/rest-api => /path/to/infra-controller-rest +replace github.com/NVIDIA/infra-controller/rest-api => /path/to/infra-controller/rest-api ``` ### Local development (kind) From 9a00f3857b32e49c715b9ca6de83a80658675226 Mon Sep 17 00:00:00 2001 From: Kyle Felter Date: Wed, 10 Jun 2026 14:22:10 -0500 Subject: [PATCH 3/3] docs: Address review feedback on rename cleanup --- docs/getting-started/quick-start.md | 64 ++++++++++++++--------------- docs/manuals/rack_level_admin.md | 2 +- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md index 206f911b48..8d8d18b094 100644 --- a/docs/getting-started/quick-start.md +++ b/docs/getting-started/quick-start.md @@ -59,23 +59,23 @@ Everything in this step must be done **before** running `setup.sh`. Skipping any ```bash export KUBECONFIG=/path/to/kubeconfig # your cluster kubeconfig export REGISTRY_PULL_SECRET= # your registry pull credential -export NCX_IMAGE_REGISTRY=my-registry.example.com/ncx # base registry for all NCX images -export NCX_CORE_IMAGE_TAG= # e.g. v2025.12.30-rc1 -export NCX_REST_IMAGE_TAG= # e.g. v1.0.4 +export NICO_IMAGE_REGISTRY=my-registry.example.com/nico # base registry for all NICo images +export NICO_CORE_IMAGE_TAG= # e.g. v2025.12.30-rc1 +export NICO_REST_IMAGE_TAG= # e.g. v1.0.4 ``` -`NCX_IMAGE_REGISTRY` is used for both NCX Core (`/nvmetal-nico`) and NCX REST (`/nico-rest-*`). Push all images to this registry before running setup. +`NICO_IMAGE_REGISTRY` is used for both NICo Core (`/nvmetal-carbide`) and NICo REST (`/nico-rest-*`). Push all images to this registry before running setup. Obtain an NGC API key at [ngc.nvidia.com](https://ngc.nvidia.com) → **API Keys** → **Generate Personal Key**. | Variable | Required | Description | |----------|----------|-------------| -| `REGISTRY_PULL_SECRET` | **Yes** | Pull secret and API key for your image registry. Used to create the image pull secret for both Infra Controller Core and Infra Controller REST. | -| `NCX_IMAGE_REGISTRY` | **Yes** | Base image registry for all Infra Controller images (e.g. `my-registry.example.com/ncx`). Used for Infra Controller Core (`/nvmetal-nico`) and Infra Controller REST (`/nico-rest-*`). | -| `NCX_CORE_IMAGE_TAG` | **Yes** | Infra Controller Core image tag (e.g. `v2025.12.30`). | -| `NCX_REST_IMAGE_TAG` | **Yes** | Infra Controller REST image tag (e.g. `v1.0.4`). | +| `REGISTRY_PULL_SECRET` | **Yes** | Pull secret and API key for your image registry. Used to create the image pull secret for both NICo Core and NICo REST. | +| `NICO_IMAGE_REGISTRY` | **Yes** | Base image registry for all NICo images (e.g. `my-registry.example.com/nico`). Used for NICo Core (`/nvmetal-carbide`) and NICo REST (`/nico-rest-*`). | +| `NICO_CORE_IMAGE_TAG` | **Yes** | NICo Core image tag (e.g. `v2025.12.30`). | +| `NICO_REST_IMAGE_TAG` | **Yes** | NICo REST image tag (e.g. `v1.0.4`). | | `KUBECONFIG` | **Yes** | Path to your cluster kubeconfig. | -| `NCX_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. | +| `NICO_SITE_UUID` | No | Stable UUID for this site. Defaults to `a1b2c3d4-e5f6-4000-8000-000000000001`. | ### 3b. Set your Site Name @@ -85,7 +85,7 @@ Open `helm-prereqs/values.yaml` and change `siteName` from the placeholder to yo siteName: "mysite" # ← replace "TMP_SITE" with your site name (e.g. "examplesite", "prod-us-east") ``` -This value is injected into every postgres pod as the `TMP_SITE` environment variable. It must match the `sitename` in the NCX Core `siteConfig` block below. +This value is injected into every postgres pod as the `TMP_SITE` environment variable. It must match the `sitename` in the NICo Core `siteConfig` block below. To tune PostgreSQL resources for your node capacity (the defaults are conservative for dev), edit the following values: ```yaml @@ -101,9 +101,9 @@ postgresql: memory: "1Gi" ``` -### 3c. Configure NCX Core Site Deployment +### 3c. Configure NICo Core Site Deployment -Open `helm-prereqs/values/ncx-core.yaml` and update the following values: +Open `helm-prereqs/values/nico-core.yaml` and update the following values: - **API hostname**: The external DNS name for the Infra Controller Core API: @@ -137,11 +137,11 @@ All fields are documented with inline comments in the file. NICo REST lives in this repository under `rest-api/`. The Helm charts, kustomize bases, and helper scripts that `setup.sh` uses for [Phase 7](#setup-script-phases) are resolved in-tree automatically--there is no separate repository to clone and no `NCX_REPO` to set. `preflight.sh` errors out only if `rest-api/` is missing from the checkout. -### 3e. Configure NCX REST Authentication +### 3e. Configure NICo REST Authentication The default configuration uses the *dev Keycloak instance* that `setup.sh` deploys automatically. No changes are needed if you're running a dev/test environment. -For *production*, or if you are using your own IdP, edit the `helm-prereqs/values/ncx-rest.yaml` file as follows: +For *production*, or if you are using your own IdP, edit the `helm-prereqs/values/nico-rest.yaml` file as follows: **Option 1: Use your own Keycloak or OIDC-compatible IdP** @@ -172,7 +172,7 @@ When `keycloak.enabled: false`, the Keycloak deployment is still created by `set ### 3f. Review site-agent Config -The defaults in `helm-prereqs/values/ncx-site-agent.yaml` should match the dev postgres instance deployed by `setup.sh`. +The defaults in `helm-prereqs/values/nico-site-agent.yaml` should match the dev postgres instance deployed by `setup.sh`. `DB_USER` and `DB_PASSWORD` are injected at runtime from the `db-creds` Kubernetes Secret (created by the `nico-rest-common` sub-chart during Phase 7g). The Secret is referenced via `secrets.dbCreds` in the site-agent values. @@ -189,7 +189,7 @@ envConfig: ### 3g. Configure MetalLB -MetalLB provides LoadBalancer IPs for NCX Core services (nico-api, DHCP, DNS, PXE, SSH console). Without it, those services stay in `` state and the site is unreachable. +MetalLB provides LoadBalancer IPs for NICo Core services (nico-api, DHCP, DNS, PXE, SSH console). Without it, those services stay in `` state and the site is unreachable. > **NTP note:** NICo does not run a standalone NTP service. Instead, NTP server addresses are provided to managed hosts via DHCP option 42--configured in the `nico-dhcp` chart Kea hook parameters (`nico-ntpserver`). Point this to your enterprise NTP servers. @@ -210,9 +210,9 @@ Add or remove `BGPPeer` blocks to match your node count, with one block per work ### 3h. Assign Service VIPs -Each NCX Core service that exposes a LoadBalancer needs a **specific, stable IP** from your MetalLB pool. Without explicit assignments, MetalLB picks IPs randomly on each install, which means your DHCP relay, DNS records, PXE config, and API hostname cannot be pre-configured and will break on redeploy. +Each NICo Core service that exposes a LoadBalancer needs a **specific, stable IP** from your MetalLB pool. Without explicit assignments, MetalLB picks IPs randomly on each install, which means your DHCP relay, DNS records, PXE config, and API hostname cannot be pre-configured and will break on redeploy. -Open `helm-prereqs/values/ncx-core.yaml` and update the VIP for each service: +Open `helm-prereqs/values/nico-core.yaml` and update the VIP for each service: | Service | Values key | Pool to use | |---------|-----------|-------------| @@ -231,10 +231,10 @@ All IPs must be within the `IPAddressPool` ranges you defined in `values/metallb ### 3i. (Optional) Set a Stable Site UUID -If you want a specific site UUID instead of the default placeholder, set the `NCX_SITE_UUID` environment variable: +If you want a specific site UUID instead of the default placeholder, set the `NICO_SITE_UUID` environment variable: ```bash -export NCX_SITE_UUID= # must be a valid UUID v4 +export NICO_SITE_UUID= # must be a valid UUID v4 ``` This UUID is used as the Temporal namespace for the site and as the `CLUSTER_ID` passed to the site-agent. Once set and deployed, changing it requires redeploying the site-agent and re-registering the site. @@ -276,7 +276,7 @@ Run the `setup.sh` script as follows: ```bash cd helm-prereqs/ -./setup.sh # interactive — prompts before deploying NCX Core and NCX REST +./setup.sh # interactive — prompts before deploying NICo Core and NICo REST ./setup.sh -y # non-interactive — deploys everything ``` @@ -294,8 +294,8 @@ The `setup.sh` script installs all prerequisites and NICo components in sequenti | 3 | HashiCorp Vault (3-node HA Raft) | | 4 | Vault init + unseal + SSH host key | | 5 | external-secrets + nico-prereqs + nico-pg-cluster | -| 6 | **NCX Core** (nico helm release) | -| 7a-7h | **NCX REST** full stack (postgres, Keycloak, Temporal, nico-rest, site-agent) | +| 6 | **NICo Core** (nico helm release) | +| 7a-7h | **NICo REST** full stack (postgres, Keycloak, Temporal, nico-rest, site-agent) | The following components are deployed: @@ -307,10 +307,10 @@ cert-manager (jetstack/cert-manager v1.17.1) vault (hashicorp/vault 0.25.0, 3-node HA Raft, TLS) external-secrets (external-secrets/external-secrets 0.14.3) nico-prereqs (this Helm chart - nico-system namespace) -NCX Core (../helm - ncx-core.yaml values) -NCX REST (rest-api/helm/charts/nico-rest) +NICo Core (../helm - nico-core.yaml values) +NICo REST (rest-api/helm/charts/nico-rest) ├── nico-rest-ca-issuer ClusterIssuer (cert-manager.io) - ├── postgres StatefulSet (temporal + keycloak + NCX databases) + ├── postgres StatefulSet (temporal + keycloak + NICo databases) ├── keycloak (dev OIDC IdP, nico-dev realm) ├── temporal (temporal-helm/temporal, mTLS) ├── nico-rest (API, cert-manager, workflow, site-manager) @@ -326,8 +326,8 @@ Before ingesting hosts, verify that all site controller components are healthy. ### Check That All Pods Are Running ```bash -kubectl get pods -n nico-system # NCX Core -kubectl get pods -n nico-rest # NCX REST +kubectl get pods -n nico-system # NICo Core +kubectl get pods -n nico-rest # NICo REST kubectl get pods -n temporal # Temporal ``` @@ -358,7 +358,7 @@ Both external IPs should be within your internal VIP pool range. ### Acquire a Keycloak Access Token -This section only applies if `keycloak.enabled: true` in `values/ncx-rest.yaml` (the default). If you disabled the bundled Keycloak and pointed `nico-rest-api` at your own IdP, obtain tokens from that IdP instead. +This section only applies if `keycloak.enabled: true` in `values/nico-rest.yaml` (the default). If you disabled the bundled Keycloak and pointed `nico-rest-api` at your own IdP, obtain tokens from that IdP instead. The `setup.sh` script deploys a dev Keycloak instance with a `nico` realm pre-loaded with the `ncx-service` client (M2M / `client_credentials`). @@ -490,7 +490,7 @@ For detailed OOB network requirements, refer to the [BMC and Out-of-Band Setup]( ## Step 7 — Discover Your First Host -This step uses `nico-admin-cli`, the gRPC CLI for NICo Core. Build it from the NCX Core repo: +This step uses `nico-admin-cli`, the gRPC CLI for NICo Core. Build it from the `infra-controller` repo: ```bash cd infra-controller/ @@ -500,7 +500,7 @@ cargo build --release -p nico-admin-cli Alternatively, use the containerized version bundled in the `nico-api` pod (available at `/opt/nico/nico-admin-cli` inside the container). -The `` in the commands below is the NICo Core gRPC API endpoint. This is the `nico-api` hostname configured in [Step 3c](#3c-configure-ncx-core-site-deployment), not the REST API used in Step 5. The format is typically `https://api-.`. You can also retrieve it from the LoadBalancer VIP: +The `` in the commands below is the NICo Core gRPC API endpoint. This is the `nico-api` hostname configured in [Step 3c](#3c-configure-nico-core-site-deployment), not the REST API used in Step 5. The format is typically `https://api-.`. You can also retrieve it from the LoadBalancer VIP: ```bash kubectl get svc nico-api -n nico-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}' @@ -565,4 +565,4 @@ cd helm-prereqs/ ./clean.sh ``` -This removes NCX REST, NCX Core, all helmfile releases, cluster-scoped resources, namespaces, and released PersistentVolumes. For details on what `clean.sh` does and the removal order, refer to the [Reference Installation](installation-options/reference-install.md) guide. +This removes NICo REST, NICo Core, all helmfile releases, cluster-scoped resources, namespaces, and released PersistentVolumes. For details on what `clean.sh` does and the removal order, refer to the [Reference Installation](installation-options/reference-install.md) guide. diff --git a/docs/manuals/rack_level_admin.md b/docs/manuals/rack_level_admin.md index d32d6c6d5a..a0b54708b3 100644 --- a/docs/manuals/rack_level_admin.md +++ b/docs/manuals/rack_level_admin.md @@ -130,7 +130,7 @@ Currently, NICo only supports GB200 NVL72 racks, where a rack and a NVL domain o - [GET /v2/org/{org}/carbide/rack/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/validate-racks): Validate components of all racks in the specified site by comparing the expected inventory data to the actual inventory data. - [GET /v2/org/{org}/carbide/rack/{rack_id}/validation](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/validate-rack): Validate components of the specified rack by comparing the expected inventory data to the actual inventory data. - [PATCH /v2/org/{org}/carbide/rack/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/power-control-racks): Control power of all or selected racks in the site. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. -- [PATCH /v2/org/{org}/carbide/rack/{id}/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-rack): Control power of the specified rack. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. +- [PATCH /v2/org/{org}/carbide/rack/{id}/power](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/power-control-rack): Control power of the specified rack. Supported power states are `on`, `off`, `cycle`, `forceoff`, `forcecycle`. - [PATCH /v2/org/{org}/carbide/rack/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-racks): Update firmware on all or selected racks in the site. - [PATCH /v2/org/{org}/carbide/rack/{id}/firmware](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/firmware-update-rack): Update firmware on the specified rack. - [POST /v2/org/{org}/carbide/rack/bringup](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference/rack/bringup-racks): Bring up all or selected racks in the site.