Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# AGENTS.md

This file provides guidance for AI coding agents working in the
`ncx-infra-controller-core` repository.
`infra-controller` repository.

## Project Overview

Expand All @@ -26,7 +26,7 @@ to fast-track building next-generation AI Cloud offerings.
## Repository Structure

```
ncx-infra-controller-core/
infra-controller/
├── crates/ # Rust crate implementations. To discover all crates
│ # and their purpose, run `ls crates/` or see the
│ # [workspace] members list in `Cargo.toml` — each
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,11 +110,11 @@ All pull requests are automatically checked for DCO compliance via DCO bot. Pull

## Fork and Setup

Developers must first fork the upstream [NCX Infra Controller repository](https://github.com/NVIDIA/ncx-infra-controller-core).
Developers must first fork the upstream [Infra Controller repository](https://github.com/NVIDIA/infra-controller).

### 1. Fork the Repository

1. Navigate to the [NCX Infra Controller repository](https://github.com/NVIDIA/ncx-infra-controller-core) on GitHub.
1. Navigate to the [Infra Controller repository](https://github.com/NVIDIA/infra-controller) on GitHub.
2. Click the **Fork** button in the upper right corner.
3. Select your GitHub account as the destination.

Expand Down
7 changes: 2 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ of the bare-metal lifecycle to fast-track building next generation AI Cloud offe
```bash
# 1. Build and push images to your registry
# NICo Core image: <your-registry>/nvmetal-nico:<tag> (this repo)
# NICo REST images: <your-registry>/nico-rest-api:<tag>, etc. (infra-controller-rest)
# NICo REST images: <your-registry>/nico-rest-api:<tag>, etc.

# 2. Set environment variables
export KUBECONFIG=/path/to/kubeconfig
Expand All @@ -52,10 +52,7 @@ export NICO_REST_IMAGE_TAG=<nico-rest-tag> # e.g. 2.0.0-pr-58-g38a54a3
# Edit helm-prereqs/values.yaml:
# siteName — short site identifier

# 4. Point NICO_REST_REPO at infra-controller-rest (auto-detected if a sibling directory)
export NICO_REST_REPO=/path/to/infra-controller-rest # optional

# 5. Run setup — installs common services, NICo Core, and NICo REST in order
# 4. Run setup — installs common services, NICo Core, and NICo REST in order
cd helm-prereqs
./setup.sh # interactive — prompts before deploying Core and REST
./setup.sh -y # non-interactive — deploys everything (CI/CD)
Expand Down
2 changes: 1 addition & 1 deletion STYLE_GUIDE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# How to write Rust in ncx-infra-controller-core
# How to write Rust in infra-controller

The goal of this document is to help keep our codebase consistent and maintainable by outlining best-practices we've
learned through experience. It is currently a mix of best practices for _this codebase_ (ie. how we expect code to
Expand Down
8 changes: 4 additions & 4 deletions book/src/configuration/configurability.md
Original file line number Diff line number Diff line change
Expand Up @@ -689,8 +689,8 @@ it with the rack-level state machine. See
The NICo REST stack (separate helm release named `nico-rest`, in the
`nico-rest` namespace) sits on top of NICo Core and provides the public
REST API, workflow orchestration, optional Keycloak IdP, and the
per-site agent. Its source repo is
[`infra-controller-rest`](https://github.com/NVIDIA/ncx-infra-controller-rest);
per-site agent. Its source lives in the
[`rest-api/`](https://github.com/NVIDIA/infra-controller/tree/main/rest-api) tree;
this guide covers only the *site-side* configuration knobs.

### nico-rest helm release — `helm-prereqs/values/nico-rest.yaml`
Expand Down Expand Up @@ -732,7 +732,7 @@ Temporal is deployed by `setup.sh` Phase 7f using the upstream Temporal
helm chart with mTLS enabled. The mTLS issuer (`nico-rest-ca-issuer`) is
installed in Phase 7b. Operators usually don't touch Temporal config
directly; see the temporal subchart values in
[`infra-controller-rest/helm/charts/temporal/values.yaml`](https://github.com/NVIDIA/ncx-infra-controller-rest)
[`rest-api/temporal-helm/temporal/values.yaml`](https://github.com/NVIDIA/infra-controller/tree/main/rest-api/temporal-helm/temporal)
if you need to tune retention or task queue counts.

### Keycloak (dev IdP)
Expand Down Expand Up @@ -814,7 +814,7 @@ also re-applies operator-chart defaults that may not match your
production tuning.

For the REST stack the equivalent is `helm upgrade nico-rest …` against
`infra-controller-rest/helm/charts/nico-rest`.
`rest-api/helm/charts/nico-rest`.

See [`helm/README.md` → Upgrading](../../../helm/README.md#upgrading) for
the diff-then-apply pattern.
Expand Down
2 changes: 1 addition & 1 deletion crates/bmc-proxy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -223,7 +223,7 @@ sequenceDiagram

This crate is meant to implement a clean architectural boundary, but the implementation still couples to nico in slightly uncomfortable ways:

1. It's still a component of the ncx-infra-controller-core repo, so it's not fully independent
1. It's still a component of the infra-controller repo, so it's not fully independent
2. It expects nico-api to resolve proxied BMC IPs through `FindMacAddressByBmcIp`.
3. It expects nico-api to return credentials from `GetBmcCredentials` for every proxied BMC.

Expand Down
2 changes: 1 addition & 1 deletion dev/deployment/devspace/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ docker build -t "machine-a-tron:<devspace-generated-tag>" -f dev/deployment/devs

DevSpace then deploys the Helm chart with the built `nico-api` image wired into `global.image.repository` and `global.image.tag`, the built `nico-bmc-proxy` image wired into the `nico-bmc-proxy` chart values, and applies the local-only `machine-a-tron` manifest with its image wired into the `Deployment` spec.

## Re-initializing ncx-infra-controller-core to a clean slate
## Re-initializing infra-controller to a clean slate

Once deployed, the `nico-api` container will run and initialize its database, and the `machine-a-tron` container will run a set of mock machines, which will be discovered and ingested into the database, and run through the state machine until they reach a Ready state.

Expand Down
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ The REST layer can be deployed in the datacenter with Infra Controller Core, or
in Cloud with Site Agent connecting from the datacenter. Multiple Infra Controller Cores running
in different datacenters can also connect to Infra Controller REST through respective Site Agents.

For details on NICo REST, please refer to [NICo REST Github Repository](https://github.com/NVIDIA/infra-controller-rest) and [NICo REST API Schema](https://nvidia.github.io/infra-controller-rest/).
For details on NICo REST, please refer to the [infra-controller GitHub repository](https://github.com/NVIDIA/infra-controller) and the [REST API Reference](https://docs.nvidia.com/infra-controller/rest-api-reference/api-reference).

### Managed Hosts

Expand Down
4 changes: 2 additions & 2 deletions docs/architecture/health_aggregation.md
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@ Details can be found in the [SKU Validation guide](../provisioning/sku-validatio

### BMC health monitoring

The [`nico-hw-health`](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/health) service periodically queries all Host and DPU BMCs in the system for health information. It emits the captured health datapoints as metrics on a metrics endpoint that can be scraped by a standard telemetry system (prometheus/otel).
The [`nico-hw-health`](https://github.com/NVIDIA/infra-controller/blob/main/crates/health) service periodically queries all Host and DPU BMCs in the system for health information. It emits the captured health datapoints as metrics on a metrics endpoint that can be scraped by a standard telemetry system (prometheus/otel).

Health metrics fetched from BMCs include:
- Fan speeds
Expand Down Expand Up @@ -282,7 +282,7 @@ In certain conditions the scraping process will place a health alert on the host

### dpu-agent based health monitoring

[`dpu-agent`](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/agent) collects health information directly on the DPU and sends a health-**rollup** towards `nico-core`. The agent monitors a variety of health conditions, including
[`dpu-agent`](https://github.com/NVIDIA/infra-controller/blob/main/crates/agent) collects health information directly on the DPU and sends a health-**rollup** towards `nico-core`. The agent monitors a variety of health conditions, including
- whether BGP sessions are established to peers according to the current configuration of the DPU
- whether all required services on the DPU are running
- whether the DPU is configured in restricted mode
Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/infiniband/nic_selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ use the independent devices.
### NICo machine hardware enumeration

When NICo discovers a machine that is intended to be managed by the NICo site controller,
it enumerates its hardware details using the [nico-scout](https://github.com/NVIDIA/infra-controller-core/tree/main/crates/scout) tool.
it enumerates its hardware details using the [nico-scout](https://github.com/NVIDIA/infra-controller/tree/main/crates/scout) tool.

The tool reports all discovered hardware information (e.g. the number and type
of CPUs, GPUs, and network interfaces), and this information gets persisted
Expand Down
20 changes: 10 additions & 10 deletions docs/architecture/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,15 @@ NICo deploys a set of binaries on these hosts during various points of their lif

### Scout

[scout](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/scout) is an agent that NICo runs on the host and DPU of managed hosts for a variety of tasks:
[scout](https://github.com/NVIDIA/infra-controller/blob/main/crates/scout) is an agent that NICo runs on the host and DPU of managed hosts for a variety of tasks:
- "Inventory" collection: Scout collects and transmits hardware properties of the host to [NICo core](#nico-core) which can not be determined through out-of-band tooling.
- Execution of cleanup tasks whenever the bare metal instance using the host is released by a user
- Execution of machine validation tests
- Periodic Health checks

### DPU Agent

[dpu-agent](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/agent) is an agent that NICo runs exclusively on DPUS managed by NICo as a daemon.
[dpu-agent](https://github.com/NVIDIA/infra-controller/blob/main/crates/agent) is an agent that NICo runs exclusively on DPUS managed by NICo as a daemon.

DPU agent performs the following tasks:
- Configuring the DPU as required at any state during the hosts lifecycle. This process is described more in depth in [DPU configuration](dpu_configuration.md).
Expand All @@ -51,24 +51,24 @@ DPU agent performs the following tasks:

### DHCP Server

NICo runs a [custom DHCP server](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dhcp-server) on the DPU, which handles all DHCP requests of the actual host. This means DHCP requests on the hosts primary networking interfaces will never leave the DPU and show up on the underlay network - which provides enhanced security and reliability.
NICo runs a [custom DHCP server](https://github.com/NVIDIA/infra-controller/blob/main/crates/dhcp-server) on the DPU, which handles all DHCP requests of the actual host. This means DHCP requests on the hosts primary networking interfaces will never leave the DPU and show up on the underlay network - which provides enhanced security and reliability.
The DHCP server is configured by dpu-agent.

## NICo Control plane services

The NICo control plane consists of a number of services which work together to orchestrate the lifecycle of a managed host:

- [nico-core](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/api): The NICo core service is the entrypoint into the control plane. It provides a [gRPC](https://grpc.io) API that all other components as well as users (site providers/tenants/site administrators) interact with, as well as implements the lifecycle management of all NICo managed resources (VPCs, prefixes, Infiniband and NVLink partitions and bare metal instances). The [NICo Core](#nico_core_architecture) section describes it further in detail.
- [nico-dhcp (DHCP)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dhcp): The DHCP server responds to DHCP requests for all
- [nico-core](https://github.com/NVIDIA/infra-controller/blob/main/crates/api): The NICo core service is the entrypoint into the control plane. It provides a [gRPC](https://grpc.io) API that all other components as well as users (site providers/tenants/site administrators) interact with, as well as implements the lifecycle management of all NICo managed resources (VPCs, prefixes, Infiniband and NVLink partitions and bare metal instances). The [NICo Core](#nico_core_architecture) section describes it further in detail.
- [nico-dhcp (DHCP)](https://github.com/NVIDIA/infra-controller/blob/main/crates/dhcp): The DHCP server responds to DHCP requests for all
devices on underlay networks. This includes Host BMCs, DPU BMCs and DPU OOB addresses. nico-dhcp can be thought of as a stateless proxy: It does not actually perform any IP address management - it just converts DHCP requests into gRPC format and forwards the gRPC based DHCP requests to nico core.
- [nico-pxe (iPXE)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/pxe): The PXE server provides boot artifacts like iPXE scripts, iPXE user-data and OS images to managed hosts at boot time over HTTP. It determines which OS data to provide for a specific host by requesting the respective data from nico core - therefore the PXE server is also stateless.
- [nico-pxe (iPXE)](https://github.com/NVIDIA/infra-controller/blob/main/crates/pxe): The PXE server provides boot artifacts like iPXE scripts, iPXE user-data and OS images to managed hosts at boot time over HTTP. It determines which OS data to provide for a specific host by requesting the respective data from nico core - therefore the PXE server is also stateless.
Currently, managed hosts are configured to always boot from PXE. If a local
bootable device is found, the host will boot it. Hosts can also be configured to always boot from a
particular image for stateless configurations.
- [nico-hw-health (Hardware health)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/health): This service scrapes all host and DPU BMCs known by NICo for system health information. It extracts measurements like fan speeds, temperatures and leak indicators. These measurements are emitted as prometheus metrics on a `/metrics` endpoint on port 9009. In addition to that, the service calls the nico-core API `RecordHardwareHealthReport` to set health alerts based on issues identified within the metrics. These alerts are merged within nico-core into the aggregated-host-health - which is emitted in overall health metrics and used to decide whether hosts are usable as bare metal instances for tenants.
- [ssh-console](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/ssh-console): The SSH console provides bare metal-tenants and site-administrators virtual serial console access to hosts managed by NICo. The ssh-console service also sends the output of each hosts serial console to
- [nico-hw-health (Hardware health)](https://github.com/NVIDIA/infra-controller/blob/main/crates/health): This service scrapes all host and DPU BMCs known by NICo for system health information. It extracts measurements like fan speeds, temperatures and leak indicators. These measurements are emitted as prometheus metrics on a `/metrics` endpoint on port 9009. In addition to that, the service calls the nico-core API `RecordHardwareHealthReport` to set health alerts based on issues identified within the metrics. These alerts are merged within nico-core into the aggregated-host-health - which is emitted in overall health metrics and used to decide whether hosts are usable as bare metal instances for tenants.
- [ssh-console](https://github.com/NVIDIA/infra-controller/blob/main/crates/ssh-console): The SSH console provides bare metal-tenants and site-administrators virtual serial console access to hosts managed by NICo. The ssh-console service also sends the output of each hosts serial console to
the logging system (Loki), from where it can be queried using Grafana and logcli. In order to provide this functionality, the ssh-console service *continuously* connects to all host BMCs. The ssh-console service only forwards logs to users ("bare metal tenants") if they connect to the service and get authenticated.
- [nico-dns (DNS)](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/dns): Domain name service (DNS) functionality
- [nico-dns (DNS)](https://github.com/NVIDIA/infra-controller/blob/main/crates/dns): Domain name service (DNS) functionality
is handled by two services. The `nico-dns` service handles DNS queries from the site controller and managed nodes and is authoritative for delegated zones.

## <a name="nico_core_architecture"></a> NICo Core
Expand Down Expand Up @@ -203,7 +203,7 @@ pods. There are three different K8s statefulsets that run on the controller node

The point of having a site controller is to administer a site that has been populated with tenant managed hosts.
Each managed host is a pairing of a Bluefield (BF) 2/3 DPUs and a host server (only two DPUs have been tested).
During initial deployment [scout](https://github.com/NVIDIA/infra-controller-core/blob/main/crates/scout) runs and
During initial deployment [scout](https://github.com/NVIDIA/infra-controller/blob/main/crates/scout) runs and
informs nico-api of any discovered DPUs. NICo completes the installation of services on the DPU and boots
into regular operation mode. Thereafter the nico-dpu-agent starts as a daemon.

Expand Down
2 changes: 1 addition & 1 deletion docs/architecture/redfish_workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ Once all DPUs are matched and validated, the host enters an "ingestable" state a

## 4. DPU Provisioning

After pairing, the DPU must be provisioned with NICo software. This is orchestrated via Temporal workflows (in `nico-rest`) with Redfish power control (in `infra-controller-core`).
After pairing, the DPU must be provisioned with NICo software. This is orchestrated via Temporal workflows (in `nico-rest`) with Redfish power control (in `infra-controller`).

### Boot Configuration

Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/tenant_management.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This guide assumes you have completed the [Quick Start Guide](../getting-started

- A running NICo deployment with healthy REST API, database, Temporal workflow engine, and at least one site controller.
- At least one site registered and in `Registered` status, with machines discovered and available for allocation.
- `nicocli` installed (`make nico-cli` from the infra-controller-rest repo) and reachable on `$PATH`.
- `nicocli` installed (`make nico-cli` from the `rest-api/` directory of the `infra-controller` repo) and reachable on `$PATH`.

If you plan to enable SPIFFE JWT-SVID **machine identity**, complete [Day 0 Machine Identity](../getting-started/installation-options/day0-machine-identity.md) before provisioning instances, then configure per-org identity after tenants exist — see [Machine Identity](machine_identity.md).

Expand Down
Loading
Loading