Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/concepts/compute-and-clusters.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Providers are the physical infrastructure Vantage provisions clusters on.
|---|---|
| Public clouds (AWS, Azure, GCP) | Elastic capacity, global regions, spot pricing |
| Cudo Compute | Cost-efficient GPU cloud |
| On-premises / LXD | Your own hardware, maximum control |
| On-premises / LXD / Multipass / Juju | Your own hardware or local VMs β€” agent-based, Multipass, or Charmed HPC |
| Vantage partners (atNorth, BuzzHPC, RCI) | Pre-integrated managed colocation and HPC |

## Regions and availability
Expand Down
134 changes: 53 additions & 81 deletions docs/get-started/create-cluster-intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: Deploy Slurm, Kubernetes, or Slurm on Kubernetes clusters using the

## Overview

Clusters are the compute environments where jobs run in Vantage. This guide walks you through creating a cluster using the Vantage web UI. Three cluster types are supported: **Slurm** (traditional HPC), **Kubernetes** (managed platform cluster), and **Slurm on Kubernetes** (Slurm deployed on an existing K8s cluster).
Clusters are the compute environments where jobs run in Vantage. This guide walks you through creating a cluster using the Vantage web UI. Three cluster types are supported: **Slurm** (traditional HPC), **Kubernetes** (managed platform cluster), and **Slurm on Kubernetes** (Slurm deployed on an existing K8s cluster). On-premises clusters can also be created using **Multipass** or **Juju (Charmed HPC)** β€” see [On-Premises clusters](/platform/clusters/On-Premises).

:::note Alternative Methods

Expand All @@ -17,73 +17,68 @@ Clusters can also be created via the [Vantage CLI](https://docs.vantagecompute.a

- How to navigate to the Clusters dashboard
- How to create a Slurm, Kubernetes, or Slurm on Kubernetes cluster
- How to create on-premises clusters using Multipass, Juju, or the Vantage Agent

## Prerequisites

- A Vantage account and organization ([Sign Up](./sign-up.md))
- A configured [Cloud Account](./create-cloud.md) β€” required before creating a cluster

## Step 1: Access the Cluster Dashboard
:::note On-premises clusters
On-premises clusters (Multipass, Juju, and agent-based) do not require a cloud account. See [On-Premises clusters](/platform/clusters/On-Premises) for setup guides.
:::

Click **Clusters** in the left navigation sidebar, then select the **Slurm** or **Kubernetes** tab to view clusters of that type. The cluster list shows columns for **Name**, **Type**, **Status**, **Provider**, **Owner**, and **Actions**.
## Step 1: Access the Cluster Dashboard

![Cluster dashboard](./img/create-cluster-intro/create-cluster-intro-00.png)
Click **Clusters** in the left navigation sidebar. A cluster type navigation appears with **Slurm** and **Kubernetes** β€” click the type you want to work with. The **Kubernetes** view is shown by default. Each view lists existing clusters with columns for **Name**, **Type**, **Status**, **Provider**, **Owner**, and **Actions**, and refreshes periodically to reflect status changes.

## Step 2: Prepare a Cluster

Click the **+ Prepare Cluster** button in the top-right corner. A multi-step wizard opens titled **"Choose Cluster Type"**.
Make sure you are on the correct cluster type view (Slurm or Kubernetes), then click **+ Prepare Cluster** in the top-right corner. A modal opens where you configure the cluster.

![Prepare cluster button](./img/create-cluster-intro/create-cluster-intro-01.png)
## Step 3: Configure the Cluster

## Step 3: Choose a Cluster Type and Configure

Select the type of cluster you want to create:
The configuration steps depend on the cluster type and cloud provider:

<Tabs>
<TabItem value="slurm" label="Slurm" default>

Traditional HPC workload manager. Configure compute partitions, submit batch jobs, and manage node pools.

Click the **Slurm** card and then click **Continue**.

### Configure Cluster Details
Click <kbd>Prepare Cluster</kbd> from the **Slurm** cluster type view. A modal opens with the **Configure** step.

| Field | Required | Notes |
|---|---|---|
| Cluster Name | Yes | Max 27 characters, must be unique |
| Cluster Description | No | Max 255 characters |
| Description | No | Max 255 characters |
| Cloud Account | Yes | Select from your configured cloud accounts |

The remaining steps depend on the **Cloud Account** type selected:

**Non-AWS accounts** (Azure, GCP, Cudo Compute, on-premises, LXD) β€” No additional fields appear. Click **Create Cluster** to finish. The wizard completes in 2 steps β€” partitions and networking are managed post-creation from the cluster detail page.
**Non-AWS accounts** (Azure, GCP) β€” Click <kbd>Create Cluster</kbd> to finish. Partitions and networking are managed post-creation from the cluster detail page. These providers use backend defaults for provisioning.

**Cloud provider accounts (e.g., AWS)** β€” A notice appears: *"Cloud clusters are deployed in AWS and scale automatically to the size of the workloads submitted to them."* Additional fields appear:
**AWS** β€” Click <kbd>Continue</kbd>. The **Provider** step opens with additional fields:

| Field | Required | Notes |
|---|---|---|
| Region | Yes | Select your cloud region |
| Head Node Machine Type | Yes | Select a region first, then click **Select Head Node** to choose a machine type |
| SSH Key Name | Yes | Select a cloud account and region first |
| Region | Yes | The dropdown loads after you select the cloud account |
| Head Node Machine Type | Yes | Click <kbd>Select Head Node</kbd> to browse instance types by vCPU, GPU, and price |
| SSH Key Name | Yes | The list loads after you pick a region |

**Advanced Options** (expand to configure custom networking β€” leave empty to use cloud defaults):
Click **Advanced Options** to pin the cluster to a custom **VPC**, **Head Node Subnet**, and **Compute Node Subnet**. Leave these empty to use AWS-managed defaults.

| Field | Required | Notes |
|---|---|---|
| VPC ID | No | Select a Cloud Account and region first |
| Head Node Subnet ID | Yes, if VPC selected | Select a VPC first |
| Compute Node Subnet ID | No | Select a VPC first |
Click <kbd>Proceed to Select Partitions</kbd>. The **Partitions** step opens. A default partition named `compute` is pre-filled. Set the **Maximum node count** and add more partitions as needed, then click <kbd>Prepare Cluster</kbd> to submit.

Click **Proceed to Select Partitions** to continue. Configure your Slurm partitions, then click **Create Cluster**.
:::tip
In the **Configure** step you can also select a **Kubernetes Cluster** as the deployment target β€” this creates a Slurm-on-Kubernetes cluster instead. See [Slurm on Kubernetes](/platform/clusters/Kubernetes/create#slurm-on-kubernetes).
:::

</TabItem>
<TabItem value="k8s" label="Kubernetes">

Managed platform cluster for Workbench sessions, ML training, and containerized workloads.

Click the **Kubernetes** card and then click **Continue**.

### Configure Cluster Details
Click <kbd>Prepare Cluster</kbd> from the **Kubernetes** cluster type view (shown by default). A modal opens with the **Configure** step.

| Field | Required | Notes |
|---|---|---|
Expand All @@ -93,62 +88,52 @@ Click the **Kubernetes** card and then click **Continue**.

The remaining configuration depends on the provider:

**AWS** β€” Additional fields appear:
**Non-AWS providers** (Azure, GCP, Cudo Compute) β€” Click <kbd>Create Cluster</kbd> to submit. These providers use backend defaults for provisioning.

**AWS** β€” Click <kbd>Continue</kbd>. The **Provider** step opens with additional fields:

| Field | Required | Notes |
|---|---|---|
| Region | Yes | Select your AWS region |
| Control Plane Instance Type | Yes | Click **Select Control Plane** to choose an EC2 instance type |
| SSH Key Name | Yes | Select a cloud account and region first |

Non-AWS providers (Azure, GCP, Cudo Compute, on-premises, LXD) use Vantage-managed defaults and require only the fields above.
| Region | Yes | The dropdown loads after you select the cloud account |
| Control Plane Machine Type | Yes | Click <kbd>Select Machine</kbd> to browse EC2 instance types by vCPU, GPU, and price |
| SSH Key Name | Yes | The list loads after you pick a region |

**Platform Integrations** (configured after submission, in the final wizard step):
Click <kbd>Prepare Cluster</kbd> to submit.

| Integration | Purpose | Default |
|---|---|---|
| Notebook | JupyterHub for interactive sessions | Enabled |
| Grafana + Prometheus | Cluster monitoring and observability | Enabled |
| Ray | Distributed ML training framework | Disabled |
| MLflow | ML experiment tracking | Disabled |
| Slurm on Kubernetes | Deploy Slurm on this cluster later | Disabled |

Click **Create Cluster** to submit. Provisioning time varies by provider β€” AWS typically takes 10–15 minutes, others connect more quickly.
JupyterHub and Grafana + Prometheus are enabled by default. See [Integrations](/platform/clusters/Kubernetes/integrations) for details.

</TabItem>
<TabItem value="slurm-on-k8s" label="Slurm on Kubernetes">

Deploy a Slurm HPC cluster on top of an existing Kubernetes cluster. Manage node groups and partitions via VDeployer.

Click the **Slurm on Kubernetes** card and then click **Continue**. This path has 4 steps: Choose Type β†’ Select K8s Cluster β†’ Configure β†’ Creating.
Deploy a Slurm HPC cluster on top of an existing Kubernetes cluster. Manage node groups and partitions.

### Step 2 β€” Select K8s Cluster
### From the Slurm list

A grid of available Kubernetes clusters is shown, with each cluster's name and cloud provider type. Click a cluster card to select it (it will show a highlighted border), then click **Configure Slurm Cluster**.
1. Click **Slurm** in the cluster type navigation, then click <kbd>Prepare Cluster</kbd>.
1. In the **Configure** step, select **Kubernetes Cluster** as the deployment target. A list of ready K8s clusters appears. Click the target cluster, then click <kbd>Configure Slurm Cluster</kbd>.

The selected parent cluster determines the available profiles for the node groups. AWS clusters unlock EC2 instance type selection; non-AWS clusters use pre-defined profiles.
### From the Kubernetes detail page

### Step 3 β€” Configure
1. Click the target cluster name to open its detail page.
1. Click the **Slurm Clusters** tab, then click <kbd>Create Slurm Cluster</kbd>.

**Cluster Identity:**
### Configure compute pools and partitions

| Field | Required | Notes |
|---|---|---|
| Slurm Cluster Name | Yes | Must start with a lowercase letter and can only include lowercase letters, numbers, and dashes (no trailing dash) |
| Parent K8s Cluster | Yes | Pre-filled from the previous step (read-only) |
From either entry point, the wizard opens with the **Compute & Partitions** step.

**Node Groups:**
**Node Groups**

Two node groups are pre-configured β€” **Control Plane** and **Compute Group**. Node group names are auto-generated based on the cluster name (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`).
Two node groups are pre-configured β€” **Slurm Controller** (control plane) and **Compute Workers**. Node group names are auto-generated (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`).

| Field | Default | Notes |
|---|---|---|
| Profile | β€” | Select a profile. No default β€” a selection is required. |
| GPU | No | Toggle to enable GPU compute |
| Max Nodes | 1 (Control Plane) / 10 (Compute) | Minimum 1 |

The **Profile** field adapts based on the parent K8s cluster's provider:

- **AWS parent** β€” Opens an **instance type browser** dialog. Search and select any EC2 instance type (e.g., `t3.medium`, `c5n.4xlarge`). CPU and memory are managed by AWS; no profile presets are used.
- **AWS parent** β€” Opens an instance type browser dialog. Select any EC2 instance type (e.g., `t3.medium`, `c5n.4xlarge`).
- **Non-AWS parent** (Cudo Compute, on-premises, LXD) β€” A dropdown with three pre-defined profiles:

| Profile | vCPU | Memory |
Expand All @@ -157,33 +142,19 @@ The **Profile** field adapts based on the parent K8s cluster's provider:
| Medium | 8 | 16 GiB |
| Large | 16 | 32 GiB |

Selecting a profile auto-fills the CPU and memory for that node group. If you change the parent K8s cluster after selecting profiles, all profile selections are reset.

Click **+ Add Compute Group** to add additional compute node groups. At least one control plane group and one compute group are required.

**Partitions:**

A default partition named `compute` is pre-configured. Partitions route jobs to a specific node group.

| Field | Default | Notes |
|---|---|---|
| Partition Name | `compute` | Name for the Slurm partition |
| Node Group | β€” | Select from the compute groups defined above |
| Default | Enabled | Only one partition can be default at a time |

Click **+ Add Partition** to add additional partitions. At least one partition is required.

Click **Create Slurm Cluster** to begin provisioning. The wizard advances to a progress view showing each step as it completes.
**Partitions**

### Step 4 β€” Creating (Progress)
A default partition named `partition-1` is pre-configured. Set the **Partition Name**, choose which **Compute Group** it routes to, and toggle **Default** status. Only one partition can be default at a time.

The wizard shows a progress stepper with three sequential stages:
Click **Advanced Options** to configure TLS, NodePort exposure, job profiling, and the K8s scheduler bridge.

1. **Registering cluster** β€” Creates the Slurm cluster record and provisions a Keycloak client in the background
2. **Creating node groups** β€” Provisions each node group sequentially on the parent K8s cluster (control plane, then compute groups)
3. **Creating Slurm cluster** β€” Finalizes the Slurm deployment with your partition configuration
Click <kbd>Create Slurm Cluster</kbd> to submit. The wizard shows a progress stepper with three sequential stages:

Each completed stage shows a green checkmark. While provisioning, the progress bars remain visible so you can track which stage the cluster is in.
1. **Registering cluster** β€” Creates the Slurm cluster record and provisions a Keycloak client
2. **Creating node groups** β€” Provisions each node group on the parent K8s cluster (control plane, then compute groups)
3. **Creating Slurm cluster** β€” Finalizes the Slurm deployment

</TabItem>
</Tabs>
Expand All @@ -192,7 +163,7 @@ Each completed stage shows a green checkmark. While provisioning, the progress b

Return to the Clusters list page. The cluster status shows **"preparing"** while provisioning, then transitions to **"ready"** when complete.

![Cluster connected successfully](./img/create-cluster-intro/create-cluster-intro-04.png)
A cluster with `ready` status shows a green badge in the **Status** column. Clicking the cluster row opens the cluster detail page.

## Summary

Expand All @@ -204,3 +175,4 @@ Your cluster is now ready for workloads. You can launch notebooks, submit jobs,
- [Create a Job Script](./create-job-script-intro.md)
- [Submit Your First Job](./create-job-submission-intro.md)
- [Invite Team Members](./invite-intro.md)
- [Create an on-premises cluster](/platform/clusters/On-Premises) β€” Multipass, Juju, or agent-based
Loading