From 7a1a1512ffb3772cde368fdae539afb13e505c8e Mon Sep 17 00:00:00 2001 From: Bryan Ward Date: Wed, 27 May 2026 09:05:03 -0700 Subject: [PATCH] docs: update clusters, add multipass, juju, and vantage agent options. update platform/clusters section to reflect the new "setup" wizard/ or modal update sidebar-main.js to include new pages. clean up relevant sections. --- docs/concepts/compute-and-clusters.md | 2 +- docs/get-started/create-cluster-intro.md | 134 +++++------ docs/platform/clusters/Kubernetes/create.mdx | 113 ++++----- docs/platform/clusters/Kubernetes/index.mdx | 9 +- .../clusters/On-Premises/agent-based.mdx | 73 ++++++ docs/platform/clusters/On-Premises/index.mdx | 42 ++++ docs/platform/clusters/On-Premises/juju.mdx | 213 +++++++++++++++++ .../clusters/On-Premises/multipass.mdx | 224 ++++++++++++++++++ docs/platform/clusters/Slurm/create.mdx | 63 +++-- docs/platform/clusters/Slurm/index.mdx | 9 +- docs/platform/clusters/concepts.mdx | 4 +- docs/platform/clusters/get-started.mdx | 11 +- docs/platform/clusters/index.md | 14 +- docs/platform/clusters/troubleshooting.mdx | 2 +- .../compute-providers/on-premises/index.md | 20 ++ external/vantage-cli | 2 +- sidebars-main.js | 12 + 17 files changed, 747 insertions(+), 200 deletions(-) create mode 100644 docs/platform/clusters/On-Premises/agent-based.mdx create mode 100644 docs/platform/clusters/On-Premises/index.mdx create mode 100644 docs/platform/clusters/On-Premises/juju.mdx create mode 100644 docs/platform/clusters/On-Premises/multipass.mdx diff --git a/docs/concepts/compute-and-clusters.md b/docs/concepts/compute-and-clusters.md index b9871e7..817c322 100644 --- a/docs/concepts/compute-and-clusters.md +++ b/docs/concepts/compute-and-clusters.md @@ -31,7 +31,7 @@ Providers are the physical infrastructure Vantage provisions clusters on. |---|---| | Public clouds (AWS, Azure, GCP) | Elastic capacity, global regions, spot pricing | | Cudo Compute | Cost-efficient GPU cloud | -| On-premises / LXD | Your own hardware, maximum control | +| On-premises / LXD / Multipass / Juju | Your own hardware or local VMs — agent-based, Multipass, or Charmed HPC | | Vantage partners (atNorth, BuzzHPC, RCI) | Pre-integrated managed colocation and HPC | ## Regions and availability diff --git a/docs/get-started/create-cluster-intro.md b/docs/get-started/create-cluster-intro.md index 659963a..461e0a4 100644 --- a/docs/get-started/create-cluster-intro.md +++ b/docs/get-started/create-cluster-intro.md @@ -5,7 +5,7 @@ description: Deploy Slurm, Kubernetes, or Slurm on Kubernetes clusters using the ## Overview -Clusters are the compute environments where jobs run in Vantage. This guide walks you through creating a cluster using the Vantage web UI. Three cluster types are supported: **Slurm** (traditional HPC), **Kubernetes** (managed platform cluster), and **Slurm on Kubernetes** (Slurm deployed on an existing K8s cluster). +Clusters are the compute environments where jobs run in Vantage. This guide walks you through creating a cluster using the Vantage web UI. Three cluster types are supported: **Slurm** (traditional HPC), **Kubernetes** (managed platform cluster), and **Slurm on Kubernetes** (Slurm deployed on an existing K8s cluster). On-premises clusters can also be created using **Multipass** or **Juju (Charmed HPC)** — see [On-Premises clusters](/platform/clusters/On-Premises). :::note Alternative Methods @@ -17,73 +17,68 @@ Clusters can also be created via the [Vantage CLI](https://docs.vantagecompute.a - How to navigate to the Clusters dashboard - How to create a Slurm, Kubernetes, or Slurm on Kubernetes cluster +- How to create on-premises clusters using Multipass, Juju, or the Vantage Agent ## Prerequisites - A Vantage account and organization ([Sign Up](./sign-up.md)) - A configured [Cloud Account](./create-cloud.md) — required before creating a cluster -## Step 1: Access the Cluster Dashboard +:::note On-premises clusters +On-premises clusters (Multipass, Juju, and agent-based) do not require a cloud account. See [On-Premises clusters](/platform/clusters/On-Premises) for setup guides. +::: -Click **Clusters** in the left navigation sidebar, then select the **Slurm** or **Kubernetes** tab to view clusters of that type. The cluster list shows columns for **Name**, **Type**, **Status**, **Provider**, **Owner**, and **Actions**. +## Step 1: Access the Cluster Dashboard -![Cluster dashboard](./img/create-cluster-intro/create-cluster-intro-00.png) +Click **Clusters** in the left navigation sidebar. A cluster type navigation appears with **Slurm** and **Kubernetes** — click the type you want to work with. The **Kubernetes** view is shown by default. Each view lists existing clusters with columns for **Name**, **Type**, **Status**, **Provider**, **Owner**, and **Actions**, and refreshes periodically to reflect status changes. ## Step 2: Prepare a Cluster -Click the **+ Prepare Cluster** button in the top-right corner. A multi-step wizard opens titled **"Choose Cluster Type"**. +Make sure you are on the correct cluster type view (Slurm or Kubernetes), then click **+ Prepare Cluster** in the top-right corner. A modal opens where you configure the cluster. -![Prepare cluster button](./img/create-cluster-intro/create-cluster-intro-01.png) +## Step 3: Configure the Cluster -## Step 3: Choose a Cluster Type and Configure - -Select the type of cluster you want to create: +The configuration steps depend on the cluster type and cloud provider: Traditional HPC workload manager. Configure compute partitions, submit batch jobs, and manage node pools. -Click the **Slurm** card and then click **Continue**. - -### Configure Cluster Details +Click Prepare Cluster from the **Slurm** cluster type view. A modal opens with the **Configure** step. | Field | Required | Notes | |---|---|---| | Cluster Name | Yes | Max 27 characters, must be unique | -| Cluster Description | No | Max 255 characters | +| Description | No | Max 255 characters | | Cloud Account | Yes | Select from your configured cloud accounts | The remaining steps depend on the **Cloud Account** type selected: -**Non-AWS accounts** (Azure, GCP, Cudo Compute, on-premises, LXD) — No additional fields appear. Click **Create Cluster** to finish. The wizard completes in 2 steps — partitions and networking are managed post-creation from the cluster detail page. +**Non-AWS accounts** (Azure, GCP) — Click Create Cluster to finish. Partitions and networking are managed post-creation from the cluster detail page. These providers use backend defaults for provisioning. -**Cloud provider accounts (e.g., AWS)** — A notice appears: *"Cloud clusters are deployed in AWS and scale automatically to the size of the workloads submitted to them."* Additional fields appear: +**AWS** — Click Continue. The **Provider** step opens with additional fields: | Field | Required | Notes | |---|---|---| -| Region | Yes | Select your cloud region | -| Head Node Machine Type | Yes | Select a region first, then click **Select Head Node** to choose a machine type | -| SSH Key Name | Yes | Select a cloud account and region first | +| Region | Yes | The dropdown loads after you select the cloud account | +| Head Node Machine Type | Yes | Click Select Head Node to browse instance types by vCPU, GPU, and price | +| SSH Key Name | Yes | The list loads after you pick a region | -**Advanced Options** (expand to configure custom networking — leave empty to use cloud defaults): +Click **Advanced Options** to pin the cluster to a custom **VPC**, **Head Node Subnet**, and **Compute Node Subnet**. Leave these empty to use AWS-managed defaults. -| Field | Required | Notes | -|---|---|---| -| VPC ID | No | Select a Cloud Account and region first | -| Head Node Subnet ID | Yes, if VPC selected | Select a VPC first | -| Compute Node Subnet ID | No | Select a VPC first | +Click Proceed to Select Partitions. The **Partitions** step opens. A default partition named `compute` is pre-filled. Set the **Maximum node count** and add more partitions as needed, then click Prepare Cluster to submit. -Click **Proceed to Select Partitions** to continue. Configure your Slurm partitions, then click **Create Cluster**. +:::tip +In the **Configure** step you can also select a **Kubernetes Cluster** as the deployment target — this creates a Slurm-on-Kubernetes cluster instead. See [Slurm on Kubernetes](/platform/clusters/Kubernetes/create#slurm-on-kubernetes). +::: Managed platform cluster for Workbench sessions, ML training, and containerized workloads. -Click the **Kubernetes** card and then click **Continue**. - -### Configure Cluster Details +Click Prepare Cluster from the **Kubernetes** cluster type view (shown by default). A modal opens with the **Configure** step. | Field | Required | Notes | |---|---|---| @@ -93,62 +88,52 @@ Click the **Kubernetes** card and then click **Continue**. The remaining configuration depends on the provider: -**AWS** — Additional fields appear: +**Non-AWS providers** (Azure, GCP, Cudo Compute) — Click Create Cluster to submit. These providers use backend defaults for provisioning. + +**AWS** — Click Continue. The **Provider** step opens with additional fields: | Field | Required | Notes | |---|---|---| -| Region | Yes | Select your AWS region | -| Control Plane Instance Type | Yes | Click **Select Control Plane** to choose an EC2 instance type | -| SSH Key Name | Yes | Select a cloud account and region first | - -Non-AWS providers (Azure, GCP, Cudo Compute, on-premises, LXD) use Vantage-managed defaults and require only the fields above. +| Region | Yes | The dropdown loads after you select the cloud account | +| Control Plane Machine Type | Yes | Click Select Machine to browse EC2 instance types by vCPU, GPU, and price | +| SSH Key Name | Yes | The list loads after you pick a region | -**Platform Integrations** (configured after submission, in the final wizard step): +Click Prepare Cluster to submit. -| Integration | Purpose | Default | -|---|---|---| -| Notebook | JupyterHub for interactive sessions | Enabled | -| Grafana + Prometheus | Cluster monitoring and observability | Enabled | -| Ray | Distributed ML training framework | Disabled | -| MLflow | ML experiment tracking | Disabled | -| Slurm on Kubernetes | Deploy Slurm on this cluster later | Disabled | - -Click **Create Cluster** to submit. Provisioning time varies by provider — AWS typically takes 10–15 minutes, others connect more quickly. +JupyterHub and Grafana + Prometheus are enabled by default. See [Integrations](/platform/clusters/Kubernetes/integrations) for details. -Deploy a Slurm HPC cluster on top of an existing Kubernetes cluster. Manage node groups and partitions via VDeployer. - -Click the **Slurm on Kubernetes** card and then click **Continue**. This path has 4 steps: Choose Type → Select K8s Cluster → Configure → Creating. +Deploy a Slurm HPC cluster on top of an existing Kubernetes cluster. Manage node groups and partitions. -### Step 2 — Select K8s Cluster +### From the Slurm list -A grid of available Kubernetes clusters is shown, with each cluster's name and cloud provider type. Click a cluster card to select it (it will show a highlighted border), then click **Configure Slurm Cluster**. +1. Click **Slurm** in the cluster type navigation, then click Prepare Cluster. +1. In the **Configure** step, select **Kubernetes Cluster** as the deployment target. A list of ready K8s clusters appears. Click the target cluster, then click Configure Slurm Cluster. -The selected parent cluster determines the available profiles for the node groups. AWS clusters unlock EC2 instance type selection; non-AWS clusters use pre-defined profiles. +### From the Kubernetes detail page -### Step 3 — Configure +1. Click the target cluster name to open its detail page. +1. Click the **Slurm Clusters** tab, then click Create Slurm Cluster. -**Cluster Identity:** +### Configure compute pools and partitions -| Field | Required | Notes | -|---|---|---| -| Slurm Cluster Name | Yes | Must start with a lowercase letter and can only include lowercase letters, numbers, and dashes (no trailing dash) | -| Parent K8s Cluster | Yes | Pre-filled from the previous step (read-only) | +From either entry point, the wizard opens with the **Compute & Partitions** step. -**Node Groups:** +**Node Groups** -Two node groups are pre-configured — **Control Plane** and **Compute Group**. Node group names are auto-generated based on the cluster name (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`). +Two node groups are pre-configured — **Slurm Controller** (control plane) and **Compute Workers**. Node group names are auto-generated (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`). | Field | Default | Notes | |---|---|---| | Profile | — | Select a profile. No default — a selection is required. | +| GPU | No | Toggle to enable GPU compute | | Max Nodes | 1 (Control Plane) / 10 (Compute) | Minimum 1 | The **Profile** field adapts based on the parent K8s cluster's provider: -- **AWS parent** — Opens an **instance type browser** dialog. Search and select any EC2 instance type (e.g., `t3.medium`, `c5n.4xlarge`). CPU and memory are managed by AWS; no profile presets are used. +- **AWS parent** — Opens an instance type browser dialog. Select any EC2 instance type (e.g., `t3.medium`, `c5n.4xlarge`). - **Non-AWS parent** (Cudo Compute, on-premises, LXD) — A dropdown with three pre-defined profiles: | Profile | vCPU | Memory | @@ -157,33 +142,19 @@ The **Profile** field adapts based on the parent K8s cluster's provider: | Medium | 8 | 16 GiB | | Large | 16 | 32 GiB | -Selecting a profile auto-fills the CPU and memory for that node group. If you change the parent K8s cluster after selecting profiles, all profile selections are reset. - Click **+ Add Compute Group** to add additional compute node groups. At least one control plane group and one compute group are required. -**Partitions:** - -A default partition named `compute` is pre-configured. Partitions route jobs to a specific node group. - -| Field | Default | Notes | -|---|---|---| -| Partition Name | `compute` | Name for the Slurm partition | -| Node Group | — | Select from the compute groups defined above | -| Default | Enabled | Only one partition can be default at a time | - -Click **+ Add Partition** to add additional partitions. At least one partition is required. - -Click **Create Slurm Cluster** to begin provisioning. The wizard advances to a progress view showing each step as it completes. +**Partitions** -### Step 4 — Creating (Progress) +A default partition named `partition-1` is pre-configured. Set the **Partition Name**, choose which **Compute Group** it routes to, and toggle **Default** status. Only one partition can be default at a time. -The wizard shows a progress stepper with three sequential stages: +Click **Advanced Options** to configure TLS, NodePort exposure, job profiling, and the K8s scheduler bridge. -1. **Registering cluster** — Creates the Slurm cluster record and provisions a Keycloak client in the background -2. **Creating node groups** — Provisions each node group sequentially on the parent K8s cluster (control plane, then compute groups) -3. **Creating Slurm cluster** — Finalizes the Slurm deployment with your partition configuration +Click Create Slurm Cluster to submit. The wizard shows a progress stepper with three sequential stages: -Each completed stage shows a green checkmark. While provisioning, the progress bars remain visible so you can track which stage the cluster is in. +1. **Registering cluster** — Creates the Slurm cluster record and provisions a Keycloak client +2. **Creating node groups** — Provisions each node group on the parent K8s cluster (control plane, then compute groups) +3. **Creating Slurm cluster** — Finalizes the Slurm deployment @@ -192,7 +163,7 @@ Each completed stage shows a green checkmark. While provisioning, the progress b Return to the Clusters list page. The cluster status shows **"preparing"** while provisioning, then transitions to **"ready"** when complete. -![Cluster connected successfully](./img/create-cluster-intro/create-cluster-intro-04.png) +A cluster with `ready` status shows a green badge in the **Status** column. Clicking the cluster row opens the cluster detail page. ## Summary @@ -204,3 +175,4 @@ Your cluster is now ready for workloads. You can launch notebooks, submit jobs, - [Create a Job Script](./create-job-script-intro.md) - [Submit Your First Job](./create-job-submission-intro.md) - [Invite Team Members](./invite-intro.md) +- [Create an on-premises cluster](/platform/clusters/On-Premises) — Multipass, Juju, or agent-based diff --git a/docs/platform/clusters/Kubernetes/create.mdx b/docs/platform/clusters/Kubernetes/create.mdx index 608b5f8..ab06bb9 100644 --- a/docs/platform/clusters/Kubernetes/create.mdx +++ b/docs/platform/clusters/Kubernetes/create.mdx @@ -15,32 +15,13 @@ description: Step-by-step guides for creating Kubernetes clusters on every suppo AWS K8s clusters use direct boto3 API calls (not CloudFormation) to provision infrastructure. Vantage creates the VPC, IAM roles, security groups, and launches a control plane EC2 instance with MicroK8s pre-configured. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +1. **Open Clusters** — Click **Clusters** in the left sidebar (the **Kubernetes** view is shown by default), then click Prepare Cluster. A modal opens with the **Configure** step. -1. **Choose type** — Select **Kubernetes** and click **Continue**. +1. **Configure the cluster** — Enter a **Cluster Name** (max 27 characters, must be unique) and optional **Description**. Select your **AWS Cloud Account**, then click Continue. The **Provider** step opens. -1. **Configure the cluster:** - - Enter a **Cluster Name** (max 27 characters, must be unique) and optionally a **Description**. - - Select your **AWS Cloud Account**. - - Pick a **Region** — the dropdown loads after you select the cloud account. - - Click **Select Control Plane** to choose an **EC2 instance type** for the cluster management nodes. Browse by vCPU, GPU, and price. - - Select an **SSH Key Name** — the list loads after you pick a region. If it's empty, create a key pair in the AWS EC2 console first. - -1. **Advanced networking (optional)** — Click **Advanced Options** to specify: - - **VPC ID** — Deploy into an existing VPC. A new VPC is created if omitted. - - **Subnet ID** — Existing subnet in the VPC. Resolved automatically if omitted. - -1. **Select platform integrations** — Choose which tools to install on the cluster. See [Integrations](/platform/clusters/Kubernetes/integrations) for details. - - | Integration | Purpose | Default | - |---|---|---| - | Notebook | JupyterHub for interactive sessions | Enabled | - | Grafana + Prometheus | Cluster monitoring and observability | Enabled | - | Ray | Distributed ML training framework | Disabled | - | MLflow | ML experiment tracking | Disabled | - | Slurm on Kubernetes | Deploy a Slurm scheduler on this cluster later | Disabled | +1. **Configure AWS resources** — Set the **Region** (the dropdown loads after you select the cloud account). Click Select Machine to choose a **Control Plane Machine Type** — browse EC2 instance types by vCPU, GPU, and price. Select an **SSH Key Name** (the list loads after you pick a region; create a key pair in the AWS EC2 console first if empty). Click Prepare Cluster to submit. -1. **Submit** — Click Prepare Cluster. The cluster enters `preparing` status. AWS provisioning typically takes 10-15 minutes. + JupyterHub and Grafana + Prometheus are enabled by default. See [Integrations](/platform/clusters/Kubernetes/integrations) for details. ### What Vantage provisions on AWS @@ -61,16 +42,12 @@ AWS K8s clusters use direct boto3 API calls (not CloudFormation) to provision in Cudo Compute K8s clusters provision a control plane VM through the Cudo Compute REST API. Unlike AWS, compute is specified by raw resources (vcpus + memory_gib) rather than instance types, and each node group has its own data center. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. - -1. **Choose type** — Select **Kubernetes** and click **Continue**. +1. **Open Clusters** — Click **Clusters** in the left sidebar (the **Kubernetes** view is shown by default), then click Prepare Cluster. 1. **Configure the cluster:** - Enter a **Cluster Name** (max 27 characters, must be unique) and optionally a **Description**. - Select your **Cudo Compute Cloud Account**. -1. **Select platform integrations** — Same options as AWS (JupyterHub, Grafana, Ray, MLflow). - 1. **Submit** — Click Prepare Cluster. Cudo provisioning typically takes 10-25 minutes (longer than AWS due to VM provisioning + cloud-init). ### What Vantage provisions on Cudo @@ -83,45 +60,41 @@ Cudo Compute K8s clusters provision a control plane VM through the Cudo Compute ## Azure -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. - -1. **Choose type** — Select **Kubernetes** and click **Continue**. +1. **Open Clusters** — Click **Clusters** in the left sidebar (the **Kubernetes** view is shown by default), then click Prepare Cluster. 1. **Configure:** - Enter a **Cluster Name** and optional **Description**. - Select your **Azure Cloud Account**. -1. **Select platform integrations** — Same options as other providers. + :::note + This provider uses backend defaults for provisioning. Review your cloud account configuration before submitting. + ::: -1. **Submit** — Azure Kubernetes clusters use Vantage-managed defaults for node sizing and networking. Review your cloud account's regional quota before submitting. +1. **Submit** — Click Create Cluster. Azure Kubernetes clusters use Vantage-managed defaults for node sizing and networking. Review your cloud account's regional quota before submitting. ## GCP -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. - -1. **Choose type** — Select **Kubernetes** and click **Continue**. +1. **Open Clusters** — Click **Clusters** in the left sidebar (the **Kubernetes** view is shown by default), then click Prepare Cluster. 1. **Configure:** - Enter a **Cluster Name** and optional **Description**. - Select your **GCP Cloud Account**. -1. **Select platform integrations** — Same options as other providers. + :::note + This provider uses backend defaults for provisioning. Review your cloud account configuration before submitting. + ::: -1. **Submit** — GCP Kubernetes clusters use Vantage-managed defaults. Verify your project's quota before submitting. +1. **Submit** — Click Create Cluster. GCP Kubernetes clusters use Vantage-managed defaults. Verify your project's quota before submitting. -## On-premises / LXD +## On-premises -On-premises Kubernetes clusters connect through a lightweight agent, same as on-premises Slurm clusters. Vantage does not provision cloud resources. +On-premises Kubernetes clusters connect through a lightweight agent deployed on your infrastructure. Vantage does not provision cloud resources — you provide the compute. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +For the full setup guide, see [Agent-based on-premises clusters](/platform/clusters/On-Premises/agent-based). -1. **Choose type** — Select **Kubernetes** and click **Continue**. - -1. **Configure:** - - Enter a **Cluster Name**. - - Select your **On-Premises** or **LXD** cloud account. - -1. **Get the agent command** — The wizard shows the agent installation command. Copy and run it on your cluster's head node. The agent establishes an outbound HTTPS connection to Vantage. +:::note +Multipass and Juju on-premises clusters only support Slurm, not Kubernetes. For Kubernetes on your own hardware, use the agent-based method. +::: ## Slurm on Kubernetes @@ -130,29 +103,24 @@ You can deploy a Slurm scheduler on top of an existing Kubernetes cluster (AWS o ### Prerequisites - An existing Kubernetes cluster with **Ready** status — see [AWS](#aws) or [Cudo Compute](#cudo-compute) above. -- The parent K8s cluster must have **Slurm on Kubernetes** enabled in its integrations. - -### Steps -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +### From the Slurm list -1. **Choose type** — Select **Slurm on Kubernetes** and click **Continue**. +1. **Open Clusters** — Click **Clusters** in the left sidebar, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. A modal opens with the **Configure** step. -1. **Select parent K8s cluster** — A grid shows your available Kubernetes clusters. Click the target cluster to select it, then click **Configure Slurm Cluster**. +1. **Select deployment target** — Under **Deployment Target**, choose **Kubernetes Cluster**. A list of ready K8s clusters appears. Click the target cluster to select it, then click Configure Slurm Cluster. The **Compute & Partitions** step opens. -1. **Configure the Slurm cluster:** +1. **Configure compute pools and partitions:** - **Cluster Identity:** - - **Slurm Cluster Name** — Must start with a lowercase letter and use only lowercase letters, numbers, and dashes (no trailing dash). - - **Parent K8s Cluster** — Pre-filled from the previous step (read-only). - - **Node Groups:** - Two node groups are pre-configured — **Control Plane** and **Compute Group**. Node group names are auto-generated based on the cluster name (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`). + **Node Groups** + Two node groups are pre-configured — **Slurm Controller** (control plane) and **Compute Workers**. Node group names are auto-generated (e.g., `slurm-control-{name}` and `slurm-compute-{name}-1`). | Field | Default | Notes | |---|---|---| | Profile | — | Select a profile. No default — a selection is required. | - | Max Nodes | 1 (Control Plane) / 10 (Compute) | Minimum 1 | + | GPU | No | Toggle to enable GPU compute | + | Min Nodes | 1 | Minimum 1 | + | Max Nodes | 1 (Control Plane) / 10 (Compute) | | The **Profile** field adapts based on the parent K8s cluster's provider: - **AWS parent** — Opens an instance type browser dialog. Select any EC2 instance type (e.g., `t3.medium`, `c5n.4xlarge`). @@ -166,14 +134,27 @@ You can deploy a Slurm scheduler on top of an existing Kubernetes cluster (AWS o Click **+ Add Compute Group** to add additional compute node groups. At least one control plane group and one compute group are required. - **Partitions:** - A default partition named `compute` is pre-configured. Partitions route jobs to a specific node group. Add more partitions as needed. + **Partitions** + A default partition named `partition-1` is pre-configured. Set the **Partition Name**, choose which **Compute Group** it routes to, and toggle **Default** status. Only one partition can be default at a time. + + Click **Advanced Options** to configure: + - **Expose Slurm services via NodePort** + - **TLS enabled** (recommended) — enabled by default + - **Job profiling (InfluxDB)** + - **K8s scheduler bridge** — enabled by default -1. **Submit** — Click **Create Slurm Cluster**. The wizard shows a progress stepper: +1. **Submit** — Click Create Slurm Cluster. The wizard shows a progress stepper: 1. **Registering cluster** — Creates the Slurm cluster record and provisions a Keycloak client. 2. **Creating node groups** — Provisions each node group sequentially on the parent K8s cluster (control plane, then compute groups) via vdeployer. - 3. **Creating Slurm cluster** — VDeployer triggers Helm chart installation: `slurmctld`, `slurmdbd`, `slurmrestd`, `slurmd`, and optionally `slurm-bridge`. + 3. **Creating Slurm cluster** — VDeployer triggers Helm chart installation. + +### From the Kubernetes detail page + +1. **Open the target Kubernetes cluster** — Click the cluster name to open its detail page. +1. **Open the Slurm Clusters tab** — Click **Slurm Clusters** in the cluster detail tabs. +1. **Create a Slurm cluster** — Click Create Slurm Cluster. A modal opens with the **Configure** step. +1. Follow steps 3-4 above to configure compute pools, partitions, and submit. The Slurm cluster enters `preparing` status and transitions to `ready` once all Slurm pods are running. diff --git a/docs/platform/clusters/Kubernetes/index.mdx b/docs/platform/clusters/Kubernetes/index.mdx index 32bc63b..42aa194 100644 --- a/docs/platform/clusters/Kubernetes/index.mdx +++ b/docs/platform/clusters/Kubernetes/index.mdx @@ -26,15 +26,17 @@ When you create a Kubernetes cluster, Vantage: ## Provider comparison -| Aspect | AWS | Cudo Compute | Azure / GCP | On-premises / LXD | +| Aspect | AWS | Cudo Compute | Azure / GCP | On-Premises | |---|---|---|---|---| -| Control plane | EC2 instance (boto3) | VM (Cudo API) | Vantage-managed | Agent-based | -| Instance selection | EC2 type browser | Resource profiles (vcpus + memory) | Vantage-managed defaults | Your hardware | +| Control plane | EC2 instance (boto3) | VM (Cudo API) | Vantage-managed | Agent-based (Multipass and Juju are Slurm-only) | +| Instance selection | EC2 type browser | Resource profiles (vcpus + memory) | Vantage-managed defaults | Your hardware or local VMs | | VPC / networking | VPC + subnets (auto or existing) | Data center + machine type | Vantage-managed | Your network | | GPU support | GPU instance types | Explicit gpus + gpu_model fields | GPU instance types | Your GPUs | | Custom networking | VPC, subnet, security group | Per-group data center | No | No | | Slurm on K8s supported | Yes | Yes | No | No | +:::note Multipass and Juju on-premises clusters only support Slurm, not Kubernetes. For on-premises Kubernetes, use the agent-based method. See [On-Premises clusters](/platform/clusters/On-Premises).::: + ## Slurm on Kubernetes You can deploy a Slurm scheduler on top of an existing Kubernetes cluster — combining HPC batch scheduling with cloud-native autoscaling. This is available for: @@ -53,3 +55,4 @@ For details, see [creating a Slurm-on-Kubernetes cluster](/platform/clusters/Kub - [Node groups](/platform/clusters/Kubernetes/node-groups) — Compute pools and autoscaling - [Integrations](/platform/clusters/Kubernetes/integrations) — JupyterHub, Grafana, Ray, MLflow - [Reference](/platform/clusters/Kubernetes/reference) — Fields, limits, error codes +- [On-Premises clusters](/platform/clusters/On-Premises) — Agent-based setup for on-premises Kubernetes diff --git a/docs/platform/clusters/On-Premises/agent-based.mdx b/docs/platform/clusters/On-Premises/agent-based.mdx new file mode 100644 index 0000000..536c75b --- /dev/null +++ b/docs/platform/clusters/On-Premises/agent-based.mdx @@ -0,0 +1,73 @@ +--- +title: Agent-based on-premises clusters +description: Connect your existing infrastructure to Vantage using the Vantage Agent. +sidebar_position: 1 +--- + +# Agent-based on-premises clusters + +Agent-based clusters connect your existing servers to Vantage through a lightweight agent. Vantage does not provision cloud resources — you provide the compute. The agent establishes an outbound HTTPS connection to Vantage, so no inbound firewall rules are required. + +Both Slurm and Kubernetes clusters support agent-based on-premises deployment. + +## Prerequisites + +- A Vantage account and organization ([Sign Up](/get-started/sign-up)) +- A configured [On-Premises or LXD Cloud Account](/platform/compute-providers/on-premises) +- Outbound HTTPS access (port 443) from your infrastructure to Vantage servers + +## Create a Slurm cluster + +1. **Open Clusters** — Click **Clusters** in the left sidebar, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. + +1. **Configure the cluster:** + - Enter a **Cluster Name** (max 27 characters, must be unique). + - Select your **On-Premises** or **LXD** cloud account. + +1. **Submit** — Click Create Cluster. A success toast confirms the cluster was created. The modal closes and you are redirected to the cluster view. + +## Create a Kubernetes cluster + +1. **Open Clusters** — Click **Clusters** in the left sidebar (the **Kubernetes** view is shown by default), then click Prepare Cluster. + +1. **Configure:** + - Enter a **Cluster Name**. + - Select your **On-Premises** or **LXD** cloud account. + +1. **Submit** — Click Create Cluster. The modal closes and you are redirected to the cluster view. + +## Connect the agent + +After creating the cluster, the cluster detail page shows a message that the agent is not yet connected and provides a link to agent installation instructions. Follow those instructions to install the Vantage Agent on your cluster node. + +The agent only needs outbound HTTPS access to Vantage servers (port 443). If your cluster is behind a firewall, ensure outbound connectivity is not blocked. + +Once the agent connects: + +- The cluster transitions from `preparing` to `ready` +- Nodes appear in the cluster detail page as they register +- You can submit jobs and launch notebooks immediately + +:::tip +On-premises clusters report their location as configured by your admin. Partitions and node groups are configured post-creation from the cluster detail page. +::: + +## Troubleshooting + +### Cluster stays in "preparing" + +- Verify the agent was installed and is running on your head node. +- Check that port 443 outbound is not blocked by a firewall. +- Confirm the cloud account credentials are valid. + +### Nodes not appearing + +- Install the agent on each compute node, not just the head node. +- Check that each node can reach Vantage servers over HTTPS. + +## Next steps + +- [Manage a Slurm cluster](/platform/clusters/Slurm/manage) +- [Manage a Kubernetes cluster](/platform/clusters/Kubernetes/manage) +- [Slurm partitions](/platform/clusters/Slurm/partitions) +- [Kubernetes node groups](/platform/clusters/Kubernetes/node-groups) \ No newline at end of file diff --git a/docs/platform/clusters/On-Premises/index.mdx b/docs/platform/clusters/On-Premises/index.mdx new file mode 100644 index 0000000..b73aa68 --- /dev/null +++ b/docs/platform/clusters/On-Premises/index.mdx @@ -0,0 +1,42 @@ +--- +title: On-Premises clusters +description: Connect your own infrastructure to Vantage using agent-based registration, Multipass, or Juju. +--- + +# On-Premises clusters + +On-premises clusters run on infrastructure you control — bare-metal servers, local VMs, or LXD containers. Vantage supports three methods for creating on-premises clusters, each suited to a different use case. + +| Method | Best for | Interface | Infrastructure | +|---|---|---|---| +| Agent-based | Existing servers and bare-metal | Web UI + terminal | Your hardware | +| Multipass | Local development and testing | Terminal only | Multipass VMs on your machine | +| Juju (Charmed HPC) | Production-like HPC on localhost | Terminal only | LXD containers on your machine | + +## Agent-based + +Connect existing servers to Vantage by running a lightweight agent on your infrastructure. The agent establishes an outbound HTTPS connection — no inbound firewall rules required. This is the most flexible option for production on-premises deployments and the only method that supports both Slurm and Kubernetes clusters. + +[Create an agent-based cluster](/platform/clusters/On-Premises/agent-based) + +## Multipass + +Deploy a single-node Slurm cluster in a Multipass VM on your local machine. Multipass provides lightweight Ubuntu VMs managed through a simple CLI — ideal for development, testing, and learning Vantage without cloud infrastructure. + +[Create a Multipass cluster](/platform/clusters/On-Premises/multipass) + +## Juju (Charmed HPC) + +Deploy a multi-node Slurm cluster using Juju charms on LXD containers. Charmed HPC provides a production-like HPC environment with controller, compute, and login nodes — all on your local machine. + +[Create a Juju cluster](/platform/clusters/On-Premises/juju) + +## Which method should I use? + +- **Agent-based** — You have existing servers or bare-metal hardware and want to connect them to Vantage. Supports both Slurm and Kubernetes. +- **Multipass** — You want a quick, single-node Slurm environment on your laptop for development and testing. +- **Juju** — You want a multi-node Slurm environment that simulates a production HPC cluster, running locally in LXD containers. + +:::tip +Multipass and Juju clusters are created through the Vantage CLI. Agent-based clusters are created through the Vantage web UI. +::: \ No newline at end of file diff --git a/docs/platform/clusters/On-Premises/juju.mdx b/docs/platform/clusters/On-Premises/juju.mdx new file mode 100644 index 0000000..9e0bfe2 --- /dev/null +++ b/docs/platform/clusters/On-Premises/juju.mdx @@ -0,0 +1,213 @@ +--- +title: Juju (Charmed HPC) clusters +description: Deploy a Slurm + Vantage cluster with Juju on LXD. +sidebar_position: 3 +--- + +# Deploying a Slurm + Vantage Cluster with Juju on LXD + +Deploy a production-like Slurm cluster on your local machine using Juju charms and LXD containers. This method gives you a full HPC stack — controller, compute, database, REST API, login node, and Vantage agents — all running inside LXD containers on a single host. + +## Prerequisites + +- Ubuntu 22.04 or 24.04 (host machine) +- LXD installed and initialised (`lxd init`) +- Juju snap installed +- Internet access to download charms and container images + +## 1. Bootstrap a Juju controller + +```bash +juju bootstrap lxd +``` + +This creates a controller named `lxd-default` inside an LXD container. It manages all models and workloads. + +## 2. Add a model for your Slurm cluster + +```bash +juju add-model myslurmcluster +``` + +Models are logical environments. Name yours whatever you like — `myslurmcluster` is used throughout this guide. + +## 3. Deploy the required charms + +Use Ubuntu 24.04 as the base for Slurm charms. MySQL uses Ubuntu 22.04 (charm default). + +```bash +juju deploy slurmctld --base ubuntu@24.04 --channel 25.11/edge +juju deploy slurmdbd --base ubuntu@24.04 --channel 25.11/edge +juju deploy mysql --channel 8.0/stable +juju deploy slurmrestd --base ubuntu@24.04 --channel 25.11/edge +juju deploy slurmd --base ubuntu@24.04 --channel 25.11/edge +juju deploy sackd --base ubuntu@24.04 --channel 25.11/edge +juju deploy jobbergate-agent --channel latest/edge --base ubuntu@24.04 +juju deploy vantage-agent --channel latest/edge --base ubuntu@24.04 +``` + +## 4. Add relations (integrations) + +```bash +juju relate slurmctld slurmdbd +juju relate slurmctld slurmrestd +juju relate slurmctld slurmd +juju relate slurmdbd mysql +juju relate sackd slurmctld +juju relate jobbergate-agent sackd +juju relate vantage-agent sackd +``` + +These connections enable: + +- Accounting and configuration sharing +- REST API access +- Compute node registration +- Vantage agent communication with Slurm + +## 5. Configure the charms + +### Compute node initial state + +Set slurmd nodes to start in `idle` (ready to run jobs) or `down` (manual resume). Use lowercase only: + +```bash +juju config slurmd default-node-state=idle +``` + +### OIDC credentials (for agents) + +Replace the example values with your real OIDC client ID and secret. + +```bash +juju config jobbergate-agent jobbergate-agent-oidc-client-id=your-client-id +juju config jobbergate-agent jobbergate-agent-oidc-client-secret=your-client-secret +juju config vantage-agent vantage-agent-oidc-client-id=your-client-id +juju config vantage-agent vantage-agent-oidc-client-secret=your-client-secret +``` + +### Vantage cluster name + +```bash +juju config vantage-agent vantage-agent-cluster-name=myslurmcluster +``` + +## 6. Monitor deployment + +```bash +juju status --watch 1s +``` + +Wait until all applications show `active` and all units are `idle`. + +Expected status: + +| Application | Expected status | +|---|---| +| `slurmctld`, `slurmdbd`, `mysql`, `slurmd`, `slurmrestd`, `sackd` | `active` | +| `jobbergate-agent`, `vantage-agent` | `active` (once OIDC and cluster name are set) | + +## 7. Verify the Slurm cluster + +SSH into the controller unit (machine 0): + +```bash +juju ssh 0 +``` + +Check the partition and nodes: + +```bash +sinfo +``` + +You should see the `slurmd` partition and at least one node in `idle` state. + +If the node is `down`, resume it: + +```bash +sudo scontrol update nodename=slurmd-0 state=resume +``` + +Submit a test job: + +```bash +sbatch -p slurmd --wrap="sleep 10; hostname" +squeue +``` + +After a few seconds, `squeue` should show the job running (`R`) then disappear. Check the output: + +```bash +cat slurm-*.out +``` + +Exit the controller: + +```bash +exit +``` + +## 8. (Optional) Run the full test suite + +Save the following script as `slurm_test.sh` on the controller unit: + +```bash +#!/bin/bash +echo "=== Slurm Cluster Test ===" +sinfo +sbatch --parsable --wrap="sleep 5; hostname" +echo "Job submitted. Waiting 10 seconds..." +sleep 10 +squeue +sacct -u $USER --format=JobID,State,ExitCode +``` + +Make it executable and run: + +```bash +chmod +x slurm_test.sh +./slurm_test.sh +``` + +All jobs should complete with `COMPLETED` state. + +## Troubleshooting + +### Deployment failures + +```bash +juju status +juju debug-log --include slurmctld +``` + +### Container issues + +```bash +lxc list +lxc logs +lxc restart +``` + +### Network connectivity + +```bash +lxc network show lxdbr0 +lxc exec -- ping google.com +``` + +### Slurm services not running + +```bash +juju ssh slurmctld/0 +sudo systemctl status slurmctld +sudo systemctl status slurmdbd +sudo systemctl restart slurmctld slurmdbd +sudo journalctl -u slurmctld -f +``` + +## Next steps + +- [Submit jobs to your cluster](/platform/jobs) +- [Create a Multipass cluster](./multipass) — for simpler single-node setups +- [Create an agent-based cluster](./agent-based) — for production on-premises deployments \ No newline at end of file diff --git a/docs/platform/clusters/On-Premises/multipass.mdx b/docs/platform/clusters/On-Premises/multipass.mdx new file mode 100644 index 0000000..d33a20a --- /dev/null +++ b/docs/platform/clusters/On-Premises/multipass.mdx @@ -0,0 +1,224 @@ +--- +title: Multipass clusters +description: Deploy a single-node Slurm cluster on your local machine using Multipass. +sidebar_position: 2 +--- + +# Multipass clusters + +Multipass provides a lightweight Ubuntu VM environment for running a single-node Slurm cluster on your local machine. It is the fastest way to get started with Vantage — no cloud account or credit card required. + +Multipass clusters are created and managed entirely through the terminal using the Vantage CLI. + +:::tip When to use Multipass + +Use Multipass when you want to: + +- Try Vantage without cloud infrastructure +- Develop and test Slurm job scripts locally +- Learn HPC workflows on your laptop + +For a multi-node HPC environment, consider [Juju (Charmed HPC)](./juju) instead. For connecting existing production hardware, use [Agent-based](./agent-based). +::: + +## Prerequisites + +- A Vantage account and organization ([Sign Up](/get-started/sign-up)) +- A Linux, macOS, or Windows machine with at least 8 GB RAM and 20 GB disk space +- [Snap](https://snapcraft.io/docs/installing-snapd) package manager (Linux) or Homebrew (macOS) + +## 1. Install the Vantage CLI + +```bash +pip install vantage-cli +``` + +Verify the installation: + +```bash +vantage version +``` + +Log in to your Vantage account: + +```bash +vantage login +``` + +## 2. Install Multipass + +```bash +# Linux (via snap) +sudo snap install multipass + +# macOS (via Homebrew) +brew install --cask multipass + +# Windows — download the installer from https://multipass.run/install +``` + +Verify the installation: + +```bash +multipass --version +``` + +## 3. Deploy the cluster + +Deploy a single-node Slurm cluster: + +```bash +vantage app deploy slurm-multipass-localhost +``` + +Monitor the deployment progress: + +```bash +vantage app status slurm-multipass-localhost +``` + +Wait until the VM is running and Slurm services are ready: + +```bash +multipass list +``` + +The output should show the VM in a `Running` state. + +### Custom resources + +By default, the cluster uses 4 CPUs, 8 GB RAM, and 50 GB disk. Customize resources during deployment: + +```bash +vantage app deploy slurm-multipass-localhost \ + --cpus=4 \ + --memory=8G \ + --disk=50G +``` + +## 4. Access the cluster + +Connect to the VM: + +```bash +multipass shell slurm-node +``` + +Or use SSH: + +```bash +ssh ubuntu@$(multipass info slurm-node --format json | jq -r '.info."slurm-node".ipv4[0]') +``` + +## 5. Verify Slurm + +Once connected to the VM, verify that Slurm is running: + +```bash +sinfo +squeue +``` + +Submit a test job: + +```bash +srun --nodes=1 --ntasks=1 hostname +``` + +Or submit a batch job: + +```bash +sbatch <Prepare Cluster. +1. **Open Clusters** — Click **Clusters** in the left sidebar, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. A modal opens with the **Configure** step. -1. **Choose type** — Select **Slurm** and click **Continue**. +1. **Configure the cluster** — Enter a **Cluster Name** (max 27 characters, must be unique; used as the CloudFormation stack name) and optional **Description**. Select your **AWS Cloud Account**, then click Continue. The **Provider** step opens. -1. **Configure the cluster:** - - Enter a **Cluster Name** (max 27 characters, must be unique). The name is used as the CloudFormation stack name. - - Select your **AWS Cloud Account**. The provider is detected automatically. - - Pick a **Region** — the dropdown loads after you select the cloud account. - - The **Head Node Machine Type** auto-fills a default — click **Select Head Node** to browse by vCPU, GPU, and price. - - Select an **SSH Key Name** — the list loads after you pick a region. If it's empty, create a key pair in the AWS EC2 console first. + :::tip + In the **Configure** step you can also select a ready **Kubernetes Cluster** as the deployment target to create a Slurm-on-Kubernetes cluster instead. See [Slurm on Kubernetes](/platform/clusters/Kubernetes/create#slurm-on-kubernetes). + ::: + +1. **Configure AWS resources** — Set the **Region** (the dropdown loads after you select the cloud account). The **Head Node Machine Type** auto-fills a default — click Select Head Node to browse by vCPU, GPU, and price. Select an **SSH Key Name** (the list loads after you pick a region; create a key pair in the AWS EC2 console first if empty). Click Proceed to Select Partitions. The **Partitions** step opens. -1. **Networking (optional)** — Click **Advanced Options** to pin the cluster to a specific **VPC**, **Head Node Subnet**, and **Compute Node Subnet**. Leave these empty to use AWS-managed defaults (Vantage creates a VPC, public and private subnets, Internet Gateway, NAT Gateway, and security groups automatically). + Click **Advanced Options** to pin the cluster to a specific **VPC**, **Head Node Subnet**, and **Compute Node Subnet**. Leave these empty to use AWS-managed defaults (Vantage creates a VPC, public and private subnets, Internet Gateway, NAT Gateway, and security groups automatically). 1. **Set partitions** — A default partition named `compute` is pre-filled. For each partition: - Give it a **Partition Name**. @@ -53,33 +52,35 @@ The most common Slurm path. Vantage uses CloudFormation to provision a VPC, Auto ## Azure -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. - -1. **Choose type** — Select **Slurm** and click **Continue**. +1. **Open Clusters** — Click **Clusters**, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. 1. **Configure the cluster:** - Enter a **Cluster Name** (max 27 characters, must be unique). - Select your **Azure Cloud Account**. -1. **Submit** — Click Prepare Cluster. Azure Slurm clusters use Vantage-managed defaults for node configuration and networking. Partitions are configured post-creation from the **Partitions** tab on the cluster detail page. + :::note + This provider uses backend defaults for provisioning. Review your cloud account configuration before submitting. + ::: -## GCP +1. **Submit** — Click Create Cluster. Azure Slurm clusters use Vantage-managed defaults for node configuration and networking. Partitions are configured post-creation from the **Partitions** tab on the cluster detail page. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +## GCP -1. **Choose type** — Select **Slurm** and click **Continue**. +1. **Open Clusters** — Click **Clusters**, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. 1. **Configure the cluster:** - Enter a **Cluster Name** (max 27 characters, must be unique). - Select your **GCP Cloud Account**. -1. **Submit** — Click Prepare Cluster. GCP Slurm clusters use Vantage-managed defaults for node configuration and networking. Partitions are configured post-creation from the cluster detail page. + :::note + This provider uses backend defaults for provisioning. Review your cloud account configuration before submitting. + ::: -## Cudo Compute +1. **Submit** — Click Create Cluster. GCP Slurm clusters use Vantage-managed defaults for node configuration and networking. Partitions are configured post-creation from the cluster detail page. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +## Cudo Compute -1. **Choose type** — Select **Slurm** and click **Continue**. +1. **Open Clusters** — Click **Clusters**, then click **Slurm** in the cluster type navigation, then click Prepare Cluster. 1. **Configure the cluster:** - Enter a **Cluster Name** (max 27 characters, must be unique). @@ -87,25 +88,15 @@ The most common Slurm path. Vantage uses CloudFormation to provision a VPC, Auto 1. **Submit** — Click Prepare Cluster. Cudo Slurm clusters use Vantage-managed defaults for node configuration and networking. Partitions are configured post-creation. -## On-premises / LXD +## On-premises -On-premises clusters connect through a lightweight agent deployed on your infrastructure. Vantage does not provision cloud resources — you provide the compute. +On-premises Slurm clusters connect through a lightweight agent deployed on your infrastructure. Vantage does not provision cloud resources — you provide the compute. -1. **Open Clusters** — Click **Clusters**, then Prepare Cluster. +For full setup guides covering agent-based, Multipass, and Juju (Charmed HPC) clusters, see [On-Premises clusters](/platform/clusters/On-Premises). -1. **Choose type** — Select **Slurm** and click **Continue**. - -1. **Configure:** - - Enter a **Cluster Name**. - - Select your **On-Premises** or **LXD** cloud account. - -1. **Get the agent command** — The wizard shows a Vantage Agent installation command. Copy it. - -1. **Install the agent** — Run the installation command on your cluster's head node (or multiple nodes). The agent establishes an outbound HTTPS connection to Vantage — no inbound firewall rules required. - -1. **Watch it connect** — The cluster flips to `ready` once the agent is reporting. Nodes appear in the detail page as they register. - -The agent only needs outbound HTTPS access to Vantage servers (port 443). If your cluster is behind a firewall, ensure outbound connectivity is not blocked. +:::tip +On-premises clusters created through the web UI use the agent-based method. For local development and testing, use Multipass or Juju via the Vantage CLI. +::: ## What happens after submission diff --git a/docs/platform/clusters/Slurm/index.mdx b/docs/platform/clusters/Slurm/index.mdx index 5fb14c9..b1de7df 100644 --- a/docs/platform/clusters/Slurm/index.mdx +++ b/docs/platform/clusters/Slurm/index.mdx @@ -24,17 +24,20 @@ Provisioning is asynchronous. The cluster enters `preparing` status immediately ## Provider comparison -| Aspect | AWS | Azure / GCP | Cudo Compute | On-premises / LXD | +| Aspect | AWS | Azure / GCP | Cudo Compute | On-Premises | |---|---|---|---|---| -| Provisioning | CloudFormation | Vantage-managed defaults | Vantage-managed defaults | Agent-based (you provide infrastructure) | -| Instance selection | EC2 instance type browser | Vantage-managed defaults | Vantage-managed defaults | Your existing hardware | +| Provisioning | CloudFormation | Vantage-managed defaults | Vantage-managed defaults | Agent-based, Multipass, or Juju (you provide infrastructure) | +| Instance selection | EC2 instance type browser | Vantage-managed defaults | Vantage-managed defaults | Your hardware or local VMs | | Partitions | Configured during creation | Configured post-creation | Configured post-creation | Configured post-creation | | SSH key required | Yes (EC2 key pair name) | No | No | No | | Custom networking | VPC, subnet, security group | No | No | No | +:::tip On-premises clusters can be created through the web UI (agent-based) or the Vantage CLI (Multipass and Juju). See [On-Premises clusters](/platform/clusters/On-Premises) for details.::: + ## Next steps - [Create a Slurm cluster](/platform/clusters/Slurm/create) — Provider-specific walkthroughs - [Manage a Slurm cluster](/platform/clusters/Slurm/manage) — Status lifecycle, detail page, monitoring - [Partitions](/platform/clusters/Slurm/partitions) — Job queues and node pools - [Reference](/platform/clusters/Slurm/reference) — Fields, limits, error codes +- [On-Premises clusters](/platform/clusters/On-Premises) — Agent-based, Multipass, and Juju setup diff --git a/docs/platform/clusters/concepts.mdx b/docs/platform/clusters/concepts.mdx index 7baa16c..46bffaf 100644 --- a/docs/platform/clusters/concepts.mdx +++ b/docs/platform/clusters/concepts.mdx @@ -16,6 +16,8 @@ Vantage supports three cluster types. Your choice determines the scheduler, the | Kubernetes | Kubernetes | Workbench sessions, ML training, containerized apps | | Slurm on Kubernetes | Slurm inside K8s | HPC workloads on cloud-native, auto-scaled infrastructure | +:::note On-premises clusters can use the Vantage Agent (both Slurm and Kubernetes), Multipass (Slurm only), or Juju/Charmed HPC (Slurm only). See [On-Premises clusters](/platform/clusters/On-Premises).::: + Slurm and Slurm-on-Kubernetes clusters appear under the **Slurm** list in the sidebar. Kubernetes clusters appear under **Kubernetes**. ## Status lifecycle @@ -59,7 +61,7 @@ Providers are the physical infrastructure Vantage provisions clusters on. |---|---| | Public clouds (AWS, Azure, GCP) | Elastic capacity, global regions, spot pricing | | Cudo Compute | Cost-efficient GPU cloud | -| On-premises / LXD | Your own hardware, maximum control | +| On-premises / LXD / Multipass / Juju | Your own hardware or local VMs — agent-based, Multipass, or Charmed HPC | | Vantage partners (atNorth, BuzzHPC, RCI) | Pre-integrated managed colocation and HPC | ## Regions and availability diff --git a/docs/platform/clusters/get-started.mdx b/docs/platform/clusters/get-started.mdx index 6503312..2a62924 100644 --- a/docs/platform/clusters/get-started.mdx +++ b/docs/platform/clusters/get-started.mdx @@ -8,15 +8,13 @@ sidebar_position: 1 From an empty workspace to a connected cluster in under five minutes. -1. **Open Clusters** — From the left sidebar, click **Clusters**. The sub-items show **Slurm** and **Kubernetes** lists. Both are empty until you connect or create a cluster. +1. **Open Clusters** — From the left sidebar, click **Clusters**. A cluster type navigation shows **Slurm** and **Kubernetes** — click **Slurm** to see Slurm clusters (the **Kubernetes** view is shown by default). -1. **Start the wizard** — Click Prepare Cluster in the top-right corner. A three-step wizard opens. +1. **Start the wizard** — Click Prepare Cluster in the top-right corner. A modal opens with the **Configure** step. -1. **Choose a cluster type** — Pick **Slurm** for traditional HPC batch workloads, **Kubernetes** for ML and containerized workloads, or **Slurm on Kubernetes** to run a Slurm scheduler inside an existing K8s cluster. If you're not sure, start with **Slurm**. +1. **Configure the cluster** — Enter a **Cluster Name** and select your cloud account. For non-AWS providers, click Create Cluster to submit. For AWS, click Continue to configure region, head node machine type, and partitions before submitting. -1. **Configure the cluster** — Enter a name, then pick a cloud provider and cloud account (for cloud-provisioned clusters). For on-premises clusters, the wizard shows an agent installation command — run it on your infrastructure and Vantage connects automatically. - -1. **Finish and explore** — Click Prepare Cluster to submit. Once the cluster connects, click its row to open the detail view. +1. **Finish and explore** — Click the submit button on the final step. Once the cluster connects, click its row to open the detail view. ## What's next @@ -28,6 +26,7 @@ After your first cluster is connected, dive deeper into your cluster type: | [Creating a Kubernetes cluster](/platform/clusters/Kubernetes/create) | Provider-specific steps for K8s | | [Slurm partitions](/platform/clusters/Slurm/partitions) | Managing job queues | | [Kubernetes node groups](/platform/clusters/Kubernetes/node-groups) | Managing compute pools | +| [On-Premises clusters](/platform/clusters/On-Premises) | Agent-based, Multipass, and Juju setup guides | :::tip Cloud cluster nodes bill while running — even idle ones. Set the maximum node count for each partition or node group conservatively during setup. You can raise it when you're ready to run jobs. diff --git a/docs/platform/clusters/index.md b/docs/platform/clusters/index.md index cc38d6d..8e4249a 100644 --- a/docs/platform/clusters/index.md +++ b/docs/platform/clusters/index.md @@ -12,6 +12,7 @@ Vantage supports three cluster types: - **[Slurm](/platform/clusters/Slurm)** — Traditional HPC batch scheduler. Best for simulations, MPI workloads, batch pipelines, and any workload that needs a queue-based scheduler with fine-grained partition control. - **[Kubernetes](/platform/clusters/Kubernetes)** — Managed platform cluster for Workbench sessions, ML training, model serving, and containerized workloads. Runs MicroK8s with Vantage-managed control plane, autoscaling, and observability. - **[Slurm on Kubernetes](/platform/clusters/Kubernetes#slurm-on-kubernetes)** — A Slurm scheduler deployed inside an existing Kubernetes cluster. Gives you HPC scheduling on cloud-native, auto-scaled infrastructure without managing a separate Slurm controller fleet. +- **[On-Premises](/platform/clusters/On-Premises)** — Connect your own infrastructure using the Vantage Agent, Multipass, or Juju (Charmed HPC). Agent-based clusters support both Slurm and Kubernetes. Multipass and Juju provide local Slurm environments for development and testing. ## Getting started @@ -22,6 +23,7 @@ Start with the [quickstart](/platform/clusters/get-started) to create your first | Batch HPC jobs with Slurm | [Slurm overview](/platform/clusters/Slurm) | | Interactive ML development | [Kubernetes overview](/platform/clusters/Kubernetes) | | HPC on cloud-native infra | [Slurm on Kubernetes](/platform/clusters/Kubernetes#slurm-on-kubernetes) | +| On-premises HPC without cloud | [On-Premises overview](/platform/clusters/On-Premises) | ## Supported providers @@ -33,11 +35,21 @@ Vantage provisions clusters on six infrastructure types: | Microsoft Azure | Yes | Yes | — | | Google Cloud Platform | Yes | Yes | — | | Cudo Compute | Yes | Yes | Yes | -| On-premises / LXD | Yes | Yes | — | +| On-premises / LXD / Multipass / Juju | Yes | Yes | — | | Vantage partners (atNorth, BuzzHPC, RCI) | Yes | Yes | — | Not every combination is available. See the provider-specific pages for details. +## On-premises clusters + +On-premises clusters run on infrastructure you control. Vantage supports three methods: + +- **Agent-based** — Connect existing servers via the Vantage Agent. Supports both Slurm and Kubernetes. +- **Multipass** — Single-node Slurm cluster in a local VM. Terminal only. +- **Juju (Charmed HPC)** — Multi-node Slurm cluster in LXD containers. Terminal only. + +See [On-Premises clusters](/platform/clusters/On-Premises) for setup guides. + ## How clusters relate to other Vantage concepts - **Cloud accounts** are the credential bindings that let Vantage provision infrastructure. One account backs multiple clusters. See [Compute Providers](/platform/compute-providers). diff --git a/docs/platform/clusters/troubleshooting.mdx b/docs/platform/clusters/troubleshooting.mdx index 92737d8..c166aa5 100644 --- a/docs/platform/clusters/troubleshooting.mdx +++ b/docs/platform/clusters/troubleshooting.mdx @@ -36,7 +36,7 @@ If the SSH key name dropdown is empty: - **Cloud clusters** — Verify the minimum size is set to at least 1 if you expect nodes to always be present. Autoscaling scales to zero when `min_size = 0`. - **Slurm on Kubernetes** — Ensure the parent K8s cluster has sufficient capacity and the autoscaler is enabled. The AWS autoscaler uses EC2 Fleet to provision instances. -- **On-premises clusters** — Nodes must be registered manually on your infrastructure. Run the agent installation command on each node. +- **On-premises clusters** — Nodes must be registered manually on your infrastructure. Install the Vantage Agent on each node. See [Agent-based on-premises clusters](/platform/clusters/On-Premises/agent-based) for details. ## Cluster not appearing in the list diff --git a/docs/platform/compute-providers/on-premises/index.md b/docs/platform/compute-providers/on-premises/index.md index fe89193..bb27d11 100644 --- a/docs/platform/compute-providers/on-premises/index.md +++ b/docs/platform/compute-providers/on-premises/index.md @@ -32,6 +32,26 @@ LXD requires three credentials: the server URL, a client certificate, and a clie 1. Enter an **Account name** and optional description. 1. Submit. +## Multipass + +Multipass provides lightweight Ubuntu VMs for local development and testing. Use it to create a single-node Slurm cluster on your machine without cloud infrastructure. + +1. Install Multipass: `sudo snap install multipass` +2. Deploy a Slurm cluster using the Vantage CLI: `vantage app deploy slurm-multipass-localhost` + +See [Multipass clusters](/platform/clusters/On-Premises/multipass) for the full setup guide. + +## Juju (Charmed HPC) + +Juju deploys multi-node Slurm clusters using charms on LXD containers. Use it to simulate a production HPC environment on your local machine. + +1. Install Juju: `sudo snap install juju --channel 3/stable` +2. Bootstrap a Juju controller: `juju bootstrap lxd` +3. Add a model: `juju add-model myslurmcluster` +4. Deploy the Slurm and Vantage charms: see [Juju clusters](/platform/clusters/On-Premises/juju) for the full charm list and configuration steps. + +See [Juju clusters](/platform/clusters/On-Premises/juju) for the full setup guide. + :::tip On-premises accounts carry no cloud spend per se — but your hardware runs continuously. Idle nodes still consume power, cooling, and rack space. Plan capacity around your actual workload schedule. ::: diff --git a/external/vantage-cli b/external/vantage-cli index 9321f65..de51175 160000 --- a/external/vantage-cli +++ b/external/vantage-cli @@ -1 +1 @@ -Subproject commit 9321f65dc89fa91b47fcf2153ffb400c43ae6f58 +Subproject commit de511756d858001fe9e9bb65fd9620f05f98ec04 diff --git a/sidebars-main.js b/sidebars-main.js index 6df7a40..9e9ae0a 100644 --- a/sidebars-main.js +++ b/sidebars-main.js @@ -119,6 +119,18 @@ module.exports = { {type: 'doc', id: 'platform/clusters/Kubernetes/reference'}, ], }, + { + type: 'category', + label: 'On-Premises', + collapsible: true, + collapsed: true, + link: {type: 'doc', id: 'platform/clusters/On-Premises/index'}, + items: [ + {type: 'doc', id: 'platform/clusters/On-Premises/agent-based'}, + {type: 'doc', id: 'platform/clusters/On-Premises/multipass'}, + {type: 'doc', id: 'platform/clusters/On-Premises/juju'}, + ], + }, ], }, {