Skip to content

Commit 4831f81

Browse files
zhoward-1claudeaustingreco
authored
docs: restructure operator guide index into navigation hub (#1036)
## Summary The current `operator-guides/index.md` is the full platform configuration reference, not a navigation hub. Operators arriving at the guide have no entry point — they land in a wall of YAML with no context for where to start. Changes: - Rename `index.md` → `platform-setup.md` (content unchanged, just renamed) - Create new `index.md` as a navigation hub with: - A getting-started reading path for fresh deployments (platform setup → register cluster → serving → auth) - Organized tables linking to every operator guide by category (platform config, jobs, serving, UI, integrations, operations, ingester) Part of the operator/contributor guide improvements proposed in #1033. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Austin Greco <austingreco@gmail.com>
1 parent 4b7702f commit 4831f81

2 files changed

Lines changed: 304 additions & 242 deletions

File tree

docs/operator-guides/index.md

Lines changed: 44 additions & 242 deletions
Original file line numberDiff line numberDiff line change
@@ -1,260 +1,62 @@
1-
# Platform Setup Guide
1+
# Operator Guides
22

3-
This guide describes how to configure **Michelangelo server components** in Kubernetes cluster. It focuses on the **configuration surfaces** (ConfigMaps, fields, and key parameters).
3+
These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks.
44

5-
# Overview
5+
## Getting Started
66

7-
Michelangelo consists of four core server components:
7+
For a fresh deployment, follow this recommended reading order:
88

9-
1. **API Server** – Central gRPC API
10-
2. **Controller Manager** Kubernetes controllers
11-
3. **Worker** – Workflow execution (Temporal workers + compute integration)
12-
4. **UI + Envoy** – Frontend and proxy
9+
1. **[Platform Setup](platform-setup.md)** — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays
10+
2. **[Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md)** — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it
11+
3. **[Cluster Setup for Serving](serving/cluster-setup.md)** — enable model inference on a local or remote cluster
12+
4. **[Authentication](authentication.md)** — connect an identity provider and configure RBAC before opening to users
1313

14-
Each component exposes server-side configuration through ConfigMaps and overlays.
14+
## Platform Configuration
1515

16-
This document explains:
16+
| Guide | Description |
17+
|-------|-------------|
18+
| [Platform Setup](platform-setup.md) | ConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy |
19+
| [API Framework](api-framework.md) | Architecture overview of the Michelangelo API and control plane |
1720

18-
* Where each component's configuration lives
19-
* What fields can be customized
20-
* What each field means
21-
* How to apply changes using Kustomize overlays
21+
## Jobs & Compute
2222

23-
# Michelangelo Service architecture diagram
23+
| Guide | Description |
24+
|-------|-------------|
25+
| [Jobs Overview](jobs/index.md) | Ray and Spark job lifecycle, compute selection, and observability |
26+
| [Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md) | Connect an existing Kubernetes cluster to the Michelangelo control plane |
27+
| [Run a Pipeline on a Compute Cluster](jobs/run-uniflow-pipeline-on-compute-cluster.md) | Submit and monitor a Uniflow pipeline on a registered cluster |
28+
| [Extend the Job Scheduler](jobs/extend-michelangelo-batch-job-scheduler-system.md) | Custom scheduling backends (Kueue, Volcano) and assignment strategies |
2429

25-
The following diagram shows the relationship between each of the services in Michelangelo eco-system.
30+
## Model Serving
2631

27-
![Michelangelo Service Architecture](./images/ma-service-architecture.png)
32+
| Guide | Description |
33+
|-------|-------------|
34+
| [Serving Overview](serving/index.md) | InferenceServer and Deployment lifecycle, architecture |
35+
| [Cluster Setup for Serving](serving/cluster-setup.md) | Configure a cluster for inference |
36+
| [Integrate a Custom Backend](serving/integrate-custom-backend.md) | Plugin interfaces for Triton, vLLM, TensorRT-LLM, and custom frameworks |
2837

29-
# Server Configuration
38+
## UI
3039

31-
## API Server Configuration
40+
| Guide | Description |
41+
|-------|-------------|
42+
| [Deploying the UI](ui/deploying-michelangelo-ui.md) | Deploy the Michelangelo web UI to Kubernetes |
43+
| [Local UI Development](ui/local-development-setup.md) | Run the UI locally for development |
3244

33-
### **Key Fields**
45+
## Integrating with Your ML Stack
3446

35-
```yaml
36-
apiserver:
37-
yarpc:
38-
host: 0.0.0.0
39-
port: 15566
40-
k8s:
41-
qps: 300
42-
burst: 600
43-
metadataStorage:
44-
enableMetadataStorage: false
45-
crdSync:
46-
enableCRDUpdate: true
47-
enableIncompatibleUpdate: false
48-
```
47+
Michelangelo is designed to be adopted alongside existing ML infrastructure. These guides cover how to connect Michelangelo to the systems your teams already use.
4948

50-
### **Field Explanations**
49+
| Guide | Description |
50+
|-------|-------------|
51+
| [Custom Serving Backend](serving/integrate-custom-backend.md) | Add support for any inference framework — Triton, vLLM, TensorRT-LLM, or your own |
52+
| [Custom Job Scheduler](jobs/extend-michelangelo-batch-job-scheduler-system.md) | Replace or extend the job scheduler — integrate Kueue, Volcano, or a custom assignment strategy |
53+
| [Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md) | Connect an existing Kubernetes cluster so Michelangelo can dispatch jobs to it |
5154

52-
| Field | Description |
53-
| ----- | ----- |
54-
| `yarpc.host/port` | gRPC bind address + port |
55-
| `k8s.qps/burst` | Throttling limits for Kubernetes API calls |
56-
| `enableMetadataStorage` | Enables metadata persistence |
57-
| `enableCRDUpdate` | Controls whether CRDs can be sync'd |
58-
| `enableIncompatibleUpdate` | Allows breaking CRD changes (use only during major migrations) |
55+
## Operations
5956

60-
## Controller Manager Configuration
57+
| Guide | Description |
58+
|-------|-------------|
59+
| [Authentication](authentication.md) | OIDC identity provider setup, RBAC, session configuration, multi-tenant isolation |
60+
| [Compliance](compliance.md) | SOC 2, GDPR, and HIPAA configuration |
61+
| [Troubleshooting](troubleshooting.md) | Common failure modes and `kubectl` diagnostic commands |
6162

62-
### **Key Fields**
63-
64-
```yaml
65-
controllermgr:
66-
metricsBindAddress: 8091
67-
healthProbeBindAddress: 8081
68-
leaderElection: false
69-
leaderElectionID: michelangelo.your-organization.com
70-
port: 9443
71-
72-
controllers:
73-
rayCluster:
74-
k8sQps: 300
75-
k8sBurst: 600
76-
77-
minio:
78-
awsRegion: ap-southeast-1
79-
awsEndpointUrl: s3.ap-southeast-1.amazonaws.com
80-
useIam: true
81-
82-
workflowClient:
83-
service: temporal-frontend
84-
host: temporal.your-domain.com:7233
85-
transport: grpc
86-
domain: uniflow
87-
```
88-
89-
### **Field Explanations**
90-
91-
| Field | Description |
92-
| ----- | ----- |
93-
| `metricsBindAddress` | Controller metrics port |
94-
| `healthProbeBindAddress` | Health check port |
95-
| `leaderElection` | Enable for production HA |
96-
| `minio.*` | S3 / MinIO backend configuration |
97-
| `workflowClient.*` | Temporal client configuration |
98-
| `controllers.*` | Each controller components' configuration |
99-
100-
## Worker Configuration
101-
102-
### **Key Fields**
103-
104-
```yaml
105-
worker:
106-
address: michelangelo-apiserver.your-domain.com:443
107-
maApiServiceName: ma-apiserver
108-
useTLS: true
109-
110-
logging:
111-
level: info
112-
development: true
113-
encoding: console
114-
115-
workflow-engine:
116-
host: temporal.your-domain.com:7233
117-
transport: grpc
118-
provider: temporal
119-
workers:
120-
- domain: default
121-
taskList: production-uniflow
122-
client:
123-
domain: uniflow
124-
```
125-
126-
### **Field Explanations**
127-
128-
| Field | Description |
129-
| ----- | ----- |
130-
| `worker.address` | API server endpoint used by workers |
131-
| `workflow-engine.host` | Temporal endpoint |
132-
| `workers[].taskList` | Worker task list to poll |
133-
| `client.domain` | Temporal workflow domain |
134-
135-
## UI & Envoy Configuration
136-
137-
### **Envoy Proxy**
138-
139-
**ConfigMap:**
140-
141-
```yaml
142-
static_resources:
143-
listeners:
144-
- address:
145-
socket_address: { address: 0.0.0.0, port_value: 8081 }
146-
147-
clusters:
148-
- name: michelangelo-apiserver
149-
load_assignment:
150-
endpoints:
151-
- lb_endpoints:
152-
- endpoint:
153-
address:
154-
socket_address:
155-
address: michelangelo-apiserver
156-
port_value: 15566
157-
```
158-
159-
### **Public UI Config**
160-
161-
**ConfigMap:**
162-
163-
```json
164-
config.json: |
165-
{
166-
"apiBaseUrl": "https://michelangelo-envoy.<your-domain>"
167-
}
168-
```
169-
170-
## Environment Overrides / Domain Settings
171-
172-
You must customize domain-specific values in overlays:
173-
174-
| Location | Fields to Update |
175-
| ----- | ----- |
176-
| Worker ConfigMap | API server domain, compute domain, Temporal host |
177-
| UI Public Config | `"apiBaseUrl"` |
178-
| Envoy Config | CORS allowed origins, API cluster hostname |
179-
| Ingress | Hostnames for API server & UI |
180-
| Controller Manager | S3 region, endpoint, Temporal host |
181-
182-
# Object Store Configuration
183-
184-
Object storage (MinIO / S3) is used by Michelangelo for artifacts and metadata.
185-
186-
## Controller Manager Object Store Settings
187-
188-
These live in the controller manager ConfigMap:
189-
190-
```yaml
191-
minio:
192-
awsRegion: sample # AWS region
193-
awsEndpointUrl: sample.amazonaws.com
194-
useIam: true # Use IAM roles for authentication
195-
```
196-
197-
### **Fields**
198-
199-
* `awsRegion` – The AWS region of your S3 bucket.
200-
* `awsEndpointUrl` – S3 endpoint (`s3.amazonaws.com` or regional endpoint).
201-
* `useIam` – Set to `true` in production (do not hardcode keys in config).
202-
203-
## **Storage Setup Checklist (from original guide)**
204-
205-
* Configure **AWS credentials/IAM roles** for pods that need S3 access.
206-
* Verify **region and endpoint** in the ConfigMap match your S3 setup.
207-
* Test connectivity from worker/controller pods to the bucket.
208-
209-
# Workflow Engine Configuration (Temporal/Cadence)
210-
211-
Michelangelo uses a workflow engine (Temporal or Cadence) for orchestrating workflows. Most of your current guide examples use **Temporal**, and Cadence is used in sandbox/dev.
212-
213-
## Controller Manager Workflow Client
214-
215-
From `controllermgr-configmap.yaml`:
216-
217-
```yaml
218-
workflowClient:
219-
service: temporal-frontend # Temporal service name
220-
host: temporal.your-domain.com:7233 # Temporal endpoint
221-
transport: grpc # Transport protocol
222-
domain: uniflow # Temporal domain
223-
```
224-
225-
### **Fields**
226-
227-
* `service` – Workflow engine frontend service name (`temporal-frontend` / `cadence-frontend`).
228-
* `host` – Full endpoint (host:port).
229-
* `transport` – Typically `grpc`.
230-
* `domain` – Temporal domain (or Cadence domain) to target.
231-
232-
## Worker Workflow Engine Settings
233-
234-
From `worker-configmap.yaml`:
235-
236-
```yaml
237-
workflow-engine:
238-
host: temporal(/cadence).your-domain.com:7233
239-
transport: grpc
240-
provider: temporal/cadence
241-
workers:
242-
- domain: default
243-
taskList: production-uniflow
244-
client:
245-
domain: uniflow
246-
```
247-
248-
### Fields
249-
250-
* `provider` – `temporal` (or **cadence**); can be extended to `cadence` if needed.
251-
* `host` – Temporal/Cadence endpoint.
252-
* `workers[].domain` – Domain where worker polls for tasks.
253-
* `workers[].taskList` – Task list (queue) used for workflow tasks.
254-
* `client.domain` – Client domain for starting workflows.
255-
256-
## Temporal Setup (from original external dependencies)
257-
258-
* Ensure Temporal is accessible at the configured endpoint.
259-
* Create required domains (`uniflow`, `default`, `production-uniflow`).
260-
* Configure task lists such as `production-uniflow`.

0 commit comments

Comments
 (0)