|
1 | | -# Platform Setup Guide |
| 1 | +# Operator Guides |
2 | 2 |
|
3 | | -This guide describes how to configure **Michelangelo server components** in Kubernetes cluster. It focuses on the **configuration surfaces** (ConfigMaps, fields, and key parameters). |
| 3 | +These guides cover deploying, configuring, and integrating Michelangelo in a Kubernetes environment. They target platform engineers and infrastructure operators who are responsible for running Michelangelo in production and for connecting it to the broader ML infrastructure their teams already use — experiment tracking, model registries, compute clusters, schedulers, and serving frameworks. |
4 | 4 |
|
5 | | -# Overview |
| 5 | +## Getting Started |
6 | 6 |
|
7 | | -Michelangelo consists of four core server components: |
| 7 | +For a fresh deployment, follow this recommended reading order: |
8 | 8 |
|
9 | | -1. **API Server** – Central gRPC API |
10 | | -2. **Controller Manager** – Kubernetes controllers |
11 | | -3. **Worker** – Workflow execution (Temporal workers + compute integration) |
12 | | -4. **UI + Envoy** – Frontend and proxy |
| 9 | +1. **[Platform Setup](platform-setup.md)** — configure each component (API server, controller manager, worker, UI/Envoy) via ConfigMaps and Kustomize overlays |
| 10 | +2. **[Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md)** — connect an existing Kubernetes cluster so Michelangelo can dispatch Ray and Spark jobs to it |
| 11 | +3. **[Cluster Setup for Serving](serving/cluster-setup.md)** — enable model inference on a local or remote cluster |
| 12 | +4. **[Authentication](authentication.md)** — connect an identity provider and configure RBAC before opening to users |
13 | 13 |
|
14 | | -Each component exposes server-side configuration through ConfigMaps and overlays. |
| 14 | +## Platform Configuration |
15 | 15 |
|
16 | | -This document explains: |
| 16 | +| Guide | Description | |
| 17 | +|-------|-------------| |
| 18 | +| [Platform Setup](platform-setup.md) | ConfigMaps and key fields for API server, controller manager, worker, and UI/Envoy | |
| 19 | +| [API Framework](api-framework.md) | Architecture overview of the Michelangelo API and control plane | |
17 | 20 |
|
18 | | -* Where each component's configuration lives |
19 | | -* What fields can be customized |
20 | | -* What each field means |
21 | | -* How to apply changes using Kustomize overlays |
| 21 | +## Jobs & Compute |
22 | 22 |
|
23 | | -# Michelangelo Service architecture diagram |
| 23 | +| Guide | Description | |
| 24 | +|-------|-------------| |
| 25 | +| [Jobs Overview](jobs/index.md) | Ray and Spark job lifecycle, compute selection, and observability | |
| 26 | +| [Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md) | Connect an existing Kubernetes cluster to the Michelangelo control plane | |
| 27 | +| [Run a Pipeline on a Compute Cluster](jobs/run-uniflow-pipeline-on-compute-cluster.md) | Submit and monitor a Uniflow pipeline on a registered cluster | |
| 28 | +| [Extend the Job Scheduler](jobs/extend-michelangelo-batch-job-scheduler-system.md) | Custom scheduling backends (Kueue, Volcano) and assignment strategies | |
24 | 29 |
|
25 | | -The following diagram shows the relationship between each of the services in Michelangelo eco-system. |
| 30 | +## Model Serving |
26 | 31 |
|
27 | | - |
| 32 | +| Guide | Description | |
| 33 | +|-------|-------------| |
| 34 | +| [Serving Overview](serving/index.md) | InferenceServer and Deployment lifecycle, architecture | |
| 35 | +| [Cluster Setup for Serving](serving/cluster-setup.md) | Configure a cluster for inference | |
| 36 | +| [Integrate a Custom Backend](serving/integrate-custom-backend.md) | Plugin interfaces for Triton, vLLM, TensorRT-LLM, and custom frameworks | |
28 | 37 |
|
29 | | -# Server Configuration |
| 38 | +## UI |
30 | 39 |
|
31 | | -## API Server Configuration |
| 40 | +| Guide | Description | |
| 41 | +|-------|-------------| |
| 42 | +| [Deploying the UI](ui/deploying-michelangelo-ui.md) | Deploy the Michelangelo web UI to Kubernetes | |
| 43 | +| [Local UI Development](ui/local-development-setup.md) | Run the UI locally for development | |
32 | 44 |
|
33 | | -### **Key Fields** |
| 45 | +## Integrating with Your ML Stack |
34 | 46 |
|
35 | | -```yaml |
36 | | -apiserver: |
37 | | - yarpc: |
38 | | - host: 0.0.0.0 |
39 | | - port: 15566 |
40 | | - k8s: |
41 | | - qps: 300 |
42 | | - burst: 600 |
43 | | - metadataStorage: |
44 | | - enableMetadataStorage: false |
45 | | - crdSync: |
46 | | - enableCRDUpdate: true |
47 | | - enableIncompatibleUpdate: false |
48 | | -``` |
| 47 | +Michelangelo is designed to be adopted alongside existing ML infrastructure. These guides cover how to connect Michelangelo to the systems your teams already use. |
49 | 48 |
|
50 | | -### **Field Explanations** |
| 49 | +| Guide | Description | |
| 50 | +|-------|-------------| |
| 51 | +| [Custom Serving Backend](serving/integrate-custom-backend.md) | Add support for any inference framework — Triton, vLLM, TensorRT-LLM, or your own | |
| 52 | +| [Custom Job Scheduler](jobs/extend-michelangelo-batch-job-scheduler-system.md) | Replace or extend the job scheduler — integrate Kueue, Volcano, or a custom assignment strategy | |
| 53 | +| [Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md) | Connect an existing Kubernetes cluster so Michelangelo can dispatch jobs to it | |
51 | 54 |
|
52 | | -| Field | Description | |
53 | | -| ----- | ----- | |
54 | | -| `yarpc.host/port` | gRPC bind address + port | |
55 | | -| `k8s.qps/burst` | Throttling limits for Kubernetes API calls | |
56 | | -| `enableMetadataStorage` | Enables metadata persistence | |
57 | | -| `enableCRDUpdate` | Controls whether CRDs can be sync'd | |
58 | | -| `enableIncompatibleUpdate` | Allows breaking CRD changes (use only during major migrations) | |
| 55 | +## Operations |
59 | 56 |
|
60 | | -## Controller Manager Configuration |
| 57 | +| Guide | Description | |
| 58 | +|-------|-------------| |
| 59 | +| [Authentication](authentication.md) | OIDC identity provider setup, RBAC, session configuration, multi-tenant isolation | |
| 60 | +| [Compliance](compliance.md) | SOC 2, GDPR, and HIPAA configuration | |
| 61 | +| [Troubleshooting](troubleshooting.md) | Common failure modes and `kubectl` diagnostic commands | |
61 | 62 |
|
62 | | -### **Key Fields** |
63 | | - |
64 | | -```yaml |
65 | | -controllermgr: |
66 | | - metricsBindAddress: 8091 |
67 | | - healthProbeBindAddress: 8081 |
68 | | - leaderElection: false |
69 | | - leaderElectionID: michelangelo.your-organization.com |
70 | | - port: 9443 |
71 | | -
|
72 | | -controllers: |
73 | | - rayCluster: |
74 | | - k8sQps: 300 |
75 | | - k8sBurst: 600 |
76 | | -
|
77 | | -minio: |
78 | | - awsRegion: ap-southeast-1 |
79 | | - awsEndpointUrl: s3.ap-southeast-1.amazonaws.com |
80 | | - useIam: true |
81 | | -
|
82 | | -workflowClient: |
83 | | - service: temporal-frontend |
84 | | - host: temporal.your-domain.com:7233 |
85 | | - transport: grpc |
86 | | - domain: uniflow |
87 | | -``` |
88 | | - |
89 | | -### **Field Explanations** |
90 | | - |
91 | | -| Field | Description | |
92 | | -| ----- | ----- | |
93 | | -| `metricsBindAddress` | Controller metrics port | |
94 | | -| `healthProbeBindAddress` | Health check port | |
95 | | -| `leaderElection` | Enable for production HA | |
96 | | -| `minio.*` | S3 / MinIO backend configuration | |
97 | | -| `workflowClient.*` | Temporal client configuration | |
98 | | -| `controllers.*` | Each controller components' configuration | |
99 | | - |
100 | | -## Worker Configuration |
101 | | - |
102 | | -### **Key Fields** |
103 | | - |
104 | | -```yaml |
105 | | -worker: |
106 | | - address: michelangelo-apiserver.your-domain.com:443 |
107 | | - maApiServiceName: ma-apiserver |
108 | | - useTLS: true |
109 | | -
|
110 | | -logging: |
111 | | - level: info |
112 | | - development: true |
113 | | - encoding: console |
114 | | -
|
115 | | -workflow-engine: |
116 | | - host: temporal.your-domain.com:7233 |
117 | | - transport: grpc |
118 | | - provider: temporal |
119 | | - workers: |
120 | | - - domain: default |
121 | | - taskList: production-uniflow |
122 | | - client: |
123 | | - domain: uniflow |
124 | | -``` |
125 | | - |
126 | | -### **Field Explanations** |
127 | | - |
128 | | -| Field | Description | |
129 | | -| ----- | ----- | |
130 | | -| `worker.address` | API server endpoint used by workers | |
131 | | -| `workflow-engine.host` | Temporal endpoint | |
132 | | -| `workers[].taskList` | Worker task list to poll | |
133 | | -| `client.domain` | Temporal workflow domain | |
134 | | - |
135 | | -## UI & Envoy Configuration |
136 | | - |
137 | | -### **Envoy Proxy** |
138 | | - |
139 | | -**ConfigMap:** |
140 | | - |
141 | | -```yaml |
142 | | -static_resources: |
143 | | - listeners: |
144 | | - - address: |
145 | | - socket_address: { address: 0.0.0.0, port_value: 8081 } |
146 | | -
|
147 | | - clusters: |
148 | | - - name: michelangelo-apiserver |
149 | | - load_assignment: |
150 | | - endpoints: |
151 | | - - lb_endpoints: |
152 | | - - endpoint: |
153 | | - address: |
154 | | - socket_address: |
155 | | - address: michelangelo-apiserver |
156 | | - port_value: 15566 |
157 | | -``` |
158 | | - |
159 | | -### **Public UI Config** |
160 | | - |
161 | | -**ConfigMap:** |
162 | | - |
163 | | -```json |
164 | | -config.json: | |
165 | | - { |
166 | | - "apiBaseUrl": "https://michelangelo-envoy.<your-domain>" |
167 | | - } |
168 | | -``` |
169 | | - |
170 | | -## Environment Overrides / Domain Settings |
171 | | - |
172 | | -You must customize domain-specific values in overlays: |
173 | | - |
174 | | -| Location | Fields to Update | |
175 | | -| ----- | ----- | |
176 | | -| Worker ConfigMap | API server domain, compute domain, Temporal host | |
177 | | -| UI Public Config | `"apiBaseUrl"` | |
178 | | -| Envoy Config | CORS allowed origins, API cluster hostname | |
179 | | -| Ingress | Hostnames for API server & UI | |
180 | | -| Controller Manager | S3 region, endpoint, Temporal host | |
181 | | - |
182 | | -# Object Store Configuration |
183 | | - |
184 | | -Object storage (MinIO / S3) is used by Michelangelo for artifacts and metadata. |
185 | | - |
186 | | -## Controller Manager Object Store Settings |
187 | | - |
188 | | -These live in the controller manager ConfigMap: |
189 | | - |
190 | | -```yaml |
191 | | -minio: |
192 | | - awsRegion: sample # AWS region |
193 | | - awsEndpointUrl: sample.amazonaws.com |
194 | | - useIam: true # Use IAM roles for authentication |
195 | | -``` |
196 | | - |
197 | | -### **Fields** |
198 | | - |
199 | | -* `awsRegion` – The AWS region of your S3 bucket. |
200 | | -* `awsEndpointUrl` – S3 endpoint (`s3.amazonaws.com` or regional endpoint). |
201 | | -* `useIam` – Set to `true` in production (do not hardcode keys in config). |
202 | | - |
203 | | -## **Storage Setup Checklist (from original guide)** |
204 | | - |
205 | | -* Configure **AWS credentials/IAM roles** for pods that need S3 access. |
206 | | -* Verify **region and endpoint** in the ConfigMap match your S3 setup. |
207 | | -* Test connectivity from worker/controller pods to the bucket. |
208 | | - |
209 | | -# Workflow Engine Configuration (Temporal/Cadence) |
210 | | - |
211 | | -Michelangelo uses a workflow engine (Temporal or Cadence) for orchestrating workflows. Most of your current guide examples use **Temporal**, and Cadence is used in sandbox/dev. |
212 | | - |
213 | | -## Controller Manager Workflow Client |
214 | | - |
215 | | -From `controllermgr-configmap.yaml`: |
216 | | - |
217 | | -```yaml |
218 | | -workflowClient: |
219 | | - service: temporal-frontend # Temporal service name |
220 | | - host: temporal.your-domain.com:7233 # Temporal endpoint |
221 | | - transport: grpc # Transport protocol |
222 | | - domain: uniflow # Temporal domain |
223 | | -``` |
224 | | - |
225 | | -### **Fields** |
226 | | - |
227 | | -* `service` – Workflow engine frontend service name (`temporal-frontend` / `cadence-frontend`). |
228 | | -* `host` – Full endpoint (host:port). |
229 | | -* `transport` – Typically `grpc`. |
230 | | -* `domain` – Temporal domain (or Cadence domain) to target. |
231 | | - |
232 | | -## Worker Workflow Engine Settings |
233 | | - |
234 | | -From `worker-configmap.yaml`: |
235 | | - |
236 | | -```yaml |
237 | | -workflow-engine: |
238 | | - host: temporal(/cadence).your-domain.com:7233 |
239 | | - transport: grpc |
240 | | - provider: temporal/cadence |
241 | | - workers: |
242 | | - - domain: default |
243 | | - taskList: production-uniflow |
244 | | - client: |
245 | | - domain: uniflow |
246 | | -``` |
247 | | - |
248 | | -### Fields |
249 | | - |
250 | | -* `provider` – `temporal` (or **cadence**); can be extended to `cadence` if needed. |
251 | | -* `host` – Temporal/Cadence endpoint. |
252 | | -* `workers[].domain` – Domain where worker polls for tasks. |
253 | | -* `workers[].taskList` – Task list (queue) used for workflow tasks. |
254 | | -* `client.domain` – Client domain for starting workflows. |
255 | | - |
256 | | -## Temporal Setup (from original external dependencies) |
257 | | - |
258 | | -* Ensure Temporal is accessible at the configured endpoint. |
259 | | -* Create required domains (`uniflow`, `default`, `production-uniflow`). |
260 | | -* Configure task lists such as `production-uniflow`. |
0 commit comments