Skip to content

Commit adcb692

Browse files
zhoward-1claudeaustingreco
authored
docs: add network and ingress configuration guide (#1044)
## Summary - Adds `docs/operator-guides/network.md` - Envoy proxy: full CORS configuration with annotated YAML, TLS termination at Envoy vs at Ingress - Ingress: NGINX Ingress setup for both the gRPC API server (with `backend-protocol: GRPC` and HTTP/2 notes) and the UI/Envoy proxy - TLS: cert-manager setup for both Let's Encrypt (ACME HTTP-01) and internal CA - Multi-cluster: topology diagram, connectivity requirements table (controller manager → compute K8s API, task pods → MA API server, S3), NetworkPolicy snippet - Deployment checklist ## Why The platform-setup doc has a table of "domain settings to update" but no guide on how to actually configure Ingress, TLS, or Envoy CORS — blocking non-Uber adopters who need to deploy into their own infrastructure. This was priority 7 in the original proposal doc. ## Test plan - [ ] Verify all internal cross-links resolve - [ ] Confirm NGINX Ingress annotation names are correct - [ ] Confirm cert-manager YAML is valid against current cert-manager API version - [ ] Confirm Envoy config structure is valid Envoy v3 xDS 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Austin Greco <austingreco@gmail.com>
1 parent 4831f81 commit adcb692

1 file changed

Lines changed: 366 additions & 0 deletions

File tree

docs/operator-guides/network.md

Lines changed: 366 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,366 @@
1+
# Network & Ingress Configuration
2+
3+
This guide covers the network configuration required to deploy Michelangelo in a Kubernetes cluster: Envoy proxy settings (CORS, cluster hostnames), Ingress setup for the API server and UI, TLS with cert-manager, and connectivity requirements for multi-cluster deployments.
4+
5+
---
6+
7+
## Overview
8+
9+
Michelangelo's network surface has two external-facing entry points:
10+
11+
| Entry Point | Default Port | Purpose |
12+
|-------------|-------------|---------|
13+
| API Server Ingress | 443 (HTTPS) | gRPC API used by the `ma` CLI, workers, and SDK |
14+
| UI + Envoy Ingress | 443 (HTTPS) | Browser-facing UI and REST/gRPC-Web proxy |
15+
16+
Traffic flow from the public internet to internal components:
17+
18+
```
19+
Internet
20+
21+
├─ api.your-domain.com ──► Ingress ──► michelangelo-apiserver:15566 (gRPC)
22+
23+
└─ app.your-domain.com ──► Ingress ──► michelangelo-envoy:8081
24+
25+
└─► michelangelo-apiserver:15566 (gRPC-Web)
26+
```
27+
28+
---
29+
30+
## Envoy Proxy Configuration
31+
32+
The Envoy proxy sits in front of the API server for browser clients. It handles HTTP/1.1 → gRPC transcoding and CORS.
33+
34+
### CORS Configuration
35+
36+
Add your UI domain to Envoy's CORS allowed origins. This is required for the browser-based UI to call the API. In the Envoy ConfigMap:
37+
38+
```yaml
39+
static_resources:
40+
listeners:
41+
- address:
42+
socket_address: { address: 0.0.0.0, port_value: 8081 }
43+
filter_chains:
44+
- filters:
45+
- name: envoy.filters.network.http_connection_manager
46+
typed_config:
47+
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
48+
route_config:
49+
virtual_hosts:
50+
- name: local_service
51+
domains: ["*"]
52+
cors:
53+
allow_origin_string_match:
54+
- safe_regex:
55+
regex: "https://app\\.your-domain\\.com"
56+
allow_methods: "GET, POST, OPTIONS"
57+
allow_headers: "content-type, context-ttl-ms, grpc-timeout, rpc-caller, rpc-encoding, rpc-service, x-grpc-web, x-user-agent"
58+
expose_headers: "grpc-status, grpc-message"
59+
max_age: "1728000"
60+
routes:
61+
- match: { prefix: "/" }
62+
route:
63+
cluster: michelangelo-apiserver
64+
max_grpc_timeout: 0s
65+
http_filters:
66+
- name: envoy.filters.http.grpc_web
67+
typed_config:
68+
"@type": type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb
69+
- name: envoy.filters.http.cors
70+
typed_config:
71+
"@type": type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors
72+
- name: envoy.filters.http.router
73+
typed_config:
74+
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
75+
76+
clusters:
77+
- name: michelangelo-apiserver
78+
connect_timeout: 30s
79+
type: LOGICAL_DNS
80+
http2_protocol_options: {}
81+
load_assignment:
82+
cluster_name: michelangelo-apiserver
83+
endpoints:
84+
- lb_endpoints:
85+
- endpoint:
86+
address:
87+
socket_address:
88+
address: michelangelo-apiserver # Kubernetes service name
89+
port_value: 15566
90+
```
91+
92+
**Fields to customize per environment:**
93+
94+
| Field | Description |
95+
|-------|-------------|
96+
| `allow_origin_string_match.regex` | Replace with your UI domain regex |
97+
| `socket_address.address` | API server Kubernetes service name (default: `michelangelo-apiserver`) |
98+
| `socket_address.port_value` | API server port (default: `15566`) |
99+
100+
### Envoy TLS Termination
101+
102+
If you terminate TLS at the Envoy pod (rather than at the Ingress), add a `transport_socket` to the listener:
103+
104+
```yaml
105+
transport_socket:
106+
name: envoy.transport_sockets.tls
107+
typed_config:
108+
"@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
109+
common_tls_context:
110+
tls_certificates:
111+
- certificate_chain:
112+
filename: /etc/ssl/certs/tls.crt
113+
private_key:
114+
filename: /etc/ssl/certs/tls.key
115+
```
116+
117+
Mount the certificate from a Kubernetes Secret:
118+
119+
```yaml
120+
volumes:
121+
- name: tls-cert
122+
secret:
123+
secretName: michelangelo-envoy-tls
124+
volumeMounts:
125+
- name: tls-cert
126+
mountPath: /etc/ssl/certs
127+
readOnly: true
128+
```
129+
130+
In most deployments, TLS is terminated at the Ingress layer instead — see [TLS with cert-manager](#tls-with-cert-manager) below.
131+
132+
---
133+
134+
## Ingress Setup
135+
136+
### API Server Ingress
137+
138+
The API server uses gRPC (HTTP/2). Your Ingress controller must support HTTP/2 backend connections. With NGINX Ingress Controller:
139+
140+
```yaml
141+
apiVersion: networking.k8s.io/v1
142+
kind: Ingress
143+
metadata:
144+
name: michelangelo-apiserver
145+
namespace: michelangelo
146+
annotations:
147+
nginx.ingress.kubernetes.io/ssl-redirect: "true"
148+
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
149+
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
150+
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
151+
spec:
152+
ingressClassName: nginx
153+
tls:
154+
- hosts:
155+
- api.your-domain.com
156+
secretName: michelangelo-apiserver-tls
157+
rules:
158+
- host: api.your-domain.com
159+
http:
160+
paths:
161+
- path: /
162+
pathType: Prefix
163+
backend:
164+
service:
165+
name: michelangelo-apiserver
166+
port:
167+
number: 15566
168+
```
169+
170+
> **HTTP/2 requirement:** gRPC requires HTTP/2 end-to-end. If your Ingress controller terminates TLS but connects to the backend over HTTP/1.1, gRPC calls will fail. Ensure `backend-protocol: GRPC` (or equivalent) is set.
171+
172+
### UI + Envoy Ingress
173+
174+
The UI and gRPC-Web proxy share a single Ingress:
175+
176+
```yaml
177+
apiVersion: networking.k8s.io/v1
178+
kind: Ingress
179+
metadata:
180+
name: michelangelo-ui
181+
namespace: michelangelo
182+
annotations:
183+
nginx.ingress.kubernetes.io/ssl-redirect: "true"
184+
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
185+
spec:
186+
ingressClassName: nginx
187+
tls:
188+
- hosts:
189+
- app.your-domain.com
190+
secretName: michelangelo-ui-tls
191+
rules:
192+
- host: app.your-domain.com
193+
http:
194+
paths:
195+
- path: /
196+
pathType: Prefix
197+
backend:
198+
service:
199+
name: michelangelo-envoy
200+
port:
201+
number: 8081
202+
```
203+
204+
### Domain Names to Update in Overlays
205+
206+
After setting hostnames in Ingress resources, propagate them through ConfigMaps:
207+
208+
| Location | Field | Value |
209+
|----------|-------|-------|
210+
| Worker ConfigMap | `worker.address` | `api.your-domain.com:443` |
211+
| UI Public Config | `apiBaseUrl` | `https://app.your-domain.com` |
212+
| Envoy CORS config | `allow_origin_string_match.regex` | Your UI domain |
213+
214+
See [Platform Setup — Environment Overrides](index.md#environment-overrides--domain-settings) for the full list.
215+
216+
---
217+
218+
## TLS with cert-manager
219+
220+
Use cert-manager to automate TLS certificate provisioning. Install cert-manager if it is not already present:
221+
222+
```bash
223+
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
224+
```
225+
226+
### ClusterIssuer (Let's Encrypt)
227+
228+
```yaml
229+
apiVersion: cert-manager.io/v1
230+
kind: ClusterIssuer
231+
metadata:
232+
name: letsencrypt-prod
233+
spec:
234+
acme:
235+
server: https://acme-v02.api.letsencrypt.org/directory
236+
email: platform-team@your-domain.com
237+
privateKeySecretRef:
238+
name: letsencrypt-prod
239+
solvers:
240+
- http01:
241+
ingress:
242+
class: nginx
243+
```
244+
245+
### Referencing the Issuer in Ingress
246+
247+
Add the cert-manager annotation to your Ingress resources:
248+
249+
```yaml
250+
metadata:
251+
annotations:
252+
cert-manager.io/cluster-issuer: "letsencrypt-prod"
253+
```
254+
255+
cert-manager will automatically create and renew the TLS Secret referenced in `spec.tls[].secretName`.
256+
257+
### Using an Internal CA
258+
259+
For private clusters that cannot use ACME, use a `ClusterIssuer` backed by an internal CA:
260+
261+
```yaml
262+
apiVersion: cert-manager.io/v1
263+
kind: ClusterIssuer
264+
metadata:
265+
name: internal-ca
266+
spec:
267+
ca:
268+
secretName: internal-ca-key-pair # Secret containing the CA cert and key
269+
```
270+
271+
---
272+
273+
## Multi-Cluster Network Topology
274+
275+
When Michelangelo's control plane dispatches jobs to registered compute clusters, the following connectivity is required:
276+
277+
```
278+
Control Plane Cluster Compute Cluster
279+
┌────────────────────────┐ ┌──────────────────────────────┐
280+
│ Controller Manager │──── HTTPS ───►│ Kubernetes API server │
281+
│ (kubeconfig for each │ │ (port 443) │
282+
│ compute cluster) │ └──────────────────────────────┘
283+
│ │
284+
│ Worker │◄──── gRPC ────┤ Task pods (report back │
285+
│ (port 15566) │ │ via worker.address) │
286+
└────────────────────────┘ └──────────────────────────────┘
287+
```
288+
289+
### Required Connectivity
290+
291+
| Direction | Source | Destination | Port | Purpose |
292+
|-----------|--------|-------------|------|---------|
293+
| Outbound from control plane | Controller Manager | Compute cluster K8s API | 443 | Dispatching RayCluster / SparkApplication CRDs |
294+
| Outbound from compute | Task pods | Michelangelo API server | 443 | Worker connectivity for result reporting |
295+
| Outbound from compute | Task pods | S3 / object store | 443 | Artifact reads and writes |
296+
297+
### NetworkPolicy for Control Plane → Compute Cluster
298+
299+
If your compute cluster enforces NetworkPolicy, ensure the control plane's egress IP range can reach the Kubernetes API server:
300+
301+
> **Managed Kubernetes (EKS, GKE, AKS):** The API server runs outside the cluster on managed platforms and is not a schedulable pod. This NetworkPolicy only applies to self-managed clusters. For managed clusters, use your cloud provider's security groups or authorized networks instead.
302+
303+
```yaml
304+
apiVersion: networking.k8s.io/v1
305+
kind: NetworkPolicy
306+
metadata:
307+
name: allow-michelangelo-controller
308+
namespace: kube-system
309+
spec:
310+
podSelector:
311+
matchLabels:
312+
component: kube-apiserver
313+
policyTypes:
314+
- Ingress
315+
ingress:
316+
- from:
317+
- ipBlock:
318+
cidr: <control-plane-egress-cidr>/32
319+
ports:
320+
- protocol: TCP
321+
port: 443
322+
```
323+
324+
### Verifying Cross-Cluster Connectivity
325+
326+
From the Controller Manager pod, verify it can reach each registered compute cluster:
327+
328+
```bash
329+
# Exec into the controller manager pod
330+
kubectl exec -it deploy/michelangelo-controllermgr -n michelangelo -- /bin/sh
331+
332+
# Check connectivity to a registered compute cluster's K8s API
333+
curl -sk https://<compute-cluster-api-server>:443/healthz
334+
```
335+
336+
From a task pod in the compute cluster, verify it can reach the Michelangelo API server:
337+
338+
```bash
339+
kubectl exec -it <task-pod> -n <compute-namespace> -- \
340+
curl -sk https://api.your-domain.com/healthz
341+
```
342+
343+
---
344+
345+
## Checklist
346+
347+
Use this checklist when deploying Michelangelo to a new environment:
348+
349+
- [ ] Ingress controller installed and supports HTTP/2 (for gRPC)
350+
- [ ] API server Ingress created with `backend-protocol: GRPC`
351+
- [ ] UI + Envoy Ingress created
352+
- [ ] TLS certificates provisioned (cert-manager or manual)
353+
- [ ] Envoy CORS `allow_origin` updated to match UI domain
354+
- [ ] Worker ConfigMap `worker.address` updated to `api.your-domain.com:443`
355+
- [ ] UI `config.json` `apiBaseUrl` updated to UI domain
356+
- [ ] Cross-cluster connectivity verified (controller manager → compute K8s API)
357+
- [ ] Task pod → Michelangelo API server connectivity verified
358+
359+
---
360+
361+
## Related
362+
363+
- [Platform Setup — Environment Overrides](index.md#environment-overrides--domain-settings)
364+
- [Authentication](authentication.md)
365+
- [Register a Compute Cluster](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md)
366+
- [Troubleshooting](troubleshooting.md)

0 commit comments

Comments
 (0)