relay caddy front#240
Conversation
build_enroll_app now accepts relay_domain keyword arg (required) and mounts GET /tls-check?domain=<sni>; returns 200 for enroll and per-device subdomains under relay_domain, 403 otherwise. All callers updated.
…k route - Change default --host from 0.0.0.0 to 127.0.0.1 (Caddy fronts it) - Add --relay-domain option (default $RELAY_DOMAIN) fed to build_enroll_app - Extract module-level _run_uvicorn and _enroll_server_impl seams for testability - Add tests/cli/test_relay_enroll_server.py with loopback + env-var default tests
…ptional override
…oopback enroll-server)
… route Whole-branch review caught scripts/relay-bootstrap.sh still creating the removed acme/ volume and omitting the caddy-data/caddy-config dirs the docs promise, plus a troubleshooting curl pointing at a /health route the enroll-app doesn't have. Claude-Session: https://claude.ai/code/session_01BCH6QFgBdw9LiWBzV39Xnj
Greptile SummaryThis PR puts the relay's device-enrollment endpoint behind Caddy on port 443, replacing the raw
Confidence Score: 5/5Safe to merge. The ingress topology change is well-scoped and the security-critical allowlist predicate has a thorough test matrix covering all bypass vectors. All changed paths are covered by new tests (tls_ask allow/deny matrix, /tls-check route, enroll-server loopback wiring, enroll URL precedence). The two issues flagged in prior review rounds — the _SERVE_LABEL lone-hyphen base and the unmanaged enroll-server dependency — are both addressed: the regex now anchors on [a-z0-9], and the bootstrap script installs a systemd supervisor. The Caddyfile routing correctly blocks public access to /tls-check, and the network_mode: host + 127.0.0.1:8080 port binding topology is internally consistent. No files require special attention. The Caddyfile could not be machine-validated in CI (noted in the PR test plan), so a caddy validate run on the relay host before going live is worth doing. Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant D as New Device
participant C as Caddy (:443)
participant E as enroll-server (127.0.0.1:47180)
participant S as sish (127.0.0.1:8080)
participant O as Owner
Note over C,E: On-demand TLS ask (internal loopback)
D->>C: HTTPS SNI enroll.relay.example.com
C->>E: "GET /tls-check?domain=enroll.relay.example.com"
E-->>C: 200 OK (tls_ask_allowed True)
C-->>D: TLS cert issued (Let's Encrypt)
D->>C: POST /enroll (pubkey + hostname)
C->>E: reverse_proxy (only POST /enroll, 4KB cap)
E-->>C: "{id: a1c2, status: pending}"
C-->>D: "200 {id: a1c2}"
D->>C: GET /status/a1c2 (polling)
C->>E: reverse_proxy
E-->>C: "{status: pending}"
C-->>D: pending
O->>E: mship relay approve a1c2 (direct)
Note over E: writes pubkey to pubkeys/
D->>C: GET /status/a1c2
C->>E: reverse_proxy
E-->>C: "{status: approved}"
C-->>D: approved
Note over D,S: Enrolled device opens tunnel
D->>S: SSH tunnel :2222
C->>E: "GET /tls-check?domain=slug-abcdef.relay.example.com"
E-->>C: 200 OK (serve subdomain pattern matches)
C->>S: wildcard proxy with Host header preserved
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant D as New Device
participant C as Caddy (:443)
participant E as enroll-server (127.0.0.1:47180)
participant S as sish (127.0.0.1:8080)
participant O as Owner
Note over C,E: On-demand TLS ask (internal loopback)
D->>C: HTTPS SNI enroll.relay.example.com
C->>E: "GET /tls-check?domain=enroll.relay.example.com"
E-->>C: 200 OK (tls_ask_allowed True)
C-->>D: TLS cert issued (Let's Encrypt)
D->>C: POST /enroll (pubkey + hostname)
C->>E: reverse_proxy (only POST /enroll, 4KB cap)
E-->>C: "{id: a1c2, status: pending}"
C-->>D: "200 {id: a1c2}"
D->>C: GET /status/a1c2 (polling)
C->>E: reverse_proxy
E-->>C: "{status: pending}"
C-->>D: pending
O->>E: mship relay approve a1c2 (direct)
Note over E: writes pubkey to pubkeys/
D->>C: GET /status/a1c2
C->>E: reverse_proxy
E-->>C: "{status: approved}"
C-->>D: approved
Note over D,S: Enrolled device opens tunnel
D->>S: SSH tunnel :2222
C->>E: "GET /tls-check?domain=slug-abcdef.relay.example.com"
E-->>C: 200 OK (serve subdomain pattern matches)
C->>S: wildcard proxy with Host header preserved
Reviews (2): Last reviewed commit: "relay: supervise enroll-server (backs Ca..." | Re-trigger Greptile |
| # Enrollment — public but edge-hardened (only POST /enroll, GET /status/*). | ||
| enroll.{$RELAY_DOMAIN} { | ||
| tls { |
There was a problem hiding this comment.
Enroll-server is an unmanaged dependency for all on-demand TLS, not just enrollment
The global on_demand_tls ask endpoint at 127.0.0.1:47180 is called by Caddy every time it needs to issue or renew a TLS cert — for enroll.<relay> and every *.<relay> sish per-device subdomain. Because the enroll-server runs outside the compose stack (no restart: unless-stopped), a host reboot brings caddy and sish back automatically but leaves the enroll-server down. From that point, Caddy silently refuses cert issuance for any relay domain, so the first new sish tunnel from a device that hasn't yet received a cert will fail with a TLS error, and existing certs fail to renew near their 90-day boundary.
The compose file has no depends_on or health-check that could surface this gap. Options: add the enroll-server as a compose service (even a simple command: mship relay enroll-server …) with restart: unless-stopped, or ship a systemd unit in the bootstrap script so the process is supervised the same way caddy and sish are.
There was a problem hiding this comment.
Fixed in 0b9ee91. Agreed — the ask endpoint gates issuance and renewal for every relay subdomain, so an unsupervised host process is the wrong place for it. Rather than containerize mship (the enroll-server writes the host pubkeys/ that sish bind-mounts and Caddy reaches it on loopback, so it is inherently a host process), it is now supervised: relay-bootstrap.sh installs + enables a mship-relay-enroll.service systemd unit (Restart=always) when run as root, with a manual unit + the criticality note documented in relay-hosting.md Step 4, and a fallback "supervise this" notice when not root. So it now survives reboots the same way Compose restarts Caddy/sish.
…label regex (greptile P1/P2)
Summary
Puts the relay's device-enrollment endpoint behind the relay's existing 443 via a Caddy reverse proxy, replacing the raw
:47180port (and its manual firewall hole). Implements the approved specrelay-caddy-front. The request→approve enrollment model andRequestStoreinternals are unchanged — this is purely the ingress + device UX.Before:
mship relay enroll-serverran as a public HTTP service on:47180(open the port, plaintext, device types a fullhttp://<host>:47180URL). sish terminated TLS itself and would mint an on-demand cert for any requested hostname.After:
network_mode: host). sish moves behind it:--https=false, internal127.0.0.1:8080, still owns SSH:2222.enroll.<relay>→ the enroll-server (now bound127.0.0.1— no public port, firewall hole gone);*.<relay>→ sish (Host preserved, so every per-device serve subdomain still works).askendpoint — a pure predicatetls_ask_allowed(domain, relay_domain)issues certs only forenroll.<relay>and serve subdomains (<slug>-<6hex>.<relay>), closing sish's "cert for any host" surface. Adversarially reviewed; no bypass (the"." + relay_domainsuffix forces a label boundary,relay_domainis never compiled as regex).POST /enroll,GET /status/*) in the Caddyfile.mship relay enroll --relay-host <relay>deriveshttps://enroll.<relay>itself — operator supplies only the relay host.--enroll-urlremains an optional explicit override.New:
core/relay/tls_ask.py,docker/relay/Caddyfile. Changed:core/relay/enroll_app.py(the/tls-checkask route),cli/relay.py(enroll-server loopback +--relay-domain; enroll--relay-hostderivation),docker/relay/docker-compose.yml(Caddy service, sish reconfig),docs/relay-hosting.md,scripts/relay-bootstrap.sh. Plan:docs/plans/2026-06-26-relay-caddy-front.md.Built subagent-driven (implementer + spec/quality review per task). A final whole-branch review caught the one integration gap the per-task reviews couldn't:
scripts/relay-bootstrap.shstill created the removedacme/volume and omitted thecaddy-data//caddy-config/dirs the docs promise — fixed here, along with a troubleshootingcurlpointing at a/healthroute the enroll-app doesn't expose.Deferred (documented): a rate-limit plugin (custom Caddy image) and wildcard DNS-01 as an alternative TLS strategy.
Test plan
mship test --repos mothership— full suite green (1740 passed). New unit tests:tls_ask_allowedallow/deny matrix + bypass-critical regressions (suffix-boundary, trailing-dot FQDN, case, leading hyphen); the/tls-checkroute (allow/deny/missing-domain via TestClient); enroll-server loopback default +--relay-domainenv fallback (CliRunner);enroll_base_urlprecedence (--enroll-url>--relay-host> configrelay.host).docker compose configvalidates with the new Caddy service + reconfigured sish.caddybinary; docker daemon not accessible in the build env) — the Caddyfile was reviewed by hand; validate on the host withcaddy validate --config docker/relay/Caddyfile --adapter caddyfile.docker compose up -d(sish + caddy);RELAY_DOMAIN=<relay> mship relay enroll-server --pubkeys-dir … --store-dir …(binds127.0.0.1); from another devicemship relay enroll --relay-host <relay>reacheshttps://enroll.<relay>; ownermship relay approve <id>; confirm an existingmship serve --relaysubdomain still loads through Caddy; confirm:47180is not reachable externally.https://claude.ai/code/session_01BCH6QFgBdw9LiWBzV39Xnj