Skip to content

relay caddy front#240

Open
atomikpanda wants to merge 10 commits into
mainfrom
feat/relay-caddy-front
Open

relay caddy front#240
atomikpanda wants to merge 10 commits into
mainfrom
feat/relay-caddy-front

Conversation

@atomikpanda

Copy link
Copy Markdown
Owner

Summary

Puts the relay's device-enrollment endpoint behind the relay's existing 443 via a Caddy reverse proxy, replacing the raw :47180 port (and its manual firewall hole). Implements the approved spec relay-caddy-front. The request→approve enrollment model and RequestStore internals are unchanged — this is purely the ingress + device UX.

Before: mship relay enroll-server ran as a public HTTP service on :47180 (open the port, plaintext, device types a full http://<host>:47180 URL). sish terminated TLS itself and would mint an on-demand cert for any requested hostname.

After:

  • Caddy is the sole public web ingress (80/443, network_mode: host). sish moves behind it: --https=false, internal 127.0.0.1:8080, still owns SSH :2222.
  • Host routing: enroll.<relay> → the enroll-server (now bound 127.0.0.1 — no public port, firewall hole gone); *.<relay> → sish (Host preserved, so every per-device serve subdomain still works).
  • On-demand TLS gated by an ask endpoint — a pure predicate tls_ask_allowed(domain, relay_domain) issues certs only for enroll.<relay> and serve subdomains (<slug>-<6hex>.<relay>), closing sish's "cert for any host" surface. Adversarially reviewed; no bypass (the "." + relay_domain suffix forces a label boundary, relay_domain is never compiled as regex).
  • Edge-hardened enroll route: request-body size cap + method/path allowlist (only POST /enroll, GET /status/*) in the Caddyfile.
  • Device UX: mship relay enroll --relay-host <relay> derives https://enroll.<relay> itself — operator supplies only the relay host. --enroll-url remains an optional explicit override.

New: core/relay/tls_ask.py, docker/relay/Caddyfile. Changed: core/relay/enroll_app.py (the /tls-check ask route), cli/relay.py (enroll-server loopback + --relay-domain; enroll --relay-host derivation), docker/relay/docker-compose.yml (Caddy service, sish reconfig), docs/relay-hosting.md, scripts/relay-bootstrap.sh. Plan: docs/plans/2026-06-26-relay-caddy-front.md.

Built subagent-driven (implementer + spec/quality review per task). A final whole-branch review caught the one integration gap the per-task reviews couldn't: scripts/relay-bootstrap.sh still created the removed acme/ volume and omitted the caddy-data//caddy-config/ dirs the docs promise — fixed here, along with a troubleshooting curl pointing at a /health route the enroll-app doesn't expose.

Deferred (documented): a rate-limit plugin (custom Caddy image) and wildcard DNS-01 as an alternative TLS strategy.

Test plan

  • mship test --repos mothership — full suite green (1740 passed). New unit tests: tls_ask_allowed allow/deny matrix + bypass-critical regressions (suffix-boundary, trailing-dot FQDN, case, leading hyphen); the /tls-check route (allow/deny/missing-domain via TestClient); enroll-server loopback default + --relay-domain env fallback (CliRunner); enroll_base_url precedence (--enroll-url > --relay-host > config relay.host).
  • docker compose config validates with the new Caddy service + reconfigured sish.
  • Caddy config validation could not run in CI (no caddy binary; docker daemon not accessible in the build env) — the Caddyfile was reviewed by hand; validate on the host with caddy validate --config docker/relay/Caddyfile --adapter caddyfile.
  • Manual smoke (on the relay host): docker compose up -d (sish + caddy); RELAY_DOMAIN=<relay> mship relay enroll-server --pubkeys-dir … --store-dir … (binds 127.0.0.1); from another device mship relay enroll --relay-host <relay> reaches https://enroll.<relay>; owner mship relay approve <id>; confirm an existing mship serve --relay subdomain still loads through Caddy; confirm :47180 is not reachable externally.

https://claude.ai/code/session_01BCH6QFgBdw9LiWBzV39Xnj

build_enroll_app now accepts relay_domain keyword arg (required) and
mounts GET /tls-check?domain=<sni>; returns 200 for enroll and per-device
subdomains under relay_domain, 403 otherwise. All callers updated.
…k route

- Change default --host from 0.0.0.0 to 127.0.0.1 (Caddy fronts it)
- Add --relay-domain option (default $RELAY_DOMAIN) fed to build_enroll_app
- Extract module-level _run_uvicorn and _enroll_server_impl seams for testability
- Add tests/cli/test_relay_enroll_server.py with loopback + env-var default tests
… route

Whole-branch review caught scripts/relay-bootstrap.sh still creating the removed
acme/ volume and omitting the caddy-data/caddy-config dirs the docs promise, plus
a troubleshooting curl pointing at a /health route the enroll-app doesn't have.

Claude-Session: https://claude.ai/code/session_01BCH6QFgBdw9LiWBzV39Xnj
@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR puts the relay's device-enrollment endpoint behind Caddy on port 443, replacing the raw :47180 public port. Caddy becomes the sole public web ingress, terminating TLS via on-demand Let's Encrypt certs gated by a tight allowlist predicate (tls_ask_allowed), while sish moves to an internal loopback HTTP listener.

  • Security hardening: enroll-server binds 127.0.0.1 only; the Caddyfile allowlists only POST /enroll and GET /status/* with a 4 KB body cap; tls_ask_allowed issues certs only for enroll.<relay> and <slug>-<6hex>.<relay> shapes, closing sish's "cert for any host" surface. The _SERVE_LABEL regex was tightened to [a-z0-9][a-z0-9-]* (fixing the previously-flagged lone-hyphen base gap).
  • Operational resilience: the bootstrap script installs a systemd unit to supervise the enroll-server (which gates all cert issuance/renewal via the on-demand ask endpoint); the docs prominently warn operators to keep it supervised if systemd is unavailable.
  • Device UX: mship relay enroll --relay-host <host> now derives https://enroll.<host> automatically; --enroll-url remains as an explicit override.

Confidence Score: 5/5

Safe to merge. The ingress topology change is well-scoped and the security-critical allowlist predicate has a thorough test matrix covering all bypass vectors.

All changed paths are covered by new tests (tls_ask allow/deny matrix, /tls-check route, enroll-server loopback wiring, enroll URL precedence). The two issues flagged in prior review rounds — the _SERVE_LABEL lone-hyphen base and the unmanaged enroll-server dependency — are both addressed: the regex now anchors on [a-z0-9], and the bootstrap script installs a systemd supervisor. The Caddyfile routing correctly blocks public access to /tls-check, and the network_mode: host + 127.0.0.1:8080 port binding topology is internally consistent.

No files require special attention. The Caddyfile could not be machine-validated in CI (noted in the PR test plan), so a caddy validate run on the relay host before going live is worth doing.

Important Files Changed

Filename Overview
src/mship/core/relay/tls_ask.py New cert allowlist predicate; regex fixed to [a-z0-9][a-z0-9-]* to enforce alphanumeric-first label. Suffix check uses "." + relay_domain to enforce label boundary. All edge cases tested.
src/mship/core/relay/enroll_app.py Added relay_domain parameter to build_enroll_app and mounted the Caddy on-demand /tls-check route backed by tls_ask_allowed. /tls-check is blocked from public access by the Caddyfile; only Caddy itself calls it on loopback.
src/mship/cli/relay.py Extracted _enroll_server_impl / _run_uvicorn seams for testability; enroll-server default bind changed from 0.0.0.0 to 127.0.0.1; added --relay-domain with $RELAY_DOMAIN env fallback; enroll command now accepts --relay-host to derive the enroll URL.
docker/relay/Caddyfile New Caddyfile. On-demand TLS gated by http://127.0.0.1:47180/tls-check. enroll.<relay> site allows only POST /enroll and GET /status/* with 4 KB body cap; everything else returns 404. Wildcard *.<relay> proxies to sish with Host preserved.
docker/relay/docker-compose.yml sish demoted to internal HTTP on 127.0.0.1:8080 with --https=false; Caddy service added with network_mode: host. ACME cert storage migrated from sish-managed ./acme to ./caddy-data / ./caddy-config.
scripts/relay-bootstrap.sh Replaced acme/ dir creation with caddy-data/ and caddy-config/. Adds a systemd unit for the enroll-server when run as root with systemd and mship available; otherwise prints a prominent warning.
tests/core/relay/test_tls_ask.py Comprehensive allow/deny matrix covering enroll host, serve subdomains, bare apex, foreign domains, lookalikes, extra levels, blank input, trailing-dot FQDN, suffix boundary, case normalization, and leading-hyphen bypass cases.
tests/cli/test_relay_enroll_server.py Tests loopback default and $RELAY_DOMAIN env var fallback via monkeypatched seams; no server is started.
tests/cli/test_relay_enroll_url.py Tests enroll_base_url precedence (explicit URL > relay-host > config host) and trailing-slash normalisation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant D as New Device
    participant C as Caddy (:443)
    participant E as enroll-server (127.0.0.1:47180)
    participant S as sish (127.0.0.1:8080)
    participant O as Owner

    Note over C,E: On-demand TLS ask (internal loopback)
    D->>C: HTTPS SNI enroll.relay.example.com
    C->>E: "GET /tls-check?domain=enroll.relay.example.com"
    E-->>C: 200 OK (tls_ask_allowed True)
    C-->>D: TLS cert issued (Let's Encrypt)

    D->>C: POST /enroll (pubkey + hostname)
    C->>E: reverse_proxy (only POST /enroll, 4KB cap)
    E-->>C: "{id: a1c2, status: pending}"
    C-->>D: "200 {id: a1c2}"

    D->>C: GET /status/a1c2 (polling)
    C->>E: reverse_proxy
    E-->>C: "{status: pending}"
    C-->>D: pending

    O->>E: mship relay approve a1c2 (direct)
    Note over E: writes pubkey to pubkeys/

    D->>C: GET /status/a1c2
    C->>E: reverse_proxy
    E-->>C: "{status: approved}"
    C-->>D: approved

    Note over D,S: Enrolled device opens tunnel
    D->>S: SSH tunnel :2222
    C->>E: "GET /tls-check?domain=slug-abcdef.relay.example.com"
    E-->>C: 200 OK (serve subdomain pattern matches)
    C->>S: wildcard proxy with Host header preserved
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant D as New Device
    participant C as Caddy (:443)
    participant E as enroll-server (127.0.0.1:47180)
    participant S as sish (127.0.0.1:8080)
    participant O as Owner

    Note over C,E: On-demand TLS ask (internal loopback)
    D->>C: HTTPS SNI enroll.relay.example.com
    C->>E: "GET /tls-check?domain=enroll.relay.example.com"
    E-->>C: 200 OK (tls_ask_allowed True)
    C-->>D: TLS cert issued (Let's Encrypt)

    D->>C: POST /enroll (pubkey + hostname)
    C->>E: reverse_proxy (only POST /enroll, 4KB cap)
    E-->>C: "{id: a1c2, status: pending}"
    C-->>D: "200 {id: a1c2}"

    D->>C: GET /status/a1c2 (polling)
    C->>E: reverse_proxy
    E-->>C: "{status: pending}"
    C-->>D: pending

    O->>E: mship relay approve a1c2 (direct)
    Note over E: writes pubkey to pubkeys/

    D->>C: GET /status/a1c2
    C->>E: reverse_proxy
    E-->>C: "{status: approved}"
    C-->>D: approved

    Note over D,S: Enrolled device opens tunnel
    D->>S: SSH tunnel :2222
    C->>E: "GET /tls-check?domain=slug-abcdef.relay.example.com"
    E-->>C: 200 OK (serve subdomain pattern matches)
    C->>S: wildcard proxy with Host header preserved
Loading

Reviews (2): Last reviewed commit: "relay: supervise enroll-server (backs Ca..." | Re-trigger Greptile

Comment thread docker/relay/Caddyfile
Comment on lines +8 to +10
# Enrollment — public but edge-hardened (only POST /enroll, GET /status/*).
enroll.{$RELAY_DOMAIN} {
tls {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Enroll-server is an unmanaged dependency for all on-demand TLS, not just enrollment

The global on_demand_tls ask endpoint at 127.0.0.1:47180 is called by Caddy every time it needs to issue or renew a TLS cert — for enroll.<relay> and every *.<relay> sish per-device subdomain. Because the enroll-server runs outside the compose stack (no restart: unless-stopped), a host reboot brings caddy and sish back automatically but leaves the enroll-server down. From that point, Caddy silently refuses cert issuance for any relay domain, so the first new sish tunnel from a device that hasn't yet received a cert will fail with a TLS error, and existing certs fail to renew near their 90-day boundary.

The compose file has no depends_on or health-check that could surface this gap. Options: add the enroll-server as a compose service (even a simple command: mship relay enroll-server …) with restart: unless-stopped, or ship a systemd unit in the bootstrap script so the process is supervised the same way caddy and sish are.

Fix in Claude Code Fix in Codex

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0b9ee91. Agreed — the ask endpoint gates issuance and renewal for every relay subdomain, so an unsupervised host process is the wrong place for it. Rather than containerize mship (the enroll-server writes the host pubkeys/ that sish bind-mounts and Caddy reaches it on loopback, so it is inherently a host process), it is now supervised: relay-bootstrap.sh installs + enables a mship-relay-enroll.service systemd unit (Restart=always) when run as root, with a manual unit + the criticality note documented in relay-hosting.md Step 4, and a fallback "supervise this" notice when not root. So it now survives reboots the same way Compose restarts Caddy/sish.

Comment thread src/mship/core/relay/tls_ask.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant