ci: guard the published page against stray external domains#49
Conversation
docs/index.html publishes to GitHub Pages, so an external link to a client, vendor, or product domain reaching it is a confidentiality leak. The existing client-name defense lives only in a gitignored local file, so CI never sees it. This adds the committed, no-secret half: - scripts/check_external_domains.py fails if the published page links to any external host outside a small allowlist (standards/repo/identity domains plus the author's current product domain), with a built-in --self-test. - Wired into ci.yml (self-test + real check) and mirrored in the lefthook pre-commit hook (globbed to docs/index.html). Catches the leak shape without the private name list, and unlike the local hook it covers fresh clones and fork PRs. The private client-name layer (a CI secret) is intentionally deferred. Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce54a46bfb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
|
|
||
| PAGE = pathlib.Path(__file__).resolve().parent.parent / "docs" / "index.html" | ||
| _HOST_RE = re.compile(r"https?://([A-Za-z0-9._-]+)") |
There was a problem hiding this comment.
Recognize scheme-relative external links
The guard promises to fail any non-allowlisted external host in docs/index.html, but this regex only matches URLs that explicitly include http:// or https://. In the CI path inspected in .github/workflows/ci.yml, the script is run directly, so a valid published link such as <a href="//client.example/path"> is still an external browser navigation while violations(...) returns [], letting CI and the pre-commit guard pass. Include scheme-relative URLs, or parse link attributes with a URL parser before applying the allowlist.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR adds a committed (no-secret) CI + pre-commit safeguard to prevent confidentiality leaks via the published GitHub Pages homepage (docs/index.html) by failing builds/commits when the page references external domains outside a small allowlist.
Changes:
- Add
scripts/check_external_domains.pyto extract external hosts fromdocs/index.htmland fail on any host not inALLOWED_HOSTS(with a--self-testmode). - Wire the new check into GitHub Actions CI (
.github/workflows/ci.yml) and mirror it in thelefthook.ymlpre-commit hooks.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| scripts/check_external_domains.py | Introduces the external-domain allowlist scanner + self-test used by CI/hooks. |
| lefthook.yml | Adds a pre-commit hook intended to block non-allowlisted external domains in docs/index.html. |
| .github/workflows/ci.yml | Adds CI steps to run the self-test and enforce the external-domain allowlist check. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not PAGE.exists(): | ||
| print(f"::error::{PAGE} not found") | ||
| return 1 | ||
| bad = violations(PAGE.read_text(encoding="utf-8")) |
| external-domains: | ||
| glob: "docs/index.html" | ||
| run: | | ||
| if [ -x .venv/bin/python ]; then PY=.venv/bin/python; else PY=python3; fi | ||
| "$PY" scripts/check_external_domains.py |
What
Adds the committed, no-secret half of the deny-scan defense (Option C in the proposal): a CI + pre-commit guard that fails if
docs/index.html(published to GitHub Pages) links to any external host outside a small allowlist.scripts/check_external_domains.py— extracts external hosts from the published page, fails on any not inALLOWED_HOSTS(standards / repo / identity domains + the author's current product domain), with a built-in--self-test..github/workflows/ci.yml(self-test + real check) and mirrored in thelefthook.ymlpre-commit hook (globbed todocs/index.html).Why
A stray external link — a client, vendor, or product domain — reaching the published page is a confidentiality leak. The existing client-name defense lives only in a gitignored local file, so CI never sees it. This catches the leak shape with zero secrets, and (unlike the local hook) covers fresh clones and fork PRs.
Scope / deliberately deferred
ALLOWED_HOSTSso the guard re-tightens.Verified locally: self-test passes, the current page passes clean (no red-lock), and a planted external domain is correctly rejected.
🤖 Generated with Claude Code