Skip to content

feat: add AS215932 extmon and diagnostics ops support#213

Open
Svaag wants to merge 7 commits into
mainfrom
chore/agentic-diagnostics-pricing-knobs
Open

feat: add AS215932 extmon and diagnostics ops support#213
Svaag wants to merge 7 commits into
mainfrom
chore/agentic-diagnostics-pricing-knobs

Conversation

@Svaag

@Svaag Svaag commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR contains the Network Operations side of the AS215932 / Agentic ISP Support rollout.

It adds operational support for the new Hyrule Cloud BGP/network-intelligence platform and exposes pricing knobs for the new agentic diagnostics APIs.

Major pieces:

  • extmon role expansion for out-of-AS215932 monitoring
    • Routinator
    • BGPalerter
    • BGPStream/PyBGPStream bootstrap
    • extmon-bgp-agent
    • extmon-diag-agent
    • Prometheus scrape jobs and alert rules
    • direct Discord alerting plus best-effort NOC webhook
  • BGPalerter pinned to v2.0.1 with SHA256
  • extmon rendered service/config templates
  • NOC-side AS215932 router-table snapshot timer/service templates
  • Hyrule Cloud runtime/deployment support for BGP data dir and ingest token
  • Vault/static env template price knobs for:
    • BGP/IP/DNS/RDAP/WHOIS/MX/Mail
    • Web/TLS
    • Path diagnostics
    • Port checks
    • NAT/CGNAT
    • Threat reputation
    • VoIP/SIP
    • Speedtest
  • docs updates for extmon and network flows

Review notes

This is paired with:

  • hyrule-cloud PR: x402 Agentic ISP Support API/Skill implementation
  • hyrule-mcp PR: AS215932 router snapshot collector

The important design constraint remains: extmon is outside AS215932 and must not hold router SSH keys. Router snapshots stay NOC-side.

Validation

After rebasing on current origin/main:

cd network-operations && ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/cloud.yml --syntax-check
playbook: ansible/playbooks/cloud.yml

Previous local validation before the branch split also covered extmon/noc/cloud syntax checks and extmon render/agent py_compile.

@Svaag Svaag requested a review from a team as a code owner June 14, 2026 00:43

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5a5d7daec4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +56 to +58
resolve_public(parsed.hostname or "")
req = urllib.request.Request(url, headers={"User-Agent": "AS215932-extmon-diag/1.0"})
with urllib.request.urlopen(req, timeout=timeout) as resp:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block redirects to private addresses

When a diagnostics request probes a public URL that returns a 30x to loopback/link-local/private infrastructure (for example metadata or 127.0.0.1), resolve_public only validates the original hostname, but urllib.request.urlopen follows redirects and resolves the redirected target itself. That bypasses the non-public-address guard and can return a body sample from internal services; disable redirects or re-validate each redirect target before fetching.

Useful? React with 👍 / 👎.

summary: 'BGP data source stale: {% raw %}{{ $labels.source }}{% endraw %}'

- alert: BGPalerterCriticalEvent
expr: increase(bgpalerter_alerts_total{severity="critical"}[10m]) > 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the first BGPalerter event

For the first critical BGPalerter webhook after the agent starts, or for any new channel/type labelset, the counter series first appears at 1; Prometheus cannot compute a positive increase() without a prior sample, and later samples remain at 1. This misses the one-shot hijack/withdrawal notification until a second matching event increments the same series, so initialize counters at 0 or alert from an event timestamp/gauge instead.

Useful? React with 👍 / 👎.


# BGP router table snapshots for paid Hyrule Cloud data products.
hyrule_mcp_bgp_snapshot_dir: "{{ hyrule_mcp_state_dir }}/bgp-snapshots"
hyrule_mcp_bgp_snapshot_ingest_url: "{{ lookup('ansible.builtin.env', 'HYRULE_BGP_INGEST_URL') | default('https://cloud.hyrule.host/v1/internal/bgp') }}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the default ingest URL when env is unset

When HYRULE_BGP_INGEST_URL is not exported, Ansible's env lookup returns an empty string, and default(...) does not replace falsey values unless the second argument is true. The rendered bgp-router-snapshot.env therefore gets HYRULE_BGP_INGEST_URL= instead of the intended cloud endpoint, so the hourly snapshot job will not upload unless every deploy environment explicitly sets the URL.

Useful? React with 👍 / 👎.

@github-actions

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 5 🔵🔵🔵🔵🔵
🏅 Score: 72
🧪 No relevant tests
🔒 Security concerns

Yes. Multiple security concerns:
(1) The CAIDA BGPStream bootstrap script is fetched via unauthenticated HTTP and piped to bash, creating a supply-chain compromise vector. (2) The diag agent's resolve_public function has a logic flaw that can bypass private-IP blocking under crafted DNS responses. (3) The diag agent's ping/trace probes pass user-controlled addresses to subprocess without sanitization, enabling argument injection if the token is empty or leaked. (4) The EXTMON_DIAG_AGENT_TOKEN defaults to empty string in the env file, meaning the agent is unprotected by default until the operator explicitly sets a token. (5) The EXTMON_BGP_INGEST_TOKEN and EXTMON_BGP_CLOUDFLARE_API_TOKEN are rendered into a world-readable env file (mode 0600 is set in the install task, but the generated file in ansible/generated/extmon/bgp-agent.env has no mode restriction — it inherits the default from the render task which is 0644). (6) The HYRULE_BGP_INGEST_TOKEN is rendered into bgp-router-snapshot.env with mode 0640, which is group-readable; the group is hyrule_mcp_app_group which may include non-root users.

⚡ Recommended focus areas for review

Operational Footgun

The CAIDA BGPStream repository bootstrap (lines 52-62) runs a remote shell script piped into bash via curl -s "https://pkg.caida.org/os/$(lsb_release -si | awk '{print tolower($0)}')/bootstrap.sh" | bash. This is a classic supply-chain risk: the script is fetched over plain HTTP (no HTTPS pinned), and the URL is dynamically constructed from the OS release string. If the CAIDA bootstrap endpoint is compromised or the DNS is hijacked, arbitrary code can execute as root during the apply phase. This bypasses the SHA-pinned binary pattern used elsewhere in the PR (e.g., BGPalerter). The creates: /etc/apt/sources.list.d/caida.list guard only prevents re-execution after the first run, not the initial compromise.

- name: Apply — bootstrap CAIDA BGPStream repositories
  shell: |
    set -euo pipefail
    curl -s "https://pkg.caida.org/os/$(lsb_release -si | awk '{print tolower($0)}')/bootstrap.sh" | bash
  args:
    executable: /bin/bash
    creates: /etc/apt/sources.list.d/caida.list
  when:
    - extmon_apply | bool
    - extmon_install_bgpstream | bool
  tags: [apply]
Security Concern

The resolve_public function (lines 25-39) has a logic flaw: if blocked_ip(host) raises a ValueError, the bare except ValueError: pass on line 29-30 silently swallows the exception and continues to socket.getaddrinfo. This means a caller can bypass the blocked-IP check by passing a hostname that is itself a private IP string (e.g., "127.0.0.1" or "10.0.0.1"). The function will skip the check, resolve the hostname (which is already an IP), and then the subsequent blocked_ip(addr) check on line 35 will catch it — but only if the address resolves to the same IP. If the hostname resolves to a different private IP (e.g., via a crafted DNS response), the check is effectively bypassed. In practice, an attacker who can control the host parameter to the tcp or ping/trace probes could target internal RFC1918 or loopback addresses, bypassing the intended restriction to public targets only.

def resolve_public(host: str) -> list[str]:
    try:
        if blocked_ip(host):
            raise ValueError(f"blocked non-public target {host}")
    except ValueError:
        pass
    infos = socket.getaddrinfo(host, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
    addrs = []
    for info in infos:
        addr = info[4][0]
        if blocked_ip(addr):
            raise ValueError(f"blocked resolved non-public target {addr}")
        if addr not in addrs:
            addrs.append(addr)
    return addrs
Security Concern

The run_probe function (lines 62-76) allows ping and trace probes that execute subprocess commands with user-controlled addr values (line 70). While resolve_public is called first, the addr variable is the first resolved address from getaddrinfo. If an attacker can control the host payload and the DNS resolves to a crafted address containing shell metacharacters (e.g., ; rm -rf /), the subprocess.run call on line 71 could execute arbitrary commands. Although subprocess.run with a list argument (not shell=True) prevents shell injection, the addr is passed as a positional argument to ping/traceroute, which could still allow argument injection (e.g., -c flag injection in ping). This is a realistic attack vector if the diag agent is exposed to untrusted clients (the token check on line 86 is optional — if TOKEN is empty, any POST is accepted).

if kind in {"ping", "trace"}:
    host = str(payload["host"])
    addr = resolve_public(host)[0]
    cmd = ["ping", "-c", "4", "-W", "2", addr] if kind == "ping" else ["traceroute", "-m", "20", addr]
    proc = subprocess.run(cmd, capture_output=True, text=True, timeout=20, check=False)
    return {"ok": proc.returncode == 0, "returncode": proc.returncode, "output": (proc.stdout or proc.stderr)[-4096:]}

@github-actions

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Security
Fix SSRF bypass in resolve_public

The try-except block silently swallows the ValueError when a blocked IP is detected,
completely bypassing the SSRF protection. Remove the outer try-except so that any
non-public target immediately raises an exception and prevents execution of the
probe.

ansible/roles/extmon/templates/diag-agent.py.j2 [25-30]

 def resolve_public(host: str) -> list[str]:
-    try:
-        if blocked_ip(host):
-            raise ValueError(f"blocked non-public target {host}")
-    except ValueError:
-        pass
+    if blocked_ip(host):
+        raise ValueError(f"blocked non-public target {host}")
Suggestion importance[1-10]: 9

__

Why: The try-except silently swallows the ValueError when a blocked IP is detected, completely bypassing SSRF protection. Removing it ensures non-public targets are rejected immediately, which is a critical security fix.

High
Possible issue
Handle empty DNS resolution result

resolve_public may return an empty list if no addresses are resolved, causing an
unhandled IndexError that crashes the probe handler. Check the length of the list
before indexing and raise a clear error if no addresses are available.

ansible/roles/extmon/templates/diag-agent.py.j2 [42-45]

 def tcp(host: str, port: int, timeout: float = 10.0):
     ...
-    addr = resolve_public(host)[0]
+    addrs = resolve_public(host)
+    if not addrs:
+        raise ValueError(f"no public addresses resolved for {host}")
+    addr = addrs[0]
Suggestion importance[1-10]: 6

__

Why: resolve_public may return an empty list, causing an unhandled IndexError. Adding a length check prevents a crash and provides a clear error message, improving robustness.

Low
General
Prevent polling loop from crashing

An unexpected exception in poll_once() (e.g., a bug in the shared state lock) would
terminate the polling thread entirely, stopping all metric collection. Wrap the call
in a try-except to log any such error and continue the loop, ensuring the agent
remains responsive.

ansible/roles/extmon/templates/bgp-agent.py.j2 [139-142]

 def poll_loop() -> None:
     while True:
-        poll_once()
+        try:
+            poll_once()
+        except Exception:
+            pass  # Prevent loop exit on unexpected error
         time.sleep(POLL_SECONDS)
Suggestion importance[1-10]: 5

__

Why: An unexpected exception in poll_once() would terminate the polling thread, stopping all metric collection. Wrapping it in a try-except ensures the loop continues, though the individual poll functions already have error handling.

Low

@Svaag

Svaag commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Live extmon VPS has been bootstrapped and this PR branch now points inventory at it.

Provisioned host:

  • IPv4: 45.32.179.134
  • IPv6: 2001:19f0:7402:0cd5:5400:06ff:fe40:7112
  • OS: Debian 13 / trixie
  • RAM: ~3.8GiB
  • Swap: ~7.7GiB
  • Hostname: extmon
  • SSH: key-based root access works with the AS215932 operator key
  • Personal account: svag created with AS215932 operator key and passwordless sudo
  • SSH hardening: password/kbd-interactive auth disabled; root login is key-only

Validated after inventory update:

ANSIBLE_CONFIG=ansible/ansible.cfg ansible -i ansible/inventory/hosts.yml extmon -m ping
extmon | SUCCESS

ANSIBLE_CONFIG=ansible/ansible.cfg ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/extmon.yml --syntax-check
OK

ANSIBLE_CONFIG=ansible/ansible.cfg ANSIBLE_ROLES_PATH=ansible/roles ansible-playbook -i ansible/inventory/hosts.yml ansible/playbooks/extmon.yml --tags validate --connection=local --limit extmon
OK

Still blocked before extmon_apply=true full service deployment:

  • EXTMON_DISCORD_WEBHOOK_URL
  • EXTMON_BGP_CLOUDFLARE_API_TOKEN
  • EXTMON_BGP_INGEST_TOKEN
  • EXTMON_DIAG_AGENT_TOKEN
  • OVH expiry collector credentials if that collector should be live on first apply

@Svaag Svaag force-pushed the chore/agentic-diagnostics-pricing-knobs branch from eed8734 to e47057f Compare June 14, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant