noc-agent is the operator-facing investigation service for Hyrule Networks
(AS215932). It accepts monitoring events, runs structured incident analysis,
records human-review proposals, and keeps a fallback local control plane available
even when chat tooling is unreachable.
- FastAPI receives Alertmanager and Icinga webhooks.
- A LangGraph investigation runtime normalizes the alert, correlates repeated incidents, routes to a specialist profile, validates confidence, checks golden-state drift, and produces a reviewable proposal.
- Redis stores graph checkpoint state plus short-lived incident memory.
- Discord can act as the interactive operator console.
- A loopback-only local control API plus
nocctlgives operators an SSH/VPN fallback for review and decision recording. - Hyrule MCP provides live diagnostic telemetry; NOC Agent consumes it through the configured daemon URL or the legacy stdio path.
The current tranche is intentionally diagnostic-only. Approval records and resumes operator state, but it does not execute infrastructure changes.
Existing interfaces preserved:
POST /webhook/alertmanagerPOST /webhook/icingaPOST /taskPOST /mail/pollGET /healthGET /health/mcpGET /health/configGET /health/modelGET /health/mailGET /health/casesGET /metrics
New control-plane interfaces:
GET /control/incidents/pendingGET /control/incidents/{incident_id}POST /control/incidents/{incident_id}/decisionPOST /approval/resumeGET /control/proactive/statusPOST /control/proactive/pausePOST /control/proactive/resumePOST /control/proactive/run-onceGET /control/proactive/suppressionsPOST /control/proactive/ack/POST /control/proactive/unackGET /control/case-service/casesGET /control/case-service/cases/{case_id}GET /control/case-service/outbox
The /control/... endpoints require X-NOC-Control-Token. The signed resume
endpoint requires an HMAC signature using NOC_APPROVAL_SIGNING_SECRET.
nocctl is the local fallback interface:
nocctl pending
nocctl show <incident-id>
nocctl decide <incident-id> approved --operator svag --comment "reviewed"In production this is intended to run on noc over existing SSH/VPN access,
with NOC_CONTROL_URL=http://127.0.0.1:8000.
Approved remediation execution is off by default. An approved proposal only
runs when NOC_ENABLE_APPROVED_EXECUTION=1, and even then the first phase is
inert: with NOC_ENABLE_NOOP_ROLLBACK_GUARDS=1, execution routes through the
hyrule-mcp no-op rollback guards (prepare_commit_confirm →, on verify,
confirm_change or rollback_change) with action_class="noop_rollback_guard"
— no service restart, no FRR/PF/nftables/WireGuard mutation. Each execution
record carries execution_mode, guard_id, an authorization_fingerprint
(the raw HMAC signature is never persisted), and an execution_audit trail.
| State | Meaning |
|---|---|
execution_disabled |
approved but NOC_ENABLE_APPROVED_EXECUTION is off |
noop_guards_prepared |
no-op guard installed for each action |
verified |
guard confirmed (no-op execution finalized) |
verification_failed |
guard could not confirm; rolled back |
Operators can mint a signed authorization offline (matching the MCP HMAC scheme) for manual/smoke use:
nocctl approvals sign --proposal-id <case> --action-id <id> --operator svag --action-class noop_rollback_guard --ttl 300Signing requires HYRULE_MCP_ACTION_SIGNING_SECRET (or NOC_APPROVAL_SIGNING_SECRET).
Beyond the reactive webhook path, noc-agent runs an active operator loop
(app/proactive/) that continuously looks for trouble before an alert fires.
It is enabled in production (the deployment env sets NOC_PROACTIVE_ENABLED=1);
the in-code default stays 0 so local dev and the test suite never spin it up.
It is read-only, budgeted, and modelled on the production engineering-loop
(run-lock + per-day ledger + budgets + Icinga heartbeat) and
hyperliquid-trading-agent (continuous observe → propose → evaluate → learn)
loops. The conservative budgets — 1 investigation/cycle, 12/day, $10/day,
MEDIUM severity floor, heavy probes proposed not auto-run — are the live
safety rails.
Each cycle:
- Scan — cheap read-only sweep of Prometheus/Icinga (
app/proactive/scanner.py) for precursors the reactive tripwires fire on only after they harden: a BGP peer that just left Established or is flapping, a filesystem projected to fill within 24h, a scrape target flapping, restart churn / failed units, intermittent ICMP on a mesh link, certs near expiry. - Rank & gate — score by severity, snapshot a decision context, and gate
expensive investigation against the daily ledger and severity floor
(
app/proactive/governance.py,app/proactive/ledger.py). - Investigate — attach the top hotspot to a CaseService case, render it as
a synthetic alert, and run it through the same investigation graph the
webhooks use with CaseService-backed graph memory. Heavy read-only probes
(
tcpdump_capture,dns_probe_burst,multi_source_probe) are stripped unlessNOC_PROACTIVE_AUTO_HEAVY_PROBES=1(app/proactive/investigate.py). - Report — a Discord digest plus an optional Icinga passive heartbeat.
- Hand off — when a finding needs a config/docs change, open an idempotent
loop:candidateGitHub issue (app/proactive/handoff.py); a human promotes it toloop:approvedand the engineering-loop drafts the PR. - Learn — record predictions, correlate them with later real alerts, and
propose candidate lessons under
NOC_PROACTIVE_MEMORY_DIR(app/proactive/memory.py); humans merge proposals intolessons/.
To canary cheaply, set NOC_PROACTIVE_SHADOW=1 (scan-and-report only, no
autonomous investigation) and flip it back to 0 once the scanners look right.
To pause production at any time, set NOC_PROACTIVE_ENABLED=0 and re-apply, or
hit POST /control/proactive/pause.
Closing the loop — case links + ack/snooze: each digest hotspot carries an
ack id (short fingerprint) and, once investigated, a link to its NOC case
(set NOC_CONTROL_PUBLIC_URL for a clickable link; else the case number shows).
Mute a known issue you've decided to handle:
curl -XPOST -H "x-noc-control-token: $TOK" http://127.0.0.1:8000/control/proactive/ack \
-d '{"fingerprint":"<ack id>","reason":"tracked in network-operations#268","ttl_hours":168}'From the web dashboard (GET /control): a "Proactive loop" panel lists the
current hotspots (status, ack id, summary) with inline Ack buttons + a
reason/hours field, and the muted list with Unack — backed by
/control/proactive/status (which now returns hotspots + suppressions).
From the Discord bot (the ack id is right there in the digest):
/noc_ack ack_id:<id> [reason:…] [hours:…], /noc_unack ack_id:<id>,
/noc_acks. The ack id is matched as a prefix (git short-SHA style), so the
short id shown in the digest works.
An acked hotspot drops from the digest and from autonomous investigation
until it resolves — when it stops firing the suppression auto-prunes, so a
recurrence re-alerts. GET /control/proactive/suppressions lists active acks;
POST /control/proactive/unack clears one.
Proactive config (in-code defaults are conservative; the deployment env enables it):
NOC_PROACTIVE_ENABLED(in-code default0, deployed1; loop not mounted unless1)NOC_PROACTIVE_SHADOW(in-code default1, deployed0;1= report hotspots only)NOC_PROACTIVE_INTERVAL_S(default120) /NOC_PROACTIVE_DEEP_SCAN_S(default900)NOC_PROACTIVE_MAX_INVESTIGATIONS_PER_CYCLE(default1)NOC_PROACTIVE_MAX_INVESTIGATIONS_PER_DAY(default12)NOC_PROACTIVE_MAX_COST_USD_PER_DAY(default10)NOC_PROACTIVE_COST_USD_PER_INVESTIGATION(default0.05; flat estimate charged to the daily $ cap until per-run token→USD metering lands — the count cap is the primary budget)NOC_PROACTIVE_INVESTIGATION_COOLDOWN_S(default21600, six hours): minimum interval before the same successfully investigated hotspot fingerprint is re-investigatedNOC_PROACTIVE_INVESTIGATION_FAILURE_RETRY_S/NOC_CASE_INVESTIGATION_FAILURE_RETRY_S(default21600, six hours): minimum interval before retrying a failed CaseService-grounded investigation; dependency failures must not re-page every scan cycleNOC_PROACTIVE_AUTO_HEAVY_PROBES(default0; propose heavy probes instead of auto-running)NOC_PROACTIVE_HANDOFF_ENABLED(default0) +NOC_PROACTIVE_HANDOFF_REPO(defaultAS215932/network-operations)NOC_PROACTIVE_SEVERITY_FLOOR(defaultMEDIUM)NOC_PROACTIVE_MEMORY_DIR(default/var/lib/noc-agent/memory) /NOC_PROACTIVE_STATE_DIR(default/var/lib/noc-agent/proactive)- Handoff auth (only needed when
NOC_PROACTIVE_HANDOFF_ENABLED=1), in preference order:- GitHub App (recommended):
NOC_GITHUB_APP_ID+ the private key viaNOC_GITHUB_APP_PRIVATE_KEY_PATH(a PEM file; Vault Agent renders it onnoc) or inlineNOC_GITHUB_APP_PRIVATE_KEY. The installation is auto-resolved from the repo; override withNOC_GITHUB_APP_INSTALLATION_ID. The app mints short-lived installation tokens at call time (cached). App needs Issues: read/write + Metadata: read on the handoff repo. - PAT fallback:
NOC_GITHUB_TOKEN(fine-grained, issues-scoped).
- GitHub App (recommended):
NOC_PROACTIVE_ICINGA_URL/NOC_PROACTIVE_ICINGA_CHECK(optional passive heartbeat; reusesICINGA_API_USER/ICINGA_API_PASSWORD)
Settings can also be set under a [proactive] table in the model-config TOML;
environment variables take precedence.
The case-grounded state machine (app/cases/) is available as a strangler
foundation for both reactive webhooks and the proactive loop. It stores typed
observations, atomic cases, meta-cases, append-only events, aliases, traces,
operator feedback, and idempotent side-effect outbox intents.
It is off by default. Enable best-effort shadow writes with:
NOC_CASESERVICE_SHADOW=1If NOC_DATABASE_URL or DATABASE_URL is set, the runtime uses the Postgres
case store and creates the current schema on startup. Install the optional
Postgres support in deployments that enable this path:
uv sync --extra postgresWithout a DSN, shadow mode uses in-memory storage for local/dev canaries. Set
NOC_REQUIRE_POSTGRES=1 in production to fail loud instead of falling back to
in-memory state; this also prevents LangGraph checkpointing from silently using
the in-memory saver when a Postgres checkpoint backend was requested.
When NOC_PROACTIVE_ENABLED=1, the CaseService runtime starts even if shadow or
reactive/control primary flags are not set. In the application runtime, the
proactive loop uses CaseService as its case owner: case state gates investigation
cooldown and report stamping instead of investigations.json, and graph runs use
CaseService-backed memory. The /control/cases surface also reads these
CaseService cases so digest links do not point at disabled legacy routes. Fresh
scans record only freshly scanned raw hotspots, never carried-forward
deep/degraded-cycle hotspots. The older
NOC_CASESERVICE_CONTROL=1 can still force this behavior in isolated loop tests
or custom embeddings.
Reactive report enqueueing can be canaried separately:
NOC_CASESERVICE_REACTIVE_REPORT=1With this flag, reactive Alertmanager/Icinga observations still flow through the
non-primary investigation path, but CaseService owns the report idempotency
decision and enqueues report outbox intents for firing cases that should be
surfaced.
The enqueued payload uses a bounded reactive_case_report_v1 schema, marks
monitor text as untrusted/not-for-model-consumption, and strips control/markdown
delimiters from webhook-derived text.
Reactive investigation cooldown is owned by CaseService when reactive-primary webhook intake is enabled. Successful graph investigations are stamped back onto CaseService case state so repeated identical signals can be skipped until the policy cooldown or a signal change.
Reactive webhook intake can be cut over to CaseService with:
NOC_CASESERVICE_REACTIVE_PRIMARY=1With this flag, Alertmanager/Icinga webhooks create/update CaseService cases.
Firing cases schedule the graph investigation with a CaseService-derived graph
case, use CaseService-backed graph context/summaries, and CaseService owns
duplicate investigation gating. The flag starts the CaseService runtime even if
NOC_CASESERVICE_SHADOW is not set.
The control case surface can be cut over to CaseService with:
NOC_CASESERVICE_CONTROL_PRIMARY=1With this flag, /control/cases, /control/cases/{id}, case events, comments,
decisions, and manual investigations use CaseService as the primary case store.
It also starts the CaseService runtime even if NOC_CASESERVICE_SHADOW is not
set. This path intentionally does not fall back to legacy case storage for old
or unmapped cases. NOC_CASESERVICE_REACTIVE_PRIMARY=1 and
NOC_PROACTIVE_ENABLED=1 also route the /control/cases surface to CaseService
so reactive/proactive primary cases remain operator-visible without legacy
fallback.
Legacy reactive/control fallbacks have been removed: reactive webhooks require
NOC_CASESERVICE_REACTIVE_PRIMARY=1, and control case routes require a
CaseService primary route. See LEGACY_DEPRECATION.md for the removal audit.
Outbox side effects are also opt-in:
NOC_CASE_OUTBOX_ENABLED=1The worker processes pending and retry-due failed intents. The default handlers
send report intents to Discord and stamp the case as reported. If
NOC_KNOWLEDGE_CANDIDATE_DIR is set, knowledge_candidate intents write
review-gated learning events there. If NOC_CASE_HANDOFF_REPO (or
NOC_PROACTIVE_HANDOFF_REPO) plus GitHub auth are configured, handoff intents
open or refresh idempotent loop:candidate issues and stamp issue_url on the
case.
Offline replay is available for sanitized fixtures:
nocctl replay path/to/observations.jsonFixtures may be either a JSON list of ObservationRecord objects or an object
with an observations list. Replay reports deterministic metrics without live
network access or production credentials.
GET /health/cases reports whether the optional runtime is disabled, healthy,
or degraded, including backend name and pending/failed outbox counts when the
runtime is enabled. The control plane also exposes read-only CaseService views
under /control/case-service/... for projections, events, traces, feedback, and
outbox rows. Reactive-primary, proactive, Discord, and control-primary routes
read and write CaseService case projections. Graph execution requires an
explicit CaseService graph case and graph memory adapter;
there is no implicit graph-runtime legacy case fallback.
Prometheus exports noc_agent_case_service_runtime_enabled,
noc_agent_case_service_shadow_observations_total,
noc_agent_case_service_shadow_failures_total, and
noc_agent_case_service_outbox_processed_total so canaries can watch shadow
parity/failures before any control flag is enabled.
When DISCORD_BOT_TOKEN is present, the service starts a discord.py bot that
supports:
- slash-command investigations
- pending/status lookups
- approve/reject/acknowledge decisions
- mention-driven investigations
Guild, channel, and role allowlists are configured with:
DISCORD_ALLOWED_GUILD_IDSDISCORD_ALLOWED_CHANNEL_IDSDISCORD_ALLOWED_ROLE_IDS
The supervisor prompt is assembled from:
app/prompts/supervisor_context.mdapp/prompts/golden_state_manifest.json
The manifest is the machine-readable intended-state anchor. Live MCP telemetry is compared against it during investigation so proposals can call out drift instead of inventing a configuration story.
This repo includes read-only AS215932 knowledge context-pack and learning-event
fixtures under evals/knowledge_shadow/. They validate the future
AS215932/knowledge integration in shadow mode only: fixture shape,
citations, policy result, required NOC sections, null vector-score placeholders,
and sanitized learning_ledger_v1 summaries. The production NOC runtime does
not consume these fixtures and this tranche adds no live knowledge service calls.
Run the fixture evals with:
uv run pytest tests/test_knowledge_shadow.pyFixtures must never contain MCP responses, Prometheus/Icinga raw output, logs, packet data, commands, credentials, authorization headers, or secrets. Learning events remain A4 fixtures/proposals until human review promotes them elsewhere.
NOC_REDIS_URLNOC_CASESERVICE_SHADOW(default0; best-effort case-service shadow writes)NOC_CASESERVICE_CONTROL(deprecated for app runtime; forces proactive case-owned cooldown/report state in custom embeddings)NOC_CASESERVICE_REACTIVE_REPORT(default0; enqueue reactive report intents from case state)NOC_CASESERVICE_REACTIVE_PRIMARY(default0; make reactive webhooks use CaseService)NOC_CASESERVICE_CONTROL_PRIMARY(default0; make/control/casesuse CaseService)NOC_CASE_POLICY_VERSION(defaultcase_policy_v1)NOC_CASE_OUTBOX_ENABLED(default0; process case side-effect outbox intents)NOC_CASE_OUTBOX_INTERVAL_S/NOC_CASE_OUTBOX_LIMIT/NOC_CASE_OUTBOX_RETRY_BACKOFF_SNOC_KNOWLEDGE_CANDIDATE_DIR(optional output dir for review-gated learning events)NOC_CASE_HANDOFF_REPO(optional handoff repo; falls back toNOC_PROACTIVE_HANDOFF_REPO)- LHP-v1 dormant cross-loop flags (all behavior-changing flags default off):
NOC_LHP_ENABLEDNOC_ENGINEERING_HANDOFF_DELIVERY_ENABLED/NOC_ENGINEERING_HANDOFF_TRANSPORT/NOC_ENGINEERING_HANDOFF_REPONOC_KNOWLEDGE_CONTEXT_ENABLED/NOC_KNOWLEDGE_EXPORT_SQLITE/NOC_KNOWLEDGE_EXPORT_MANIFESTNOC_KNOWLEDGE_CONTEXT_MAX_ARTIFACTS/NOC_KNOWLEDGE_CONTEXT_MAX_TOKENS_EQUIVALENT/NOC_KNOWLEDGE_CONTEXT_TIMEOUT_SNOC_CASE_VERIFICATION_ENABLED/NOC_CASE_VERIFICATION_DRY_RUN/NOC_CASE_AUTO_RESOLVE_ENABLEDNOC_CASE_VERIFICATION_INTERVAL_S/NOC_CASE_VERIFICATION_REQUIRED_CONSECUTIVE_PASSESNOC_DISK_ALERT_HANDOFF_ENABLEDNOC_LHP_CALLBACK_MAX_BYTES/NOC_LHP_ENGINEERING_SECRET(secret value is not exposed through settings)- rollout validation/eval documentation:
docs/lhp-v1-rollout-validation.mdandevals/lhp_v1/ - LHP metrics:
noc_agent_lhp_*counters onGET /metrics
NOC_DATABASE_URL/DATABASE_URL(optional Postgres case/checkpoint backend DSN)NOC_REQUIRE_POSTGRES(default0; fail loud if Postgres is required but unavailable)NOC_DATABASE_POOL_MIN_SIZE/NOC_DATABASE_POOL_MAX_SIZENOC_DATABASE_COMMAND_TIMEOUT_S/NOC_DATABASE_STATEMENT_TIMEOUT_MS/NOC_DATABASE_LOCK_TIMEOUT_MSHYRULE_MCP_URLNOC_CONTROL_TOKENNOC_APPROVAL_SIGNING_SECRET(also acceptsHYRULE_MCP_ACTION_SIGNING_SECRET)NOC_ENABLE_APPROVED_EXECUTION(default0; master switch for any execution)NOC_ENABLE_NOOP_ROLLBACK_GUARDS(default0; route execution through inert no-op guards)NOC_ACTION_AUTH_TTL_SECONDS(default300)DISCORD_BOT_TOKENDISCORD_ALLOWED_GUILD_IDSDISCORD_ALLOWED_CHANNEL_IDSDISCORD_ALLOWED_ROLE_IDSOPENROUTER_API_KEYfor the default model backend- Optional
OPENROUTER_MANAGEMENT_API_KEYfor account-wide credit monitoring - Optional
OPENROUTER_APP_TITLEandOPENROUTER_APP_URLfor OpenRouter attribution
The legacy HYRULE_MCP_CMD path remains accepted for compatibility.
Model selection is configurable via TOML. Lookup order is:
NOC_AGENT_CONFIG/etc/noc-agent/config.tomlconfig/noc-agent.toml- built-in defaults
The default chain is OpenRouter DeepSeek V4 Pro with Claude Sonnet 4.6 as a fallback:
[model]
primary = "openrouter:deepseek/deepseek-v4-pro"
fallbacks = ["openrouter:anthropic/claude-sonnet-4.6"]Any OpenRouter model can be selected with openrouter:<model-slug>. Secrets stay in environment variables, not in the config file. AGENT_MODEL and AGENT_FALLBACK_MODELS still override the config file for emergency changes.
Google/Gemini remains supported for future use:
[model]
primary = "google-gla:gemini-3.1-pro"
fallbacks = []For the new default backend, migrate from GEMINI_API_KEY to OPENROUTER_API_KEY. /health/model reports active models plus OpenRouter key-limit and usage status. /metrics exports OpenRouter credit/usage gauges.
See TESTING.md.
Part of Hyrule Networks (AS215932).