Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 16 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ production.
- `docs/agent-loops/`, `docs/agentic-development-loop.md`,
`docs/engineering-loop/` — role cards, runtime reference, and v2 design.
- `integrations/pi/` — the Pi `/loop` extension.
- `configs/loop/` — systemd service + timer for the operations lane.
- `configs/loop/` — systemd service + timer config and the Reliability Governor
capability registry.
- `model-policy.yml`, `engineering-loop-policy.yml` — model/backend routing
and the mutation/publication policy guards.
- Optional AS215932 knowledge context-pack integration is default-off and read-only.
Expand All @@ -49,6 +50,11 @@ uvx ruff check src tests

```bash
uv run hyrule-engineering-loop --help
# route intake/candidate issues through deterministic Reliability Governor policy:
uv run hyrule-engineering-loop reliability-governor --once \
--registry configs/loop/capability-registry.yml \
--knowledge-context \
--knowledge-repo /home/svag/Dev/knowledge
# one operations-lane cycle over the core AS215932 loop:approved queues:
uv run hyrule-engineering-loop daemon --once
```
Expand Down Expand Up @@ -101,6 +107,15 @@ uv run hyrule-engineering-loop feature CHANGE_ID \
--knowledge-learning-dir .engineering-loop-state/learning-events
```

The Reliability Governor is the Staff SRE control plane for autonomous
operations. It posts a Reliability Decision Record before it changes labels. It
can route issues to `loop:needs-context`, `loop:knowledge-gap`,
`loop:needs-human`, `loop:candidate`, or `loop:approved`; the Engineering daemon
still consumes only `loop:approved`. Production v1 runs it as a timer-driven
reconciler; the later callback-driven shape uses normalized wake events that
trigger reconciliation, never direct approval. See
`docs/engineering-loop/reliability-governor-production.md`.

The daemon's default production scope is the eight core repos:
`engineering-loop`, `network-operations`, `hyrule-cloud`, `hyrule-web`,
`hyrule-mcp`, `noc-agent`, `hyrule-network-proxy`, and `as215932.net`. It runs low-and-slow by
Expand Down
132 changes: 132 additions & 0 deletions configs/loop/capability-registry.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
version: 1
capabilities:
- id: tier0.docs-runbooks-tests
domains:
- docs
- runbook
- tests
- dashboard
allowed_repos:
- "*"
allowed_paths:
- docs/
- README.md
- tests/
- .github/
Comment thread
Svaag marked this conversation as resolved.
- dashboards/
forbidden_paths:
- secrets/
- "**/secrets/"
- .env
- .env.
target_loops:
- engineering
source_loops:
- human
- noc
- knowledge
- scheduled_miner
max_risk_tier: 0
auto_approve_max_risk_tier: 0
required_evidence:
- knowledge_context
- verification_method
required_checks:
- targeted_tests_or_docs_review
rollback_required: true
verification_required: true
handoff_contract: github_issue_labels
verification_owner: engineering
learning_required: false
success_count: 0
failure_count: 0

- id: tier1.monitoring-alert-tuning
domains:
- monitoring
- alert_tuning
allowed_repos:
- AS215932/network-operations
- AS215932/noc-agent
- AS215932/engineering-loop
allowed_paths:
- docs/
- tests/
- monitoring/
- alerts/
- config/
- app/knowledge/
forbidden_paths:
- secrets/
- "**/secrets/"
- .env
- .env.
target_loops:
- engineering
source_loops:
- human
- noc
- knowledge
- scheduled_miner
max_risk_tier: 1
auto_approve_max_risk_tier: 1
Comment thread
Svaag marked this conversation as resolved.
required_evidence:
- knowledge_context
- verification_method
- rollback_plan
required_checks:
- targeted_tests_or_alert_fixture
rollback_required: true
verification_required: true
handoff_contract: github_issue_labels
verification_owner: noc
learning_required: true
success_count: 0
failure_count: 0

- id: tier2.internal-service-low-risk
domains:
- internal_service_code
- provisioning_helper
- non_prod_tooling
allowed_repos:
- AS215932/hyrule-cloud
- AS215932/hyrule-web
- AS215932/hyrule-mcp
- AS215932/noc-agent
- AS215932/engineering-loop
allowed_paths:
- docs/
- tests/
- hyrule_cloud/
- hyrule_web/
- src/
- app/
- scripts/
forbidden_paths:
- secrets/
- "**/secrets/"
- .env
- .env.
target_loops:
- engineering
source_loops:
- human
- noc
- knowledge
- scheduled_miner
max_risk_tier: 2
auto_approve_max_risk_tier: 2
required_evidence:
- knowledge_context
- verification_method
- rollback_plan
required_checks:
- pytest
rollback_required: true
verification_required: true
handoff_contract: github_issue_labels
verification_owner: engineering
learning_required: true
success_count: 0
failure_count: 0
72 changes: 72 additions & 0 deletions configs/loop/hyrule-engineering-loop.service
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,82 @@ ExecStart=/opt/engineering-loop/.venv/bin/hyrule-engineering-loop daemon --once
--repo AS215932/hyrule-mcp \
--repo AS215932/noc-agent \
--repo AS215932/hyrule-network-proxy \
--repo AS215932/as215932.net \
--require-reliability-decision \
--reliability-decision-author Svaag \
--workspace-root /var/lib/engineering-loop/workspace \
--output-root /var/lib/engineering-loop/runs \
--state-dir-path /var/lib/engineering-loop/state \
--memory-dir /var/lib/engineering-loop/workspace/hyrule-infra/memory \
--allow engineering-loop=docs \
Comment thread
Svaag marked this conversation as resolved.
--allow engineering-loop=tests \
--allow engineering-loop=.github \
--allow engineering-loop=README.md \
--allow engineering-loop=dashboards \
--allow engineering-loop=monitoring \
--allow engineering-loop=alerts \
--allow engineering-loop=config \
--allow engineering-loop=app/knowledge \
--allow engineering-loop=src \
--allow engineering-loop=scripts \
--allow engineering-loop=app \
Comment thread
Svaag marked this conversation as resolved.
--allow hyrule-infra=docs \
--allow hyrule-infra=tests \
--allow hyrule-infra=.github \
--allow hyrule-infra=README.md \
--allow hyrule-infra=dashboards \
--allow hyrule-infra=monitoring \
--allow hyrule-infra=alerts \
--allow hyrule-infra=config \
--allow hyrule-infra=app/knowledge \
--allow hyrule-noc-agent=docs \
--allow hyrule-noc-agent=tests \
--allow hyrule-noc-agent=.github \
--allow hyrule-noc-agent=README.md \
--allow hyrule-noc-agent=dashboards \
--allow hyrule-noc-agent=monitoring \
--allow hyrule-noc-agent=alerts \
--allow hyrule-noc-agent=config \
--allow hyrule-noc-agent=app/knowledge \
Comment thread
Svaag marked this conversation as resolved.
--allow hyrule-noc-agent=src \
--allow hyrule-noc-agent=scripts \
--allow hyrule-noc-agent=app \
--allow hyrule-cloud=docs \
--allow hyrule-cloud=tests \
--allow hyrule-cloud=.github \
--allow hyrule-cloud=README.md \
--allow hyrule-cloud=dashboards \
--allow hyrule-cloud=hyrule_cloud \
--allow hyrule-cloud=src \
--allow hyrule-cloud=scripts \
--allow hyrule-cloud=app \
--allow hyrule-web=docs \
--allow hyrule-web=tests \
--allow hyrule-web=.github \
--allow hyrule-web=README.md \
--allow hyrule-web=dashboards \
--allow hyrule-web=hyrule_web \
--allow hyrule-web=src \
--allow hyrule-web=scripts \
--allow hyrule-web=app \
--allow hyrule-mcp=docs \
--allow hyrule-mcp=tests \
--allow hyrule-mcp=.github \
--allow hyrule-mcp=README.md \
--allow hyrule-mcp=dashboards \
--allow hyrule-mcp=src \
--allow hyrule-mcp=scripts \
--allow hyrule-mcp=app \
--allow hyrule-network-proxy=docs \
--allow hyrule-network-proxy=tests \
--allow hyrule-network-proxy=.github \
--allow hyrule-network-proxy=README.md \
--allow hyrule-network-proxy=dashboards \
--allow as215932.net=docs \
--allow as215932.net=tests \
--allow as215932.net=.github \
--allow as215932.net=README.md \
--allow as215932.net=dashboards \
--max-runs-per-day 2 \
--max-cost-usd-per-day 10
# A run never blocks the next timer fire indefinitely; the daemon enforces
Expand Down
54 changes: 54 additions & 0 deletions configs/loop/hyrule-reliability-governor.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# /etc/systemd/system/hyrule-reliability-governor.service
# Deploy to: the same dedicated `loop` VM as the Engineering Loop daemon.
#
# Oneshot, timer-driven: one pass scans unlabeled / loop:intake /
# loop:candidate issues, fetches authoritative LHP-v1 payloads from NOC
# CaseService when present, loads Knowledge context, posts a Reliability
# Decision Record, and only then applies deterministic routing labels.

[Unit]
Description=AS215932 Reliability Governor issue-routing pass
After=network-online.target
Wants=network-online.target
# Production routing authority belongs on the dedicated loop VM, not CI.
ConditionEnvironment=!GITHUB_ACTIONS

[Service]
Type=oneshot
User=loop
Group=loop
WorkingDirectory=/opt/engineering-loop
EnvironmentFile=/opt/engineering-loop/.env
# .env provides the loop's GH token for `gh`,
# ENGINEERING_LOOP_NOC_LHP_BASE_URL, ENGINEERING_LOOP_NOC_LHP_SECRET,
# and optional Knowledge MCP overrides.
ExecStart=/opt/engineering-loop/.venv/bin/hyrule-engineering-loop reliability-governor --once \
--repo AS215932/engineering-loop \
--repo AS215932/network-operations \
--repo AS215932/hyrule-cloud \
--repo AS215932/hyrule-web \
--repo AS215932/hyrule-mcp \
--repo AS215932/noc-agent \
--repo AS215932/hyrule-network-proxy \
--repo AS215932/as215932.net \
Comment thread
Svaag marked this conversation as resolved.
--registry /opt/engineering-loop/configs/loop/capability-registry.yml \
--state-dir-path /var/lib/engineering-loop/reliability-governor \
--knowledge-context \
--knowledge-mcp-url http://127.0.0.1:8767/mcp \
--knowledge-mcp-transport streamable-http
TimeoutStartSec=600
StandardOutput=journal
StandardError=journal
SyslogIdentifier=reliability-governor

# Security hardening
NoNewPrivileges=yes
PrivateTmp=yes
ProtectHome=yes
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectSystem=strict
ReadWritePaths=/opt/engineering-loop /var/lib/engineering-loop

[Install]
WantedBy=multi-user.target
17 changes: 17 additions & 0 deletions configs/loop/hyrule-reliability-governor.timer
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# /etc/systemd/system/hyrule-reliability-governor.timer
# Deploy to: the same dedicated `loop` VM as the Reliability Governor service.
#
# The Reliability Governor should run ahead of the hourly Engineering daemon so
# newly approved low-risk work is visible by the next daemon cycle.

[Unit]
Description=Schedule the AS215932 Reliability Governor issue-routing pass

[Timer]
OnCalendar=*:0/15
RandomizedDelaySec=120
Persistent=true
Unit=hyrule-reliability-governor.service

[Install]
WantedBy=timers.target
31 changes: 26 additions & 5 deletions docs/agentic-development-loop.md
Original file line number Diff line number Diff line change
Expand Up @@ -887,9 +887,10 @@ Phase 23 (v2 Phase E) adds intake and the label-gated triage inbox:

- the inbox is the GitHub issue tracker, gated by two labels:
`loop:candidate` (machine-proposed, awaiting human triage) and
`loop:approved` (human-blessed, eligible for autonomous runs — the only
thing the Phase F operations lane will consume). **Nothing in the loop
can apply `loop:approved`**; a human relabels after review;
`loop:approved` (eligible for autonomous runs — the only thing the Phase F
operations lane will consume). Intake miners cannot apply `loop:approved`;
Phase 29's Reliability Governor may apply it only after posting a
Reliability Decision Record and passing deterministic capability policy;
- `src/hyrule_engineering_loop/intake/` holds the heartbeat:
`github_issues.py` (org-repo scan, deterministic scoring by label
weights + age + body completeness, fingerprint dedupe, candidate filing
Expand All @@ -906,14 +907,34 @@ Phase 23 (v2 Phase E) adds intake and the label-gated triage inbox:
explicit operator action, never implicit); `/loop triage` in Pi shows
the queue.

Phase 29 adds the Reliability Governor:

- product role: **Staff Site Reliability Engineer, Autonomous Operations**;
- loop job titles: Engineering Loop is the Platform/Software Engineer, NOC Loop
is the NOC Engineer / SRE on-call, Knowledge Loop is the Knowledge Engineer,
and Reliability Governor is the Staff SRE control plane;
- it reads unlabeled, `loop:intake`, and `loop:candidate` issues, including
NOC LHP-v1 handoff pointers;
- it treats GitHub prose as untrusted for NOC work and fetches the
authoritative handoff payload from CaseService;
- it loads authority-tiered Hyrule Knowledge context and denies stale,
contradictory, or missing context;
- it emits a Reliability Decision Record as both a GitHub comment and local JSON
before applying any label transition;
- it routes to `loop:needs-context`, `loop:knowledge-gap`, `loop:needs-human`,
`loop:candidate`, or `loop:approved` from deterministic policy.
- production v1 is a timer-driven reconciler; future callbacks are normalized
wake events that cause a fresh reconciliation against GitHub, CaseService,
Knowledge, and CI before any routing decision.

Phase 24 (v2 Phase F) adds the operations lane — scheduled, budgeted,
one-item-at-a-time autonomy that still ends at a draft PR:

- `hyrule-engineering-loop daemon --once` runs one cycle: acquire the run
lock, check the per-day budget ledger, pick the highest-scored
`loop:approved` issue, run the full graph, and either publish a **draft
PR** (clean run — the human pre-authorized the work by applying the
label; merge stays human-gated) or leave a journaled failure for triage,
PR** (clean run — the work was pre-authorized through `loop:approved`;
merge stays human-gated) or leave a journaled failure for triage,
then exit;
- safety rails (`src/hyrule_engineering_loop/daemon.py`): a pid run lock
with stale-lock detection (one cycle at a time); per-run budgets
Expand Down
Loading