Hanalyx · remyluslosius · Jun 16, 2026 · Jun 16, 2026
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -96,7 +96,6 @@ Gaps identified comparing `docs/KENSA_OPENWATCH_BOUNDARY.md` against current Ope
 
 | Item | Priority | Status | Notes |
 |------|----------|--------|-------|
-| Wire SSH auth/sudo learning into discovery / intelligence / liveness | P2 | Not started | #566 landed the substrate (`internal/connprofile` + migration `0035`) and the dial-layer mechanism (`ssh.DialOptions.PreferAuth`/`ObservedAuth`), and wired the **scan** path (auth ordering + per-connection sudo-mode probe, recorded per host). The other three SSH paths — OS discovery, OS intelligence collector, liveness privilege probe — still re-probe each cycle instead of leading with the remembered choice. Their `SSHTransport.Dial` (`internal/intelligence/discovery`, shared by the collector) and the `internal/sshprivilege` dialer take no `hostID`, so thread it + a `connprofile` reader/recorder through. Mechanical — reuses everything from #566 |
 | Retention sweep for soft-deleted hosts | P3 | Not started | Soft-deleted hosts (`hosts.deleted_at` set by `internal/host/host.go:298 SoftDelete`) are retained **indefinitely** — no purge job exists anywhere, confirmed by a soft-deleted row from 2026-05-25 still present in the dev DB. The row stays for scan-history/audit integrity but is hidden from every query (`WHERE deleted_at IS NULL`). Add an optional retention sweep (a daemon-orchestration tick that hard-deletes `hosts WHERE deleted_at < now() - $window`, cascading scan history), with an operator-configurable window that defaults to disabled (keep forever). Low urgency — host volume is trivial — but closes unbounded soft-deleted-row growth and gives operators a real "forget this host" path |
 | `PATCH /api/v1/credentials/{id}` — in-place credential update | P2 | Deferred | Frontend uses replace-on-save (`<ReplaceCredentialModal>` runs `POST → DELETE`). Real PATCH would close the orphan-credential failure mode |
 | `POST /api/v1/bulk/hosts/analyze-csv` + `import-with-mapping` | P2 | Deferred | Today the wizard runs CSV analysis client-side and submits row-by-row — no atomic semantics, no "update existing", no row caps |
@@ -153,7 +152,7 @@ Gaps where working functionality can break without an automated test failing (id
 
 | Item | Priority | Notes |
 |------|----------|-------|
-| Live-host SSH/sudo integration test (gated) | P2 | CI has **no test that actually dials a host and runs sudo** — the dial, `sudo -n`, and `sudo -S` paths are only unit-tested at the command-construction level (the `wrap`/auth-ordering output), never against a real box. Add an opt-in, env-gated test (against `~/Documents/openwatch/test_hosts.csv`) that asserts: key auth, password auth, `sudo -n` (NOPASSWD), and `sudo -S` (password) each actually authenticate/escalate and return real results. Skips when the host env var is unset, so it never gates normal CI. Closes the biggest blind spot (live execution is manual today). |
+| Live-host SSH/sudo integration test (gated) | P2 | **Mostly done** | Delivered `internal/ssh/livehost_test.go` — opt-in (`OPENWATCH_LIVE_HOSTS` CSV + `OPENWATCH_LIVE_KEY`), self-skipping, drives the REAL `ssh.Dial` + `ssh.RunSudo`. Per host it asserts the machinery for whatever the host supports: key auth → `ObservedAuth=="key"`, password auth → `"password"`, sudo mode confirmed via `true`, and the real `sudo -S` password-on-stdin path. Validated against the dev fleet (5 key+NOPASSWD hosts pass; key-rejecting / unreachable hosts skip). **Gap:** the password-AUTH branch (`ObservedAuth=="password"`) is unverified live because the dev fleet has `PasswordAuthentication no` everywhere — needs one host with password SSH enabled to exercise it. Optional next: add a `connprofile.Store` round-trip (record→read-back) assertion against a gated DB. |
 | Frontend E2E (Playwright) for critical flows | P2 | **Zero E2E tests** — `0` Playwright files, no config (the CLAUDE.md Playwright note is Python-era). Component-level vitest only, so a wired-up page can be green in vitest and broken in the browser. Stand up Playwright + cover the critical flows: login, the activated Settings pages (Users, Notifications, Security/API tokens, SSO), and host CRUD (add / edit / delete). |
 | Negative-path ACs for security gates | P2 | The scan kill-switch bug this session passed all 988 tests + the specter gate because the scan path simply had **no AC requiring it to honor the switch** — the suite tests specced behavior, so gaps *between* specs slip through. Generalize the pattern of `system-connection-profile/AC-07` (asserts "kill-switch off / key-only cred → no `sudo -S`") across the other security gates: every gate should have a spec'd AC + test for the **disallowed** path, not just the happy path. |
 

diff --git a/internal/ssh/livehost_test.go b/internal/ssh/livehost_test.go
@@ -0,0 +1,322 @@
+// @spec system-connection-profile
+//
+// Live-host SSH/sudo integration test. Unlike every other test in this
+// package (which stubs the transport), this one opens REAL SSH connections
+// to real boxes and runs real sudo, closing the project's biggest test
+// blind spot: the dial, auth-ordering, and sudo `-n`/`-S` paths are
+// otherwise only exercised at the command-construction level, never
+// against a live host.
+//
+// It is OPT-IN and self-skipping: set both
+//
+//	OPENWATCH_LIVE_HOSTS=/path/to/test_hosts.csv   (hostname,ip,username,credential[=password])
+//	OPENWATCH_LIVE_KEY=/path/to/id_rsa             (an OpenSSH private key authorized for the user)
+//
+// to run it. With either unset it t.Skip()s, so it never gates normal CI.
+// The inventory + key live on the operator's workstation, never in the repo.
+//
+// What it asserts per reachable host:
+//
+//	AC-04  key auth dials and ObservedAuth reports "key"
+//	AC-04  password auth dials and ObservedAuth reports "password"
+//	AC-10  RunSudo confirms a sudo mode (nopasswd or password) via `true`
+//	AC-10  leading with SudoPassword exercises the real `sudo -S` wire path
+//
+// These are exactly the observations the per-host connprofile memo records,
+// so a green run is end-to-end proof that the learning inputs are real.
+
+package ssh
+
+import (
+	"bufio"
+	"context"
+	"errors"
+	"os"
+	"strings"
+	"testing"
+	"time"
+
+	cryptossh "golang.org/x/crypto/ssh"
+
+	"github.com/Hanalyx/openwatch/internal/credential"
+)
+
+type liveHost struct {
+	name     string
+	addr     string
+	user     string
+	password string
+}
+
+// loadLiveHosts reads the opt-in inventory + key, or skips the test.
+func loadLiveHosts(t *testing.T) ([]liveHost, string) {
+	t.Helper()
+	csvPath := os.Getenv("OPENWATCH_LIVE_HOSTS")
+	keyPath := os.Getenv("OPENWATCH_LIVE_KEY")
+	if csvPath == "" || keyPath == "" {
+		t.Skip("set OPENWATCH_LIVE_HOSTS (csv) and OPENWATCH_LIVE_KEY (private key) to run live-host SSH tests")
+	}
+
+	keyBytes, err := os.ReadFile(keyPath) // #nosec G304 -- operator-supplied test key path
+	if err != nil {
+		t.Fatalf("read OPENWATCH_LIVE_KEY %q: %v", keyPath, err)
+	}
+	// Sanity-check the key parses before we fan out to every host, so a bad
+	// key fails once with a clear message instead of N opaque dial errors.
+	if _, perr := cryptossh.ParsePrivateKey(keyBytes); perr != nil {
+		t.Fatalf("OPENWATCH_LIVE_KEY %q is not a usable private key: %v", keyPath, perr)
+	}
+
+	f, err := os.Open(csvPath) // #nosec G304 -- operator-supplied test inventory path
+	if err != nil {
+		t.Fatalf("open OPENWATCH_LIVE_HOSTS %q: %v", csvPath, err)
+	}
+	defer func() { _ = f.Close() }()
+
+	var hosts []liveHost
+	sc := bufio.NewScanner(f)
+	first := true
+	for sc.Scan() {
+		line := strings.TrimSpace(sc.Text())
+		if line == "" {
+			continue
+		}
+		cols := strings.Split(line, ",")
+		// Skip the header row (hostname,ip,username,credential).
+		if first {
+			first = false
+			if strings.EqualFold(strings.TrimSpace(cols[0]), "hostname") {
+				continue
+			}
+		}
+		if len(cols) < 3 {
+			continue
+		}
+		h := liveHost{
+			name: strings.TrimSpace(cols[0]),
+			addr: strings.TrimSpace(cols[1]),
+			user: strings.TrimSpace(cols[2]),
+		}
+		if len(cols) >= 4 {
+			h.password = strings.TrimSpace(cols[3])
+		}
+		if h.addr == "" || h.user == "" {
+			continue
+		}
+		hosts = append(hosts, h)
+	}
+	if err := sc.Err(); err != nil {
+		t.Fatalf("scan inventory: %v", err)
+	}
+	if len(hosts) == 0 {
+		t.Fatalf("OPENWATCH_LIVE_HOSTS %q yielded no usable rows", csvPath)
+	}
+	return hosts, string(keyBytes)
+}
+
+// liveSession adapts a real *ssh.Client to SudoSession so RunSudo (the
+// shared sudo primitive) can run against it exactly as production does.
+type liveSession struct{ client *cryptossh.Client }
+
+func (s liveSession) Run(_ context.Context, cmd string) ([]byte, int, error) {
+	sess, err := s.client.NewSession()
+	if err != nil {
+		return nil, -1, err
+	}
+	defer func() { _ = sess.Close() }()
+	out, runErr := sess.CombinedOutput(cmd)
+	return out, exitCodeOf(runErr), nonExitErr(runErr)
+}
+
+func (s liveSession) RunWithStdin(_ context.Context, cmd string, stdin []byte) ([]byte, int, error) {
+	sess, err := s.client.NewSession()
+	if err != nil {
+		return nil, -1, err
+	}
+	defer func() { _ = sess.Close() }()
+	pipe, perr := sess.StdinPipe()
+	if perr != nil {
+		return nil, -1, perr
+	}
+	go func() {
+		_, _ = pipe.Write(stdin)
+		_ = pipe.Close()
+	}()
+	out, runErr := sess.CombinedOutput(cmd)
+	return out, exitCodeOf(runErr), nonExitErr(runErr)
+}
+
+// exitCodeOf returns the remote exit status (0 on success), or -1 for a
+// transport-level error.
+func exitCodeOf(err error) int {
+	if err == nil {
+		return 0
+	}
+	var ee *cryptossh.ExitError
+	if errors.As(err, &ee) {
+		return ee.ExitStatus()
+	}
+	return -1
+}
+
+// nonExitErr passes through transport errors but swallows a clean non-zero
+// exit (that is a valid result, surfaced via the code).
+func nonExitErr(err error) error {
+	var ee *cryptossh.ExitError
+	if err == nil || errors.As(err, &ee) {
+		return nil
+	}
+	return err
+}
+
+// isAuthRejection reports a server-side auth REJECTION (the key isn't
+// authorized, or PasswordAuthentication is off) as opposed to a transport
+// fault. A real fleet is heterogeneous — not every host accepts every
+// method — so a rejection is a host-config fact to tolerate, not an
+// OpenWatch bug. (A handshake/protocol error is neither and stays fatal.)
+func isAuthRejection(err error) bool {
+	s := err.Error()
+	return strings.Contains(s, "unable to authenticate") ||
+		strings.Contains(s, "authentication failed") ||
+		strings.Contains(s, "no supported methods remain")
+}
+
+// isUnreachable reports a connection-level failure (host down / firewalled
+// / wrong subnet) — reachability is the operator's network, not our code,
+// so the host is skipped rather than failed.
+func isUnreachable(err error) bool {
+	s := err.Error()
+	return strings.Contains(s, "no route to host") ||
+		strings.Contains(s, "connection refused") ||
+		strings.Contains(s, "i/o timeout") ||
+		strings.Contains(s, "connect: ") ||
+		strings.Contains(s, "tcp connect failed")
+}
+
+// tryAuth performs one real dial with the given method. It returns a live
+// client on success (caller closes it), nil + ok=false on a tolerated auth
+// rejection, skips the whole subtest on an unreachable host, and fails the
+// test on any other (unexpected, protocol-level) error. On success it
+// asserts ObservedAuth reports the method that authenticated — the exact
+// value the connprofile memo records (spec AC-04).
+func tryAuth(t *testing.T, ctx context.Context, h liveHost, method, wantObserved string, cred *credential.Credential, store KnownHostsStore) (*cryptossh.Client, bool) {
+	t.Helper()
+	var observed string
+	client, err := Dial(ctx, h.addr, 22, cred, DialOptions{
+		Mode:         ModeTOFU,
+		Store:        store,
+		Timeout:      DefaultDialTimeout,
+		ObservedAuth: &observed,
+	})
+	if err != nil {
+		switch {
+		case isUnreachable(err):
+			t.Skipf("%s (%s) unreachable: %v", h.name, h.addr, err)
+		case isAuthRejection(err):
+			t.Logf("%s: %s auth not accepted by host (config): %v", h.name, method, err)
+			return nil, false
+		default:
+			t.Fatalf("%s: %s dial failed with an unexpected (non-auth) error: %v", h.name, method, err)
+		}
+	}
+	if observed != wantObserved {
+		t.Errorf("%s: %s auth ObservedAuth=%q, want %q", h.name, method, observed, wantObserved)
+	}
+	return client, true
+}
+
+// TestLiveHost_AuthAndSudoMatrix dials every inventory host and validates,
+// against the REAL box, the auth + sudo machinery the connprofile learning
+// rests on. The fleet is heterogeneous: a host may accept key auth, or
+// password auth, or both, and use NOPASSWD or password sudo. The test
+// discovers each host's capabilities and asserts the machinery is correct
+// for whatever the host supports — it does NOT demand every method on every
+// host. A host with no usable auth, or that is unreachable, is skipped.
+func TestLiveHost_AuthAndSudoMatrix(t *testing.T) {
+	hosts, key := loadLiveHosts(t)
+
+	for _, h := range hosts {
+		h := h
+		t.Run(h.name, func(t *testing.T) {
+			t.Parallel()
+			ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
+			defer cancel()
+
+			// One known-hosts store per host: the first dial records the
+			// host key (TOFU); any later dial verifies against it.
+			store := NewMemoryStore()
+
+			// --- AC-04: key auth ---
+			keyClient, keyOK := tryAuth(t, ctx, h, "key", PreferKey, &credential.Credential{
+				Username:   h.user,
+				AuthMethod: credential.AuthSSHKey,
+				PrivateKey: key,
+			}, store)
+			if keyClient != nil {
+				defer func() { _ = keyClient.Close() }()
+			}
+
+			// --- AC-04: password auth (only if the inventory carries one) ---
+			var pwClient *cryptossh.Client
+			pwOK := false
+			if h.password != "" {
+				pwClient, pwOK = tryAuth(t, ctx, h, "password", PreferPassword, &credential.Credential{
+					Username:   h.user,
+					AuthMethod: credential.AuthPassword,
+					Password:   h.password,
+				}, store)
+				if pwClient != nil {
+					defer func() { _ = pwClient.Close() }()
+				}
+			}
+
+			if !keyOK && !pwOK {
+				t.Skipf("%s: neither key nor password auth accepted — nothing to exercise", h.name)
+			}
+
+			// Run sudo over whichever auth succeeded.
+			sudoClient := keyClient
+			if sudoClient == nil {
+				sudoClient = pwClient
+			}
+			sess := liveSession{client: sudoClient}
+			bothCred := &credential.Credential{
+				Username:   h.user,
+				AuthMethod: credential.AuthBoth,
+				Password:   h.password,
+				PrivateKey: key,
+			}
+			policy := SudoPolicy{AllowCredentialPassword: true}
+
+			// --- AC-10: confirm the host's sudo mode via the `true` sentinel.
+			_, code, _, observed, serr := RunSudo(ctx, sess, bothCred, policy, "", "true")
+			if serr != nil {
+				t.Fatalf("sudo true on %s: transport error: %v", h.name, serr)
+			}
+			if code != 0 {
+				t.Fatalf("sudo true on %s: exit %d (sudo not usable with this credential)", h.name, code)
+			}
+			if observed != SudoNopasswd && observed != SudoPassword {
+				t.Fatalf("sudo mode not confirmed on %s: observed=%q", h.name, observed)
+			}
+			t.Logf("%s (%s): keyAuth=%v passwordAuth=%v sudo=%s", h.name, h.addr, keyOK, pwOK, observed)
+
+			// --- AC-10: exercise the real `sudo -S` wire path. Leading with
+			// SudoPassword forces `sudo -S -k -p '' true` with the password on
+			// stdin; a NOPASSWD host still accepts it, so this proves the
+			// password-on-stdin path works end to end regardless of mode.
+			// Only meaningful when the credential carries a password.
+			if h.password != "" {
+				_, codeS, usedS, _, sErr := RunSudo(ctx, sess, bothCred, policy, SudoPassword, "true")
+				switch {
+				case sErr != nil:
+					t.Errorf("sudo -S true on %s: %v", h.name, sErr)
+				case !usedS:
+					t.Errorf("sudo -S true on %s: leading with SudoPassword did not take the sudo -S path", h.name)
+				case codeS != 0 && observed == SudoNopasswd:
+					t.Errorf("sudo -S true on %s: exit %d on a NOPASSWD host (should accept regardless of password)", h.name, codeS)
+				}
+			}
+		})
+	}
+}