feat(ssh): wire per-host auth-method learning into discovery/intelligence/liveness#575
Merged
Merged
Conversation
…ence/liveness The connection-profile memo (PR #566) only led the dial with the host's known-good SSH auth method on the compliance-scan path. The other three paths that talk to a managed host -- OS discovery, OS intelligence (collector), and the liveness privilege probe -- still dialed key-first every cycle, re-offering an unauthorized public key to password-only hosts (a failed publickey attempt that counts against MaxAuthTries and can trip fail2ban) on a loop. Extend the shared connprofile store into those three paths: - connprofile.WithHostID / HostIDFrom: context helpers so a transport that only receives host:port+cred can still look up + record the host's profile, without churning the SSHTransport.Dial signature (and its test stubs across discovery + collector). - discovery.SSHTransportProd gains WithProfiles + a dial seam: when a store is wired and the ctx carries a host id, it sets PreferAuth from the recorded method and records ObservedAuth after a successful dial. This one transport is what BOTH discovery and the collector dial through (via collectorSSHAdapter), so both learn at once. - discovery.go / collector.go wrap the ctx with the host id at the dial site (both hostFacts already carry HostID). - sshprivilege.Probe gains WithProfiles: it leads the dial with the recorded method (reordering buildAuthMethods for AuthBoth) and records which method authenticated via a local single-factor observer. - cmd/openwatch wires one shared connprofile.NewStore(pool) across all four paths (the scan now reuses it too). Learning stays best-effort: a missing host id, absent profile row, or store error dials in the default order and never fails the connection (hint, not a lock -- a stale hint self-heals on the next dial). Scope: the SSH auth-method dimension. sudo-mode (NOPASSWD vs password) learning for these three paths stays a follow-up -- they already probe sudo mode correctly each cycle; only the scan learns both today. Spec system-connection-profile -> v1.1.0: C-06, AC-08 (discovery/ collector transport), AC-09 (liveness probe).
remyluslosius
added a commit
that referenced
this pull request
Jun 16, 2026
…ce/liveness (#576) Follow-up to PR #575 (auth-method learning). The same three paths that talk to a managed host still probed sudo mode from scratch every cycle — running a doomed `sudo -n` on a password-sudo host before retrying `sudo -S`. Extend the connprofile memo to the SUDO dimension so each path leads with the host's recorded mode and records the mode confirmed to work. Division of labour: - The liveness privilege probe is the AUTHORITATIVE sudo-mode learner: it runs an innocuous `true` sentinel every ~5 min, so it reliably confirms the mode regardless of any real command's exit code. - OS discovery (firewall probe) and OS intelligence (collector) learn OPPORTUNISTICALLY from their existing real sudo commands — no extra round-trip. To avoid misrecording, a mode is recorded ONLY on a confirmed exit-0 of a given sudo form, never inferred from a command that failed for its own reasons. Mechanics: - ssh.RunSudo (collector's shared primitive) gains a prefer (in) + observed (out): when prefer=SudoPassword and the password gate is satisfied it leads with `sudo -S`, skipping the doomed `sudo -n`; observed reports the confirmed form. Plain-string tokens (SudoNopasswd/SudoPassword) keep the ssh package decoupled from connprofile, matching PreferKey/PreferPassword. - discovery.runSudoWithFallback + probeFirewall thread prefer + return the learned mode; the discovery Service reads it (cfg.prefer) and records it. - sshprivilege.Probe extracts probeSudo: leads with `sudo -S` on a known password host, records the confirmed mode (preserving the AC-18/19/21 error shapes). - collector threads the recorded mode across the cycle's sudo commands and records once at the end. - cmd/openwatch wires the shared connprofile store into the collector and discovery services (the probe already had it from #575). The sudo password GATE (kill-switch + auth-method) is unchanged: leading with `sudo -S` is allowed only when a password may already be fed. Learning stays best-effort — a store miss/error escalates in the default order and never fails the connection; a stale mode self-heals (sudo -S miss falls back to sudo -n). Spec system-connection-profile -> v1.2.0: C-07, AC-10 (RunSudo primitive), AC-11 (discovery firewall probe), AC-12 (liveness probe).
remyluslosius
added a commit
that referenced
this pull request
Jun 16, 2026
Closes the project's biggest test blind spot: the dial, auth-ordering, and sudo -n/-S paths were only unit-tested at the command-construction level (stubbed transport), never against a real box. A wired-up host could regress and every test stay green. internal/ssh/livehost_test.go drives the REAL ssh.Dial + ssh.RunSudo — the primitives every host-talking path (scan, discovery, collector, liveness) shares — against an operator-supplied inventory: OPENWATCH_LIVE_HOSTS=/path/to/test_hosts.csv (hostname,ip,username,credential) OPENWATCH_LIVE_KEY=/path/to/id_rsa With either unset the test t.Skip()s, so it never gates normal CI; the inventory + key stay on the operator's workstation, never in the repo. The fleet is heterogeneous, so the test DISCOVERS each host's capabilities rather than demanding every method everywhere. Per host it asserts the machinery for whatever the host supports: - key auth dials -> ObservedAuth == "key" (the value the memo records) - password auth dials -> ObservedAuth == "password" - sudo mode confirmed via the `true` sentinel (nopasswd | password) - the real `sudo -S -k -p '' true` password-on-stdin path executes A server-side auth rejection (key not authorized, or PasswordAuthentication off) is a tolerated host-config fact; an unreachable host is skipped; only an unexpected protocol-level error or a wrong ObservedAuth/sudo result fails the test. A host with no usable auth is skipped. Validated against the dev fleet: 5 key+NOPASSWD hosts pass (real key dial, sudo -n, and sudo -S all exercised), key-rejecting and unreachable hosts skip. The password-AUTH assertion is live-unverified only because the dev fleet runs PasswordAuthentication=no everywhere (noted in BACKLOG); it runs as soon as one password-enabled host is in the inventory. Also drops the completed "wire SSH auth/sudo learning" backlog entry (shipped in #575 + #576).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Extends the per-host SSH connection memo from PR #566 (which only learned on the compliance-scan path) into the three remaining paths that talk to a managed host:
internal/intelligence/discovery)internal/intelligence/collector)internal/sshprivilege)Each now leads the SSH dial with the host's last known-good auth method and records which method actually authenticated, via the shared
connprofilestore.Why
Without the memo, those paths dialed key-first every cycle. On a password-only host that re-offers an unauthorized public key on a loop (discovery on add, the collector every intelligence cycle, the liveness probe every ~5 min) — a failed
publickeyattempt that counts againstMaxAuthTriesand can trip fail2back. This was the user's original "try SSH key next time / try password next time" ask, applied to the paths #566 left out.How
connprofile.WithHostID/HostIDFrom— context helpers so a transport that only receiveshost:port + credcan still look up + record the host's profile, without changing theSSHTransport.Dialsignature (which would churn the discovery + collector interfaces and 8 test stubs). Best-effort enrichment: no host id on the ctx ⇒ no learning, default dial order.discovery.SSHTransportProd.WithProfiles+ adialseam — reads the profile, setsDialOptions.PreferAuth, recordsObservedAuthafter a successful dial. This one prod transport is what both discovery and the collector dial through (collector viacollectorSSHAdapter), so both learn from a single wiring.discovery.go/collector.gowrap the ctx with the host id at the dial site (bothhostFactsalready carryHostID).sshprivilege.Probe.WithProfiles— leads the dial with the recorded method (reorderingbuildAuthMethodsforAuthBoth; both methods stay offered) and records what authenticated via a small local single-factor observer that mirrorsinternal/ssh's.cmd/openwatchwires one sharedconnprofile.NewStore(pool)across all four paths (the scan reuses it now too).Learning is a hint, not a lock: a missing host id, an absent profile row, or a store error dials in the default order and never fails the connection — a stale hint self-heals on the next dial.
Scope
This PR covers the SSH auth-method dimension. sudo-mode (NOPASSWD vs password) learning for these three paths stays a follow-up — they already probe sudo mode correctly each cycle; only the compliance scan learns both dimensions today. Documented in the spec's
excludes.Spec / tests
system-connection-profile→ v1.1.0: adds C-06 and AC-08 (discovery/collector transport) + AC-09 (liveness probe).TestSSHTransportProd_AuthLearning(4 cases: lead-with + record, no host id, no store, lookup-error non-fatal) andTestPrivilegeProbe_AuthLearning(lead-with + record / no-store). ExistingTestBuildAuthMethodsupdated to the new signature.gofmt/go vet/go buildclean; touched-package suites green against the isolated test DB;specter check0 errors;system-connection-profile9/9 ACs have results, PASSES tier 2.