perf(exec,workspace): reuse SSH connections + stop re-warming skipped submodules#70
Merged
Merged
Conversation
…ed submodules A fresh objective's checkout was taking 6-8 min in prod, blocking session start. Profiling showed the actual git work is only ~44s on the warm-cache box; the gap was orchestration overhead. orcha runs every prep git subcommand (~50-60 of them) through the SSH executor, which set no ControlMaster/ControlPersist — so each command paid a fresh TCP+crypto handshake plus the remote login shell. Measured quebec->box: 1.15s/connection cold vs 0.19s multiplexed (~6x), i.e. ~60-70s of pure handshake overhead per prep. Two changes: 1. exec/ssh.go: enable SSH connection multiplexing (ControlMaster=auto, ControlPersist=60, ControlPath=<tmp>/%C) on the NON-tty path only. The long-lived interactive -tt sessions are deliberately excluded: their cancellation relies on a per-connection process-group SIGHUP, so they must never share or outlive a persisted master. Fails open if the control dir can't be created. workspace/prepare.go's run() now uses the clean non-tty capture path (it was forcing a tty per command, which both blocked reuse and dragged the remote login shell in on every call). 2. workspace/prepare.go: stop re-warming a submodule the cache already shows is oversized. warmSubmoduleMirror looped ALL submodules before partitionSubmodulesBySize decided to skip the WPT suite (159k files), so every prep refetched WPT objects (~18.5s) it never checks out. Now warm only submodules not already known-oversized from the cache; a cold/bumped one is still warmed once so it can be sized. Measured: ControlMaster=auto on the exact emitted options, 15 sequential connections 3.14s vs 17.24s cold. New tests cover both; full suite green. Claude-Session: https://claude.ai/code/session_018CRsfX79dM42QC2BrW1vtZ
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A fresh objective's first checkout was taking 6–8 minutes in prod (e.g. session
e574fde2prep 6m22s, implementer3dfac5978m09s), and the agent only launches after prep — so it shows up as a session stuck "queued" for minutes. It is not the scheduler (placement is instant) and not inherent submodule cost.Where the time actually goes (measured)
Replaying prep's git steps on the warm-cache box, the whole thing is ~44s of real git work. The 6–8 min gap is orchestration overhead:
SSHExecutor, which set noControlMaster/ControlPersist→ every command is a fresh TCP+crypto handshake + remote login shell. Measured quebec→box: 1.15s/connection cold vs 0.19s multiplexed (~6×) → ~60–70s of pure handshake overhead per prep.warmSubmoduleMirrorlooped all submodules beforepartitionSubmodulesBySizeskipped it.Changes
1.
exec/ssh.go— SSH connection multiplexing, scoped to the non-tty path (ControlMaster=auto,ControlPersist=60,ControlPath=<tmpdir>/%C). The long-lived interactive-ttsessions are deliberately excluded — their cancellation relies on a per-connection process-group SIGHUP, so they must never share or outlive a persisted master. Fails open if the control dir can't be created.workspace/prepare.go'srun()now uses the clean non-tty capture path (it was forcing a tty per command — that both blocked reuse and dragged the remote login shell in on every call, thebind: warningnoise).2.
workspace/prepare.go— don't re-warm a known-oversized submodule. Warm only submodules the cache doesn't already show are oversized. In steady state WPT is sized from the cache and skipped without a re-fetch; a cold or freshly-bumped pin is still warmed once so it can be measured (the checkout-skip is unchanged).Validation
ControlMaster=autowith the exact emitted options, quebec→box: 15 sequential connections 3.14s vs 17.24s cold.TestSSHArgs_MultiplexesNonTTYOnly(multiplex non-tty, never-tt) andTestPrepareIsolated_SkipsOversizedSubmoduleWarmCache(oversized stays skipped on a warm-cache 2nd prep). Full suite green;go vetclean.Expected impact
Per prep: ~60–70s off (handshakes) + ~18s off (WPT warm). Helps every SSH op orcha does, not just prep. Complements #68 (managers skip submodules entirely).
https://claude.ai/code/session_018CRsfX79dM42QC2BrW1vtZ