Skip to content

v0.5.1 — SSH 에이전트 glibc 회귀 핫픽스 + 전송 종료 진단 로깅#4

Merged
moreih29 merged 3 commits into
mainfrom
develop
Jun 1, 2026
Merged

v0.5.1 — SSH 에이전트 glibc 회귀 핫픽스 + 전송 종료 진단 로깅#4
moreih29 merged 3 commits into
mainfrom
develop

Conversation

@moreih29
Copy link
Copy Markdown
Owner

@moreih29 moreih29 commented Jun 1, 2026

요약

특정 원격 서버(구버전 Ubuntu 등)에서 SSH 연결이 "SSH workspace validation failed"로 실패하던 회귀를 수정하고, 같은 부류 장애가 main.log만으로 진단되도록 로깅을 보강합니다. 긴급 패치(v0.5.1).

포함 커밋

  • fix(agent): force CGO_ENABLED=0 so the Linux agent links statically
    • linux/amd64 에이전트가 CI 러너에서 네이티브 빌드되며 CGO=1 → net/os/user가 빌드 호스트 glibc에 동적 링크 → 구버전 배포판(예: Ubuntu 20.04, glibc 2.31)에서 GLIBC_x.xx not found로 기동 실패. CGO_ENABLED=0 + -tags=netgo,osusergo로 전 타깃 정적 링크.
  • fix(agent): preserve exit code/signal/stderr on SSH transport close
    • close 시 종료코드·시그널·stderr 꼬리를 보존해 cause로 기록. glibc/로더/아키텍처 에러를 server.spawn-failed로 분류(사용자에게 "Remote agent failed to start"). 분류기 회귀 테스트 추가. 진단은 로컬 main.log 한정, 렌더러로 raw stderr 미전달.
  • chore: bump version to 0.5.1

검증

  • dev 빌드(dev:fresh)로 실패하던 서버 접속 성공 확인 (정적 바이너리 자동 재업로드).
  • 로컬: typecheck 0 / lint 0 / 3014 pass · 0 fail.

⚠️ Protocol & Remote 영향

  • 첫 SSH 부팅 재업로드 필요: 에이전트 바이너리가 정적으로 재빌드되어 SHA가 바뀜 → 기존 SSH 워크스페이스 첫 접속 시 자동 재업로드됨(사용자 조치 불필요).

🤖 Generated with Claude Code

moreih29 and others added 3 commits June 1, 2026 14:56
The linux/amd64 agent is a *native* build on the CI runner, where Go
defaults CGO_ENABLED to 1. Because the agent imports "net" (and os/user),
the binary then dynamically links the runner's glibc and requires that
exact version at runtime — so it failed on older distros (e.g. Ubuntu
20.04, glibc 2.31) with "version `GLIBC_x.xx' not found", surfacing in
the app as an empty-cause ssh.unknown ("SSH workspace validation failed").

Force CGO_ENABLED=0 and add -tags=netgo,osusergo so every target is a
fully static binary, independent of the build host's glibc. Update the
release.yml comment that incorrectly assumed CGO was already unused.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A child-process close produced a bare createSshError("ssh.unknown") with
no cause, and unclassified stderr lines were dropped — so a fatal loader
error (e.g. glibc mismatch) left only an empty-cause ssh.unknown in the
log, making such failures undiagnosable from main.log alone.

- pipe: retain a bounded ring buffer of recent stderr lines; notifyClose
  now returns a stderrTail; classified stderr attaches the offending line
  as the SshError cause; raise the logged cause budget 300 -> 600 chars.
- reconnecting channel: capture {code, signal, stderrTail} on close and
  pass it to closeError.
- ssh channel: render that context into the ssh.unknown cause string.
- stderr-patterns: classify glibc/loader/arch errors as server.spawn-failed
  so the user sees "Remote agent failed to start" instead of the generic
  transport error. Add regression tests.

All diagnostics go to the local main.log only; raw stderr is still never
forwarded to the renderer.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@moreih29 moreih29 merged commit c26dc2a into main Jun 1, 2026
1 check passed
moreih29 added a commit that referenced this pull request Jun 2, 2026
Follow-up to the idle watchdog (e528ccd). Review surfaced six issues
across correctness, scope, and recovery; this addresses all of them.

#1 Monotonic clock: lastInbound was stored as wall-clock UnixNano and
compared via time.Since on a time.Unix value, which silently falls back
to wall-clock arithmetic. A laptop waking from sleep (local agent) or an
NTP step (remote) made elapsed jump past the limit and reap a live
session. Now anchored to a monotonic startMono via stampInbound/
idleElapsed.

#2 Scope: the watchdog ran for local agents too, where parent death
already arrives as stdin EOF (plus Pdeathsig on Linux) — pure downside.
Now gated on a new --idle-watchdog flag the SSH launch sets and the local
launch omits.

#3 Threshold: 60s limit with 3-ping margin was tight enough that a
stalled Electron main thread (ping is event-loop bound; ssh ServerAlive
is not) could trip it. Widened to 90s limit / 15s ping (6 slots), with
the check interval decoupled to limit/6 so the kill window stays tight.

#4 Contract: client ping was gated on heartbeat advertisement, the agent
watchdog on nothing — drift-prone. The agent now advertises idleWatchdogMs
in the Ready frame; the client pings iff positive, at idleWatchdogMs/6.

#5 Orphans: drainAndExit reaches os.Exit, which skips the `defer
pty.Close()`. Linux survived via Pdeathsig; a darwin remote (supported,
shipped) had only SIGHUP-on-fd-close, so SIGHUP-ignoring children
orphaned. PTY cleanup is now a shutdown hook that SIGKILLs each process
group on every OS.

#6 Recovery: the watchdog exited 0, which the client's handleClose treats
as a clean terminal exit (no reconnect). On a false positive (client
alive but stalled) the session died permanently. Now exits 75
(EX_TEMPFAIL) so the client reconnects.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant