fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull#166
Conversation
Add additive, omitempty BootstrapComplete/BootstrapCursor/ BootstrapFilesSynced/BootstrapFilesTotal/BootstrapStartedAt to mountState; mirror cursor-free progress into publicState.Bootstrap and the CLI status surface (syncStateBootstrap). Legacy state files load with zero values (self-healing forced full reconcile later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolveLatestEventCursor now derives its own short deadline from rootCtx (RELAYFILE_CURSOR_TIMEOUT, default 20s) instead of riding the inbound per-cycle ctx, so a slow events feed can no longer wedge a healthy cycle. Adds cursorTimeout/bootstrapTimeout/bootstrapIdleTimeout/ forceFullReconcile resolution to NewSyncer (option -> env -> default) and the SyncerOptions surface. Signature/return contract unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-tree pull Add bootstrapContext() deriving the heavy full-pull deadline from rootCtx (NOT the tiny per-cycle ctx): hard-cap mode when RELAYFILE_BOOTSTRAP_TIMEOUT>0, else a progress-extension watchdog that cancels only after RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT (default 90s) with no applied-file progress. touch() is called per ListTree page and per applied file in both full-tree and export paths; the watchdog goroutine is torn down on every pullRemote exit (no leak). The periodic full pull and post-fast-path full pull now use this context; the surrounding ListEvents probes stay on the per-cycle ctx (no latency regression). pullRemoteFullTree is resumable: resumes from a persisted BootstrapCursor, persists cursor + filesSynced after each page, bumps ListTree page size 10->200, and on full completion calls markBootstrapComplete (clearing cursor/progress). A traversal resumed from a persisted cursor SKIPS the snapshot delete pass (partial listing) to preserve the #164/#165 mount-root-clobber invariants; the next full empty-cursor cycle does the authoritative delete. Export is atomic so it sets completion directly. markSyncSuccess never sets completion. New --bootstrap-timeout/--cursor-timeout/--full-reconcile flags + env on both daemons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d60b6) Replace the unsafe LastEventAt fast-path heuristic with the authoritative BootstrapComplete gate: skip the bootstrap full pull ONLY when the workspace has been fully mirrored at least once AND no --full-reconcile/RELAYFILE_FORCE_FULL_RECONCILE escape hatch is set. A non-empty state with BootstrapComplete=false (clobber remnant or interrupted prior bootstrap) now self-heals by forcing a one-shot full reconcile, with a diagnostic log line. The force flag is cleared after one successful full reconcile (one-shot, see markBootstrapComplete). Updates the two restart-fast-path tests in the same commit (seed BootstrapComplete:true) so the suite stays green; the LastEventAt-only seeds would otherwise (correctly) no longer short-circuit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
savePublicState sets status="bootstrapping" + a public Bootstrap block (phase/filesSynced/filesTotal/startedAt; cursor omitted) while a bootstrap is in progress. The CLI runMountLoop runCycle and 10-min stall detector suppress stallReason and log "bootstrapping N/M" when .relay/state.json shows an in-progress bootstrap; runStatus prints "bootstrapping: N/M files (started <humanized>)" and suppresses the generic stall line. The standalone relayfile-mount run() logs "bootstrapping N/M" on DeadlineExceeded mid-bootstrap instead of a bare cycle-failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New bootstrap_test.go: decoupled-from-cycle-timeout, progress-extension (alive + stall-cancel), resume-from-persisted-cursor, complete-gates- fast-path (rw_517d60b6 repro), force-full-reconcile env self-clear, independent cursor timeout, status surfaces phase, resumed traversal skips snapshot delete pass (#164/#165 guard), and a watchdog goroutine-leak assertion. integration_test.go gains a ~600-file intermittently-failing slow paginated chaos test asserting eventual full convergence under a tiny per-cycle timeout. Also scales the bootstrap watchdog poll interval to the idle window (min idle/3, 10ms floor) so short windows cancel promptly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cursor The BootstrapComplete gate alone let backends whose events feed is always-empty-but-not-404 (e.g. some mock/degraded feeds) seed an empty EventsCursor and short-circuit forever after the first bootstrap: the periodic full-pull cadence keys off a non-empty EventsCursor, so new remote files would never land (regressed TestProductizedCloudMountE2EProof). Require resolveLatestEventCursor to return a non-empty tip before skipping; otherwise fall through to the idempotent full pull. This restores the safety the old LastEventAt gate provided without reintroducing the rw_517d60b6 partial-mirror hazard (still gated on BootstrapComplete). Test client now provides a usable event tip where the fast-path is expected to engage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh-eyes review: the periodic full-pull path called bcancel() unconditionally instead of via defer, leaking the watchdog goroutine if pullRemoteFull panics. Match the deferred-cancel pattern already used on the post-fast-path full-pull sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… large bootstrap The decoupled bootstrap context (prior commits) was necessary but not sufficient. Every syncer request still flowed through an *http.Client whose .Timeout was hardwired to the per-cycle mount timeout (default 15s). net/http enforces http.Client.Timeout as a whole-request wall-clock INDEPENDENT of context, so it killed the bootstrap full-tree body read mid-stream regardless of the rootCtx-derived bootstrap context: mount sync cycle failed: net/http: request canceled (Client.Timeout or context cancellation while reading body) Fix: introduce mountsync.NewSyncTransport()/NewSyncHTTPClient() with granular timeouts (DialContext 10s, TLSHandshake 10s, ResponseHeaderTimeout 30s, ExpectContinue 1s, IdleConn 90s) and NO total-request cap. All four syncer client construction sites now use it, and NewHTTPClient normalizes any caller-supplied non-zero Timeout to 0 so the bootstrap path can never be whole-request-capped. Context-coverage audit: every request path remains bounded by a context deadline - incremental -> per-cycle ctx (RELAYFILE_MOUNT_TIMEOUT); bootstrap/periodic full-pull -> bootstrap ctx (progress-extending + idle watchdog); resolveLatestEventCursor -> cursorTimeout ctx; writeback/delete -> per-cycle ctx. No path relied solely on http.Client.Timeout for cancellation, so removing it cannot cause an infinite hang. The 'conflict deleting /digests/today.md; remote changed' loop is a downstream symptom, not a separate defect: applyRemoteFile already re-pulls and re-tracks the remote-changed file with its fresh revision, which converges on the next cycle - but only if the cycle is not aborted first by the whole-request timeout. Fixing the timeout gap removes the cycle abort that prevented convergence; no separate fix needed. Tests: assert NewSyncHTTPClient has Timeout==0 with the expected granular transport; assert NewHTTPClient normalizes an explicit Timeout; regression test streaming a slow-but-progressing body for ~18s (past the old 15s cap) that now succeeds. mount_root_clobber and invariants suites untouched and green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR introduces resumable bootstrap with persistent state tracking, granular timeout controls, and a watchdog-based idle-timeout mechanism for long-running operations. HTTP timeouts are refactored from whole-request caps to per-operation deadlines with context-based cancellation, enabling progress-driven streaming for large initial syncs. Bootstrap progress is now exposed in operator status output and CLI monitoring. ChangesBootstrap Resumption and Timeout Model
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes This PR introduces a substantial timeout and bootstrap state model with persistent resumption, affecting core reconciliation logic, HTTP client construction, status reporting, and deadline handling across CLI, mount main, and syncer internals. The changes involve high-density logic (watchdog goroutines, progress-touch callbacks, cursor persistence), multiple interacting components (timeouts, state fields, context derivation), and comprehensive test coverage. While most changes are contained to mountsync and CLI logic, the breadth of modifications to restart fast-path gating, full-pull streaming, stall suppression, and operator status surface requires careful review of interaction between layers. Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Relayfile Eval ReviewRun: Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0 Human Review CasesNo reviewable human-review cases captured Relayfile output. |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/relayfile-cli/main.go (1)
3633-3658:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDocument the new mount flags in the custom help output.
These flags are wired up here, but
relayfile mount --helpdoes not show them because this command discards the default flag help and printsprintMountHelpinstead. Right now--bootstrap-timeout,--cursor-timeout, and--full-reconcileare effectively undiscoverable from the CLI.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cmd/relayfile-cli/main.go` around lines 3633 - 3658, The new flags bootstrapTimeout, cursorTimeout and fullReconcile are registered with the FlagSet but never documented because mount uses a custom printMountHelp; update the custom help to include these flags by adding their descriptions to printMountHelp (or by emitting fs.PrintDefaults()/FlagSet.VisitAll output when printing help for the mount command) so --bootstrap-timeout, --cursor-timeout and --full-reconcile become discoverable; locate the flag registrations (bootstrapTimeout, cursorTimeout, fullReconcile) and modify printMountHelp to include matching lines or programmatically include the FlagSet's defaults when fs.Parse is bypassed.
🧹 Nitpick comments (1)
internal/mountsync/syncer.go (1)
259-271: 💤 Low valueMutating caller's
http.Clientcould surprise callers who share the object.This directly modifies
httpClient.Timeouton the caller's object. If a caller passes an*http.Clientthey also use elsewhere (e.g., for non-bootstrap operations where a timeout cap is desired), this mutation silently disables their timeout. Consider cloning the client or documenting this prominently.The existing comment explains the rationale, so this is a minor concern given the defensive intent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/mountsync/syncer.go` around lines 259 - 271, Don't mutate the caller's *http.Client; instead make a shallow copy, adjust the copy's Timeout and Transport, and use that copy. Concretely, replace direct writes to httpClient.Timeout/Transport with a local clone (e.g. c := *httpClient; client := &c), set client.Timeout = 0 and if client.Transport == nil set client.Transport = NewSyncTransport(), and continue using client in the surrounding code (references to httpClient should be switched to the cloned client).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.trajectories/index.json:
- Line 244: Replace the absolute filesystem paths used in the JSON "path" fields
with repository-relative paths so entries remain portable; locate the "path"
keys that contain values like
"/private/tmp/relayfile-bootstrap-fix/.trajectories/.../traj_nxbsptr6c5q0.json"
(and the similar entries mentioned) and change them to
".trajectories/.../traj_nxbsptr6c5q0.json" (i.e., drop the leading
"/private/tmp/relayfile-bootstrap-fix/") so all "path" values use the same
relative ".trajectories/..." format.
In `@cmd/relayfile-cli/main.go`:
- Around line 3009-3016: The retry path removed the http.Client.Timeout which
leaves requests running under context.Background(), allowing HandleLocalChange
in the relayfile writeback retry to hang indefinitely after headers succeed;
restore an end-to-end deadline by wiring a request-scoped timeout into the
client used by mountsync.NewHTTPClient (or by enforcing a per-request context
with timeout inside the retry loop that calls HandleLocalChange), e.g., create a
context.WithTimeout around the code path that performs the upload/retry and pass
that context into the transport/client so newWritebackFailureTransport /
mountsync.NewSyncTransport operations are bounded; ensure the timeout value is
configurable and applied consistently for the code paths referenced
(mountsync.NewHTTPClient, newWritebackFailureTransport, HandleLocalChange).
- Around line 6934-6943: The code currently clears stallReason for any reconcile
error when a bootstrap status exists; change the bootstrap handling (the block
using readBootstrapStatus(localDir), stallReason, writeSnapshot) to only
suppress and return early when the reconcile error is a deadline exceeded error
by wrapping the existing branch in a check like errors.Is(err,
context.DeadlineExceeded) (or equivalent); if the error is not a deadline
exceeded, do not clear stallReason or return early so real failures (auth/local
I/O/bad responses) continue to surface. Ensure you import/usage errors and
context as needed and reference the existing readBootstrapStatus, stallReason,
writeSnapshot identifiers.
In `@internal/mountsync/bootstrap_test.go`:
- Around line 252-258: The test sets s.bootstrapIdleTimeout to 80ms but gives
Reconcile a too-short inbound deadline (context.WithTimeout(..., 1ms)) which
lets the per-cycle deadline mask the idle-timeout behavior; change the test to
create a context deadline comfortably longer than s.bootstrapIdleTimeout (e.g.,
>80ms, like 200ms) when calling s.Reconcile, record the start time, call
s.Reconcile(ctx), assert that it returns an error, and then assert the elapsed
time is at least s.bootstrapIdleTimeout to prove the failure came from the idle
window; reference s.bootstrapIdleTimeout, Reconcile, and context.WithTimeout to
locate and update the test accordingly.
In `@internal/mountsync/integration_test.go`:
- Around line 369-375: The convergence check currently uses ">= fileCount" which
can hide stray files; update the condition in the loop that sets "converged" to
require an exact match by using "countLocalFiles(t, localDir) == fileCount"
(referencing countLocalFiles, localDir, fileCount, converged) so the test only
succeeds when the local directory has exactly the expected number of files;
adjust any related failure message if needed to reflect exact-match semantics.
---
Outside diff comments:
In `@cmd/relayfile-cli/main.go`:
- Around line 3633-3658: The new flags bootstrapTimeout, cursorTimeout and
fullReconcile are registered with the FlagSet but never documented because mount
uses a custom printMountHelp; update the custom help to include these flags by
adding their descriptions to printMountHelp (or by emitting
fs.PrintDefaults()/FlagSet.VisitAll output when printing help for the mount
command) so --bootstrap-timeout, --cursor-timeout and --full-reconcile become
discoverable; locate the flag registrations (bootstrapTimeout, cursorTimeout,
fullReconcile) and modify printMountHelp to include matching lines or
programmatically include the FlagSet's defaults when fs.Parse is bypassed.
---
Nitpick comments:
In `@internal/mountsync/syncer.go`:
- Around line 259-271: Don't mutate the caller's *http.Client; instead make a
shallow copy, adjust the copy's Timeout and Transport, and use that copy.
Concretely, replace direct writes to httpClient.Timeout/Transport with a local
clone (e.g. c := *httpClient; client := &c), set client.Timeout = 0 and if
client.Transport == nil set client.Transport = NewSyncTransport(), and continue
using client in the surrounding code (references to httpClient should be
switched to the cloned client).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: dea5bfd6-0c4f-43ce-8f87-5027dc8b8cde
📒 Files selected for processing (8)
.trajectories/index.jsoncmd/relayfile-cli/main.gocmd/relayfile-mount/main.gointernal/mountsync/bootstrap_test.gointernal/mountsync/integration_test.gointernal/mountsync/sync_transport_test.gointernal/mountsync/syncer.gointernal/mountsync/syncer_test.go
| return nil | ||
| } |
There was a problem hiding this comment.
🔴 snapshotDeleteUnsafe check shadows !startedFromEmpty, preventing resumed bootstraps from ever completing
In pullRemoteFullTree, the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. A resumed bootstrap traversal only populates remotePaths with the tail portion of the tree (from the persisted cursor onward). When the tail is less than 50% of the total tracked file count, snapshotDeleteUnsafe returns true and the function exits at line 2267 without calling markBootstrapComplete(). The !startedFromEmpty branch at line 2279 — which correctly calls markBootstrapComplete() — is never reached.
Full impact trace for a 400-file workspace interrupted at page 3/4
Cycle 1: Pages 1-3 applied (300 tracked files), page 4 fails. State: BootstrapCursor=cursor_p3, BootstrapComplete=false.
Cycle 2 (resume): Traversal resumes from cursor_p3, page 4 applied (~100 files). remotePaths=100, tracked=400. snapshotDeleteUnsafe(100): 100 < 400*0.5=200 → true. Returns nil WITHOUT markBootstrapComplete(). BootstrapCursor still points to cursor_p3.
Every subsequent periodic full-pull (every ~10 min): resumes from stale cursor_p3, reads only the last ~100 files, snapshotDeleteUnsafe fires again. The periodic full-pull safety net is permanently broken (never does a real full pull), the restart fast-path never engages (BootstrapComplete stays false), and the stale BootstrapCursor is never cleared.
(Refers to lines 2264-2268)
Prompt for agents
In pullRemoteFullTree (syncer.go), the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. When a resumed traversal completes, remotePaths only covers the tail of the tree, so snapshotDeleteUnsafe incorrectly flags the listing as degraded. This prevents markBootstrapComplete() from being called, creating an infinite loop where every periodic full-pull re-reads just the tail.
Fix: move the !startedFromEmpty check (lines 2279-2284) BEFORE the snapshotDeleteUnsafe check (lines 2264-2268). A resumed traversal is already known to be partial, so the snapshot-delete safety check is not applicable — it is designed for degraded cloud responses, not intentionally partial listings. Alternatively, skip the snapshotDeleteUnsafe check entirely when startedFromEmpty is false, since the delete pass is already skipped in that branch.
Was this helpful? React with 👍 or 👎 to provide feedback.
| } else if httpClient.Timeout != 0 { | ||
| // A caller passed an explicit whole-request Timeout. That cap is | ||
| // fatal to the bootstrap full-pull (net/http enforces it | ||
| // independent of context, killing a long-but-progressing body | ||
| // read). Normalize it away and ensure a granular sync transport | ||
| // is in place; cancellation remains fully covered by the | ||
| // per-cycle / bootstrap / cursor contexts. This guarantees the | ||
| // bootstrap path is never whole-request-capped no matter how the | ||
| // client was constructed. | ||
| httpClient.Timeout = 0 | ||
| if httpClient.Transport == nil { | ||
| httpClient.Transport = NewSyncTransport() | ||
| } |
There was a problem hiding this comment.
🟡 NewHTTPClient unconditionally zeroes caller's http.Client.Timeout, removing FUSE mount's per-request timeout
NewHTTPClient (syncer.go:259-268) mutates the caller's *http.Client by forcibly setting Timeout = 0 when it was non-zero. The FUSE mount path (cmd/relayfile-mount/fuse_mount.go:20) passes &http.Client{Timeout: cfg.timeout} — this timeout intentionally bounded individual FUSE API requests (ReadFile, ListTree, WriteFile, DeleteFile) that back synchronous kernel filesystem operations. After the mutation, FUSE API calls have no whole-request timeout, relying only on ResponseHeaderTimeout: 30s from NewSyncTransport. The FUSE mount does not use bootstrap and has no context-based cancellation for individual API calls, so the explicit Timeout was the primary safety mechanism against hung requests blocking kernel-level filesystem operations.
Prompt for agents
NewHTTPClient (syncer.go:259-271) unconditionally zeroes the caller's http.Client.Timeout. This is needed for the poll-mode bootstrap path but breaks the FUSE mount path (fuse_mount.go:20) which relies on the whole-request Timeout to bound individual FUSE API calls.
Possible approaches:
1. Instead of mutating the caller's http.Client, clone it internally: create a new http.Client with Timeout=0 and the original (or new) Transport, leaving the caller's object untouched.
2. Have the FUSE mount path construct its client differently (e.g., passing a flag or using a separate constructor) so it is not subject to the bootstrap normalization.
3. Add a parameter or option to NewHTTPClient that controls whether the Timeout should be preserved (for FUSE) or normalized away (for poll/bootstrap).
The key invariant: poll-mode bootstrap must have Timeout=0, FUSE-mode must retain its per-request Timeout.
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
A mount against a large workspace (live repro:
rw_517d60b6, 581 files / 419 dirs) never converges — it permanently stalls at a handful of files. This is not the cloud OOM (that was fixed in cloud#730 and is deployed/verified). It is a set of compounding client-side defects in the mount daemon, exposed once the cloud could actually serve large workspaces fast.Five compounding defects, all fixed here:
RELAYFILE_MOUNT_TIMEOUT(default 15s) wrapped every sync cycle; the one-time bootstrap full-tree fetch of hundreds of files exceeds 15s →context deadline exceededevery cycle → permanent stall. (Trap was already documented atsyncer.go:1711.)if len(Files) > 0 && LastEventAt != ""→ after a clobber-recovery or interrupted bootstrap left a partial local state (e.g. 8 files), the fast-path seeded the events cursor and skipped the full pull forever, so the missing ~573 files never reconciled. Verified live: state had 8 tracked files, remote 581, daemon reported "healthy", mirror stuck at 9.resolveLatestEventCursorshared the tiny per-cycle deadline → timed out on large/slow workspaces, defeating the very fast-path it enables, then falling through to the doomed full pull.relayfile statusshowed a genericstall: context deadline exceededwith no diagnosis.Plus a sixth, found by live-testing this very branch against the stuck workspace:
http.Client.Timeoutis a second, independent whole-request cap.net/httpenforceshttp.Client.Timeoutover the entire request including body read, regardless of context. The daemon hardwired it toRELAYFILE_MOUNT_TIMEOUT. Decoupling the context (defects 1–4) was necessary but not sufficient — the big bootstrap body read still died withrequest canceled (Client.Timeout while reading body). Fixed by removing the whole-request cap and using a granular transport + context-based cancellation (idiomatic Go).Commits
272cb1ddcd4f851509aa064061042d2e2d0ded455a22c4c260a3c81178287bdDesign
s.rootCtx(not the inbound per-cycle ctx). Default mode = unbounded while making progress: a watchdog cancels only after an idle window (RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT, default 90s) with no page/file progress. Optional hard cap viaRELAYFILE_BOOTSTRAP_TIMEOUT. SIGTERM/rootCtx cancellation still propagates; watchdog torn down on every exit path (incl. panic, via deferred/IIFE-scoped cancel).BootstrapCursor/BootstrapFilesSynced/BootstrapStartedAtpersisted per page; an interrupted pull resumes instead of restarting. ListTree page size 10→200 (cloud is fast now).BootstrapCompleteset only after a verified full traversal.BootstrapComplete && !forceFullReconcile && len(Files)>0(and a usable resolved cursor). A partial/clobber-remnant state hasBootstrapComplete=false→ forced full (resumable) pull. Legacy state files load withBootstrapComplete=false→ exactly one self-healing forced reconcile.RELAYFILE_CURSOR_TIMEOUT, default 20s) froms.rootCtx.--full-reconcile/RELAYFILE_FORCE_FULL_RECONCILE=1, one-shot (self-clears after a successful reconcile). Auto-triggers for the clobber-remnant case.status: "bootstrapping"+ abootstrap{filesSynced,filesTotal,startedAt}block in state/status instead of a false stall. All new JSON fieldsomitempty(back-compat).http.Client.Timeout; granularDialContext/TLSHandshakeTimeout/ResponseHeaderTimeoutbound connect & time-to-first-byte without capping a progressing body. Cancellation is the context's job (per-cycle / bootstrap / cursor). Centralized inmountsync.NewSyncTransport()/NewSyncHTTPClient(); callers passing an explicitTimeoutare normalized to 0.#164/#165 mount-root-clobber invariants are explicitly preserved: a resumed/partial traversal skips the snapshot-delete pass so
applyRemoteSnapshotDeletesRevnever runs on an incompleteremotePathsset.mount_root_clobber_test.goandinvariants_test.goare byte-unchanged and green.Validation
Live, against the actual stuck
rw_517d60b6:stall: no successful reconcile for 10m.detected non-empty state without completed bootstrap; forcing full reconcile (8 tracked files)— the new gate correctly rejects the partial clobber-remnant state. (Live end-to-end convergence run is pending a relayfile re-auth; the gating logic and the previously-missinghttp.Client.Timeoutwere both proven via this live testing.)Automated:
go build ./...,go vet ./...— cleango test ./internal/mountsync/... -count=1— green (~105s incl. chaos)go test ./... -short -count=1— all packages greenscripts/check-contract-surface.sh— passed (new fields are mount-local state, not wire contract)bootstrap_test.go(8 tests: decoupled timeout, progress extension, resume-from-cursor, complete-gates-fast-path = the rw_517d60b6 repro, force-reconcile self-clear, independent cursor timeout, status surfacing, resumed-traversal-skips-delete-pass) + watchdog goroutine-leak assertion.sync_transport_test.go: asserts no whole-request timeout; slow-body regression test trickles a large body ~18s past the old 15s cap and succeeds (-raceclean).integration_test.go:TestLargeWorkspaceChaosEventuallyConverges— 600 files, intermittent 500s + slow, 3ms per-cycle deadline, asserts eventual full convergence + mount root intact.New env knobs (optional, safe defaults)
RELAYFILE_BOOTSTRAP_TIMEOUT/--bootstrap-timeoutRELAYFILE_BOOTSTRAP_IDLE_TIMEOUTRELAYFILE_CURSOR_TIMEOUT/--cursor-timeoutRELAYFILE_FORCE_FULL_RECONCILE/--full-reconcileBuilds on merged #164/#165 (mount-root clobber protection). Companion to cloud#730 (WorkspaceDO OOM elimination) — that fixed the server; this fixes the client so large workspaces actually converge.
🤖 Generated with Claude Code