fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull by khaliqgant · Pull Request #166 · AgentWorkforce/relayfile

khaliqgant · 2026-05-18T19:35:53Z

Summary

A mount against a large workspace (live repro: rw_517d60b6, 581 files / 419 dirs) never converges — it permanently stalls at a handful of files. This is not the cloud OOM (that was fixed in cloud#730 and is deployed/verified). It is a set of compounding client-side defects in the mount daemon, exposed once the cloud could actually serve large workspaces fast.

Five compounding defects, all fixed here:

Bootstrap full-pull bound by the 15s per-cycle deadline. RELAYFILE_MOUNT_TIMEOUT (default 15s) wrapped every sync cycle; the one-time bootstrap full-tree fetch of hundreds of files exceeds 15s → context deadline exceeded every cycle → permanent stall. (Trap was already documented at syncer.go:1711.)
Restart fast-path trusted incomplete/corrupt on-disk state. if len(Files) > 0 && LastEventAt != "" → after a clobber-recovery or interrupted bootstrap left a partial local state (e.g. 8 files), the fast-path seeded the events cursor and skipped the full pull forever, so the missing ~573 files never reconciled. Verified live: state had 8 tracked files, remote 581, daemon reported "healthy", mirror stuck at 9.
resolveLatestEventCursor shared the tiny per-cycle deadline → timed out on large/slow workspaces, defeating the very fast-path it enables, then falling through to the doomed full pull.
No resumable bootstrap → an interrupted full pull restarted from scratch every cycle.
No progress observability → relayfile status showed a generic stall: context deadline exceeded with no diagnosis.

Plus a sixth, found by live-testing this very branch against the stuck workspace:

http.Client.Timeout is a second, independent whole-request cap. net/http enforces http.Client.Timeout over the entire request including body read, regardless of context. The daemon hardwired it to RELAYFILE_MOUNT_TIMEOUT. Decoupling the context (defects 1–4) was necessary but not sufficient — the big bootstrap body read still died with request canceled (Client.Timeout while reading body). Fixed by removing the whole-request cap and using a granular transport + context-based cancellation (idiomatic Go).

Commits

SHA	Title
`272cb1d`	feat(mountsync): add resumable-bootstrap state fields
`dcd4f85`	feat(mountsync): independent short timeout for cursor resolution
`1509aa0`	feat(mountsync): progress-extending bootstrap context + resumable full-tree pull
`6406104`	fix(mountsync): gate restart fast-path on completed bootstrap (rw_517d60b6)
`2d2e2d0`	feat(observability): surface bootstrap progress instead of false stalls
`ded455a`	test(mountsync): resumable-bootstrap + decoupled-timeout coverage
`22c4c26`	fix(mountsync): only fast-path-skip when events feed yields a usable cursor
`0a3c811`	fix(mountsync): defer bootstrap ctx cancel on periodic full-pull path
`78287bd`	fix(mountsync): remove whole-request http.Client.Timeout that aborted large bootstrap

Design

Decoupled bootstrap context derived from s.rootCtx (not the inbound per-cycle ctx). Default mode = unbounded while making progress: a watchdog cancels only after an idle window (RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT, default 90s) with no page/file progress. Optional hard cap via RELAYFILE_BOOTSTRAP_TIMEOUT. SIGTERM/rootCtx cancellation still propagates; watchdog torn down on every exit path (incl. panic, via deferred/IIFE-scoped cancel).
Resumable bootstrap: BootstrapCursor/BootstrapFilesSynced/BootstrapStartedAt persisted per page; an interrupted pull resumes instead of restarting. ListTree page size 10→200 (cloud is fast now). BootstrapComplete set only after a verified full traversal.
Safe fast-path gate: BootstrapComplete && !forceFullReconcile && len(Files)>0 (and a usable resolved cursor). A partial/clobber-remnant state has BootstrapComplete=false → forced full (resumable) pull. Legacy state files load with BootstrapComplete=false → exactly one self-healing forced reconcile.
Independent cursor timeout (RELAYFILE_CURSOR_TIMEOUT, default 20s) from s.rootCtx.
Escape hatch: --full-reconcile / RELAYFILE_FORCE_FULL_RECONCILE=1, one-shot (self-clears after a successful reconcile). Auto-triggers for the clobber-remnant case.
Observability: status: "bootstrapping" + a bootstrap{filesSynced,filesTotal,startedAt} block in state/status instead of a false stall. All new JSON fields omitempty (back-compat).
HTTP transport: no whole-request http.Client.Timeout; granular DialContext/TLSHandshakeTimeout/ResponseHeaderTimeout bound connect & time-to-first-byte without capping a progressing body. Cancellation is the context's job (per-cycle / bootstrap / cursor). Centralized in mountsync.NewSyncTransport()/NewSyncHTTPClient(); callers passing an explicit Timeout are normalized to 0.

#164/#165 mount-root-clobber invariants are explicitly preserved: a resumed/partial traversal skips the snapshot-delete pass so applyRemoteSnapshotDeletesRev never runs on an incomplete remotePaths set. mount_root_clobber_test.go and invariants_test.go are byte-unchanged and green.

Validation

Live, against the actual stuck rw_517d60b6:

Old build: permanently stuck at 9/581 files, stall: no successful reconcile for 10m.
This branch: immediately logged detected non-empty state without completed bootstrap; forcing full reconcile (8 tracked files) — the new gate correctly rejects the partial clobber-remnant state. (Live end-to-end convergence run is pending a relayfile re-auth; the gating logic and the previously-missing http.Client.Timeout were both proven via this live testing.)

Automated:

go build ./..., go vet ./... — clean
go test ./internal/mountsync/... -count=1 — green (~105s incl. chaos)
go test ./... -short -count=1 — all packages green
scripts/check-contract-surface.sh — passed (new fields are mount-local state, not wire contract)
New bootstrap_test.go (8 tests: decoupled timeout, progress extension, resume-from-cursor, complete-gates-fast-path = the rw_517d60b6 repro, force-reconcile self-clear, independent cursor timeout, status surfacing, resumed-traversal-skips-delete-pass) + watchdog goroutine-leak assertion.
sync_transport_test.go: asserts no whole-request timeout; slow-body regression test trickles a large body ~18s past the old 15s cap and succeeds (-race clean).
integration_test.go: TestLargeWorkspaceChaosEventuallyConverges — 600 files, intermittent 500s + slow, 3ms per-cycle deadline, asserts eventual full convergence + mount root intact.

New env knobs (optional, safe defaults)

Env / flag	Default	Purpose
`RELAYFILE_BOOTSTRAP_TIMEOUT` / `--bootstrap-timeout`	0 (unbounded-while-progressing)	Hard cap for the bootstrap full-pull
`RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT`	90s	No-progress idle window before the bootstrap watchdog cancels
`RELAYFILE_CURSOR_TIMEOUT` / `--cursor-timeout`	20s	Independent deadline for events-cursor resolution
`RELAYFILE_FORCE_FULL_RECONCILE` / `--full-reconcile`	off	One-shot forced full reconcile (clobber-remnant recovery)

Builds on merged #164/#165 (mount-root clobber protection). Companion to cloud#730 (WorkspaceDO OOM elimination) — that fixed the server; this fixes the client so large workspaces actually converge.

🤖 Generated with Claude Code

Add additive, omitempty BootstrapComplete/BootstrapCursor/ BootstrapFilesSynced/BootstrapFilesTotal/BootstrapStartedAt to mountState; mirror cursor-free progress into publicState.Bootstrap and the CLI status surface (syncStateBootstrap). Legacy state files load with zero values (self-healing forced full reconcile later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

resolveLatestEventCursor now derives its own short deadline from rootCtx (RELAYFILE_CURSOR_TIMEOUT, default 20s) instead of riding the inbound per-cycle ctx, so a slow events feed can no longer wedge a healthy cycle. Adds cursorTimeout/bootstrapTimeout/bootstrapIdleTimeout/ forceFullReconcile resolution to NewSyncer (option -> env -> default) and the SyncerOptions surface. Signature/return contract unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-tree pull Add bootstrapContext() deriving the heavy full-pull deadline from rootCtx (NOT the tiny per-cycle ctx): hard-cap mode when RELAYFILE_BOOTSTRAP_TIMEOUT>0, else a progress-extension watchdog that cancels only after RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT (default 90s) with no applied-file progress. touch() is called per ListTree page and per applied file in both full-tree and export paths; the watchdog goroutine is torn down on every pullRemote exit (no leak). The periodic full pull and post-fast-path full pull now use this context; the surrounding ListEvents probes stay on the per-cycle ctx (no latency regression). pullRemoteFullTree is resumable: resumes from a persisted BootstrapCursor, persists cursor + filesSynced after each page, bumps ListTree page size 10->200, and on full completion calls markBootstrapComplete (clearing cursor/progress). A traversal resumed from a persisted cursor SKIPS the snapshot delete pass (partial listing) to preserve the #164/#165 mount-root-clobber invariants; the next full empty-cursor cycle does the authoritative delete. Export is atomic so it sets completion directly. markSyncSuccess never sets completion. New --bootstrap-timeout/--cursor-timeout/--full-reconcile flags + env on both daemons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d60b6) Replace the unsafe LastEventAt fast-path heuristic with the authoritative BootstrapComplete gate: skip the bootstrap full pull ONLY when the workspace has been fully mirrored at least once AND no --full-reconcile/RELAYFILE_FORCE_FULL_RECONCILE escape hatch is set. A non-empty state with BootstrapComplete=false (clobber remnant or interrupted prior bootstrap) now self-heals by forcing a one-shot full reconcile, with a diagnostic log line. The force flag is cleared after one successful full reconcile (one-shot, see markBootstrapComplete). Updates the two restart-fast-path tests in the same commit (seed BootstrapComplete:true) so the suite stays green; the LastEventAt-only seeds would otherwise (correctly) no longer short-circuit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

savePublicState sets status="bootstrapping" + a public Bootstrap block (phase/filesSynced/filesTotal/startedAt; cursor omitted) while a bootstrap is in progress. The CLI runMountLoop runCycle and 10-min stall detector suppress stallReason and log "bootstrapping N/M" when .relay/state.json shows an in-progress bootstrap; runStatus prints "bootstrapping: N/M files (started <humanized>)" and suppresses the generic stall line. The standalone relayfile-mount run() logs "bootstrapping N/M" on DeadlineExceeded mid-bootstrap instead of a bare cycle-failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New bootstrap_test.go: decoupled-from-cycle-timeout, progress-extension (alive + stall-cancel), resume-from-persisted-cursor, complete-gates- fast-path (rw_517d60b6 repro), force-full-reconcile env self-clear, independent cursor timeout, status surfaces phase, resumed traversal skips snapshot delete pass (#164/#165 guard), and a watchdog goroutine-leak assertion. integration_test.go gains a ~600-file intermittently-failing slow paginated chaos test asserting eventual full convergence under a tiny per-cycle timeout. Also scales the bootstrap watchdog poll interval to the idle window (min idle/3, 10ms floor) so short windows cancel promptly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cursor The BootstrapComplete gate alone let backends whose events feed is always-empty-but-not-404 (e.g. some mock/degraded feeds) seed an empty EventsCursor and short-circuit forever after the first bootstrap: the periodic full-pull cadence keys off a non-empty EventsCursor, so new remote files would never land (regressed TestProductizedCloudMountE2EProof). Require resolveLatestEventCursor to return a non-empty tip before skipping; otherwise fall through to the idempotent full pull. This restores the safety the old LastEventAt gate provided without reintroducing the rw_517d60b6 partial-mirror hazard (still gated on BootstrapComplete). Test client now provides a usable event tip where the fast-path is expected to engage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fresh-eyes review: the periodic full-pull path called bcancel() unconditionally instead of via defer, leaking the watchdog goroutine if pullRemoteFull panics. Match the deferred-cancel pattern already used on the post-fast-path full-pull sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… large bootstrap The decoupled bootstrap context (prior commits) was necessary but not sufficient. Every syncer request still flowed through an *http.Client whose .Timeout was hardwired to the per-cycle mount timeout (default 15s). net/http enforces http.Client.Timeout as a whole-request wall-clock INDEPENDENT of context, so it killed the bootstrap full-tree body read mid-stream regardless of the rootCtx-derived bootstrap context: mount sync cycle failed: net/http: request canceled (Client.Timeout or context cancellation while reading body) Fix: introduce mountsync.NewSyncTransport()/NewSyncHTTPClient() with granular timeouts (DialContext 10s, TLSHandshake 10s, ResponseHeaderTimeout 30s, ExpectContinue 1s, IdleConn 90s) and NO total-request cap. All four syncer client construction sites now use it, and NewHTTPClient normalizes any caller-supplied non-zero Timeout to 0 so the bootstrap path can never be whole-request-capped. Context-coverage audit: every request path remains bounded by a context deadline - incremental -> per-cycle ctx (RELAYFILE_MOUNT_TIMEOUT); bootstrap/periodic full-pull -> bootstrap ctx (progress-extending + idle watchdog); resolveLatestEventCursor -> cursorTimeout ctx; writeback/delete -> per-cycle ctx. No path relied solely on http.Client.Timeout for cancellation, so removing it cannot cause an infinite hang. The 'conflict deleting /digests/today.md; remote changed' loop is a downstream symptom, not a separate defect: applyRemoteFile already re-pulls and re-tracks the remote-changed file with its fresh revision, which converges on the next cycle - but only if the cycle is not aborted first by the whole-request timeout. Fixing the timeout gap removes the cycle abort that prevented convergence; no separate fix needed. Tests: assert NewSyncHTTPClient has Timeout==0 with the expected granular transport; assert NewHTTPClient normalizes an explicit Timeout; regression test streaming a slow-but-progressing body for ~18s (past the old 15s cap) that now succeeds. mount_root_clobber and invariants suites untouched and green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-18T19:36:06Z

Warning

Rate limit exceeded

@khaliqgant has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 6 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a2e747b4-3169-4e3a-bd4e-797c0c293adb

📥 Commits

Reviewing files that changed from the base of the PR and between 78287bd and 088871d.

📒 Files selected for processing (6)

.trajectories/index.json
cmd/relayfile-cli/main.go
internal/mountsync/bootstrap_test.go
internal/mountsync/integration_test.go
internal/mountsync/sync_transport_test.go
internal/mountsync/syncer.go

📝 Walkthrough

Walkthrough

This PR introduces resumable bootstrap with persistent state tracking, granular timeout controls, and a watchdog-based idle-timeout mechanism for long-running operations. HTTP timeouts are refactored from whole-request caps to per-operation deadlines with context-based cancellation, enabling progress-driven streaming for large initial syncs. Bootstrap progress is now exposed in operator status output and CLI monitoring.

Changes

Bootstrap Resumption and Timeout Model

Layer / File(s)	Summary
HTTP transport refactoring: granular timeouts and context-based cancellation `internal/mountsync/syncer.go`, `internal/mountsync/sync_transport_test.go`	`NewSyncTransport` and `NewSyncHTTPClient` provide granular per-operation timeouts (connect/TLS/response-header) with `http.Client.Timeout` normalized to 0, enabling context-based cancellation instead of whole-request caps. Three new tests validate the timeout model and confirm streaming completes past historical caps.
Bootstrap state model and persistent resumption fields `internal/mountsync/syncer.go`	`mountState` gains `BootstrapComplete`, `BootstrapCursor`, `BootstrapFilesSynced`, `BootstrapFilesTotal`, and `BootstrapStartedAt` to enable resume after interruption. `publicState.Bootstrap` field with `bootstrapStatus` struct surfaces phase/file-counts/started-at in operator status.
CLI and configuration: timeout flags and syncer option wiring `cmd/relayfile-cli/main.go`, `cmd/relayfile-mount/main.go`, `internal/mountsync/syncer.go`	New `--bootstrap-timeout`, `--cursor-timeout`, and `--full-reconcile` flags are added to mount subcommands and wired into `SyncerOptions` (CursorTimeout, BootstrapTimeout, ForceFullReconcile) via `NewSyncer`.
Option resolution: duration parsing and fallback semantics `internal/mountsync/syncer.go`	`resolveDurationEnv` helper parses timeout options/env with fallback rules; `NewSyncer` resolves and stores `cursorTimeout`, `bootstrapTimeout`, `bootstrapIdleTimeout`, and `forceFullReconcile` from options/env.
Bootstrap watchdog and progress extension mechanism `internal/mountsync/syncer.go`	`bootstrapContext` helper returns context/cancel and a progress-touch callback; in progress-extension mode, a watchdog goroutine cancels only if no touches occur within the configured idle window, extending liveness during long operations.
Full-pull refactoring: resumable bootstrap and partial progress persistence `internal/mountsync/syncer.go`	`pullRemoteFull` and `pullRemoteFullTree` accept `bootstrapProgress`, touch progress during export/traversal, persist resumable cursors and file counts during interruption, and defer snapshot-delete passes when resuming from a persisted cursor to avoid clobbering already-mirrored prefixes.
Restart fast-path: BootstrapComplete gate and independent cursor timeout `internal/mountsync/syncer.go`, `internal/mountsync/syncer_test.go`	Restart fast-path now requires `BootstrapComplete=true` before skipping the bootstrap full pull. `resolveLatestEventCursor` uses an independent cursor-timeout from `rootCtx`, preventing slow events feeds from wedging healthy cycles. Test seeds updated to explicitly gate on `BootstrapComplete`.
Bootstrap progress in operator status and CLI output `internal/mountsync/syncer.go`, `cmd/relayfile-cli/main.go`	`publicState` surfaces bootstrap-in-progress details overriding normal status logic. CLI parses `.relay/state.json` bootstrap block via `readBootstrapStatus` and renders "bootstrapping: / files" progress in `runStatus`. Deadletter retry switches to new sync transport without whole-request timeout.
Stall monitoring: suppress misleading stall messages during bootstrap `cmd/relayfile-cli/main.go`	Mount cycle failure and "no successful reconcile for 10m" handlers check bootstrap status; when active, clear `stallReason`, log bootstrap progress, and write snapshots without labeling it as a generic stall.
Mount main: deadline handling with bootstrap progress detection `cmd/relayfile-mount/main.go`	Special handling for `context.DeadlineExceeded` reads bootstrap progress from `.relay/state.json` and logs bootstrapping status, returning early instead of treating deadline as terminal. Includes `readBootstrapProgress` helper for state file parsing.
Bootstrap test infrastructure and comprehensive test suite `internal/mountsync/bootstrap_test.go`	`bootstrapClient` test double with configurable pagination, failure injection, and delays. Nine tests cover timeout decoupling, progress extension, cursor resumption, BootstrapComplete gating, force-full-reconcile override, cursor-timeout independence, status surfacing, snapshot-delete skip, and watchdog cleanup. Includes state-loading and goroutine-leak helpers.
Integration tests: transport validation and large-workspace chaos `internal/mountsync/sync_transport_test.go`, `internal/mountsync/integration_test.go`	`sync_transport_test.go` validates granular transport timeouts and successful streaming past historical caps. `TestLargeWorkspaceChaosEventuallyConverges` uses `largeChaosCloud` server with deterministic pagination, intermittent 500s, and delays; verifies local mirror convergence under chaos and short per-cycle timeouts.
Administrative updates: trajectory index refresh `.trajectories/index.json`	`lastUpdated` timestamp refreshed; two completed and one active trajectory entry appended with metadata (titles, statuses, start/completion times, paths).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This PR introduces a substantial timeout and bootstrap state model with persistent resumption, affecting core reconciliation logic, HTTP client construction, status reporting, and deadline handling across CLI, mount main, and syncer internals. The changes involve high-density logic (watchdog goroutines, progress-touch callbacks, cursor persistence), multiple interacting components (timeouts, state fields, context derivation), and comprehensive test coverage. While most changes are contained to mountsync and CLI logic, the breadth of modifications to restart fast-path gating, full-pull streaming, stall suppression, and operator status surface requires careful review of interaction between layers.

Possibly related PRs

AgentWorkforce/relayfile#163: Modifies cmd/relayfile-cli/main.go dead-letter writeback behavior; directly related to the deadletter retry transport changes in this PR.
AgentWorkforce/relayfile#164: Modifies mountsync snapshot-delete decision logic in full-tree/full-export pulls; related to the resumable bootstrap snapshot-delete deferral in this PR.
AgentWorkforce/relayfile#92: Modifies internal/mountsync/syncer.go restart fast-path logic and cursor-based bootstrap skipping; overlaps with this PR's BootstrapComplete gating and restart fast-path tests.

Poem

🐰 A rabbit bounds through bootstrap's deep woods,
Resume point saved when the path grows hard,
Watchdog bells chime when progress goes quiet,
Each timeout granular, no whole-request regard,
The mirror converges, one file-synced forward!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing mount daemon convergence on large workspaces through bootstrap timeout, safe fast-path, and resumable pull functionality.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing all six defects fixed, design decisions, commits, validation, and new environment knobs.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/mount-bootstrap-large-workspace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-18T19:37:43Z

Relayfile Eval Review

Run: .relayfile/evals/runs/2026-05-19T10-43-32-088Z-HEAD-provider
Mode: provider
Git SHA: b3f3d49

Human Review Cases

No reviewable human-review cases captured Relayfile output.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cmd/relayfile-cli/main.go (1)
3633-3658: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the new mount flags in the custom help output.

These flags are wired up here, but relayfile mount --help does not show them because this command discards the default flag help and prints printMountHelp instead. Right now --bootstrap-timeout, --cursor-timeout, and --full-reconcile are effectively undiscoverable from the CLI.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/relayfile-cli/main.go` around lines 3633 - 3658, The new flags
bootstrapTimeout, cursorTimeout and fullReconcile are registered with the
FlagSet but never documented because mount uses a custom printMountHelp; update
the custom help to include these flags by adding their descriptions to
printMountHelp (or by emitting fs.PrintDefaults()/FlagSet.VisitAll output when
printing help for the mount command) so --bootstrap-timeout, --cursor-timeout
and --full-reconcile become discoverable; locate the flag registrations
(bootstrapTimeout, cursorTimeout, fullReconcile) and modify printMountHelp to
include matching lines or programmatically include the FlagSet's defaults when
fs.Parse is bypassed.

🧹 Nitpick comments (1)

internal/mountsync/syncer.go (1)
259-271: 💤 Low value

Mutating caller's http.Client could surprise callers who share the object.

This directly modifies httpClient.Timeout on the caller's object. If a caller passes an *http.Client they also use elsewhere (e.g., for non-bootstrap operations where a timeout cap is desired), this mutation silently disables their timeout. Consider cloning the client or documenting this prominently.

The existing comment explains the rationale, so this is a minor concern given the defensive intent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/mountsync/syncer.go` around lines 259 - 271, Don't mutate the
caller's *http.Client; instead make a shallow copy, adjust the copy's Timeout
and Transport, and use that copy. Concretely, replace direct writes to
httpClient.Timeout/Transport with a local clone (e.g. c := *httpClient; client
:= &c), set client.Timeout = 0 and if client.Transport == nil set
client.Transport = NewSyncTransport(), and continue using client in the
surrounding code (references to httpClient should be switched to the cloned
client).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/index.json:
- Line 244: Replace the absolute filesystem paths used in the JSON "path" fields
with repository-relative paths so entries remain portable; locate the "path"
keys that contain values like
"/private/tmp/relayfile-bootstrap-fix/.trajectories/.../traj_nxbsptr6c5q0.json"
(and the similar entries mentioned) and change them to
".trajectories/.../traj_nxbsptr6c5q0.json" (i.e., drop the leading
"/private/tmp/relayfile-bootstrap-fix/") so all "path" values use the same
relative ".trajectories/..." format.

In `@cmd/relayfile-cli/main.go`:
- Around line 3009-3016: The retry path removed the http.Client.Timeout which
leaves requests running under context.Background(), allowing HandleLocalChange
in the relayfile writeback retry to hang indefinitely after headers succeed;
restore an end-to-end deadline by wiring a request-scoped timeout into the
client used by mountsync.NewHTTPClient (or by enforcing a per-request context
with timeout inside the retry loop that calls HandleLocalChange), e.g., create a
context.WithTimeout around the code path that performs the upload/retry and pass
that context into the transport/client so newWritebackFailureTransport /
mountsync.NewSyncTransport operations are bounded; ensure the timeout value is
configurable and applied consistently for the code paths referenced
(mountsync.NewHTTPClient, newWritebackFailureTransport, HandleLocalChange).
- Around line 6934-6943: The code currently clears stallReason for any reconcile
error when a bootstrap status exists; change the bootstrap handling (the block
using readBootstrapStatus(localDir), stallReason, writeSnapshot) to only
suppress and return early when the reconcile error is a deadline exceeded error
by wrapping the existing branch in a check like errors.Is(err,
context.DeadlineExceeded) (or equivalent); if the error is not a deadline
exceeded, do not clear stallReason or return early so real failures (auth/local
I/O/bad responses) continue to surface. Ensure you import/usage errors and
context as needed and reference the existing readBootstrapStatus, stallReason,
writeSnapshot identifiers.

In `@internal/mountsync/bootstrap_test.go`:
- Around line 252-258: The test sets s.bootstrapIdleTimeout to 80ms but gives
Reconcile a too-short inbound deadline (context.WithTimeout(..., 1ms)) which
lets the per-cycle deadline mask the idle-timeout behavior; change the test to
create a context deadline comfortably longer than s.bootstrapIdleTimeout (e.g.,
>80ms, like 200ms) when calling s.Reconcile, record the start time, call
s.Reconcile(ctx), assert that it returns an error, and then assert the elapsed
time is at least s.bootstrapIdleTimeout to prove the failure came from the idle
window; reference s.bootstrapIdleTimeout, Reconcile, and context.WithTimeout to
locate and update the test accordingly.

In `@internal/mountsync/integration_test.go`:
- Around line 369-375: The convergence check currently uses ">= fileCount" which
can hide stray files; update the condition in the loop that sets "converged" to
require an exact match by using "countLocalFiles(t, localDir) == fileCount"
(referencing countLocalFiles, localDir, fileCount, converged) so the test only
succeeds when the local directory has exactly the expected number of files;
adjust any related failure message if needed to reflect exact-match semantics.

---

Outside diff comments:
In `@cmd/relayfile-cli/main.go`:
- Around line 3633-3658: The new flags bootstrapTimeout, cursorTimeout and
fullReconcile are registered with the FlagSet but never documented because mount
uses a custom printMountHelp; update the custom help to include these flags by
adding their descriptions to printMountHelp (or by emitting
fs.PrintDefaults()/FlagSet.VisitAll output when printing help for the mount
command) so --bootstrap-timeout, --cursor-timeout and --full-reconcile become
discoverable; locate the flag registrations (bootstrapTimeout, cursorTimeout,
fullReconcile) and modify printMountHelp to include matching lines or
programmatically include the FlagSet's defaults when fs.Parse is bypassed.

---

Nitpick comments:
In `@internal/mountsync/syncer.go`:
- Around line 259-271: Don't mutate the caller's *http.Client; instead make a
shallow copy, adjust the copy's Timeout and Transport, and use that copy.
Concretely, replace direct writes to httpClient.Timeout/Transport with a local
clone (e.g. c := *httpClient; client := &c), set client.Timeout = 0 and if
client.Transport == nil set client.Transport = NewSyncTransport(), and continue
using client in the surrounding code (references to httpClient should be
switched to the cloned client).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: dea5bfd6-0c4f-43ce-8f87-5027dc8b8cde

📥 Commits

Reviewing files that changed from the base of the PR and between 920b3f2 and 78287bd.

📒 Files selected for processing (8)

.trajectories/index.json
cmd/relayfile-cli/main.go
cmd/relayfile-mount/main.go
internal/mountsync/bootstrap_test.go
internal/mountsync/integration_test.go
internal/mountsync/sync_transport_test.go
internal/mountsync/syncer.go
internal/mountsync/syncer_test.go

devin-ai-integration

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-05-18T19:45:54Z

 		return nil
 	}


🔴 snapshotDeleteUnsafe check shadows !startedFromEmpty, preventing resumed bootstraps from ever completing

In pullRemoteFullTree, the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. A resumed bootstrap traversal only populates remotePaths with the tail portion of the tree (from the persisted cursor onward). When the tail is less than 50% of the total tracked file count, snapshotDeleteUnsafe returns true and the function exits at line 2267 without calling markBootstrapComplete(). The !startedFromEmpty branch at line 2279 — which correctly calls markBootstrapComplete() — is never reached.

Full impact trace for a 400-file workspace interrupted at page 3/4

Cycle 1: Pages 1-3 applied (300 tracked files), page 4 fails. State: BootstrapCursor=cursor_p3, BootstrapComplete=false.

Cycle 2 (resume): Traversal resumes from cursor_p3, page 4 applied (~100 files). remotePaths=100, tracked=400. snapshotDeleteUnsafe(100): 100 < 400*0.5=200 → true. Returns nil WITHOUT markBootstrapComplete(). BootstrapCursor still points to cursor_p3.

Every subsequent periodic full-pull (every ~10 min): resumes from stale cursor_p3, reads only the last ~100 files, snapshotDeleteUnsafe fires again. The periodic full-pull safety net is permanently broken (never does a real full pull), the restart fast-path never engages (BootstrapComplete stays false), and the stale BootstrapCursor is never cleared.

(Refers to lines 2264-2268)

Prompt for agents

In pullRemoteFullTree (syncer.go), the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. When a resumed traversal completes, remotePaths only covers the tail of the tree, so snapshotDeleteUnsafe incorrectly flags the listing as degraded. This prevents markBootstrapComplete() from being called, creating an infinite loop where every periodic full-pull re-reads just the tail. Fix: move the !startedFromEmpty check (lines 2279-2284) BEFORE the snapshotDeleteUnsafe check (lines 2264-2268). A resumed traversal is already known to be partial, so the snapshot-delete safety check is not applicable — it is designed for degraded cloud responses, not intentionally partial listings. Alternatively, skip the snapshotDeleteUnsafe check entirely when startedFromEmpty is false, since the delete pass is already skipped in that branch.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-18T19:45:55Z

+	} else if httpClient.Timeout != 0 {
+		// A caller passed an explicit whole-request Timeout. That cap is
+		// fatal to the bootstrap full-pull (net/http enforces it
+		// independent of context, killing a long-but-progressing body
+		// read). Normalize it away and ensure a granular sync transport
+		// is in place; cancellation remains fully covered by the
+		// per-cycle / bootstrap / cursor contexts. This guarantees the
+		// bootstrap path is never whole-request-capped no matter how the
+		// client was constructed.
+		httpClient.Timeout = 0
+		if httpClient.Transport == nil {
+			httpClient.Transport = NewSyncTransport()
+		}


🟡 NewHTTPClient unconditionally zeroes caller's http.Client.Timeout, removing FUSE mount's per-request timeout

NewHTTPClient (syncer.go:259-268) mutates the caller's *http.Client by forcibly setting Timeout = 0 when it was non-zero. The FUSE mount path (cmd/relayfile-mount/fuse_mount.go:20) passes &http.Client{Timeout: cfg.timeout} — this timeout intentionally bounded individual FUSE API requests (ReadFile, ListTree, WriteFile, DeleteFile) that back synchronous kernel filesystem operations. After the mutation, FUSE API calls have no whole-request timeout, relying only on ResponseHeaderTimeout: 30s from NewSyncTransport. The FUSE mount does not use bootstrap and has no context-based cancellation for individual API calls, so the explicit Timeout was the primary safety mechanism against hung requests blocking kernel-level filesystem operations.

Prompt for agents

NewHTTPClient (syncer.go:259-271) unconditionally zeroes the caller's http.Client.Timeout. This is needed for the poll-mode bootstrap path but breaks the FUSE mount path (fuse_mount.go:20) which relies on the whole-request Timeout to bound individual FUSE API calls. Possible approaches: 1. Instead of mutating the caller's http.Client, clone it internally: create a new http.Client with Timeout=0 and the original (or new) Transport, leaving the caller's object untouched. 2. Have the FUSE mount path construct its client differently (e.g., passing a flag or using a separate constructor) so it is not subject to the bootstrap normalization. 3. Add a parameter or option to NewHTTPClient that controls whether the Timeout should be preserved (for FUSE) or normalized away (for poll/bootstrap). The key invariant: poll-mode bootstrap must have Timeout=0, FUSE-mode must retain its per-request Timeout.

Was this helpful? React with 👍 or 👎 to provide feedback.

Proactive Runtime Bot and others added 9 commits May 18, 2026 20:48

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

Comment thread .trajectories/index.json Outdated

Comment thread cmd/relayfile-cli/main.go

Comment thread cmd/relayfile-cli/main.go

Comment thread internal/mountsync/bootstrap_test.go

Comment thread internal/mountsync/integration_test.go Outdated

devin-ai-integration Bot reviewed May 18, 2026

View reviewed changes

fix(mountsync): address bootstrap review feedback

088871d

khaliqgant merged commit 0eba617 into main May 19, 2026
9 checks passed

khaliqgant deleted the fix/mount-bootstrap-large-workspace branch May 19, 2026 10:47

khaliqgant mentioned this pull request May 19, 2026

feat(workspace): complete PR-39 local primitives and digest runtime #167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull#166

fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull#166
khaliqgant merged 10 commits into
mainfrom
fix/mount-bootstrap-large-workspace

khaliqgant commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 18, 2026

Uh oh!

devin-ai-integration Bot May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

khaliqgant commented May 18, 2026

Summary

Commits

Design

Validation

New env knobs (optional, safe defaults)

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relayfile Eval Review

Human Review Cases

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 18, 2026 •

edited

Loading

github-actions Bot commented May 18, 2026 •

edited

Loading