Skip to content

fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull#166

Merged
khaliqgant merged 10 commits into
mainfrom
fix/mount-bootstrap-large-workspace
May 19, 2026
Merged

fix(mountsync): converge large workspaces — bootstrap timeout, safe fast-path, resumable pull#166
khaliqgant merged 10 commits into
mainfrom
fix/mount-bootstrap-large-workspace

Conversation

@khaliqgant
Copy link
Copy Markdown
Member

Summary

A mount against a large workspace (live repro: rw_517d60b6, 581 files / 419 dirs) never converges — it permanently stalls at a handful of files. This is not the cloud OOM (that was fixed in cloud#730 and is deployed/verified). It is a set of compounding client-side defects in the mount daemon, exposed once the cloud could actually serve large workspaces fast.

Five compounding defects, all fixed here:

  1. Bootstrap full-pull bound by the 15s per-cycle deadline. RELAYFILE_MOUNT_TIMEOUT (default 15s) wrapped every sync cycle; the one-time bootstrap full-tree fetch of hundreds of files exceeds 15s → context deadline exceeded every cycle → permanent stall. (Trap was already documented at syncer.go:1711.)
  2. Restart fast-path trusted incomplete/corrupt on-disk state. if len(Files) > 0 && LastEventAt != "" → after a clobber-recovery or interrupted bootstrap left a partial local state (e.g. 8 files), the fast-path seeded the events cursor and skipped the full pull forever, so the missing ~573 files never reconciled. Verified live: state had 8 tracked files, remote 581, daemon reported "healthy", mirror stuck at 9.
  3. resolveLatestEventCursor shared the tiny per-cycle deadline → timed out on large/slow workspaces, defeating the very fast-path it enables, then falling through to the doomed full pull.
  4. No resumable bootstrap → an interrupted full pull restarted from scratch every cycle.
  5. No progress observabilityrelayfile status showed a generic stall: context deadline exceeded with no diagnosis.

Plus a sixth, found by live-testing this very branch against the stuck workspace:

  1. http.Client.Timeout is a second, independent whole-request cap. net/http enforces http.Client.Timeout over the entire request including body read, regardless of context. The daemon hardwired it to RELAYFILE_MOUNT_TIMEOUT. Decoupling the context (defects 1–4) was necessary but not sufficient — the big bootstrap body read still died with request canceled (Client.Timeout while reading body). Fixed by removing the whole-request cap and using a granular transport + context-based cancellation (idiomatic Go).

Commits

SHA Title
272cb1d feat(mountsync): add resumable-bootstrap state fields
dcd4f85 feat(mountsync): independent short timeout for cursor resolution
1509aa0 feat(mountsync): progress-extending bootstrap context + resumable full-tree pull
6406104 fix(mountsync): gate restart fast-path on completed bootstrap (rw_517d60b6)
2d2e2d0 feat(observability): surface bootstrap progress instead of false stalls
ded455a test(mountsync): resumable-bootstrap + decoupled-timeout coverage
22c4c26 fix(mountsync): only fast-path-skip when events feed yields a usable cursor
0a3c811 fix(mountsync): defer bootstrap ctx cancel on periodic full-pull path
78287bd fix(mountsync): remove whole-request http.Client.Timeout that aborted large bootstrap

Design

  • Decoupled bootstrap context derived from s.rootCtx (not the inbound per-cycle ctx). Default mode = unbounded while making progress: a watchdog cancels only after an idle window (RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT, default 90s) with no page/file progress. Optional hard cap via RELAYFILE_BOOTSTRAP_TIMEOUT. SIGTERM/rootCtx cancellation still propagates; watchdog torn down on every exit path (incl. panic, via deferred/IIFE-scoped cancel).
  • Resumable bootstrap: BootstrapCursor/BootstrapFilesSynced/BootstrapStartedAt persisted per page; an interrupted pull resumes instead of restarting. ListTree page size 10→200 (cloud is fast now). BootstrapComplete set only after a verified full traversal.
  • Safe fast-path gate: BootstrapComplete && !forceFullReconcile && len(Files)>0 (and a usable resolved cursor). A partial/clobber-remnant state has BootstrapComplete=false → forced full (resumable) pull. Legacy state files load with BootstrapComplete=false → exactly one self-healing forced reconcile.
  • Independent cursor timeout (RELAYFILE_CURSOR_TIMEOUT, default 20s) from s.rootCtx.
  • Escape hatch: --full-reconcile / RELAYFILE_FORCE_FULL_RECONCILE=1, one-shot (self-clears after a successful reconcile). Auto-triggers for the clobber-remnant case.
  • Observability: status: "bootstrapping" + a bootstrap{filesSynced,filesTotal,startedAt} block in state/status instead of a false stall. All new JSON fields omitempty (back-compat).
  • HTTP transport: no whole-request http.Client.Timeout; granular DialContext/TLSHandshakeTimeout/ResponseHeaderTimeout bound connect & time-to-first-byte without capping a progressing body. Cancellation is the context's job (per-cycle / bootstrap / cursor). Centralized in mountsync.NewSyncTransport()/NewSyncHTTPClient(); callers passing an explicit Timeout are normalized to 0.

#164/#165 mount-root-clobber invariants are explicitly preserved: a resumed/partial traversal skips the snapshot-delete pass so applyRemoteSnapshotDeletesRev never runs on an incomplete remotePaths set. mount_root_clobber_test.go and invariants_test.go are byte-unchanged and green.

Validation

Live, against the actual stuck rw_517d60b6:

  • Old build: permanently stuck at 9/581 files, stall: no successful reconcile for 10m.
  • This branch: immediately logged detected non-empty state without completed bootstrap; forcing full reconcile (8 tracked files) — the new gate correctly rejects the partial clobber-remnant state. (Live end-to-end convergence run is pending a relayfile re-auth; the gating logic and the previously-missing http.Client.Timeout were both proven via this live testing.)

Automated:

  • go build ./..., go vet ./... — clean
  • go test ./internal/mountsync/... -count=1 — green (~105s incl. chaos)
  • go test ./... -short -count=1 — all packages green
  • scripts/check-contract-surface.sh — passed (new fields are mount-local state, not wire contract)
  • New bootstrap_test.go (8 tests: decoupled timeout, progress extension, resume-from-cursor, complete-gates-fast-path = the rw_517d60b6 repro, force-reconcile self-clear, independent cursor timeout, status surfacing, resumed-traversal-skips-delete-pass) + watchdog goroutine-leak assertion.
  • sync_transport_test.go: asserts no whole-request timeout; slow-body regression test trickles a large body ~18s past the old 15s cap and succeeds (-race clean).
  • integration_test.go: TestLargeWorkspaceChaosEventuallyConverges — 600 files, intermittent 500s + slow, 3ms per-cycle deadline, asserts eventual full convergence + mount root intact.

New env knobs (optional, safe defaults)

Env / flag Default Purpose
RELAYFILE_BOOTSTRAP_TIMEOUT / --bootstrap-timeout 0 (unbounded-while-progressing) Hard cap for the bootstrap full-pull
RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT 90s No-progress idle window before the bootstrap watchdog cancels
RELAYFILE_CURSOR_TIMEOUT / --cursor-timeout 20s Independent deadline for events-cursor resolution
RELAYFILE_FORCE_FULL_RECONCILE / --full-reconcile off One-shot forced full reconcile (clobber-remnant recovery)

Builds on merged #164/#165 (mount-root clobber protection). Companion to cloud#730 (WorkspaceDO OOM elimination) — that fixed the server; this fixes the client so large workspaces actually converge.

🤖 Generated with Claude Code

Proactive Runtime Bot and others added 9 commits May 18, 2026 20:48
Add additive, omitempty BootstrapComplete/BootstrapCursor/
BootstrapFilesSynced/BootstrapFilesTotal/BootstrapStartedAt to
mountState; mirror cursor-free progress into publicState.Bootstrap
and the CLI status surface (syncStateBootstrap). Legacy state files
load with zero values (self-healing forced full reconcile later).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolveLatestEventCursor now derives its own short deadline from
rootCtx (RELAYFILE_CURSOR_TIMEOUT, default 20s) instead of riding the
inbound per-cycle ctx, so a slow events feed can no longer wedge a
healthy cycle. Adds cursorTimeout/bootstrapTimeout/bootstrapIdleTimeout/
forceFullReconcile resolution to NewSyncer (option -> env -> default)
and the SyncerOptions surface. Signature/return contract unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…l-tree pull

Add bootstrapContext() deriving the heavy full-pull deadline from
rootCtx (NOT the tiny per-cycle ctx): hard-cap mode when
RELAYFILE_BOOTSTRAP_TIMEOUT>0, else a progress-extension watchdog that
cancels only after RELAYFILE_BOOTSTRAP_IDLE_TIMEOUT (default 90s) with
no applied-file progress. touch() is called per ListTree page and per
applied file in both full-tree and export paths; the watchdog goroutine
is torn down on every pullRemote exit (no leak). The periodic full pull
and post-fast-path full pull now use this context; the surrounding
ListEvents probes stay on the per-cycle ctx (no latency regression).

pullRemoteFullTree is resumable: resumes from a persisted
BootstrapCursor, persists cursor + filesSynced after each page, bumps
ListTree page size 10->200, and on full completion calls
markBootstrapComplete (clearing cursor/progress). A traversal resumed
from a persisted cursor SKIPS the snapshot delete pass (partial
listing) to preserve the #164/#165 mount-root-clobber invariants; the
next full empty-cursor cycle does the authoritative delete. Export is
atomic so it sets completion directly. markSyncSuccess never sets
completion. New --bootstrap-timeout/--cursor-timeout/--full-reconcile
flags + env on both daemons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d60b6)

Replace the unsafe LastEventAt fast-path heuristic with the
authoritative BootstrapComplete gate: skip the bootstrap full pull
ONLY when the workspace has been fully mirrored at least once AND no
--full-reconcile/RELAYFILE_FORCE_FULL_RECONCILE escape hatch is set.
A non-empty state with BootstrapComplete=false (clobber remnant or
interrupted prior bootstrap) now self-heals by forcing a one-shot full
reconcile, with a diagnostic log line. The force flag is cleared after
one successful full reconcile (one-shot, see markBootstrapComplete).

Updates the two restart-fast-path tests in the same commit (seed
BootstrapComplete:true) so the suite stays green; the LastEventAt-only
seeds would otherwise (correctly) no longer short-circuit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
savePublicState sets status="bootstrapping" + a public Bootstrap block
(phase/filesSynced/filesTotal/startedAt; cursor omitted) while a
bootstrap is in progress. The CLI runMountLoop runCycle and 10-min
stall detector suppress stallReason and log "bootstrapping N/M" when
.relay/state.json shows an in-progress bootstrap; runStatus prints
"bootstrapping: N/M files (started <humanized>)" and suppresses the
generic stall line. The standalone relayfile-mount run() logs
"bootstrapping N/M" on DeadlineExceeded mid-bootstrap instead of a
bare cycle-failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New bootstrap_test.go: decoupled-from-cycle-timeout, progress-extension
(alive + stall-cancel), resume-from-persisted-cursor, complete-gates-
fast-path (rw_517d60b6 repro), force-full-reconcile env self-clear,
independent cursor timeout, status surfaces phase, resumed traversal
skips snapshot delete pass (#164/#165 guard), and a watchdog
goroutine-leak assertion. integration_test.go gains a ~600-file
intermittently-failing slow paginated chaos test asserting eventual
full convergence under a tiny per-cycle timeout. Also scales the
bootstrap watchdog poll interval to the idle window (min idle/3, 10ms
floor) so short windows cancel promptly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cursor

The BootstrapComplete gate alone let backends whose events feed is
always-empty-but-not-404 (e.g. some mock/degraded feeds) seed an empty
EventsCursor and short-circuit forever after the first bootstrap: the
periodic full-pull cadence keys off a non-empty EventsCursor, so new
remote files would never land (regressed
TestProductizedCloudMountE2EProof). Require resolveLatestEventCursor to
return a non-empty tip before skipping; otherwise fall through to the
idempotent full pull. This restores the safety the old LastEventAt gate
provided without reintroducing the rw_517d60b6 partial-mirror hazard
(still gated on BootstrapComplete). Test client now provides a usable
event tip where the fast-path is expected to engage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fresh-eyes review: the periodic full-pull path called bcancel()
unconditionally instead of via defer, leaking the watchdog goroutine
if pullRemoteFull panics. Match the deferred-cancel pattern already
used on the post-fast-path full-pull sibling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… large bootstrap

The decoupled bootstrap context (prior commits) was necessary but not
sufficient. Every syncer request still flowed through an *http.Client
whose .Timeout was hardwired to the per-cycle mount timeout (default
15s). net/http enforces http.Client.Timeout as a whole-request
wall-clock INDEPENDENT of context, so it killed the bootstrap
full-tree body read mid-stream regardless of the rootCtx-derived
bootstrap context:

  mount sync cycle failed: net/http: request canceled
  (Client.Timeout or context cancellation while reading body)

Fix: introduce mountsync.NewSyncTransport()/NewSyncHTTPClient() with
granular timeouts (DialContext 10s, TLSHandshake 10s,
ResponseHeaderTimeout 30s, ExpectContinue 1s, IdleConn 90s) and NO
total-request cap. All four syncer client construction sites now use
it, and NewHTTPClient normalizes any caller-supplied non-zero Timeout
to 0 so the bootstrap path can never be whole-request-capped.

Context-coverage audit: every request path remains bounded by a
context deadline - incremental -> per-cycle ctx
(RELAYFILE_MOUNT_TIMEOUT); bootstrap/periodic full-pull -> bootstrap
ctx (progress-extending + idle watchdog); resolveLatestEventCursor ->
cursorTimeout ctx; writeback/delete -> per-cycle ctx. No path relied
solely on http.Client.Timeout for cancellation, so removing it cannot
cause an infinite hang.

The 'conflict deleting /digests/today.md; remote changed' loop is a
downstream symptom, not a separate defect: applyRemoteFile already
re-pulls and re-tracks the remote-changed file with its fresh
revision, which converges on the next cycle - but only if the cycle
is not aborted first by the whole-request timeout. Fixing the timeout
gap removes the cycle abort that prevented convergence; no separate
fix needed.

Tests: assert NewSyncHTTPClient has Timeout==0 with the expected
granular transport; assert NewHTTPClient normalizes an explicit
Timeout; regression test streaming a slow-but-progressing body for
~18s (past the old 15s cap) that now succeeds. mount_root_clobber and
invariants suites untouched and green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Review Change Stack

Warning

Rate limit exceeded

@khaliqgant has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 34 minutes and 6 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: a2e747b4-3169-4e3a-bd4e-797c0c293adb

📥 Commits

Reviewing files that changed from the base of the PR and between 78287bd and 088871d.

📒 Files selected for processing (6)
  • .trajectories/index.json
  • cmd/relayfile-cli/main.go
  • internal/mountsync/bootstrap_test.go
  • internal/mountsync/integration_test.go
  • internal/mountsync/sync_transport_test.go
  • internal/mountsync/syncer.go
📝 Walkthrough

Walkthrough

This PR introduces resumable bootstrap with persistent state tracking, granular timeout controls, and a watchdog-based idle-timeout mechanism for long-running operations. HTTP timeouts are refactored from whole-request caps to per-operation deadlines with context-based cancellation, enabling progress-driven streaming for large initial syncs. Bootstrap progress is now exposed in operator status output and CLI monitoring.

Changes

Bootstrap Resumption and Timeout Model

Layer / File(s) Summary
HTTP transport refactoring: granular timeouts and context-based cancellation
internal/mountsync/syncer.go, internal/mountsync/sync_transport_test.go
NewSyncTransport and NewSyncHTTPClient provide granular per-operation timeouts (connect/TLS/response-header) with http.Client.Timeout normalized to 0, enabling context-based cancellation instead of whole-request caps. Three new tests validate the timeout model and confirm streaming completes past historical caps.
Bootstrap state model and persistent resumption fields
internal/mountsync/syncer.go
mountState gains BootstrapComplete, BootstrapCursor, BootstrapFilesSynced, BootstrapFilesTotal, and BootstrapStartedAt to enable resume after interruption. publicState.Bootstrap field with bootstrapStatus struct surfaces phase/file-counts/started-at in operator status.
CLI and configuration: timeout flags and syncer option wiring
cmd/relayfile-cli/main.go, cmd/relayfile-mount/main.go, internal/mountsync/syncer.go
New --bootstrap-timeout, --cursor-timeout, and --full-reconcile flags are added to mount subcommands and wired into SyncerOptions (CursorTimeout, BootstrapTimeout, ForceFullReconcile) via NewSyncer.
Option resolution: duration parsing and fallback semantics
internal/mountsync/syncer.go
resolveDurationEnv helper parses timeout options/env with fallback rules; NewSyncer resolves and stores cursorTimeout, bootstrapTimeout, bootstrapIdleTimeout, and forceFullReconcile from options/env.
Bootstrap watchdog and progress extension mechanism
internal/mountsync/syncer.go
bootstrapContext helper returns context/cancel and a progress-touch callback; in progress-extension mode, a watchdog goroutine cancels only if no touches occur within the configured idle window, extending liveness during long operations.
Full-pull refactoring: resumable bootstrap and partial progress persistence
internal/mountsync/syncer.go
pullRemoteFull and pullRemoteFullTree accept bootstrapProgress, touch progress during export/traversal, persist resumable cursors and file counts during interruption, and defer snapshot-delete passes when resuming from a persisted cursor to avoid clobbering already-mirrored prefixes.
Restart fast-path: BootstrapComplete gate and independent cursor timeout
internal/mountsync/syncer.go, internal/mountsync/syncer_test.go
Restart fast-path now requires BootstrapComplete=true before skipping the bootstrap full pull. resolveLatestEventCursor uses an independent cursor-timeout from rootCtx, preventing slow events feeds from wedging healthy cycles. Test seeds updated to explicitly gate on BootstrapComplete.
Bootstrap progress in operator status and CLI output
internal/mountsync/syncer.go, cmd/relayfile-cli/main.go
publicState surfaces bootstrap-in-progress details overriding normal status logic. CLI parses .relay/state.json bootstrap block via readBootstrapStatus and renders "bootstrapping: / files" progress in runStatus. Deadletter retry switches to new sync transport without whole-request timeout.
Stall monitoring: suppress misleading stall messages during bootstrap
cmd/relayfile-cli/main.go
Mount cycle failure and "no successful reconcile for 10m" handlers check bootstrap status; when active, clear stallReason, log bootstrap progress, and write snapshots without labeling it as a generic stall.
Mount main: deadline handling with bootstrap progress detection
cmd/relayfile-mount/main.go
Special handling for context.DeadlineExceeded reads bootstrap progress from .relay/state.json and logs bootstrapping status, returning early instead of treating deadline as terminal. Includes readBootstrapProgress helper for state file parsing.
Bootstrap test infrastructure and comprehensive test suite
internal/mountsync/bootstrap_test.go
bootstrapClient test double with configurable pagination, failure injection, and delays. Nine tests cover timeout decoupling, progress extension, cursor resumption, BootstrapComplete gating, force-full-reconcile override, cursor-timeout independence, status surfacing, snapshot-delete skip, and watchdog cleanup. Includes state-loading and goroutine-leak helpers.
Integration tests: transport validation and large-workspace chaos
internal/mountsync/sync_transport_test.go, internal/mountsync/integration_test.go
sync_transport_test.go validates granular transport timeouts and successful streaming past historical caps. TestLargeWorkspaceChaosEventuallyConverges uses largeChaosCloud server with deterministic pagination, intermittent 500s, and delays; verifies local mirror convergence under chaos and short per-cycle timeouts.
Administrative updates: trajectory index refresh
.trajectories/index.json
lastUpdated timestamp refreshed; two completed and one active trajectory entry appended with metadata (titles, statuses, start/completion times, paths).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

This PR introduces a substantial timeout and bootstrap state model with persistent resumption, affecting core reconciliation logic, HTTP client construction, status reporting, and deadline handling across CLI, mount main, and syncer internals. The changes involve high-density logic (watchdog goroutines, progress-touch callbacks, cursor persistence), multiple interacting components (timeouts, state fields, context derivation), and comprehensive test coverage. While most changes are contained to mountsync and CLI logic, the breadth of modifications to restart fast-path gating, full-pull streaming, stall suppression, and operator status surface requires careful review of interaction between layers.

Possibly related PRs

  • AgentWorkforce/relayfile#163: Modifies cmd/relayfile-cli/main.go dead-letter writeback behavior; directly related to the deadletter retry transport changes in this PR.
  • AgentWorkforce/relayfile#164: Modifies mountsync snapshot-delete decision logic in full-tree/full-export pulls; related to the resumable bootstrap snapshot-delete deferral in this PR.
  • AgentWorkforce/relayfile#92: Modifies internal/mountsync/syncer.go restart fast-path logic and cursor-based bootstrap skipping; overlaps with this PR's BootstrapComplete gating and restart fast-path tests.

Poem

🐰 A rabbit bounds through bootstrap's deep woods,
Resume point saved when the path grows hard,
Watchdog bells chime when progress goes quiet,
Each timeout granular, no whole-request regard,
The mirror converges, one file-synced forward!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing mount daemon convergence on large workspaces through bootstrap timeout, safe fast-path, and resumable pull functionality.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing all six defects fixed, design decisions, commits, validation, and new environment knobs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/mount-bootstrap-large-workspace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 18, 2026

Relayfile Eval Review

Run: .relayfile/evals/runs/2026-05-19T10-43-32-088Z-HEAD-provider
Mode: provider
Git SHA: b3f3d49

Passed: 4 | Needs human: 0 | Reviewable: 0 | Missing output: 0 | Failed: 0 | Skipped: 0

Human Review Cases

No reviewable human-review cases captured Relayfile output.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/relayfile-cli/main.go (1)

3633-3658: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the new mount flags in the custom help output.

These flags are wired up here, but relayfile mount --help does not show them because this command discards the default flag help and prints printMountHelp instead. Right now --bootstrap-timeout, --cursor-timeout, and --full-reconcile are effectively undiscoverable from the CLI.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/relayfile-cli/main.go` around lines 3633 - 3658, The new flags
bootstrapTimeout, cursorTimeout and fullReconcile are registered with the
FlagSet but never documented because mount uses a custom printMountHelp; update
the custom help to include these flags by adding their descriptions to
printMountHelp (or by emitting fs.PrintDefaults()/FlagSet.VisitAll output when
printing help for the mount command) so --bootstrap-timeout, --cursor-timeout
and --full-reconcile become discoverable; locate the flag registrations
(bootstrapTimeout, cursorTimeout, fullReconcile) and modify printMountHelp to
include matching lines or programmatically include the FlagSet's defaults when
fs.Parse is bypassed.
🧹 Nitpick comments (1)
internal/mountsync/syncer.go (1)

259-271: 💤 Low value

Mutating caller's http.Client could surprise callers who share the object.

This directly modifies httpClient.Timeout on the caller's object. If a caller passes an *http.Client they also use elsewhere (e.g., for non-bootstrap operations where a timeout cap is desired), this mutation silently disables their timeout. Consider cloning the client or documenting this prominently.

The existing comment explains the rationale, so this is a minor concern given the defensive intent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/mountsync/syncer.go` around lines 259 - 271, Don't mutate the
caller's *http.Client; instead make a shallow copy, adjust the copy's Timeout
and Transport, and use that copy. Concretely, replace direct writes to
httpClient.Timeout/Transport with a local clone (e.g. c := *httpClient; client
:= &c), set client.Timeout = 0 and if client.Transport == nil set
client.Transport = NewSyncTransport(), and continue using client in the
surrounding code (references to httpClient should be switched to the cloned
client).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.trajectories/index.json:
- Line 244: Replace the absolute filesystem paths used in the JSON "path" fields
with repository-relative paths so entries remain portable; locate the "path"
keys that contain values like
"/private/tmp/relayfile-bootstrap-fix/.trajectories/.../traj_nxbsptr6c5q0.json"
(and the similar entries mentioned) and change them to
".trajectories/.../traj_nxbsptr6c5q0.json" (i.e., drop the leading
"/private/tmp/relayfile-bootstrap-fix/") so all "path" values use the same
relative ".trajectories/..." format.

In `@cmd/relayfile-cli/main.go`:
- Around line 3009-3016: The retry path removed the http.Client.Timeout which
leaves requests running under context.Background(), allowing HandleLocalChange
in the relayfile writeback retry to hang indefinitely after headers succeed;
restore an end-to-end deadline by wiring a request-scoped timeout into the
client used by mountsync.NewHTTPClient (or by enforcing a per-request context
with timeout inside the retry loop that calls HandleLocalChange), e.g., create a
context.WithTimeout around the code path that performs the upload/retry and pass
that context into the transport/client so newWritebackFailureTransport /
mountsync.NewSyncTransport operations are bounded; ensure the timeout value is
configurable and applied consistently for the code paths referenced
(mountsync.NewHTTPClient, newWritebackFailureTransport, HandleLocalChange).
- Around line 6934-6943: The code currently clears stallReason for any reconcile
error when a bootstrap status exists; change the bootstrap handling (the block
using readBootstrapStatus(localDir), stallReason, writeSnapshot) to only
suppress and return early when the reconcile error is a deadline exceeded error
by wrapping the existing branch in a check like errors.Is(err,
context.DeadlineExceeded) (or equivalent); if the error is not a deadline
exceeded, do not clear stallReason or return early so real failures (auth/local
I/O/bad responses) continue to surface. Ensure you import/usage errors and
context as needed and reference the existing readBootstrapStatus, stallReason,
writeSnapshot identifiers.

In `@internal/mountsync/bootstrap_test.go`:
- Around line 252-258: The test sets s.bootstrapIdleTimeout to 80ms but gives
Reconcile a too-short inbound deadline (context.WithTimeout(..., 1ms)) which
lets the per-cycle deadline mask the idle-timeout behavior; change the test to
create a context deadline comfortably longer than s.bootstrapIdleTimeout (e.g.,
>80ms, like 200ms) when calling s.Reconcile, record the start time, call
s.Reconcile(ctx), assert that it returns an error, and then assert the elapsed
time is at least s.bootstrapIdleTimeout to prove the failure came from the idle
window; reference s.bootstrapIdleTimeout, Reconcile, and context.WithTimeout to
locate and update the test accordingly.

In `@internal/mountsync/integration_test.go`:
- Around line 369-375: The convergence check currently uses ">= fileCount" which
can hide stray files; update the condition in the loop that sets "converged" to
require an exact match by using "countLocalFiles(t, localDir) == fileCount"
(referencing countLocalFiles, localDir, fileCount, converged) so the test only
succeeds when the local directory has exactly the expected number of files;
adjust any related failure message if needed to reflect exact-match semantics.

---

Outside diff comments:
In `@cmd/relayfile-cli/main.go`:
- Around line 3633-3658: The new flags bootstrapTimeout, cursorTimeout and
fullReconcile are registered with the FlagSet but never documented because mount
uses a custom printMountHelp; update the custom help to include these flags by
adding their descriptions to printMountHelp (or by emitting
fs.PrintDefaults()/FlagSet.VisitAll output when printing help for the mount
command) so --bootstrap-timeout, --cursor-timeout and --full-reconcile become
discoverable; locate the flag registrations (bootstrapTimeout, cursorTimeout,
fullReconcile) and modify printMountHelp to include matching lines or
programmatically include the FlagSet's defaults when fs.Parse is bypassed.

---

Nitpick comments:
In `@internal/mountsync/syncer.go`:
- Around line 259-271: Don't mutate the caller's *http.Client; instead make a
shallow copy, adjust the copy's Timeout and Transport, and use that copy.
Concretely, replace direct writes to httpClient.Timeout/Transport with a local
clone (e.g. c := *httpClient; client := &c), set client.Timeout = 0 and if
client.Transport == nil set client.Transport = NewSyncTransport(), and continue
using client in the surrounding code (references to httpClient should be
switched to the cloned client).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: dea5bfd6-0c4f-43ce-8f87-5027dc8b8cde

📥 Commits

Reviewing files that changed from the base of the PR and between 920b3f2 and 78287bd.

📒 Files selected for processing (8)
  • .trajectories/index.json
  • cmd/relayfile-cli/main.go
  • cmd/relayfile-mount/main.go
  • internal/mountsync/bootstrap_test.go
  • internal/mountsync/integration_test.go
  • internal/mountsync/sync_transport_test.go
  • internal/mountsync/syncer.go
  • internal/mountsync/syncer_test.go

Comment thread .trajectories/index.json Outdated
Comment thread cmd/relayfile-cli/main.go
Comment thread cmd/relayfile-cli/main.go
Comment thread internal/mountsync/bootstrap_test.go
Comment thread internal/mountsync/integration_test.go Outdated
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread internal/mountsync/syncer.go Outdated
Comment on lines 2267 to 2268
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 snapshotDeleteUnsafe check shadows !startedFromEmpty, preventing resumed bootstraps from ever completing

In pullRemoteFullTree, the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. A resumed bootstrap traversal only populates remotePaths with the tail portion of the tree (from the persisted cursor onward). When the tail is less than 50% of the total tracked file count, snapshotDeleteUnsafe returns true and the function exits at line 2267 without calling markBootstrapComplete(). The !startedFromEmpty branch at line 2279 — which correctly calls markBootstrapComplete() — is never reached.

Full impact trace for a 400-file workspace interrupted at page 3/4

Cycle 1: Pages 1-3 applied (300 tracked files), page 4 fails. State: BootstrapCursor=cursor_p3, BootstrapComplete=false.

Cycle 2 (resume): Traversal resumes from cursor_p3, page 4 applied (~100 files). remotePaths=100, tracked=400. snapshotDeleteUnsafe(100): 100 < 400*0.5=200 → true. Returns nil WITHOUT markBootstrapComplete(). BootstrapCursor still points to cursor_p3.

Every subsequent periodic full-pull (every ~10 min): resumes from stale cursor_p3, reads only the last ~100 files, snapshotDeleteUnsafe fires again. The periodic full-pull safety net is permanently broken (never does a real full pull), the restart fast-path never engages (BootstrapComplete stays false), and the stale BootstrapCursor is never cleared.

(Refers to lines 2264-2268)

Prompt for agents
In pullRemoteFullTree (syncer.go), the snapshotDeleteUnsafe check at line 2264 fires before the !startedFromEmpty check at line 2279. When a resumed traversal completes, remotePaths only covers the tail of the tree, so snapshotDeleteUnsafe incorrectly flags the listing as degraded. This prevents markBootstrapComplete() from being called, creating an infinite loop where every periodic full-pull re-reads just the tail.

Fix: move the !startedFromEmpty check (lines 2279-2284) BEFORE the snapshotDeleteUnsafe check (lines 2264-2268). A resumed traversal is already known to be partial, so the snapshot-delete safety check is not applicable — it is designed for degraded cloud responses, not intentionally partial listings. Alternatively, skip the snapshotDeleteUnsafe check entirely when startedFromEmpty is false, since the delete pass is already skipped in that branch.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread internal/mountsync/syncer.go Outdated
Comment on lines +259 to +271
} else if httpClient.Timeout != 0 {
// A caller passed an explicit whole-request Timeout. That cap is
// fatal to the bootstrap full-pull (net/http enforces it
// independent of context, killing a long-but-progressing body
// read). Normalize it away and ensure a granular sync transport
// is in place; cancellation remains fully covered by the
// per-cycle / bootstrap / cursor contexts. This guarantees the
// bootstrap path is never whole-request-capped no matter how the
// client was constructed.
httpClient.Timeout = 0
if httpClient.Transport == nil {
httpClient.Transport = NewSyncTransport()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 NewHTTPClient unconditionally zeroes caller's http.Client.Timeout, removing FUSE mount's per-request timeout

NewHTTPClient (syncer.go:259-268) mutates the caller's *http.Client by forcibly setting Timeout = 0 when it was non-zero. The FUSE mount path (cmd/relayfile-mount/fuse_mount.go:20) passes &http.Client{Timeout: cfg.timeout} — this timeout intentionally bounded individual FUSE API requests (ReadFile, ListTree, WriteFile, DeleteFile) that back synchronous kernel filesystem operations. After the mutation, FUSE API calls have no whole-request timeout, relying only on ResponseHeaderTimeout: 30s from NewSyncTransport. The FUSE mount does not use bootstrap and has no context-based cancellation for individual API calls, so the explicit Timeout was the primary safety mechanism against hung requests blocking kernel-level filesystem operations.

Prompt for agents
NewHTTPClient (syncer.go:259-271) unconditionally zeroes the caller's http.Client.Timeout. This is needed for the poll-mode bootstrap path but breaks the FUSE mount path (fuse_mount.go:20) which relies on the whole-request Timeout to bound individual FUSE API calls.

Possible approaches:
1. Instead of mutating the caller's http.Client, clone it internally: create a new http.Client with Timeout=0 and the original (or new) Transport, leaving the caller's object untouched.
2. Have the FUSE mount path construct its client differently (e.g., passing a flag or using a separate constructor) so it is not subject to the bootstrap normalization.
3. Add a parameter or option to NewHTTPClient that controls whether the Timeout should be preserved (for FUSE) or normalized away (for poll/bootstrap).

The key invariant: poll-mode bootstrap must have Timeout=0, FUSE-mode must retain its per-request Timeout.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@khaliqgant khaliqgant merged commit 0eba617 into main May 19, 2026
9 checks passed
@khaliqgant khaliqgant deleted the fix/mount-bootstrap-large-workspace branch May 19, 2026 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant