Skip to content

fix(creds): keep the job queue alive across transient token refreshes#955

Open
bernardgut wants to merge 2 commits into
opencloud-eu:mainfrom
bernardgut:fix/transient-token-refresh-aborts-uploads
Open

fix(creds): keep the job queue alive across transient token refreshes#955
bernardgut wants to merge 2 commits into
opencloud-eu:mainfrom
bernardgut:fix/transient-token-refresh-aborts-uploads

Conversation

@bernardgut

@bernardgut bernardgut commented Jun 21, 2026

Copy link
Copy Markdown

Problem

The desktop client can silently drop files when an OAuth access token is refreshed mid-sync. Afterwards directories exist on the server but files inside are missing, and later syncs skip the already-recorded items — silent, persistent loss. This is the mechanism behind #900 ("Aborted after a single 401 on /me/drives") and #948 (a transient .well-known/openid-configuration response aborts all jobs).

Root cause

HttpCredentials::refreshAccessTokenInternal() splits refresh errors into transient (schedule a retry) and terminal (give up after TokenRefreshMaxRetries). The transient branch schedules the retry but then also Q_EMIT authenticationFailed();. Account connects that signal to JobQueue::clear() (src/libsync/account.cpp), which abort()s every queued upload. Aborted-but-unsent jobs are deleteLater()d without ever calling done(), so no error/retry flag is set; SyncEngine::finalize() then commits the journal as "All Finished" and the dropped files are never re-discovered.

So a transient refresh — which the client intends to retry — is treated as a terminal failure for the job queue.

Fix

Emit authenticationFailed() only on the terminal branch. On a transient error the job queue stays blocked (it was blocked by authenticationStarted()) across the scheduled retry and resumes via fetched() -> JobQueueGuard::unblock() on success. The early-return is restructured into an explicit if (terminal) / else (transient) so the two outcomes are clear.

Why it's safe

authenticationFailed() has two consumers — account.cpp (JobQueue::clear()) and cmd.cpp (qFatal() in opencloudcmd). Both should react only to a terminal failure; the CLI now waits out the bounded retry instead of dying on a transient blip. The retry is bounded by TokenRefreshMaxRetries == 3, and the terminal branch still emits the signal, so the queue cannot stay blocked forever.

Reproduction

OpenCloud + per-drive OAuth (Keycloak), mirall/3.0.3.2073, sync a folder large enough that the run outlives one access-token lifetime. At the expiry boundary the proxy logs a single 401 on /me/drives; the client goes "Aborted", directories are created but files are missing, and re-syncing does not recover them.

Testing

I don't have a Qt build environment to run the suite here, and there is currently no test that drives HttpCredentials::refreshAccessToken (test/testoauth covers AccountBasedOAuth; test/testjobqueue covers the queue primitives). The right regression test would deliver a transient refreshError and assert (a) authenticationFailed() is not emitted and the queue stays blocked, then (b) after TokenRefreshMaxRetries it is emitted. Glad to add it if you can point me at the preferred seam (extend the testoauth FakeAM/FakeErrorReply harness to instantiate HttpCredentials, or an integration test in testsyncengine).

Fixes #900
Refs #948

On a transient OAuth refresh error, HttpCredentials::refreshAccessTokenInternal
schedules a retry but then also emits authenticationFailed(). Account reacts to
that signal with JobQueue::clear() (account.cpp), aborting every in-flight upload.
The sync run then finalizes as "complete", so the aborted files are silently never
retried: directories exist but files are missing, and subsequent syncs skip the
already-recorded items (silent data loss).

Emit authenticationFailed() only on the terminal branch (TokenRefreshMaxRetries
exceeded, a real logout). On a transient error the queue stays blocked across the
scheduled retry and resumes via fetched()->unblock() once the refresh succeeds.

Fixes opencloud-eu#900
Refs opencloud-eu#948

Authored-By: Bernard Gütermann <bernard.gutermann@sekops.ch>
@bernardgut bernardgut force-pushed the fix/transient-token-refresh-aborts-uploads branch from 9bbc6ee to 19b0a75 Compare June 21, 2026 10:34
Verify that transient OAuth token-refresh errors do NOT emit
authenticationFailed(), which would clear the job queue and abort
in-flight uploads (opencloud-eu#900, opencloud-eu#948). Also verify that after
TokenRefreshMaxRetries consecutive errors, authenticationFailed()
IS emitted (terminal failure).

Authored-By: Bernard Gütermann <bernard.gutermann@sekops.ch>
@bernardgut

Copy link
Copy Markdown
Author

Regression test added (f44ea97)

Added two regression tests to test/testoauth/testoauth.cpp covering the fix's observable contract:

testTransientRefreshDoesNotEmitAuthFailed() — verifies that a transient refreshError (e.g. TimeoutError) does not emit authenticationFailed(), so the job queue stays blocked (not cleared) and in-flight uploads are preserved.

testTerminalRefreshEmitsAuthFailed() — verifies that after TokenRefreshMaxRetries (3) consecutive incrementing errors (ContentNotFoundError, timeout=0s), authenticationFailed() is emitted along with fetched(), triggering the terminal logout path.

Prove-the-test-bites result

Scenario testTransient testTerminal
Bug reintroduced (Q_EMIT authenticationFailed() in transient branch) FAIL FAIL
Fix applied (current) PASS PASS

Full suite: 31/31 pass, 0 regressions.

Built and tested in opencloudeu/desktop-client-build:ubuntu-24.04-qt6.10 with -DBUILD_TESTING=ON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-drive sync stays in Aborted after a single 401 on /me/drives, even when subsequent requests succeed

1 participant