fix(creds): keep the job queue alive across transient token refreshes#955
Open
bernardgut wants to merge 2 commits into
Open
fix(creds): keep the job queue alive across transient token refreshes#955bernardgut wants to merge 2 commits into
bernardgut wants to merge 2 commits into
Conversation
On a transient OAuth refresh error, HttpCredentials::refreshAccessTokenInternal schedules a retry but then also emits authenticationFailed(). Account reacts to that signal with JobQueue::clear() (account.cpp), aborting every in-flight upload. The sync run then finalizes as "complete", so the aborted files are silently never retried: directories exist but files are missing, and subsequent syncs skip the already-recorded items (silent data loss). Emit authenticationFailed() only on the terminal branch (TokenRefreshMaxRetries exceeded, a real logout). On a transient error the queue stays blocked across the scheduled retry and resumes via fetched()->unblock() once the refresh succeeds. Fixes opencloud-eu#900 Refs opencloud-eu#948 Authored-By: Bernard Gütermann <bernard.gutermann@sekops.ch>
9bbc6ee to
19b0a75
Compare
Verify that transient OAuth token-refresh errors do NOT emit authenticationFailed(), which would clear the job queue and abort in-flight uploads (opencloud-eu#900, opencloud-eu#948). Also verify that after TokenRefreshMaxRetries consecutive errors, authenticationFailed() IS emitted (terminal failure). Authored-By: Bernard Gütermann <bernard.gutermann@sekops.ch>
Author
Regression test added (f44ea97)Added two regression tests to
Prove-the-test-bites result
Full suite: 31/31 pass, 0 regressions. Built and tested in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The desktop client can silently drop files when an OAuth access token is refreshed mid-sync. Afterwards directories exist on the server but files inside are missing, and later syncs skip the already-recorded items — silent, persistent loss. This is the mechanism behind #900 ("Aborted after a single 401 on
/me/drives") and #948 (a transient.well-known/openid-configurationresponse aborts all jobs).Root cause
HttpCredentials::refreshAccessTokenInternal()splits refresh errors into transient (schedule a retry) and terminal (give up afterTokenRefreshMaxRetries). The transient branch schedules the retry but then alsoQ_EMIT authenticationFailed();.Accountconnects that signal toJobQueue::clear()(src/libsync/account.cpp), whichabort()s every queued upload. Aborted-but-unsent jobs aredeleteLater()d without ever callingdone(), so no error/retry flag is set;SyncEngine::finalize()then commits the journal as "All Finished" and the dropped files are never re-discovered.So a transient refresh — which the client intends to retry — is treated as a terminal failure for the job queue.
Fix
Emit
authenticationFailed()only on the terminal branch. On a transient error the job queue stays blocked (it was blocked byauthenticationStarted()) across the scheduled retry and resumes viafetched() -> JobQueueGuard::unblock()on success. The early-returnis restructured into an explicitif (terminal) / else (transient)so the two outcomes are clear.Why it's safe
authenticationFailed()has two consumers —account.cpp(JobQueue::clear()) andcmd.cpp(qFatal()inopencloudcmd). Both should react only to a terminal failure; the CLI now waits out the bounded retry instead of dying on a transient blip. The retry is bounded byTokenRefreshMaxRetries == 3, and the terminal branch still emits the signal, so the queue cannot stay blocked forever.Reproduction
OpenCloud + per-drive OAuth (Keycloak),
mirall/3.0.3.2073, sync a folder large enough that the run outlives one access-token lifetime. At the expiry boundary the proxy logs a single401on/me/drives; the client goes "Aborted", directories are created but files are missing, and re-syncing does not recover them.Testing
I don't have a Qt build environment to run the suite here, and there is currently no test that drives
HttpCredentials::refreshAccessToken(test/testoauthcoversAccountBasedOAuth;test/testjobqueuecovers the queue primitives). The right regression test would deliver a transientrefreshErrorand assert (a)authenticationFailed()is not emitted and the queue stays blocked, then (b) afterTokenRefreshMaxRetriesit is emitted. Glad to add it if you can point me at the preferred seam (extend thetestoauthFakeAM/FakeErrorReplyharness to instantiateHttpCredentials, or an integration test intestsyncengine).Fixes #900
Refs #948