fix: set _waitForDelete=false in moveChunk to prevent ExceededTimeLimit#65
Open
hyeonghwa wants to merge 1 commit into
Open
fix: set _waitForDelete=false in moveChunk to prevent ExceededTimeLimit#65hyeonghwa wants to merge 1 commit into
hyeonghwa wants to merge 1 commit into
Conversation
When _waitForDelete=true, moveChunk blocks synchronously until orphaned
documents are deleted from the donor shard. If the orphan range is large
(e.g., up to { _id: MaxKey }), the deletion exceeds the server's operation
time limit, resulting in error code 96 (OperationFailed: ExceededTimeLimit).
Setting _waitForDelete=false makes orphan cleanup asynchronous, allowing
moveChunk to complete without timing out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
During performance testing on the router (mongos), the following error occurred repeatedly:
Root Cause
In
MongoWorker.java, themoveChunkcommand was issued with_waitForDelete: true:_waitForDelete: trueforcesmoveChunkto block synchronously until all orphaned documents are deleted from the donor shard after chunk migration.When the orphan range is large — in this case
[{ _id: { w: 1, i: 1 } }, { _id: MaxKey }), which spans to the last chunk — the deletion operation exceeds the server'soperationTimelimit, resulting in error code 96 (OperationFailed: ExceededTimeLimit).Fix
Changed
_waitForDeletefromtruetofalseso that orphan cleanup runs asynchronously in the background, allowingmoveChunkto return immediately without blocking on deletion.Impact
moveChunkno longer times out during performance tests on sharded clusters with large orphan ranges._waitForDeleteis omitted.Test Plan
ExceededTimeLimiterror on router during POCDriver performance test