Skip to content

fix: set _waitForDelete=false in moveChunk to prevent ExceededTimeLimit#65

Open
hyeonghwa wants to merge 1 commit into
johnlpage:masterfrom
hyeonghwa:fix/movechunk-wait-for-delete-timeout
Open

fix: set _waitForDelete=false in moveChunk to prevent ExceededTimeLimit#65
hyeonghwa wants to merge 1 commit into
johnlpage:masterfrom
hyeonghwa:fix/movechunk-wait-for-delete-timeout

Conversation

@hyeonghwa

Copy link
Copy Markdown

Problem

During performance testing on the router (mongos), the following error occurred repeatedly:

Command failed with error 96 (OperationFailed):
'Data transfer error: ExceededTimeLimit: Failed to delete orphaned POCDB.POCCOLL
range [{ _id: { w: 1, i: 1 } }, { _id: MaxKey }) :: caused by :: operation exceeded time limit'

Root Cause

In MongoWorker.java, the moveChunk command was issued with _waitForDelete: true:

admindb.runCommand(new Document("moveChunk", ...)
    .append("_secondaryThrottle", true)
    .append("_waitForDelete", true)   // <-- root cause
    .append("writeConcern", new Document("w", "majority")));

_waitForDelete: true forces moveChunk to block synchronously until all orphaned documents are deleted from the donor shard after chunk migration.

When the orphan range is large — in this case [{ _id: { w: 1, i: 1 } }, { _id: MaxKey }), which spans to the last chunk — the deletion operation exceeds the server's operationTime limit, resulting in error code 96 (OperationFailed: ExceededTimeLimit).

Fix

Changed _waitForDelete from true to false so that orphan cleanup runs asynchronously in the background, allowing moveChunk to return immediately without blocking on deletion.

// Before
.append("_waitForDelete", true)

// After
.append("_waitForDelete", false)

Impact

  • moveChunk no longer times out during performance tests on sharded clusters with large orphan ranges.
  • Orphaned documents are still cleaned up by the MongoDB balancer in the background — no data loss occurs.
  • This is consistent with MongoDB's default behavior when _waitForDelete is omitted.

Test Plan

  • Verified fix resolves ExceededTimeLimit error on router during POCDriver performance test
  • Run full performance test suite (Insert / Key Query / Range Query / Update) on sharded cluster

When _waitForDelete=true, moveChunk blocks synchronously until orphaned
documents are deleted from the donor shard. If the orphan range is large
(e.g., up to { _id: MaxKey }), the deletion exceeds the server's operation
time limit, resulting in error code 96 (OperationFailed: ExceededTimeLimit).

Setting _waitForDelete=false makes orphan cleanup asynchronous, allowing
moveChunk to complete without timing out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant