fix(local): add threading.RLock to LocalCollection to prevent concurrent-upsert race (#1193)#1196
Conversation
…ent-upsert race (qdrant#1193) `LocalCollection` held no synchronization primitives. Two threads running `upsert` could interleave the mutations of `self.payload` (list append) and `self.deleted` (numpy concatenate), leaving them at mismatched lengths. A subsequent `scroll` / `search` then raised ValueError: operands could not be broadcast together with shapes (N,) (M,) Repro from issue qdrant#1193: 8 threads x 20 batches x 10 points, failed 2-3 trials out of 5 on qdrant-client 1.17.1 (and 1.12.2). Fix --- - `threading.RLock` created in `__init__`. Reentrant because the public write methods call private mutators that are also reachable from other write entry points. - `with self._lock:` wraps the full body of every public write entry point that touches the per-point parallel arrays (`payload`, `deleted`, `vectors`, `sparse_vectors`, `multivectors`, `ids`, `ids_inv`, `deleted_per_vector`): - `upsert` - `update_vectors` - `delete_vectors` - `delete` - `set_payload` - `overwrite_payload` - `delete_payload` - `clear_payload` Read paths (`search`, `scroll`, `retrieve`, ...) are NOT locked. The bug is that writes corrupted the arrays into a permanently inconsistent state — serializing writes removes the corruption. Reads that take a single `len(...)` at entry then slice will see a consistent snapshot. Leaving reads unlocked keeps the hot path lock-free and preserves the "in-memory is for testing / prototyping" performance tradeoff. Test ---- Adds `test_concurrent_upsert_does_not_corrupt_local_collection` (8 threads x 20 batches x 10 points = 1600 expected). Passes with the lock. Patched locally to a no-op lock: 3/5 trials reproduce the `operands could not be broadcast together` error, matching the issue reproducer output exactly. Ref: qdrant#1193
❌ Deploy Preview for poetic-froyo-8baba7 failed.
|
📝 WalkthroughWalkthroughA Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
qdrant_client/local/local_collection.py (1)
2553-2599:⚠️ Potential issue | 🟠 MajorWriter-reader race still reproduces the same broadcast error.
The lock only serializes writers, but
_add_pointmutatesself.payload(in-place append on line 2399) before it re-assignsself.deleted/self.deleted_per_vector[name]/self.vectors[name](lines 2403, 2421-2423, 2431-2434, etc.). Readers such asscroll,search,count,facet, andretrievego through_payload_and_non_deleted_mask, whosepayload_mask & ~self.deletedon line 523 will still raiseValueError: operands could not be broadcast together with shapes (N+1,) (N,)if a read runs while any writer is mid-_add_point. The issue#1193repro happened to join all writers before reading, so it is fixed, but the general "concurrent upsert + concurrent scroll/search" pattern is not.Given the PR's stated "read hot path lock-free" design choice this is intentional, but consider either:
- Documenting in the class/
__init__docstring that reads are not safe to overlap with writes, or- Reordering
_add_pointso per-point array growth is committed atomically from the reader's perspective (e.g., build the newdeleted/deleted_per_vector[name]/vectors[name]arrays first and only then append toself.payload/self.ids_inv/self.ids— readers then see either the old size or the new size, never a torn state). Option 2 preserves the lock-free read hot path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@qdrant_client/local/local_collection.py` around lines 2553 - 2599, The upsert path can leave readers in a torn state because _add_point mutates self.payload in-place before committing related arrays, causing _payload_and_non_deleted_mask (used by scroll/search/count/retrieve/facet) to sometimes see mismatched shapes; fix by making per-point growth atomic: in _add_point (and any callers like upsert) construct new arrays/masks for self.deleted, self.deleted_per_vector[name], and self.vectors[name] (and ids/ids_inv) first (or extend copies), then perform the final append/assignment to self.payload and ids so readers see either the old full state or the new full state; ensure the lock around writers still surrounds the commit step and update any helper functions that assume in-place mutation to use the new-then-swap approach (references: upsert, _add_point, _payload_and_non_deleted_mask, scroll, search, count, retrieve, facet).
🧹 Nitpick comments (1)
tests/test_in_memory.py (1)
303-369: Nice, targeted regression test — optional: close the client.Sizing (8×20×10 = 1600) and the post-join
scroll(limit=expected_total + 1)precisely exercise the original failure mode. Two minor refinements if you feel like it:
- The
QdrantClientis never closed; for:memory:it's harmless in CI but leaves the sqlite connection dangling until GC. Atry/finallyorwith contextlib.closing(...)(or callingclient.close()) keeps the test tidy.- Consider asserting
len({p.id for p in points}) == expected_totalas well, to catch a hypothetical regression where the race silently drops or duplicates IDs without tripping the shape mismatch.💡 Optional tweak
- client = QdrantClient(":memory:") - client.create_collection( + client = QdrantClient(":memory:") + try: + client.create_collection( collection_name="race", ... - ) + ) ... - points, _ = client.scroll(collection_name="race", limit=expected_total + 1) - assert len(points) == expected_total, ( - f"expected {expected_total} points after concurrent upserts, got {len(points)}" - ) + points, _ = client.scroll(collection_name="race", limit=expected_total + 1) + assert len(points) == expected_total, ( + f"expected {expected_total} points after concurrent upserts, got {len(points)}" + ) + assert len({p.id for p in points}) == expected_total + finally: + client.close()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_in_memory.py` around lines 303 - 369, The test leaves the QdrantClient open and could be tightened to also assert uniqueness of returned IDs; wrap the QdrantClient usage in a try/finally (or use contextlib.closing/with) and call client.close() in the finally to ensure the in-memory sqlite connection is closed, and after the scroll call add an assertion like len({p.id for p in points}) == expected_total to ensure no duplicates/drops; update references in this test function (test_concurrent_upsert_does_not_corrupt_local_collection) and keep the existing scroll/expected_total logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@qdrant_client/local/local_collection.py`:
- Around line 2553-2599: The upsert path can leave readers in a torn state
because _add_point mutates self.payload in-place before committing related
arrays, causing _payload_and_non_deleted_mask (used by
scroll/search/count/retrieve/facet) to sometimes see mismatched shapes; fix by
making per-point growth atomic: in _add_point (and any callers like upsert)
construct new arrays/masks for self.deleted, self.deleted_per_vector[name], and
self.vectors[name] (and ids/ids_inv) first (or extend copies), then perform the
final append/assignment to self.payload and ids so readers see either the old
full state or the new full state; ensure the lock around writers still surrounds
the commit step and update any helper functions that assume in-place mutation to
use the new-then-swap approach (references: upsert, _add_point,
_payload_and_non_deleted_mask, scroll, search, count, retrieve, facet).
---
Nitpick comments:
In `@tests/test_in_memory.py`:
- Around line 303-369: The test leaves the QdrantClient open and could be
tightened to also assert uniqueness of returned IDs; wrap the QdrantClient usage
in a try/finally (or use contextlib.closing/with) and call client.close() in the
finally to ensure the in-memory sqlite connection is closed, and after the
scroll call add an assertion like len({p.id for p in points}) == expected_total
to ensure no duplicates/drops; update references in this test function
(test_concurrent_upsert_does_not_corrupt_local_collection) and keep the existing
scroll/expected_total logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 208aca99-7e6f-4c9d-9e70-2ff29123b1c4
📒 Files selected for processing (2)
qdrant_client/local/local_collection.pytests/test_in_memory.py
|
Hey @aurelienbran Local mode is not designed for the concurrent workloads. Trying to overcome this with |
Summary
Fixes #1193.
LocalCollectionheld no synchronization primitives, so two threads runningupsertconcurrently could interleave the mutations ofself.payload(Pythonlist.append) andself.deleted(numpynp.concatenate), leaving the parallel arrays at mismatched lengths. A subsequentscroll()/search()/retrieve()call then raised:The reproducer posted in #1193 (8 threads × 20 batches × 10 points) failed 2-3 trials out of 5 on both
qdrant-client==1.12.2andqdrant-client==1.17.1.Fix
threading.RLockcreated inLocalCollection.__init__. Reentrant because the public write methods call private mutators reachable from multiple entry points.with self._lock:wraps the full body of every public write entry point that mutates the per-point parallel arrays (payload,deleted,vectors,sparse_vectors,multivectors,ids,ids_inv,deleted_per_vector):upsertupdate_vectorsdelete_vectorsdeleteset_payloadoverwrite_payloaddelete_payloadclear_payloadRead paths (
search,scroll,retrieve, ...) are intentionally NOT locked. The bug is that writes corrupted the arrays into a permanently inconsistent state — serializing writes alone removes the corruption. Read calls that take a singlelen(...)at entry and then slice will observe a consistent snapshot. Leaving reads unlocked keeps the hot path lock-free.Test
Adds
test_concurrent_upsert_does_not_corrupt_local_collectionintests/test_in_memory.py, mirroring the issue reproducer (8 threads, 20 batches, 10 points = 1600 expected). Passes with the lock. When I monkey-patchedLocalCollection._lockto a no-op in a local script, 3/5 trials reproduced theoperands could not be broadcast togethererror exactly as the issue describes.Ran locally on
masterHEAD (cd5eb25, v1.17.1):Scope decisions
RLockto accommodate private mutators that are also reachable from other public methods (e.g. batch paths).Lockwould be faster but risks self-deadlock on internal calls.RLock.async_qdrant_local.py/AsyncQdrantLocal: the async wrapper dispatches to the syncLocalCollection, so this fix applies automatically whenasyncio.to_threadis used to run writes concurrently. No separate async lock needed.Happy to split the 8-method lock wrapping into a decorator or reshape the scope if preferred.