Skip to content

Derive in-memory index membership from the data map#1906

Merged
brharrington merged 1 commit into
Netflix:mainfrom
brharrington:fix-memdb-cleanup-revive-race
Jun 4, 2026
Merged

Derive in-memory index membership from the data map#1906
brharrington merged 1 commit into
Netflix:mainfrom
brharrington:fix-memdb-cleanup-revive-race

Conversation

@brharrington

Copy link
Copy Markdown
Contributor

The metadata index could retain a series no longer present in the data map ("orphaned series"): a tag query returned it long after its data had aged past the retention window. It happened when a concurrent update revived a block store in the small window between rebuild()'s emptiness check and removing it from the data map -- the store stayed reachable from the index (via its BlockStoreItem) but was dropped from data, so cleanup could never revisit it.

Fix the root cause -- two independent sources of liveness -- by making the data map authoritative. data now holds BlockStoreItem, and rebuildIndex() cleans up rolled-out/filtered stores and rebuilds the index from the surviving entries in a single pass, so a store removed during cleanup cannot linger in the index. The pending-queue feed (index.update) is no longer used by MemoryDatabase.

To avoid losing the datapoint from the residual sub-instruction race, update()/rollup() call ensurePresent() after writing: a lock-free get that re-inserts the store only if it was concurrently removed, adding no per-datapoint locking. hasData is made volatile so the rebuild thread reliably observes ingest-side revivals now that membership depends on it.

Adds a deterministic regression test for the revive-during-cleanup race and a test for the rebuild() throttle.

The metadata index could retain a series no longer present in the data
map ("orphaned series"): a tag query returned it long after its data had
aged past the retention window. It happened when a concurrent update
revived a block store in the small window between rebuild()'s emptiness
check and removing it from the data map -- the store stayed reachable
from the index (via its BlockStoreItem) but was dropped from data, so
cleanup could never revisit it.

Fix the root cause -- two independent sources of liveness -- by making
the data map authoritative. data now holds BlockStoreItem, and
rebuildIndex() cleans up rolled-out/filtered stores and rebuilds the
index from the surviving entries in a single pass, so a store removed
during cleanup cannot linger in the index. The pending-queue feed
(index.update) is no longer used by MemoryDatabase.

To avoid losing the datapoint from the residual sub-instruction race,
update()/rollup() call ensurePresent() after writing: a lock-free get
that re-inserts the store only if it was concurrently removed, adding no
per-datapoint locking. hasData is made volatile so the rebuild thread
reliably observes ingest-side revivals now that membership depends on it.

Adds a deterministic regression test for the revive-during-cleanup race
and a test for the rebuild() throttle.
@brharrington brharrington added this to the 1.9.0 milestone Jun 4, 2026
@brharrington brharrington merged commit 7f768fc into Netflix:main Jun 4, 2026
5 checks passed
@brharrington brharrington deleted the fix-memdb-cleanup-revive-race branch June 4, 2026 17:11
brharrington added a commit that referenced this pull request Jun 4, 2026
Deriving index membership from the data map (#1906) stopped update()
from feeding the index pending queue, so the synchronous data->index
sync now happens only in rebuildIndex(). Callers that write via
update() and previously forced a sync with index.rebuildIndex() no
longer see their data: the no-arg BatchUpdateTagIndex.rebuildIndex()
just drains an empty pending queue.

Expose the unthrottled MemoryDatabase.rebuildIndex() so callers can
force a synchronous rebuild after a bulk load or in tests. rebuild()
remains the throttled variant used by the background thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant