fix(llm-access-store): serialize duckdb tiered append against retention by acking-you · Pull Request #18 · acking-you/static_flow

acking-you · 2026-05-31T17:47:03Z

What

Fixes the cross-thread write race in the DuckDB tiered usage store surfaced during PR #16's review (gemini flagged the shape; this is the verified, corrected mechanism). Follow-up to the duckdb split (#16), branched from clean master so it edits the just-moved append.rs/retention.rs without conflict.

The bug

The tiered repo's append and retention paths run on two separate OS threads of the usage worker (llm-access-usage-worker spawns the import loop and the maintenance loop as independent current_thread runtimes), both holding Arc::clone of the same DuckDbUsageRepository.

append_usage_events_to_tiered takes the active PersistentUsageWriter out of the state, drops the std::sync::Mutex (its guard can't be held across .await), and awaits the insert. During that window a retention cycle can lock the state, and rollover_expired_active_segment / discard_expired_active_segment checkpoint + delete/roll the active segment and advance active_path. The append then restores its now-stale writer → active_path and active_writer diverge: subsequent current (non-expired) usage events are written through the stale writer into a deleted/rolled segment and silently lost (the divergence persists until a config change forces a writer reopen).

Verified the enabling fact with a probe (not assumed): two RW Connection::open to the same DuckDB file in one process both succeed (shared DatabaseInstance keyed by canonical path), so retention's checkpoint is not lock-rejected while an append is in flight — the race is real, not self-healing.

Severity (calibrated honestly): the precondition is narrow — the active segment's newest committed row must be older than the retention cutoff while it still receives appends (old-timestamp backfill / journal replay, or a very low-traffic node whose active segment aged out). But the data lost is current, billing-relevant usage, so it's worth fixing rather than leaving latent.

The fix

A per-tiered-repo async write gate (tokio::sync::Mutex<()> on TieredDuckDbUsageState) serializing the append write path against retention's active-segment rollover/discard:

append holds the gate for its whole body;
retention holds it only around rollover_expired_active_segment (the sole active-segment mutator) — archived/detail cleanup stays ungated so it never stalls appends.

Lock order is gate-before-std in both paths (the brief clone-lock is released before awaiting the gate → gate never awaited while holding std), so no inversion/deadlock. Reads/queries take only the std mutex and stay fully concurrent. Production cost: one Arc clone + an uncontended async-mutex acquire per append; in normal operation there's a single import loop so append-vs-append never contends, and append-vs-retention only contends during the hourly rollover window.

Considered and rejected a lighter "generation-check" (don't restore the writer if active_path changed): it fixes only the persistent divergence, not the in-flight loss of a current event being inserted into a segment retention discards mid-await (the segment's committed end_ms can be fully expired while a current row is in flight). Only serialization closes that.

Test

A deterministic #[cfg(test)] seam parks an in-flight append while it holds the gate; the test asserts a concurrent retention timeouts (blocked by the gate), then proceeds once released (gate freed, no deadlock). Verified it has teeth — with the gate removed it fails on exactly the retention must block on the write gate assertion.

Verification

cargo clippy -p llm-access-store --all-targets -- -D warnings → clean
cargo test -p llm-access-store → 68 passed (67 prior + the new seam test)
cargo build -p llm-access (reverse dep) → ok
rustfmt on the 5 changed files only; deps/lance/deps/lancedb untouched

🤖 Generated with Claude Code

The tiered usage repo's append and retention paths run on two separate OS threads of the usage worker, both holding `Arc::clone` of the same repo. `append_usage_events_to_tiered` takes the active `PersistentUsageWriter` out of the state, drops the `std::sync::Mutex` (can't hold its guard across `.await`), and awaits the insert. During that window a retention cycle can lock the state, and `rollover_expired_active_segment` / `discard_expired_ active_segment` can checkpoint + delete / roll the active segment and advance `active_path`. The append then restores its now-stale writer, so `active_path` and `active_writer` diverge: subsequent current (non-expired) usage events are written through the stale writer into a deleted/rolled segment and silently lost (the divergence persists until a config change forces a writer reopen). Verified the enabling fact with a probe: two RW `Connection::open` to the same DuckDB file in one process both succeed (shared DatabaseInstance), so retention's checkpoint is NOT lock-rejected while an append is in flight — the race is real, not self-healing. Precondition is narrow (the active segment's newest committed row must be older than the retention cutoff while it still receives appends — i.e. old-timestamp backfill / journal replay, or a very low-traffic node whose active segment aged out), but the data lost is current, billing-relevant usage, so it's worth fixing. Fix: a per-tiered-repo async write gate (`tokio::sync::Mutex`) serializing the append write path against retention's active-segment rollover/discard. Append holds the gate for its whole body; retention holds it only around `rollover_expired_active_segment` (the sole active-segment mutator) so the archived/detail cleanup never stalls appends. Lock order is gate-before-std in both paths (the brief clone-lock is released before awaiting the gate), so no inversion/deadlock; reads/queries take only the std mutex and stay concurrent. Test: a deterministic `#[cfg(test)]` seam parks an in-flight append while it holds the gate; the test asserts a concurrent retention `timeout`s (blocked), then proceeds once released. Verified it has teeth (fails on exactly the "retention must block" assertion when the gate is removed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

acking-you · 2026-05-31T17:47:28Z

/gemini review

acking-you · 2026-05-31T17:47:30Z

@codex review

gemini-code-assist

Code Review

This pull request introduces a write_gate mutex to serialize the append write path against active-segment rollover and discard during retention cycles in the DuckDB usage store. This prevents in-flight writers from being orphaned onto deleted or rolled segments. A test-only hook (append_seam) and a corresponding integration test have been added to verify this serialization behavior. There are no review comments, and I have no feedback to provide.

gemini-code-assist

Code Review

This pull request introduces a synchronization mechanism using a write_gate (an Arc<tokio::sync::Mutex<()>>) to serialize the append write path against active-segment rollover and discard during retention cycles. This prevents in-flight writers from being orphaned onto deleted or rolled segments, avoiding data loss. Additionally, a test-only append_seam hook and a corresponding integration test have been added to verify this serialization behavior. I have no feedback to provide as there are no review comments.

chatgpt-codex-connector · 2026-05-31T17:49:57Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

gemini-code-assist Bot reviewed May 31, 2026

View reviewed changes

acking-you merged commit 5a10c52 into master May 31, 2026
3 checks passed

acking-you mentioned this pull request May 31, 2026

fix(llm-access): block DNS-rebinding SSRF in kiro remote-media fetch #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm-access-store): serialize duckdb tiered append against retention#18

fix(llm-access-store): serialize duckdb tiered append against retention#18
acking-you merged 1 commit into
masterfrom
fix/duckdb-tiered-write-race

acking-you commented May 31, 2026

Uh oh!

acking-you commented May 31, 2026

Uh oh!

acking-you commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

acking-you commented May 31, 2026

What

The bug

The fix

Test

Verification

Uh oh!

acking-you commented May 31, 2026

Uh oh!

acking-you commented May 31, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant