Skip to content

fix: drop transaction from station_battle flush to relieve DB contention#375

Closed
jfberry wants to merge 1 commit into
mainfrom
fix/station-battle-flush-contention
Closed

fix: drop transaction from station_battle flush to relieve DB contention#375
jfberry wants to merge 1 commit into
mainfrom
fix/station-battle-flush-contention

Conversation

@jfberry

@jfberry jfberry commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Problem

Production on main is flooding with getOrCreateStationRecord: context deadline exceeded.

flushStationBattleBatch (added in #353) is the only write-behind flush wrapped in a transaction — every other flush (flushPokestopBatch, flushGymBatch, flushStationBatch, …) is a single autocommit INSERT … ON DUPLICATE KEY UPDATE. Wrapping the batch upsert and the obsolete-prune DELETE … WHERE station_id IN (…) AND bread_battle_seed NOT IN (…) in BeginTxx … Commit holds both statements' InnoDB locks (incl. next-key/gap locks on ix_station_battle_station_end) until commit.

Under concurrent overlapping batches (up to 50 write-behind workers × 50-station batches), those transactions block each other (lock-waits up to innodb_lock_wait_timeout, or 1213 deadlock-retries) and pin connections on the shared GeneralDb pool (PokemonDb and GeneralDb are the same handle). That starves the synchronous station reads (loadStationFromDatabase + the new battle hydrate) which run under the station entity lock with only the GMO's 5s context — surfacing as the context deadline exceeded flood.

This is not a Go mutex deadlock (the battle cache is a lock-free xsync.MapOf; the enqueue is non-blocking) — it's DB-level lock/connection contention.

Fix

Drop the transaction. Run the upsert as a single autocommit NamedExecContext (matching every other flush) and the prune DELETE as a separate autocommit ExecContext, upsert-first. Each statement releases its locks immediately, the cross-statement deadlock shape is gone, and the pool connection is returned between statements.

No query text changed — only the BeginTxx/Commit/Rollback wrapper is removed.

Consistency tradeoff (acceptable, self-healing)

Losing atomicity between the two statements means:

  • A reader in the sub-ms gap can see the new battles plus a few about-to-be-pruned superseded ones — over-inclusive only (the reader filters battle_end > now, and the upsert runs first); the next hydrate corrects it.
  • A crash between the statements leaves a few obsolete rows, removed by the next flush or by expiry.

Test Plan

  • go build -tags go_json ./..., go vet ./..., go test ./... — all green (incl. all station-battle tests; buildDeleteObsoleteStationBattlesQuery query-building unchanged).
  • Deploy and confirm the getOrCreateStationRecord: context deadline exceeded flood stops under load — the contention relief is operational and can only be confirmed against a live DB. If it persists, the dominant factor may be raw write volume rather than lock-hold time (next levers: separate read pool, larger max_pool, move battle hydrate out from under the entity lock).

🤖 Generated with Claude Code

flushStationBattleBatch was the only write-behind flush wrapped in a
transaction (every other flush is a single autocommit upsert). Holding the
batch upsert's and the obsolete-prune DELETE's InnoDB locks together until
commit caused concurrent overlapping batches to block each other (lock-waits
/ 1213 deadlocks) and pin connections on the shared GeneralDb pool, starving
the synchronous station reads (loadStationFromDatabase / hydrate) that run
under the station entity lock with the 5s GMO context — surfacing as a flood
of "getOrCreateStationRecord: context deadline exceeded".

Run the upsert (matching the other flushes) and the prune DELETE as separate
autocommit statements, upsert-first. Each releases its locks immediately; the
brief gap is over-inclusive only and self-healing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jfberry

jfberry commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Probably not needed

@jfberry jfberry closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant