fix: drop transaction from station_battle flush to relieve DB contention#375
Closed
jfberry wants to merge 1 commit into
Closed
fix: drop transaction from station_battle flush to relieve DB contention#375jfberry wants to merge 1 commit into
jfberry wants to merge 1 commit into
Conversation
flushStationBattleBatch was the only write-behind flush wrapped in a transaction (every other flush is a single autocommit upsert). Holding the batch upsert's and the obsolete-prune DELETE's InnoDB locks together until commit caused concurrent overlapping batches to block each other (lock-waits / 1213 deadlocks) and pin connections on the shared GeneralDb pool, starving the synchronous station reads (loadStationFromDatabase / hydrate) that run under the station entity lock with the 5s GMO context — surfacing as a flood of "getOrCreateStationRecord: context deadline exceeded". Run the upsert (matching the other flushes) and the prune DELETE as separate autocommit statements, upsert-first. Each releases its locks immediately; the brief gap is over-inclusive only and self-healing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Probably not needed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Production on
mainis flooding withgetOrCreateStationRecord: context deadline exceeded.flushStationBattleBatch(added in #353) is the only write-behind flush wrapped in a transaction — every other flush (flushPokestopBatch,flushGymBatch,flushStationBatch, …) is a single autocommitINSERT … ON DUPLICATE KEY UPDATE. Wrapping the batch upsert and the obsolete-pruneDELETE … WHERE station_id IN (…) AND bread_battle_seed NOT IN (…)inBeginTxx … Commitholds both statements' InnoDB locks (incl. next-key/gap locks onix_station_battle_station_end) until commit.Under concurrent overlapping batches (up to 50 write-behind workers × 50-station batches), those transactions block each other (lock-waits up to
innodb_lock_wait_timeout, or 1213 deadlock-retries) and pin connections on the sharedGeneralDbpool (PokemonDbandGeneralDbare the same handle). That starves the synchronous station reads (loadStationFromDatabase+ the new battle hydrate) which run under the station entity lock with only the GMO's 5s context — surfacing as thecontext deadline exceededflood.This is not a Go mutex deadlock (the battle cache is a lock-free
xsync.MapOf; the enqueue is non-blocking) — it's DB-level lock/connection contention.Fix
Drop the transaction. Run the upsert as a single autocommit
NamedExecContext(matching every other flush) and the pruneDELETEas a separate autocommitExecContext, upsert-first. Each statement releases its locks immediately, the cross-statement deadlock shape is gone, and the pool connection is returned between statements.No query text changed — only the
BeginTxx/Commit/Rollbackwrapper is removed.Consistency tradeoff (acceptable, self-healing)
Losing atomicity between the two statements means:
battle_end > now, and the upsert runs first); the next hydrate corrects it.Test Plan
go build -tags go_json ./...,go vet ./...,go test ./...— all green (incl. all station-battle tests;buildDeleteObsoleteStationBattlesQueryquery-building unchanged).getOrCreateStationRecord: context deadline exceededflood stops under load — the contention relief is operational and can only be confirmed against a live DB. If it persists, the dominant factor may be raw write volume rather than lock-hold time (next levers: separate read pool, largermax_pool, move battle hydrate out from under the entity lock).🤖 Generated with Claude Code