fix(repository): replace peek_result panic with InfrastructureError#326
fix(repository): replace peek_result panic with InfrastructureError#326SAY-5 wants to merge 1 commit into
Conversation
peek_result panicked via unreachable!() when the database returned a row whose status is 'done' but whose result_success and result_error columns (and the merged-with task's columns) are all NULL. The incident in aixigo#307 shows this state does occur in production — the operator's database had ten such rows after a deployment failure — and the unreachable!() reduced PREvant to a panicking task fetcher. Surface the corrupt row as an AppsError::InfrastructureError that identifies the affected status_id so the caller can return a 500 and recover, leaving the row in the database for an operator to inspect or repair instead of taking the whole API down. Closes aixigo#307. Signed-off-by: SAY-5 <say.apm35@gmail.com>
|
Thanks for the PR. However, this change might fix only a symptom when peeking into the database. The function Your change made me think if there is an issue in the following places (which I already expressed in the issue description):
Please, notice the Do you mind to investigate in it more? You could use the test |
|
Agreed that this only catches the symptom, the rows shouldn't exist in the first place. I can extend the test |
|
Reproduce it first and then let's see. BTW, I pushed bdee094 Yesterday which fixes the race condition and added more test cases. See if you can reproduce the issue and then let's discuss about a solution. |
|
Sounds good, I'll pull bdee094, run the new test cases, and try to reproduce the empty-row scenario before proposing a solution. Will report back with findings. |
Summary
Closes #307.
peek_result(api/src/apps/repository.rs) panics viaunreachable!()when the database returns a row whosestatus = 'done'but whoseresult_successandresult_errorcolumns (and the merged-with task's columns) are all NULL. The incident in #307 shows this state does occur in production, the operator's database had ten such rows:The
unreachable!()therefore reduced PREvant to a panicking task fetcher whenever a corrupt row was peeked.Fix
Replace the unreachable arm with
Err(AppsError::InfrastructureError { error: ... })so the caller can return a 500, the operator can inspect/repair the row, and the rest of the system keeps running:The error variant is already exported and is the closest semantic match for "the task ran but its outcome wasn't persisted".
Test plan
cargo build -p prevant, passesapp_taskrow) that the API now returns 500 with the new error message instead of crashing the worker.