fix garyx schedule_followup boundary fallback for deleted threads and dispatch retry#13
Merged
Merged
Conversation
7a842fc to
41064cf
Compare
… dispatch retry schedule_followup schedules a cron InternalDispatch job that, on fire, injects a synthetic user-turn into the originating thread. The happy path worked, but boundary cases failed silently — and a failed one-shot followup re-fired every tick forever (advance() only runs on Success, so a failed Once job kept enabled=true with next_run in the past, which is_due() treats as due every tick). This adds explicit boundary fallback on the InternalDispatch trigger path: - New JobRunStatus::FailedDropped (serde -> "failed_dropped"): a terminal drop, distinct from Failed. - Drop classification (FollowupAttemptError): thread deleted / missing thread_id / missing app_state are non-retryable drops; other dispatch errors are transient. - Bounded retry with exponential backoff (FOLLOWUP_MAX_RETRIES=3, base 200ms) for transient failures; exhausting the budget drops with the concrete error recorded in RunRecord.error. - CronJob::settle_after_run() unifies the run_now and tick post-run blocks and makes FailedDropped terminal: one-shot jobs are disabled so a dropped followup never re-fires, and delete_after_run is honored like Success. - Every drop path emits tracing::warn; the existing cron_job_completed broadcast already carries status + reason for telemetry. dispatch_internal_message_to_thread is unchanged (it already returns Result and already errors on thread-not-found), so the restart_wake / task_notifications / tasks callers are untouched. Tests: 3 unit tests on the retry orchestrator (drop-no-retry, retry-then-success, retry-exhausted) + 1 integration test asserting a deleted thread yields status=failed_dropped with a "thread not found" reason. cargo test -p garyx-gateway --lib green (510 passed).
c892e63 to
a28a264
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
schedule_followupschedules a cronInternalDispatchjob that, on fire,injects a synthetic user-turn into the originating thread. The happy path
worked, but boundary cases failed silently — and a failed one-shot followup
re-fired every tick forever (
advance()only runs onSuccess, so a failedOncejob keptenabled=truewithnext_runin the past, whichis_due()treats as due every tick).
This adds explicit boundary fallback on the
InternalDispatchtrigger path.Scope
JobRunStatus::FailedDropped(serde →"failed_dropped"): a terminal drop,distinct from
Failed.FollowupAttemptError): thread deleted / missingthread_id/ missingapp_stateare non-retryable drops; other dispatcherrors are transient.
FOLLOWUP_MAX_RETRIES=3, base200ms → 200/400/800ms) for transient failures; exhausting the budget drops
with the concrete error recorded in
RunRecord.error.CronJob::settle_after_run()unifies therun_nowandtickpost-runblocks (single source of truth) and makes
FailedDroppedterminal: one-shotjobs are disabled so a dropped followup never re-fires, and
delete_after_runis honored like
Success.tracing::warn; the existingcron_job_completedbroadcast already carries
status+ reason for telemetry.dispatch_internal_message_to_threadis unchanged (it already returnsResultand already errors on thread-not-found), so therestart_wake/task_notifications/taskscallers are untouched.Notes / limitations
time, so a stopped-but-still-present thread will still receive the injected
turn (which starts a fresh turn). The
FollowupAttemptErrorclassifier isextensible if such a signal is added later.
can delay other due jobs by ≤~1.4s — consistent with the pre-existing serial
dispatch model and bounded.
Test plan
retry-exhausted (carries concrete error + correct attempt count).
status=failed_droppedwith a"thread not found"reason (asserts the serialized wire form too).cargo test -p garyx-gateway --lib→ 510 passed, 0 failed (existingschedule_followup happy-path regression intact).