Skip to content

fix garyx schedule_followup boundary fallback for deleted threads and dispatch retry#13

Merged
Binlogo merged 2 commits into
Pyiner:mainfrom
Binlogo:feat/followup-fallback_0b0f50
Jun 1, 2026
Merged

fix garyx schedule_followup boundary fallback for deleted threads and dispatch retry#13
Binlogo merged 2 commits into
Pyiner:mainfrom
Binlogo:feat/followup-fallback_0b0f50

Conversation

@Binlogo
Copy link
Copy Markdown
Collaborator

@Binlogo Binlogo commented Jun 1, 2026

Summary

schedule_followup schedules a cron InternalDispatch job that, on fire,
injects a synthetic user-turn into the originating thread. The happy path
worked, but boundary cases failed silently — and a failed one-shot followup
re-fired every tick forever (advance() only runs on Success, so a failed
Once job kept enabled=true with next_run in the past, which is_due()
treats as due every tick).

This adds explicit boundary fallback on the InternalDispatch trigger path.

Scope

  • JobRunStatus::FailedDropped (serde → "failed_dropped"): a terminal drop,
    distinct from Failed.
  • Drop classification (FollowupAttemptError): thread deleted / missing
    thread_id / missing app_state are non-retryable drops; other dispatch
    errors are transient.
  • Bounded retry with exponential backoff (FOLLOWUP_MAX_RETRIES=3, base
    200ms → 200/400/800ms) for transient failures; exhausting the budget drops
    with the concrete error recorded in RunRecord.error.
  • CronJob::settle_after_run() unifies the run_now and tick post-run
    blocks (single source of truth) and makes FailedDropped terminal: one-shot
    jobs are disabled so a dropped followup never re-fires, and delete_after_run
    is honored like Success.
  • Every drop path emits tracing::warn; the existing cron_job_completed
    broadcast already carries status + reason for telemetry.

dispatch_internal_message_to_thread is unchanged (it already returns
Result and already errors on thread-not-found), so the restart_wake /
task_notifications / tasks callers are untouched.

Notes / limitations

  • A "thread stopped / user cancelled" state has no dedicated signal at dispatch
    time, so a stopped-but-still-present thread will still receive the injected
    turn (which starts a fresh turn). The FollowupAttemptError classifier is
    extensible if such a signal is added later.
  • The retry backoff runs inline in the serial cron tick, so a retrying followup
    can delay other due jobs by ≤~1.4s — consistent with the pre-existing serial
    dispatch model and bounded.

Test plan

  • 3 unit tests on the retry orchestrator: drop-without-retry, retry-then-success,
    retry-exhausted (carries concrete error + correct attempt count).
  • 1 integration test: a deleted thread yields status=failed_dropped with a
    "thread not found" reason (asserts the serialized wire form too).
  • cargo test -p garyx-gateway --lib → 510 passed, 0 failed (existing
    schedule_followup happy-path regression intact).

@Binlogo Binlogo force-pushed the feat/followup-fallback_0b0f50 branch from 7a842fc to 41064cf Compare June 1, 2026 08:34
Binlogo added 2 commits June 1, 2026 16:44
… dispatch retry

schedule_followup schedules a cron InternalDispatch job that, on fire, injects
a synthetic user-turn into the originating thread. The happy path worked, but
boundary cases failed silently — and a failed one-shot followup re-fired every
tick forever (advance() only runs on Success, so a failed Once job kept
enabled=true with next_run in the past, which is_due() treats as due every tick).

This adds explicit boundary fallback on the InternalDispatch trigger path:

- New JobRunStatus::FailedDropped (serde -> "failed_dropped"): a terminal drop,
  distinct from Failed.
- Drop classification (FollowupAttemptError): thread deleted / missing
  thread_id / missing app_state are non-retryable drops; other dispatch errors
  are transient.
- Bounded retry with exponential backoff (FOLLOWUP_MAX_RETRIES=3, base 200ms)
  for transient failures; exhausting the budget drops with the concrete error
  recorded in RunRecord.error.
- CronJob::settle_after_run() unifies the run_now and tick post-run blocks and
  makes FailedDropped terminal: one-shot jobs are disabled so a dropped followup
  never re-fires, and delete_after_run is honored like Success.
- Every drop path emits tracing::warn; the existing cron_job_completed broadcast
  already carries status + reason for telemetry.

dispatch_internal_message_to_thread is unchanged (it already returns Result and
already errors on thread-not-found), so the restart_wake / task_notifications /
tasks callers are untouched.

Tests: 3 unit tests on the retry orchestrator (drop-no-retry, retry-then-success,
retry-exhausted) + 1 integration test asserting a deleted thread yields
status=failed_dropped with a "thread not found" reason. cargo test
-p garyx-gateway --lib green (510 passed).
@Binlogo Binlogo force-pushed the feat/followup-fallback_0b0f50 branch from c892e63 to a28a264 Compare June 1, 2026 08:47
@Binlogo Binlogo merged commit 0e2a0c5 into Pyiner:main Jun 1, 2026
1 check passed
@Binlogo Binlogo deleted the feat/followup-fallback_0b0f50 branch June 1, 2026 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant