Skip to content

[V1.3.4] Durable Execution for CraftBot #321

@ahmad-ajmal

Description

@ahmad-ajmal

1. Why

Issue #313 (one-time scheduled tasks delayed/duplicated after a restart) was the spark, but
the real subject is bigger: CraftBot is already a hand-rolled durable-execution engine, and
this doc names its primitives and closes the gaps where durability is partial or emergent.

We explored Temporal and parked it: we want the paradigm (durable execution — work that
survives crashes without loss or duplication), not the operational tax (running a Temporal
cluster). Everything here is Temporal-shaped on the SQLite we already run
(app/usage/session_storage.pysessions.db, WAL, idempotent upserts). No new infra.

Four pains, all in scope:

  1. Restart durability — produced-but-not-yet-executed triggers (scheduled, proactive,
    living-ui, resume) live only in an in-memory heap and are lost on restart.
  2. Too many ad-hoc producers — ~13 sites build Triggers with stringly-typed
    payload["type"] (several with none). No single model or entry API.
  3. Cost/latency of routingTriggerQueue.put() makes a synchronous LLM call for
    session routing on the hot path of every non-system trigger.
  4. Emergent-but-holey action idempotency — irreversible side effects (send email) are
    protected from re-execution only by the LLM re-reading the event stream, with a real
    crash-window hole (§4.2). This is the most agent-critical gap and the truest Temporal
    alignment.

2. The frame: durable execution as three primitives

Temporal's value isn't "a durable queue." It's three things made crash-proof together:

Temporal concept CraftBot equivalent Durable today? Target
Signals / timers Triggers ❌ volatile heap durable store (§6.2)
Workflows Tasks (+ event stream) ⚠️ partial (persist/restore exists) formalize contract (§6.3)
Activities (side effects) Irreversible actions ⚠️ emergent only idempotency keys (§6.4)

The divergence principle (state it explicitly)

Temporal reconstructs in-flight state by deterministically replaying workflow code against
an event history. CraftBot's "workflow body" is an LLM — non-deterministic by nature, so
true replay is impossible. CraftBot's durable-execution model is therefore deliberately
different:

Checkpoint observable state + make irreversible side effects idempotent — not
deterministic replay.

This is the correct model for an agent, and it sets the ceiling: we can guarantee "didn't lose
it" and "didn't do the irreversible thing twice," but not "reconstructed the exact reasoning."


3. Current architecture (as built)

3.1 The Trigger

agent_core/core/trigger.py@dataclass(order=True): fire_at, priority,
next_action_description, payload (untyped, compare=False), session_id,
waiting_for_reply. No id, no source type, no dedup key, no attempt count.

3.2 The queue

agent_core/core/impl/trigger/queue.py — in-memory heapq ordered by (fire_at, priority).
put() optionally calls the LLM to route to a session, then merges/replaces same-session
triggers ("prefer newest"). Volatile. Built in app/agent_base.py:242.

3.3 Producers (inventory)

~13 sites; representative:

Source payload["type"] Priority Durable today?
User / integration message (none) 3 message not persisted as a trigger
Scheduled (recurring + one-time) scheduled 20–40 schedule persists; trigger doesn't
Memory processing memory_processing 50–60 no
Proactive heartbeat / planner proactive_* 50 no
Restart notice / resume restart_notice / (none) 1 / 5–7 reconstructed from Tasks on boot
Task continuation (none) 5/7 implied by Task
Living UI dev / crash / import living_ui_* 30/50 no
Skill workflow / onboarding (none) 60 / 1 no

3.4 Consumer (single loop)

app/ui_layer/controller/ui_controller.py:_consume_triggers():
trigger = await agent.triggers.get(); await agent.react(trigger). react() classifies and
turns the trigger into Task creation/continuation.

3.5 Existing durability we build on

SQLite+WAL at {APP_DATA_PATH}/.usage/sessions.db; tables active_tasks, event_streams,
event_records, conversation_history; idempotent INSERT … ON CONFLICT DO UPDATE. Task
persist hooks (on_task_persist / on_task_remove_persist); boot restore
(_restore_sessions + _schedule_restored_task_triggers), filtering terminal and >24h tasks.


4. Failure modes today

4.1 Trigger layer

  1. Lost triggers — anything produced but not yet a Task vanishes on restart.
  2. Duplicated triggers — no idempotency; [V.1.3.3] Scheduled tasks are delayed and duplicated after restart of CraftBot #313's one-time double-fire is one instance.
  3. Delay/re-anchoring — fixed for scheduled tasks (Phase 0) but only locally.
  4. Untyped fan-inreact()/put() branch on payload["type"] strings across files.
  5. LLM on the hot path — routing cost/latency per message trigger.

4.2 Action layer — emergent idempotency and its hole

Idempotency for side effects is not explicit (no dedup key, verified). It is emergent
from event-stream replay, and it works for the common case:

  1. Agent runs send_email → side effect happens.
  2. action_end event recorded → event stream → persisted to sessions.db
    (agent_core/core/impl/action/manager.py:385, then on_task_persist).
  3. On resume, react() reloads the stream; the LLM reads "send_email completed" and picks
    the next action instead of resending.

The ordering, exactly:

1. side effect runs        ← irreversible (email actually sent)
2. action_end recorded + persisted to sessions.db

The hole: a crash between 1 and 2 leaves an email sent with no durable record of it. On
resume the LLM sees no completion → resends → duplicate. Two things make it sharper than a
normal queue:

  • The evidence is read by an LLM, not a checker. If action_start persisted but
    action_end didn't, the model sees "Running send_email…" with no result and must guess
    whether it already sent — a coin-flip on an irreversible action.
  • Emergent ≠ guaranteed. It depends on the model correctly inferring done-ness from prose.

Window is milliseconds and harmless for reversible actions (read_file, web_search), but
real and high-consequence for irreversible external actions (email, payment, post, outbound
message). This is the gap the activity layer (§6.4) closes.


5. Goals / non-goals

Goals

  • A trigger that is accepted is durably recorded before it can run, and is exactly-once
    w.r.t. execution
    (at-least-once delivery + idempotent claim).
  • One typed trigger model and one producer API; remove scattered payload["type"].
  • Crash/restart rehydrates pending + in-flight triggers; no loss, no dupes.
  • Routing (incl. the LLM call) becomes pluggable and off the hot path.
  • Irreversible side effects are exactly-once via idempotency keys carried to the provider.
  • Clear layering: when (triggers) · what/state (tasks) · did-it-happen (activities).

Non-goals

  • No external broker / Temporal / Redis. Reuse sessions.db.
  • No change to the LLM reasoning loop or action-selection logic.
  • Not distributed/multi-process — single consumer stays.
  • No deterministic replay — explicitly out (§2): the workflow body is an LLM.

6. Proposed design — three primitives

6.1 Layering

Producers ─emit(TriggerSpec)─▶ TriggerService ─▶ triggers table (sessions.db)
                                     │                    │  rehydrate on boot
                                     ▼                    ▼
                                Router (pluggable) ─▶ in-memory TriggerQueue (heap)
                                                          │
                                                  consumer: get() → react()
                                                          │
                                              Task (workflow) advances ── runs ──▶ Action
                                                          │                          │
                                                          │              irreversible? ─▶ Activity ledger
                                                          │                          │  (key before, result after)
                                                          └── ack trigger ◀───────────┘

6.2 Primitive A — Durable triggers (signals/timers) (Phase 1, fixes #313 class)

TriggerService is the single front door; producers call service.emit(spec) instead of
touching TriggerQueue. The queue stays as the in-memory ordering primitive but is fed
from the store, not from producers.

Typed record (stored form):

id           uuid — identity
source       enum: MESSAGE|SCHEDULED|IMMEDIATE|MEMORY|PROACTIVE|RESUME|LIVING_UI|
                   SKILL_WORKFLOW|ONBOARDING|SYSTEM        (replaces payload["type"])
dedup_key    idempotency, UNIQUE when present (nullable for one-off user messages)
fire_at      when due
not_before   retry backoff floor
priority     tiebreak
session_id   routing target (may be null → router decides)
description  next_action_description
payload      source-specific (still flexible; typed sub-objects over time)
status       PENDING | CLAIMED | DONE | FAILED | DEAD
attempts, lease_until, created_at, updated_at

Idempotency keys (the keystone; INSERT … ON CONFLICT(dedup_key) DO NOTHING):

  • scheduled recurring: scheduled:{schedule_id}:{fire_at_bucket}
  • scheduled one-time: scheduled-once:{schedule_id}
  • resume: resume:{task_id} · proactive: proactive:{frequency}:{slot}
  • living-ui: living_ui:{project_id}:{job_kind}

This retires the trigger-duplication class structurally; Phase 0's run_count>0 skip becomes
one instance of it.

Lifecycle (at-least-once + idempotent claim):

PENDING ─claim()→ CLAIMED ─ack(done)→ DONE
   ▲                 │
   │ lease expires   └─ack(fail)→ attempts<max ? PENDING(backoff) : DEAD(dead-letter)
   └─────────────────

Single consumer ⇒ leases guard crash recovery, not concurrency. A boot-time reclaim
scan
(CLAIMED → PENDING for orphans) is the core; a periodic sweep is optional polish.

Routing off the hot path: emit() returns after the durable INSERT, no LLM call.
Session routing becomes a Router strategy invoked by the queue feeder, only when needed
(multiple live sessions + routable source), and after the durable write so a crash mid-route
simply re-routes on retry. Decouples TriggerQueue from llm/prompt templates.

Overdue handling generalized: Phase 0's agent-judged catch-up note (fire it, but tell the
agent how late it is and to use judgment) moves to the store layer so any rehydrated
overdue trigger — not just scheduled — gets the same treatment.

6.3 Primitive B — Tasks-as-workflows (mostly formalizing what exists)

Tasks already persist (active_tasks) and restore. This primitive is naming + a contract,
not a rewrite:

  • A Task is the durable workflow: its state = (todos, status, event stream), already in
    sessions.db.
  • Define the resume contract explicitly: restore reads task + event stream, emits a RESUME
    trigger (now via emit() with dedup_key=resume:{task_id} — so double-boot can't
    double-resume), the LLM advances from the event history.
  • No deterministic replay (§2) — resume = "reload checkpoint + let the LLM read the stream."

6.4 Primitive C — Idempotent activities (the Temporal headline; scoped, not universal)

Closes §4.2. Applies only to irreversible external actions (email, payment, post, outbound
message) — flagged on the action definition (e.g. irreversible=True). Reversible actions are
untouched (wrapping read_file would be gold-plating).

For a flagged action:

  1. Mint a stable idempotency key (activity:{task_id}:{action_seq} or content-hash) and
    record intent durably to the event stream/ledger before execution.
  2. Carry the key to the action, and where the provider supports it, to the provider
    (email Message-ID / idempotency header, payment idempotency key) so a re-send is deduped
    at the destination — the only place exactly-once is actually enforceable.
  3. Record result durably after.
  4. On resume, the key answers "did this already run?" — a hard check, not LLM prose-reading.

This mirrors Temporal honestly: Temporal doesn't give free exactly-once either; it gives
at-least-once + an idempotency key you push to the external system. Same here.


7. Data model (additions to sessions.db)

CREATE TABLE IF NOT EXISTS triggers (
    id TEXT PRIMARY KEY, source TEXT NOT NULL, dedup_key TEXT UNIQUE,
    session_id TEXT, fire_at REAL NOT NULL, not_before REAL, priority INTEGER NOT NULL,
    description TEXT, payload_json TEXT NOT NULL, status TEXT NOT NULL,
    attempts INTEGER NOT NULL DEFAULT 0, lease_until REAL,
    created_at REAL NOT NULL, updated_at REAL NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_triggers_due ON triggers(status, fire_at);

-- Activity ledger for irreversible actions (§6.4)
CREATE TABLE IF NOT EXISTS activity_log (
    idem_key TEXT PRIMARY KEY,      -- the stable key, also pushed to provider when possible
    task_id TEXT NOT NULL, action TEXT NOT NULL,
    status TEXT NOT NULL,           -- INTENT | DONE | FAILED
    provider_ref TEXT,              -- e.g. email Message-ID returned by the provider
    created_at REAL NOT NULL, updated_at REAL NOT NULL
);

Both reuse the existing WAL connection, upsert idioms, and stale-GC TTL (DONE/INTENT rows GC'd
on the same 24h pattern — don't keep message-content payloads indefinitely).


8. Boot / restore integration

Extend _restore_sessions():

  1. Reclaim CLAIMED triggers past lease_untilPENDING; reclaim INTENT activities
    (decide per-action: re-attempt with same key, or surface for confirmation).
  2. Load PENDING triggers → feed the queue (respecting fire_at/not_before); apply the
    overdue catch-up note (§6.2).
  3. Task resume becomes a RESUME-trigger producer via emit() (dedup'd).

9. Phasing

  • Phase 0 (done): scheduler-layer [V.1.3.3] Scheduled tasks are delayed and duplicated after restart of CraftBot #313 fix — persist absolute fire_at, skip already-fired
    one-time tasks, remove-before-enqueue, agent-judged catch-up.
  • Phase 1 — Durable trigger store + idempotency (MVP): triggers table + TriggerStore
    (claim/ack/reclaim) + TriggerService.emit() + boot rehydration. Migrate scheduled and
    resume first (natural dedup keys). Retires the [V.1.3.3] Scheduled tasks are delayed and duplicated after restart of CraftBot #313 class for all producers.
  • Phase 2 — Typed model + single producer API: TriggerSource enum; migrate remaining
    producers to emit(); delete scattered payload["type"] branching.
  • Phase 3 — Routing extraction: Router behind the queue feeder; remove the LLM call from
    put(); routing post-durable-write and retry-safe.
  • Phase 4 — Idempotent activities (Temporal headline): flag irreversible actions; intent +
    key before, result after; provider-level dedup; key-based resume check. Independent of
    Phases 1–3 — can be pulled forward if the double-send risk bites first.
  • Phase 5 — Lifecycle polish: retries w/ backoff, dead-letter surfacing, DONE/INTENT TTL GC.

Each phase ships and is observable on its own.


10. Risks / open questions

  • Dedup-key & activity-key granularity is load-bearing — wrong bucket either lets dupes
    through or suppresses legitimate re-fires. Needs a reviewed per-source / per-action key table.
  • Merge semantics — today put() "prefers newest" by dropping queued triggers. With a
    store, "drop" must mean mark superseded DONE/DEAD (not silent delete) so they don't
    rehydrate.
  • Which actions are "irreversible"? Mislabel → either gold-plated or unprotected. Start
    conservative (email/payment/post/outbound message) and require explicit opt-in per action.
  • Provider dedup support varies — not every side effect has an idempotency primitive; where
    absent, the ledger reduces (not eliminates) the window, and reclaim should prefer
    confirm-with-user over blind re-attempt.
  • Write amplification — every trigger + irreversible action now hits SQLite; low volume,
    WAL is cheap, but measure boot-rehydration + heartbeat cadence.
  • Lease scope — single consumer ⇒ boot reclaim scan likely sufficient; periodic sweep
    optional.
  • Retention vs privacy — payloads/ledger may hold message content; reuse the existing 24h
    TTL, consider stripping bodies on DONE.

11. TL;DR

CraftBot is already a durable-execution engine; this names its three primitives and closes the
gaps — on the SQLite we already run, no Temporal cluster.

  • Triggers = signals/timers → make them a durable, typed, idempotent store with one
    emit() API and routing off the hot path. Phase 1 alone retires the [V.1.3.3] Scheduled tasks are delayed and duplicated after restart of CraftBot #313 failure class for
    every producer.
  • Tasks = workflows → formalize the persist/restore that already exists into an explicit
    resume contract.
  • Irreversible actions = activities → today they're protected only by the LLM re-reading
    the event stream (emergent, with a real crash-window hole); add an idempotency key recorded
    before the side effect and pushed to the provider, so "did it already run?" is a hard check.

Principle, stated once: checkpoint state + idempotent activities, not deterministic replay —
because the workflow body is an LLM.

Metadata

Metadata

Assignees

Labels

ImprovementOptimization and improvement over existing featurebugSomething isn't workingpriority: medium

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions