[V1.3.4] Durable Execution for CraftBot

## 1. Why

Issue #313 (one-time scheduled tasks delayed/duplicated after a restart) was the spark, but
the real subject is bigger: **CraftBot is already a hand-rolled durable-execution engine, and
this doc names its primitives and closes the gaps where durability is partial or emergent.**

We explored Temporal and parked it: we want the *paradigm* (durable execution — work that
survives crashes without loss or duplication), not the *operational tax* (running a Temporal
cluster). Everything here is **Temporal-shaped on the SQLite we already run**
(`app/usage/session_storage.py` → `sessions.db`, WAL, idempotent upserts). No new infra.

Four pains, all in scope:

1. **Restart durability** — produced-but-not-yet-executed triggers (scheduled, proactive,
   living-ui, resume) live only in an in-memory heap and are lost on restart.
2. **Too many ad-hoc producers** — ~13 sites build `Trigger`s with stringly-typed
   `payload["type"]` (several with none). No single model or entry API.
3. **Cost/latency of routing** — `TriggerQueue.put()` makes a synchronous LLM call for
   session routing on the hot path of every non-system trigger.
4. **Emergent-but-holey action idempotency** — irreversible side effects (send email) are
   protected from re-execution only by the LLM re-reading the event stream, with a real
   crash-window hole (§4.2). This is the most agent-critical gap and the truest Temporal
   alignment.

---

## 2. The frame: durable execution as three primitives

Temporal's value isn't "a durable queue." It's three things made crash-proof *together*:

| Temporal concept | CraftBot equivalent | Durable today? | Target |
|---|---|---|---|
| **Signals / timers** | **Triggers** | ❌ volatile heap | durable store (§6.2) |
| **Workflows** | **Tasks** (+ event stream) | ⚠️ partial (persist/restore exists) | formalize contract (§6.3) |
| **Activities** (side effects) | **Irreversible actions** | ⚠️ emergent only | idempotency keys (§6.4) |

### The divergence principle (state it explicitly)

Temporal reconstructs in-flight state by **deterministically replaying workflow code** against
an event history. CraftBot's "workflow body" is an **LLM** — non-deterministic by nature, so
true replay is impossible. CraftBot's durable-execution model is therefore deliberately
different:

> **Checkpoint observable state + make irreversible side effects idempotent — *not*
> deterministic replay.**

This is the correct model for an agent, and it sets the ceiling: we can guarantee "didn't lose
it" and "didn't do the irreversible thing twice," but not "reconstructed the exact reasoning."

---

## 3. Current architecture (as built)

### 3.1 The Trigger
`agent_core/core/trigger.py` — `@dataclass(order=True)`: `fire_at`, `priority`,
`next_action_description`, `payload` (untyped, `compare=False`), `session_id`,
`waiting_for_reply`. No id, no source type, no dedup key, no attempt count.

### 3.2 The queue
`agent_core/core/impl/trigger/queue.py` — in-memory `heapq` ordered by `(fire_at, priority)`.
`put()` optionally calls the **LLM** to route to a session, then merges/replaces same-session
triggers ("prefer newest"). **Volatile.** Built in `app/agent_base.py:242`.

### 3.3 Producers (inventory)
~13 sites; representative:

| Source | `payload["type"]` | Priority | Durable today? |
|---|---|---|---|
| User / integration message | *(none)* | 3 | message not persisted as a trigger |
| Scheduled (recurring + one-time) | `scheduled` | 20–40 | schedule persists; **trigger doesn't** |
| Memory processing | `memory_processing` | 50–60 | **no** |
| Proactive heartbeat / planner | `proactive_*` | 50 | **no** |
| Restart notice / resume | `restart_notice` / *(none)* | 1 / 5–7 | reconstructed from Tasks on boot |
| Task continuation | *(none)* | 5/7 | implied by Task |
| Living UI dev / crash / import | `living_ui_*` | 30/50 | **no** |
| Skill workflow / onboarding | *(none)* | 60 / 1 | **no** |

### 3.4 Consumer (single loop)
`app/ui_layer/controller/ui_controller.py:_consume_triggers()`:
`trigger = await agent.triggers.get(); await agent.react(trigger)`. `react()` classifies and
turns the trigger into Task creation/continuation.

### 3.5 Existing durability we build on
SQLite+WAL at `{APP_DATA_PATH}/.usage/sessions.db`; tables `active_tasks`, `event_streams`,
`event_records`, `conversation_history`; idempotent `INSERT … ON CONFLICT DO UPDATE`. Task
persist hooks (`on_task_persist` / `on_task_remove_persist`); boot restore
(`_restore_sessions` + `_schedule_restored_task_triggers`), filtering terminal and >24h tasks.

---

## 4. Failure modes today

### 4.1 Trigger layer
1. **Lost triggers** — anything produced but not yet a Task vanishes on restart.
2. **Duplicated triggers** — no idempotency; #313's one-time double-fire is one instance.
3. **Delay/re-anchoring** — fixed for scheduled tasks (Phase 0) but only locally.
4. **Untyped fan-in** — `react()`/`put()` branch on `payload["type"]` strings across files.
5. **LLM on the hot path** — routing cost/latency per message trigger.

### 4.2 Action layer — emergent idempotency and its hole

Idempotency for side effects is **not** explicit (no dedup key, verified). It is **emergent**
from event-stream replay, and it *works for the common case*:

1. Agent runs `send_email` → side effect happens.
2. `action_end` event recorded → event stream → persisted to `sessions.db`
   (`agent_core/core/impl/action/manager.py:385`, then `on_task_persist`).
3. On resume, `react()` reloads the stream; the **LLM reads "send_email completed"** and picks
   the *next* action instead of resending.

The ordering, exactly:
```
1. side effect runs        ← irreversible (email actually sent)
2. action_end recorded + persisted to sessions.db
```
**The hole:** a crash **between 1 and 2** leaves an email sent with no durable record of it. On
resume the LLM sees no completion → resends → **duplicate**. Two things make it sharper than a
normal queue:
- **The evidence is read by an LLM, not a checker.** If `action_start` persisted but
  `action_end` didn't, the model sees "Running send_email…" with no result and must *guess*
  whether it already sent — a coin-flip on an irreversible action.
- **Emergent ≠ guaranteed.** It depends on the model correctly inferring done-ness from prose.

Window is milliseconds and harmless for reversible actions (`read_file`, `web_search`), but
real and high-consequence for irreversible external actions (email, payment, post, outbound
message). This is the gap the activity layer (§6.4) closes.

---

## 5. Goals / non-goals

**Goals**
- A trigger that is accepted is **durably recorded before it can run**, and is **exactly-once
  w.r.t. execution** (at-least-once delivery + idempotent claim).
- One **typed** trigger model and one **producer API**; remove scattered `payload["type"]`.
- Crash/restart **rehydrates** pending + in-flight triggers; no loss, no dupes.
- Routing (incl. the LLM call) becomes **pluggable and off the hot path**.
- **Irreversible side effects are exactly-once** via idempotency keys carried to the provider.
- Clear layering: *when* (triggers) · *what/state* (tasks) · *did-it-happen* (activities).

**Non-goals**
- No external broker / Temporal / Redis. Reuse `sessions.db`.
- No change to the LLM reasoning loop or action-selection logic.
- Not distributed/multi-process — single consumer stays.
- **No deterministic replay** — explicitly out (§2): the workflow body is an LLM.

---

## 6. Proposed design — three primitives

### 6.1 Layering
```
Producers ─emit(TriggerSpec)─▶ TriggerService ─▶ triggers table (sessions.db)
                                     │                    │  rehydrate on boot
                                     ▼                    ▼
                                Router (pluggable) ─▶ in-memory TriggerQueue (heap)
                                                          │
                                                  consumer: get() → react()
                                                          │
                                              Task (workflow) advances ── runs ──▶ Action
                                                          │                          │
                                                          │              irreversible? ─▶ Activity ledger
                                                          │                          │  (key before, result after)
                                                          └── ack trigger ◀───────────┘
```

### 6.2 Primitive A — Durable triggers (signals/timers)  *(Phase 1, fixes #313 class)*

**TriggerService** is the single front door; producers call `service.emit(spec)` instead of
touching `TriggerQueue`. The queue **stays** as the in-memory ordering primitive but is fed
from the store, not from producers.

**Typed record** (stored form):
```
id           uuid — identity
source       enum: MESSAGE|SCHEDULED|IMMEDIATE|MEMORY|PROACTIVE|RESUME|LIVING_UI|
                   SKILL_WORKFLOW|ONBOARDING|SYSTEM        (replaces payload["type"])
dedup_key    idempotency, UNIQUE when present (nullable for one-off user messages)
fire_at      when due
not_before   retry backoff floor
priority     tiebreak
session_id   routing target (may be null → router decides)
description  next_action_description
payload      source-specific (still flexible; typed sub-objects over time)
status       PENDING | CLAIMED | DONE | FAILED | DEAD
attempts, lease_until, created_at, updated_at
```

**Idempotency keys** (the keystone; `INSERT … ON CONFLICT(dedup_key) DO NOTHING`):
- scheduled recurring: `scheduled:{schedule_id}:{fire_at_bucket}`
- scheduled one-time: `scheduled-once:{schedule_id}`
- resume: `resume:{task_id}` · proactive: `proactive:{frequency}:{slot}`
- living-ui: `living_ui:{project_id}:{job_kind}`

This retires the trigger-duplication class structurally; Phase 0's `run_count>0` skip becomes
one instance of it.

**Lifecycle** (at-least-once + idempotent claim):
```
PENDING ─claim()→ CLAIMED ─ack(done)→ DONE
   ▲                 │
   │ lease expires   └─ack(fail)→ attempts<max ? PENDING(backoff) : DEAD(dead-letter)
   └─────────────────
```
Single consumer ⇒ leases guard **crash recovery, not concurrency**. A **boot-time reclaim
scan** (`CLAIMED → PENDING` for orphans) is the core; a periodic sweep is optional polish.

**Routing off the hot path:** `emit()` returns after the durable INSERT, **no LLM call**.
Session routing becomes a `Router` strategy invoked by the queue feeder, only when needed
(multiple live sessions + routable source), and *after* the durable write so a crash mid-route
simply re-routes on retry. Decouples `TriggerQueue` from `llm`/prompt templates.

**Overdue handling generalized:** Phase 0's agent-judged catch-up note (fire it, but tell the
agent how late it is and to use judgment) moves **to the store layer** so *any* rehydrated
overdue trigger — not just scheduled — gets the same treatment.

### 6.3 Primitive B — Tasks-as-workflows  *(mostly formalizing what exists)*

Tasks already persist (`active_tasks`) and restore. This primitive is **naming + a contract**,
not a rewrite:
- A Task is the durable **workflow**: its state = `(todos, status, event stream)`, already in
  `sessions.db`.
- Define the resume contract explicitly: restore reads task + event stream, emits a `RESUME`
  trigger (now via `emit()` with `dedup_key=resume:{task_id}` — so double-boot can't
  double-resume), the LLM advances from the event history.
- No deterministic replay (§2) — resume = "reload checkpoint + let the LLM read the stream."

### 6.4 Primitive C — Idempotent activities  *(the Temporal headline; scoped, not universal)*

Closes §4.2. Applies **only to irreversible external actions** (email, payment, post, outbound
message) — flagged on the action definition (e.g. `irreversible=True`). Reversible actions are
untouched (wrapping `read_file` would be gold-plating).

For a flagged action:
1. **Mint a stable idempotency key** (`activity:{task_id}:{action_seq}` or content-hash) and
   record **intent** durably to the event stream/ledger **before** execution.
2. **Carry the key to the action**, and where the provider supports it, to the **provider**
   (email `Message-ID` / idempotency header, payment idempotency key) so a re-send is deduped
   **at the destination** — the only place exactly-once is actually enforceable.
3. Record **result** durably after.
4. On resume, the **key** answers "did this already run?" — a hard check, not LLM prose-reading.

This mirrors Temporal honestly: Temporal doesn't give free exactly-once either; it gives
at-least-once + an idempotency key you push to the external system. Same here.

---

## 7. Data model (additions to sessions.db)

```sql
CREATE TABLE IF NOT EXISTS triggers (
    id TEXT PRIMARY KEY, source TEXT NOT NULL, dedup_key TEXT UNIQUE,
    session_id TEXT, fire_at REAL NOT NULL, not_before REAL, priority INTEGER NOT NULL,
    description TEXT, payload_json TEXT NOT NULL, status TEXT NOT NULL,
    attempts INTEGER NOT NULL DEFAULT 0, lease_until REAL,
    created_at REAL NOT NULL, updated_at REAL NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_triggers_due ON triggers(status, fire_at);

-- Activity ledger for irreversible actions (§6.4)
CREATE TABLE IF NOT EXISTS activity_log (
    idem_key TEXT PRIMARY KEY,      -- the stable key, also pushed to provider when possible
    task_id TEXT NOT NULL, action TEXT NOT NULL,
    status TEXT NOT NULL,           -- INTENT | DONE | FAILED
    provider_ref TEXT,              -- e.g. email Message-ID returned by the provider
    created_at REAL NOT NULL, updated_at REAL NOT NULL
);
```
Both reuse the existing WAL connection, upsert idioms, and stale-GC TTL (DONE/INTENT rows GC'd
on the same 24h pattern — don't keep message-content payloads indefinitely).

---

## 8. Boot / restore integration

Extend `_restore_sessions()`:
1. Reclaim `CLAIMED` triggers past `lease_until` → `PENDING`; reclaim `INTENT` activities
   (decide per-action: re-attempt with same key, or surface for confirmation).
2. Load `PENDING` triggers → feed the queue (respecting `fire_at`/`not_before`); apply the
   overdue catch-up note (§6.2).
3. Task resume becomes a `RESUME`-trigger producer via `emit()` (dedup'd).

---

## 9. Phasing

- **Phase 0 (done):** scheduler-layer #313 fix — persist absolute `fire_at`, skip already-fired
  one-time tasks, remove-before-enqueue, agent-judged catch-up.
- **Phase 1 — Durable trigger store + idempotency (MVP):** `triggers` table + `TriggerStore`
  (claim/ack/reclaim) + `TriggerService.emit()` + boot rehydration. Migrate **scheduled** and
  **resume** first (natural dedup keys). *Retires the #313 class for all producers.*
- **Phase 2 — Typed model + single producer API:** `TriggerSource` enum; migrate remaining
  producers to `emit()`; delete scattered `payload["type"]` branching.
- **Phase 3 — Routing extraction:** `Router` behind the queue feeder; remove the LLM call from
  `put()`; routing post-durable-write and retry-safe.
- **Phase 4 — Idempotent activities (Temporal headline):** flag irreversible actions; intent +
  key before, result after; provider-level dedup; key-based resume check. *Independent of
  Phases 1–3 — can be pulled forward if the double-send risk bites first.*
- **Phase 5 — Lifecycle polish:** retries w/ backoff, dead-letter surfacing, DONE/INTENT TTL GC.

Each phase ships and is observable on its own.

---

## 10. Risks / open questions

- **Dedup-key & activity-key granularity is load-bearing** — wrong bucket either lets dupes
  through or suppresses legitimate re-fires. Needs a reviewed per-source / per-action key table.
- **Merge semantics** — today `put()` "prefers newest" by dropping queued triggers. With a
  store, "drop" must mean mark superseded `DONE`/`DEAD` (not silent delete) so they don't
  rehydrate.
- **Which actions are "irreversible"?** Mislabel → either gold-plated or unprotected. Start
  conservative (email/payment/post/outbound message) and require explicit opt-in per action.
- **Provider dedup support varies** — not every side effect has an idempotency primitive; where
  absent, the ledger reduces (not eliminates) the window, and reclaim should prefer
  *confirm-with-user* over blind re-attempt.
- **Write amplification** — every trigger + irreversible action now hits SQLite; low volume,
  WAL is cheap, but measure boot-rehydration + heartbeat cadence.
- **Lease scope** — single consumer ⇒ boot reclaim scan likely sufficient; periodic sweep
  optional.
- **Retention vs privacy** — payloads/ledger may hold message content; reuse the existing 24h
  TTL, consider stripping bodies on DONE.

---

## 11. TL;DR

CraftBot is already a durable-execution engine; this names its three primitives and closes the
gaps — **on the SQLite we already run, no Temporal cluster.**

- **Triggers = signals/timers** → make them a durable, typed, idempotent store with one
  `emit()` API and routing off the hot path. Phase 1 alone retires the #313 failure class for
  every producer.
- **Tasks = workflows** → formalize the persist/restore that already exists into an explicit
  resume contract.
- **Irreversible actions = activities** → today they're protected only by the LLM re-reading
  the event stream (emergent, with a real crash-window hole); add an idempotency key recorded
  before the side effect and pushed to the provider, so "did it already run?" is a hard check.

Principle, stated once: **checkpoint state + idempotent activities, not deterministic replay —
because the workflow body is an LLM.**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1.3.4] Durable Execution for CraftBot #321

1. Why

2. The frame: durable execution as three primitives

The divergence principle (state it explicitly)

3. Current architecture (as built)

3.1 The Trigger

3.2 The queue

3.3 Producers (inventory)

3.4 Consumer (single loop)

3.5 Existing durability we build on

4. Failure modes today

4.1 Trigger layer

4.2 Action layer — emergent idempotency and its hole

5. Goals / non-goals

6. Proposed design — three primitives

6.1 Layering

6.2 Primitive A — Durable triggers (signals/timers) (Phase 1, fixes #313 class)

6.3 Primitive B — Tasks-as-workflows (mostly formalizing what exists)

6.4 Primitive C — Idempotent activities (the Temporal headline; scoped, not universal)

7. Data model (additions to sessions.db)

8. Boot / restore integration

9. Phasing

10. Risks / open questions

11. TL;DR

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Temporal concept	CraftBot equivalent	Durable today?	Target
Signals / timers	Triggers	❌ volatile heap	durable store (§6.2)
Workflows	Tasks (+ event stream)	⚠️ partial (persist/restore exists)	formalize contract (§6.3)
Activities (side effects)	Irreversible actions	⚠️ emergent only	idempotency keys (§6.4)

Source	`payload["type"]`	Priority	Durable today?
User / integration message	(none)	3	message not persisted as a trigger
Scheduled (recurring + one-time)	`scheduled`	20–40	schedule persists; trigger doesn't
Memory processing	`memory_processing`	50–60	no
Proactive heartbeat / planner	`proactive_*`	50	no
Restart notice / resume	`restart_notice` / (none)	1 / 5–7	reconstructed from Tasks on boot
Task continuation	(none)	5/7	implied by Task
Living UI dev / crash / import	`living_ui_*`	30/50	no
Skill workflow / onboarding	(none)	60 / 1	no

[V1.3.4] Durable Execution for CraftBot #321

Description

1. Why

2. The frame: durable execution as three primitives

The divergence principle (state it explicitly)

3. Current architecture (as built)

3.1 The Trigger

3.2 The queue

3.3 Producers (inventory)

3.4 Consumer (single loop)

3.5 Existing durability we build on

4. Failure modes today

4.1 Trigger layer

4.2 Action layer — emergent idempotency and its hole

5. Goals / non-goals

6. Proposed design — three primitives

6.1 Layering

6.2 Primitive A — Durable triggers (signals/timers) (Phase 1, fixes #313 class)

6.3 Primitive B — Tasks-as-workflows (mostly formalizing what exists)

6.4 Primitive C — Idempotent activities (the Temporal headline; scoped, not universal)

7. Data model (additions to sessions.db)

8. Boot / restore integration

9. Phasing

10. Risks / open questions

11. TL;DR

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions