From c440481e564cbfbf9766c6f4839d0607ba216e5e Mon Sep 17 00:00:00 2001
From: Tyler Pate <tyler.graham.pate@gmail.com>
Date: Fri, 5 Jun 2026 20:53:27 -0700
Subject: [PATCH 1/3] tools: add outbound retry policy, DLQ, and per-host rate
 limits (#7, #8, #9)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implements three tightly related outbound tool resilience features:

**#7 — Per-tool retry policy**
- `model.ToolRetryConfig` (max_attempts, base_delay, max_delay,
  honor_retry_after) wired into `model.ToolDefinition.Retry`
- `isTransient` classifies 429/5xx/network as transient; other 4xx permanent
- Exponential backoff with 10% jitter; `Retry-After` header honoured when set
- Zero `ToolRetryConfig` = single attempt (backward-compatible; no retry)
- `retry:` block added to `schemas/skill-1.schema.yaml`

**#8 — Outbound DLQ and requeue**
- Exhausted/permanent tool failures enqueued to `outbound-dlq` when
  `Executor.QueueMgr != nil`; `QueueItem.ToolName`/`ToolTarget` carry context
- New `leather dlq inspect` / `leather dlq requeue` CLI subcommands
- `<item-id>` must be last (after all flags) due to Go flag.FlagSet behaviour

**#9 — Per-host rate limits and metrics**
- `internal/tool/ratelimit.go`: stdlib token-bucket `HostLimiter`; nil-safe
  `Wait`; rate spec format `"N/s"`, `"N/m"`, `"N/h"`
- `config.yaml tools.rate_limits:` parsed into `model.Config.ToolRateLimits`
- Package-level atomic counters (`retry_total`, `backoff_total`,
  `rate_limit_wait_total`) exposed via `tool.MetricSnapshot()`
- `/metrics` gains `leather_tool_retry_total`, `leather_tool_backoff_total`,
  `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth`

All new tests pass under `-race`; `make ci` green.
---
 .subagents/AGENTS-OBSERVABILITY.md         |   4 +
 .subagents/AGENTS-QUALITY.md               |  15 +
 .subagents/AGENTS-RUNTIME.md               |  58 ++-
 .subagents/AGENTS-SERVE.md                 |  31 ++
 .subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md |  23 ++
 .subagents/AGENTS-WORKER.md                |  29 ++
 internal/cli/cli.go                        |   2 +
 internal/cli/cmd_dlq.go                    | 189 ++++++++++
 internal/cli/cmd_dlq_test.go               | 188 ++++++++++
 internal/cli/cmd_run.go                    |  10 +
 internal/cli/cmd_serve.go                  |  27 ++
 internal/cli/help.go                       |   1 +
 internal/config/config.go                  |  70 ++++
 internal/model/model.go                    |  29 ++
 internal/runner/runner.go                  |   5 +-
 internal/tool/executor.go                  | 340 ++++++++++++++++--
 internal/tool/executor_test.go             | 390 ++++++++++++++++++---
 internal/tool/ratelimit.go                 | 142 ++++++++
 internal/tool/ratelimit_test.go            | 178 ++++++++++
 schemas/skill-1.schema.yaml                |  34 ++
 20 files changed, 1682 insertions(+), 83 deletions(-)
 create mode 100644 internal/cli/cmd_dlq.go
 create mode 100644 internal/cli/cmd_dlq_test.go
 create mode 100644 internal/tool/ratelimit.go
 create mode 100644 internal/tool/ratelimit_test.go
diff --git a/.subagents/AGENTS-OBSERVABILITY.md b/.subagents/AGENTS-OBSERVABILITY.md
index 1d27bfc..45ee29c 100644
--- a/.subagents/AGENTS-OBSERVABILITY.md
+++ b/.subagents/AGENTS-OBSERVABILITY.md
@@ -176,6 +176,10 @@ metric requires a dashboard update note in the PR description.
 | `leather_cache_hits_total` | counter | `kind` | Response-cache hits. |
 | `leather_cache_misses_total` | counter | `kind` | Response-cache misses. |
 | `leather_build_info` | gauge=1 | `version`, `commit` | Build metadata. |
+| `leather_tool_retry_total` | counter | _(none)_ | Tool call attempts beyond the first; tells the operator how often transient failures occur across all tools. |
+| `leather_tool_backoff_total` | counter | _(none)_ | Times a backoff sleep was applied (retry-after or exponential); indicates rate-limiting pressure from upstream services. |
+| `leather_tool_rate_limit_wait_total` | counter | _(none)_ | Times a tool call waited for a per-host token-bucket token; nonzero means the configured rate limits are actively throttling traffic. |
+| `leather_outbound_dlq_depth` | gauge | _(none)_ | Current item count in `outbound-dlq`; nonzero means tool failures need operator attention (`leather dlq inspect`). |
 
 Rules:
 
diff --git a/.subagents/AGENTS-QUALITY.md b/.subagents/AGENTS-QUALITY.md
index 5a398bb..49e39cb 100644
--- a/.subagents/AGENTS-QUALITY.md
+++ b/.subagents/AGENTS-QUALITY.md
@@ -271,6 +271,21 @@ Before opening a PR:
 - [ ] No real credentials or secrets in `testdata/`
 - [ ] CI workflow action refs are SHA-pinned with version comments
 
+### Outbound tool resilience (retry, DLQ, rate limits)
+
+PRs touching `internal/tool`, `internal/config`, or `internal/cli/cmd_dlq.go`:
+
+- [ ] `go test ./internal/tool/... ./internal/cli/...` passes
+- [ ] `go test -race ./internal/tool/... ./internal/cli/...` passes
+- [ ] New tools with `retry:` config have tests covering transient retry and permanent no-retry paths
+- [ ] `TestExecute_DLQEnqueueOnExhaustion` and `TestExecute_DLQEnqueueOnPermanent` pass
+- [ ] `TestHostLimiter_*` suite passes; no real network calls
+- [ ] `TestRunDLQ*` suite passes with `t.TempDir()` state dirs
+- [ ] `leather dlq inspect` output includes ID, tool name, agent, attempt, error
+- [ ] `leather dlq requeue --state-dir ... <item-id>` (item-id **last**) moves item
+- [ ] `tools.rate_limits` in config.yaml parses without error; bad spec is warn+disable, not panic
+- [ ] `/metrics` response contains `leather_tool_retry_total`, `leather_tool_backoff_total`, `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth`
+
 ---
 
 _Last reviewed: 2026-06-05_ 
diff --git a/.subagents/AGENTS-RUNTIME.md b/.subagents/AGENTS-RUNTIME.md
index b5392f1..992dcc2 100644
--- a/.subagents/AGENTS-RUNTIME.md
+++ b/.subagents/AGENTS-RUNTIME.md
@@ -116,13 +116,60 @@ func (r *Registry) GetTools(skillNames []string) []model.ToolDefinition
 func (r *Registry) ResolveTools(skillNames, toolsetNames, toolNames []string) []model.ToolDefinition
 
 // Executor dispatches HTTP- and MCP-backed tools.
-type Executor struct { MCP *mcp.Registry }
+// QueueMgr enables the outbound DLQ: permanent/exhausted failures are
+// enqueued to "outbound-dlq" when this field is non-nil.
+// Limiter enforces per-host token-bucket rate limits before each attempt.
+type Executor struct {
+    MCP       *mcp.Registry
+    QueueMgr  *queue.Manager  // nil = outbound DLQ disabled
+    AgentName string           // injected by Runner; stored in DLQ items
+    Limiter   *HostLimiter     // nil = no rate limiting
+}
 
 // Execute runs a single tool call. Always returns a ToolResult; never panics.
 // On failure, ToolResult.Error is set and Content may be empty.
+// With def.Retry.MaxAttempts > 0, transient failures (5xx, 429, network)
+// are retried with exponential backoff + jitter.
 func (e *Executor) Execute(ctx context.Context, def model.ToolDefinition, args map[string]any) model.ToolResult
+
+// MetricSnapshot returns a point-in-time snapshot of outbound tool counters.
+// Counters are process-lifetime atomics; they reset only on restart.
+func MetricSnapshot() (retryTotal, backoffTotal, rateLimitWaitTotal int64)
 ```
 
+#### Per-tool retry policy
+
+`ToolDefinition.Retry` (`model.ToolRetryConfig`) controls retry behaviour:
+
+| Field | Default | Meaning |
+|---|---|---|
+| `max_attempts` | 1 (no retry) | Total attempts including the initial one. |
+| `base_delay` | `1s` | Initial backoff; doubles each attempt. |
+| `max_delay` | `30s` | Backoff ceiling before jitter. |
+| `honor_retry_after` | `true` | Use `Retry-After` header value when present. |
+
+A zero `ToolRetryConfig` (all fields at zero value) means **single attempt, no
+retry** — preserving backward compatibility for tools that predate the policy.
+Only tools with an explicit `retry:` block in their skill YAML get the retry loop.
+
+`isTransient` classifies the failure to decide whether to retry:
+- **Transient** → retry: 429, 500, 502, 503, 504; network/timeout errors;
+  403 with `X-RateLimit-Remaining: 0`
+- **Permanent** → return immediately without retrying: all other 4xx
+
+#### Outbound DLQ
+
+When `Executor.QueueMgr != nil` and a tool call either:
+- exhausts its retry budget on transient errors, or
+- fails immediately with a permanent error,
+
+a `model.QueueItem` is enqueued to the well-known `"outbound-dlq"` queue.
+DLQ items carry `ToolName`, `ToolTarget`, and the last error in `Payload`.
+DLQ enqueue is a fire-and-forget side-effect; failure to enqueue is logged
+at `warn` and does not affect the returned `ToolResult`.
+
+Items in `outbound-dlq` can be inspected and requeued via `leather dlq`.
+
 #### Skill file format (`*.skill.yaml`)
 
 ```yaml
@@ -139,6 +186,11 @@ tools:
       headers:
         Authorization: "Bearer {{env:GITHUB_TOKEN}}"
         Accept: application/vnd.github+json
+    retry:
+      max_attempts: 3
+      base_delay: 1s
+      max_delay: 30s
+      honor_retry_after: true
 ```
 
 #### Toolset file format (`*.toolset.yaml`)
@@ -161,11 +213,15 @@ every toolset reference points at an already-known tool.
 - `mcp` → `execMCP` through a started `mcp.Registry`
 
 HTTP execution (`execHTTP`):
+- Calls `e.Limiter.Wait(ctx, host)` before each attempt; blocks until the
+  per-host token bucket allows the request or ctx is cancelled.
 - Expands URL template with tool call arguments (`{{.field}}`).
 - Expands `{{env:VAR}}` in header values; **never logs auth header values**.
 - Sends the request with the runner's context (inherits timeout).
 - Response body is capped at 1 MB.
 - Non-2xx responses populate `ToolResult.Error` with the status code message.
+- Transient failures are retried up to `def.Retry.MaxAttempts` times with
+  exponential backoff; permanent failures return immediately.
 
 MCP execution (`execMCP`):
 - Looks up the named server in the running registry.
diff --git a/.subagents/AGENTS-SERVE.md b/.subagents/AGENTS-SERVE.md
index 15caa33..9cef85a 100644
--- a/.subagents/AGENTS-SERVE.md
+++ b/.subagents/AGENTS-SERVE.md
@@ -46,6 +46,7 @@ Each subcommand has:
 | `run` | `RunOnce` | Load and execute a single agent, then exit |
 | `validate` | `RunValidate` | Parse and validate agent files; report errors |
 | `test-agent` | `RunTestAgent` | Execute an agent with `MockLLM` and print the turn transcript |
+| `dlq` | `RunDLQ` | Inspect and requeue outbound dead-letter queue items |
 | `status` | `RunStatus` | Print scheduler state, job history, token usage |
 | `ingest` | `RunIngest` | Store raw bytes as a hide and optionally enqueue for curing |
 | `snapshot` | `RunSnapshot` → `RunSnapshotSave` / `RunSnapshotRestore` | Save or restore a `tar.gz` point-in-time archive of runtime state |
@@ -212,8 +213,19 @@ cache_dir: ""               # empty = serve uses <state_dir>/cache
 mcp_servers_file: ""        # empty = run/serve/validate use ~/.leather/mcp-servers.yaml
 loop: 1                     # repeat leather run N times
 tannery: ""                 # path to tannery.yaml; empty = tannery disabled
+
+tools:
+  rate_limits:              # per-host token-bucket rate limits for outbound tool calls
+    api.github.com: "60/m"  # format: "N/s", "N/m", or "N/h"
+    api.example.com: "10/s"
 ```
 
+`tools.rate_limits` is a nested map. Each key is a hostname (no port, no
+scheme); the value is a rate spec in the form `N/<unit>` where unit is `s`
+(seconds), `m` (minutes), or `h` (hours). The second call to the same host
+within the interval blocks until the next token is available. Unknown hosts
+pass through immediately with no limiting.
+
 YAML keys are the snake_case equivalents of the flag names (strip `--`,
 replace `-` with `_`).
 
@@ -255,6 +267,24 @@ All endpoints live in `internal/cli/api_tannery.go`.
 | `/curings` | GET | `handleCurings` | List curing definitions with queue depth |
 | `/intake` | POST | `handleIntake` | Direct-ingest endpoint; writes hide from request body |
 
+### `leather dlq` subcommand (`internal/cli/cmd_dlq.go`)
+
+`RunDLQ(args, stdout, stderr)` dispatches `inspect` and `requeue`:
+
+```
+leather dlq inspect [--queue outbound-dlq] [--state-dir ...]
+leather dlq requeue [--queue outbound-dlq] [--work-queue <name>] [--state-dir ...] <item-id>
+```
+
+- **`inspect`** — lists all items in the DLQ; prints `ID | tool | agent | attempt | enqueued_at | error`.
+- **`requeue`** — moves the named item from the DLQ to `--work-queue`, resetting
+  `AttemptCount` to 0 so it gets a fresh retry budget. Default `--work-queue` is
+  the DLQ name with the `-dlq` suffix stripped.
+
+**Important**: `<item-id>` must come **after** all flags. Go's `flag.FlagSet`
+stops parsing at the first non-flag token, so placing `<item-id>` before flags
+silently ignores the remaining flags.
+
 ### `leather ingest` subcommand (`internal/cli/cmd_ingest.go`)
 
 `RunIngest(args, stdin, stdout, stderr)` — reads body from `--file` or stdin,
@@ -378,6 +408,7 @@ internal/schema  →  internal/config
 | Flag name doesn't match env var | `--flag-name` → `LEATHER_FLAG_NAME`; check both |
 | Skipping graceful shutdown | Always call `scheduler.Drain` before returning from `RunServe` |
 | Flags after positional arg in `leather run` | Go's `flag.FlagSet` stops at the first non-flag token. The agent file path must come **last**: `leather run --config=... --var k=v agent.md` — not `leather run agent.md --config ...` |
+| `<item-id>` before flags in `leather dlq requeue` | Same issue: item-id must be **last** after all flags: `leather dlq requeue --state-dir ... <item-id>` |
 | `leather init` overwriting without `--overwrite` | `RunInit` fails closed: any pre-existing file causes a non-zero exit and reports `--overwrite` hint. Never silently clobber. |
 | Calling `RunValidate` from `RunInit` for post-write validation | `RunValidate` performs a full semantic check including model resolution (fails without `LEATHER_MODEL`). `RunInit` uses schema-only validation (`runInitValidate`) which is syntax-only and does not require a model to be set. |
 
diff --git a/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md b/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md
index 6490f2e..7578434 100644
--- a/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md
+++ b/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md
@@ -218,6 +218,29 @@ binary for shell commands; an HTTP MCP server for HTTP APIs).
 - The capability is a coherent verb (e.g. `release-write`,
   `inbox-triage`), not a grab bag.
 
+#### Adding retry policy to a skill tool
+
+Tools that call remote APIs should declare a `retry:` block to handle transient
+failures (5xx, 429, network errors) without relying on the caller to retry:
+
+```yaml
+tools:
+  - name: github_list_issues
+    type: http
+    http:
+      url: "https://api.github.com/repos/{{.repo}}/issues"
+    retry:
+      max_attempts: 3    # initial attempt + 2 retries
+      base_delay: 1s     # doubles each retry; capped at max_delay
+      max_delay: 30s
+      honor_retry_after: true  # use Retry-After header when present
+```
+
+Omitting `retry:` (or setting `max_attempts: 1`) preserves the legacy
+single-attempt behaviour — no retries, no backoff. Only transient errors
+(5xx, 429, network timeouts) trigger retries; permanent 4xx errors return
+immediately regardless of the retry config.
+
 ### When to write a per-turn declaration
 
 - The turn does something dangerous and the agent's base scope is too
diff --git a/.subagents/AGENTS-WORKER.md b/.subagents/AGENTS-WORKER.md
index 7922103..7c87e4a 100644
--- a/.subagents/AGENTS-WORKER.md
+++ b/.subagents/AGENTS-WORKER.md
@@ -132,6 +132,35 @@ re-serialize the full slice. Files are mode 0600; directories mode 0700.
 - `Payload` — `map[string]any`; template variables available in agent prompts
 - `EnqueuedAt` — Unix timestamp
 - `AttemptCount` — incremented by the runner on each dequeue-and-run attempt
+- `ToolName` — non-empty for outbound DLQ items; the failed tool's name
+- `ToolTarget` — non-empty for outbound DLQ items; the URL or `server/tool` string
+
+#### Outbound DLQ (`outbound-dlq`)
+
+Tool execution failures that are permanent or that exhaust their retry budget
+are enqueued to the well-known `outbound-dlq` queue by `tool.Executor`. These
+items are **not** processed by `internal/curing` workers (`CuringName == ""`).
+They are surfaced for operator inspection and manual requeue via `leather dlq`.
+
+Outbound DLQ item shape:
+
+| Field | Meaning |
+|---|---|
+| `ID` | `odlq_<date>_<time>_<hex>` |
+| `AgentName` | Agent that triggered the tool call |
+| `ToolName` | Name of the failed tool |
+| `ToolTarget` | URL (HTTP tools) or `server/tool` (MCP tools) |
+| `AttemptCount` | Number of attempts made before giving up |
+| `EnqueuedAt` | Unix timestamp |
+| `Payload["tool"]` | Tool name (duplicate for backward compat) |
+| `Payload["args"]` | Tool arguments at time of failure |
+| `Payload["error"]` | Last error message |
+| `Payload["attempt"]` | Attempt count at time of failure |
+
+**DLQ enqueue path**: `tool.Executor.Execute` → final attempt fails or
+`isTransient=false` → `QueueMgr.Enqueue("outbound-dlq", item)`. Enqueue
+failure is logged at `warn` and does not affect the `ToolResult` returned to
+the runner. DLQ is disabled when `Executor.QueueMgr` is nil.
 
 ---
 
diff --git a/internal/cli/cli.go b/internal/cli/cli.go
index f64f9fc..8b16601 100644
--- a/internal/cli/cli.go
+++ b/internal/cli/cli.go
@@ -46,6 +46,8 @@ func Run(args []string, stdout, stderr io.Writer, version, commit string) int {
 		return RunReplay(rest, stdout, stderr, version, commit)
 	case "snapshot":
 		return RunSnapshot(rest, stdout, stderr)
+	case "dlq":
+		return RunDLQ(rest, stdout, stderr)
 	case "attach":
 		return RunAttach(rest, stdout, stderr)
 	case "help", "--help", "-h":
diff --git a/internal/cli/cmd_dlq.go b/internal/cli/cmd_dlq.go
new file mode 100644
index 0000000..8a80f73
--- /dev/null
+++ b/internal/cli/cmd_dlq.go
@@ -0,0 +1,189 @@
+package cli
+
+import (
+	"fmt"
+	"io"
+	"path/filepath"
+	"strings"
+	"time"
+
+	"github.com/tgpski/leather/internal/config"
+	"github.com/tgpski/leather/internal/queue"
+)
+
+// RunDLQ is the entry point for the "leather dlq" sub-command.
+// It dispatches to inspect or requeue based on the first positional argument.
+func RunDLQ(args []string, stdout, stderr io.Writer) int {
+	if len(args) == 0 || args[0] == "--help" || args[0] == "-h" {
+		fmt.Fprint(stdout, dlqUsage)
+		return 0
+	}
+	sub, rest := args[0], args[1:]
+	switch sub {
+	case "inspect":
+		return runDLQInspect(rest, stdout, stderr)
+	case "requeue":
+		return runDLQRequeue(rest, stdout, stderr)
+	default:
+		fmt.Fprintf(stderr, "leather dlq: unknown sub-command %q\n\n", sub)
+		fmt.Fprint(stderr, dlqUsage)
+		return 2
+	}
+}
+
+const dlqUsage = `leather dlq — inspect and requeue outbound dead-letter queue items
+
+Usage:
+  leather dlq inspect  [--queue outbound-dlq] [flags]
+  leather dlq requeue  <item-id> [--queue outbound-dlq] [--work-queue <name>] [flags]
+
+Sub-commands:
+  inspect   list items currently in the DLQ
+  requeue   move a DLQ item back to a work queue for re-processing
+
+Use "leather dlq <sub-command> --help" for flag details.
+`
+
+// runDLQInspect lists items in the named DLQ queue.
+func runDLQInspect(args []string, stdout, stderr io.Writer) int {
+	fs := newFlagSet("dlq inspect", stderr)
+	config.BindFlags(fs)
+	queueName := fs.String("queue", "outbound-dlq", "DLQ queue name to inspect")
+	if !parseFlags(fs, args) {
+		return 2
+	}
+
+	cfg, err := config.Load(fs)
+	if err != nil {
+		fmt.Fprintf(stderr, "leather dlq inspect: %v\n", err)
+		return 1
+	}
+
+	queueDir := filepath.Join(cfg.StateDir, "queues")
+	mgr := queue.NewManager(queueDir)
+	q, err := mgr.Get(*queueName)
+	if err != nil {
+		fmt.Fprintf(stderr, "leather dlq inspect: open queue %q: %v\n", *queueName, err)
+		return 1
+	}
+
+	items := q.Scan()
+	if len(items) == 0 {
+		fmt.Fprintf(stdout, "leather dlq inspect: queue %q is empty\n", *queueName)
+		return 0
+	}
+
+	fmt.Fprintf(stdout, "%-26s  %-20s  %-20s  %-6s  %-30s  %s\n",
+		"ID", "tool", "agent", "attempt", "enqueued_at", "error")
+	fmt.Fprintln(stdout, strings.Repeat("-", 120))
+	for _, item := range items {
+		ts := time.Unix(item.EnqueuedAt, 0).Format("2006-01-02 15:04:05")
+		tool := item.ToolName
+		if tool == "" {
+			if t, ok := item.Payload["tool"].(string); ok {
+				tool = t
+			}
+		}
+		errStr := ""
+		if e, ok := item.Payload["error"].(string); ok {
+			errStr = e
+		}
+		if len(errStr) > 60 {
+			errStr = errStr[:60] + "…"
+		}
+		attempt := item.AttemptCount
+		if a, ok := item.Payload["attempt"].(float64); ok && attempt == 0 {
+			attempt = int(a)
+		}
+		fmt.Fprintf(stdout, "%-26s  %-20s  %-20s  %-6d  %-30s  %s\n",
+			item.ID, truncate(tool, 20), truncate(item.AgentName, 20), attempt, ts, errStr)
+	}
+	return 0
+}
+
+// runDLQRequeue moves a named item from the DLQ to a work queue.
+// Usage: leather dlq requeue [flags] <item-id>
+// The item-id must be the last argument (after all flags).
+func runDLQRequeue(args []string, stdout, stderr io.Writer) int {
+	fs := newFlagSet("dlq requeue", stderr)
+	config.BindFlags(fs)
+	queueName := fs.String("queue", "outbound-dlq", "DLQ queue name to read from")
+	workQueue := fs.String("work-queue", "", "destination work queue; defaults to <queue> with -dlq suffix removed")
+	if !parseFlags(fs, args) {
+		return 2
+	}
+
+	positional := fs.Args()
+	if len(positional) == 0 {
+		fmt.Fprintf(stderr, "leather dlq requeue: missing item-id argument\n")
+		fmt.Fprint(stderr, "Usage: leather dlq requeue [flags] <item-id>\n")
+		return 2
+	}
+	itemID := positional[0]
+
+	cfg, err := config.Load(fs)
+	if err != nil {
+		fmt.Fprintf(stderr, "leather dlq requeue: %v\n", err)
+		return 1
+	}
+
+	dest := *workQueue
+	if dest == "" {
+		dest = strings.TrimSuffix(*queueName, "-dlq")
+		if dest == *queueName {
+			// queue name didn't end in -dlq; use it as-is (caller must specify --work-queue)
+			fmt.Fprintf(stderr, "leather dlq requeue: queue %q does not end in '-dlq'; use --work-queue to specify destination\n", *queueName)
+			return 2
+		}
+	}
+
+	queueDir := filepath.Join(cfg.StateDir, "queues")
+	mgr := queue.NewManager(queueDir)
+
+	dlqQ, err := mgr.Get(*queueName)
+	if err != nil {
+		fmt.Fprintf(stderr, "leather dlq requeue: open queue %q: %v\n", *queueName, err)
+		return 1
+	}
+
+	items := dlqQ.Scan()
+	found := false
+	for _, item := range items {
+		if item.ID != itemID {
+			continue
+		}
+		found = true
+		// Reset attempt count so the item gets a fresh retry budget.
+		item.AttemptCount = 0
+
+		// Dequeue from DLQ.
+		removed, deqErr := dlqQ.DequeueByIDs([]string{itemID})
+		if deqErr != nil || len(removed) == 0 {
+			fmt.Fprintf(stderr, "leather dlq requeue: dequeue item %q: %v\n", itemID, deqErr)
+			return 1
+		}
+
+		// Enqueue to work queue.
+		if enqErr := mgr.Enqueue(dest, item); enqErr != nil {
+			fmt.Fprintf(stderr, "leather dlq requeue: enqueue to %q: %v\n", dest, enqErr)
+			return 1
+		}
+		fmt.Fprintf(stdout, "requeued %s → %s\n", itemID, dest)
+		return 0
+	}
+
+	if !found {
+		fmt.Fprintf(stderr, "leather dlq requeue: item %q not found in queue %q\n", itemID, *queueName)
+		return 1
+	}
+	return 0
+}
+
+// truncate shortens s to max runes, appending "…" when truncated.
+func truncate(s string, max int) string {
+	runes := []rune(s)
+	if len(runes) <= max {
+		return s
+	}
+	return string(runes[:max-1]) + "…"
+}
diff --git a/internal/cli/cmd_dlq_test.go b/internal/cli/cmd_dlq_test.go
new file mode 100644
index 0000000..402f53d
--- /dev/null
+++ b/internal/cli/cmd_dlq_test.go
@@ -0,0 +1,188 @@
+package cli
+
+import (
+	"bytes"
+	"encoding/json"
+	"os"
+	"path/filepath"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/tgpski/leather/internal/model"
+)
+
+// writeDLQItem writes a QueueItem as a JSONL line to the given queue file.
+func writeDLQItem(t *testing.T, path string, item model.QueueItem) {
+	t.Helper()
+	if err := os.MkdirAll(filepath.Dir(path), 0700); err != nil {
+		t.Fatalf("mkdir: %v", err)
+	}
+	b, err := json.Marshal(item)
+	if err != nil {
+		t.Fatalf("marshal: %v", err)
+	}
+	f, err := os.OpenFile(path, os.O_CREATE|os.O_APPEND|os.O_WRONLY, 0600)
+	if err != nil {
+		t.Fatalf("open: %v", err)
+	}
+	defer f.Close()
+	_, _ = f.Write(append(b, '\n'))
+}
+
+func TestRunDLQInspect_Empty(t *testing.T) {
+	stateDir := t.TempDir()
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{"inspect", "--state-dir", stateDir}, &stdout, &stderr)
+	if code != 0 {
+		t.Fatalf("exit code = %d, want 0; stderr=%s", code, stderr.String())
+	}
+	if !strings.Contains(stdout.String(), "empty") {
+		t.Errorf("stdout = %q, want to contain 'empty'", stdout.String())
+	}
+}
+
+func TestRunDLQInspect_WithItems(t *testing.T) {
+	stateDir := t.TempDir()
+	queuePath := filepath.Join(stateDir, "queues", "outbound-dlq.jsonl")
+
+	item := model.QueueItem{
+		ID:         "odlq_20260101_1200_abcd",
+		AgentName:  "my-agent",
+		ToolName:   "github_list_issues",
+		ToolTarget: "https://api.github.com/repos/acme/repo/issues",
+		EnqueuedAt: time.Date(2026, 1, 1, 12, 0, 0, 0, time.UTC).Unix(),
+		Payload: map[string]any{
+			"tool":    "github_list_issues",
+			"error":   "status 503: service unavailable",
+			"attempt": float64(3),
+		},
+	}
+	writeDLQItem(t, queuePath, item)
+
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{"inspect", "--state-dir", stateDir}, &stdout, &stderr)
+	if code != 0 {
+		t.Fatalf("exit code = %d, want 0; stderr=%s", code, stderr.String())
+	}
+	out := stdout.String()
+	if !strings.Contains(out, "odlq_20260101_1200_abcd") {
+		t.Errorf("stdout missing item ID; got:\n%s", out)
+	}
+	if !strings.Contains(out, "github_list_issues") {
+		t.Errorf("stdout missing tool name; got:\n%s", out)
+	}
+	if !strings.Contains(out, "my-agent") {
+		t.Errorf("stdout missing agent name; got:\n%s", out)
+	}
+}
+
+func TestRunDLQRequeue_MovesItem(t *testing.T) {
+	stateDir := t.TempDir()
+	queueDir := filepath.Join(stateDir, "queues")
+	dlqPath := filepath.Join(queueDir, "outbound-dlq.jsonl")
+
+	item := model.QueueItem{
+		ID:         "odlq_20260101_1200_beef",
+		AgentName:  "requeue-agent",
+		ToolName:   "my_tool",
+		EnqueuedAt: time.Now().Unix(),
+		Payload:    map[string]any{"tool": "my_tool", "error": "timeout"},
+	}
+	writeDLQItem(t, dlqPath, item)
+
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{
+		"requeue",
+		"--queue", "outbound-dlq",
+		"--work-queue", "my-work-queue",
+		"--state-dir", stateDir,
+		"odlq_20260101_1200_beef",
+	}, &stdout, &stderr)
+	if code != 0 {
+		t.Fatalf("exit code = %d, want 0; stderr=%s", code, stderr.String())
+	}
+	if !strings.Contains(stdout.String(), "requeued") {
+		t.Errorf("stdout = %q, want to contain 'requeued'", stdout.String())
+	}
+
+	// DLQ should now be empty.
+	dlqData, err := os.ReadFile(dlqPath)
+	if err != nil && !os.IsNotExist(err) {
+		t.Fatalf("read dlq: %v", err)
+	}
+	if len(strings.TrimSpace(string(dlqData))) != 0 {
+		t.Errorf("DLQ not empty after requeue; contents: %s", dlqData)
+	}
+
+	// Work queue should contain the item.
+	workPath := filepath.Join(queueDir, "my-work-queue.jsonl")
+	workData, err := os.ReadFile(workPath)
+	if err != nil {
+		t.Fatalf("read work queue: %v", err)
+	}
+	var moved model.QueueItem
+	if err := json.Unmarshal(bytes.TrimSpace(workData), &moved); err != nil {
+		t.Fatalf("unmarshal work item: %v", err)
+	}
+	if moved.ID != item.ID {
+		t.Errorf("moved item ID = %q, want %q", moved.ID, item.ID)
+	}
+	if moved.AttemptCount != 0 {
+		t.Errorf("moved item AttemptCount = %d, want 0 (reset)", moved.AttemptCount)
+	}
+}
+
+func TestRunDLQRequeue_ItemNotFound(t *testing.T) {
+	stateDir := t.TempDir()
+	// Create empty dlq.
+	queueDir := filepath.Join(stateDir, "queues")
+	if err := os.MkdirAll(queueDir, 0700); err != nil {
+		t.Fatal(err)
+	}
+	dlqPath := filepath.Join(queueDir, "outbound-dlq.jsonl")
+	if err := os.WriteFile(dlqPath, nil, 0600); err != nil {
+		t.Fatal(err)
+	}
+
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{
+		"requeue",
+		"--work-queue", "dest",
+		"--state-dir", stateDir,
+		"nonexistent-id",
+	}, &stdout, &stderr)
+	if code != 1 {
+		t.Errorf("exit code = %d, want 1", code)
+	}
+	if !strings.Contains(stderr.String(), "not found") {
+		t.Errorf("stderr = %q, want to contain 'not found'", stderr.String())
+	}
+}
+
+func TestRunDLQRequeue_MissingItemID(t *testing.T) {
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{"requeue"}, &stdout, &stderr)
+	if code != 2 {
+		t.Errorf("exit code = %d, want 2", code)
+	}
+}
+
+func TestRunDLQUnknownSubcommand(t *testing.T) {
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{"bogus"}, &stdout, &stderr)
+	if code != 2 {
+		t.Errorf("exit code = %d, want 2", code)
+	}
+}
+
+func TestRunDLQHelp(t *testing.T) {
+	var stdout, stderr bytes.Buffer
+	code := RunDLQ([]string{"--help"}, &stdout, &stderr)
+	if code != 0 {
+		t.Errorf("exit code = %d, want 0", code)
+	}
+	if !strings.Contains(stdout.String(), "inspect") {
+		t.Errorf("help output missing 'inspect'; got: %s", stdout.String())
+	}
+}
diff --git a/internal/cli/cmd_run.go b/internal/cli/cmd_run.go
index 0b5fc0b..e36bdbf 100644
--- a/internal/cli/cmd_run.go
+++ b/internal/cli/cmd_run.go
@@ -131,6 +131,15 @@ func RunOnce(args []string, stdout, stderr io.Writer) int {
 		log.Warn("notify backend init failed", "error", e)
 	}
 
+	var toolLimiter *tool.HostLimiter
+	if len(cfg.ToolRateLimits) > 0 {
+		var limErr error
+		toolLimiter, limErr = tool.NewHostLimiter(cfg.ToolRateLimits)
+		if limErr != nil {
+			log.Warn("tool rate limits: invalid config, rate limiting disabled", "error", limErr)
+		}
+	}
+
 	r := &runner.Runner{
 		Client:        session.NewHTTPClient(cfg.LLMEndpoint, cfg.LLMAPIKey, cfg.LLMTimeout),
 		Registry:      toolReg,
@@ -138,6 +147,7 @@ func RunOnce(args []string, stdout, stderr io.Writer) int {
 		Log:           log,
 		MaxToolRounds: cfg.MaxToolRounds,
 		Notifiers:     notifiers,
+		ToolLimiter:   toolLimiter,
 	}
 
 	// Cancel the run context on SIGINT/SIGTERM/SIGHUP so in-flight LLM calls
diff --git a/internal/cli/cmd_serve.go b/internal/cli/cmd_serve.go
index 1736d73..12f3b91 100644
--- a/internal/cli/cmd_serve.go
+++ b/internal/cli/cmd_serve.go
@@ -897,6 +897,16 @@ func RunServe(args []string, stdout, stderr io.Writer, version, commit string) i
 	devtoolsBus := bus.New(4096)
 	devtoolsSrc := sources.Wire(devtoolsBus, sources.Deps{})
 
+	var toolLimiter *tool.HostLimiter
+	if len(cfg.ToolRateLimits) > 0 {
+		var limErr error
+		toolLimiter, limErr = tool.NewHostLimiter(cfg.ToolRateLimits)
+		if limErr != nil {
+			log.Warn("tool rate limits: invalid config, rate limiting disabled", "error", limErr)
+			toolLimiter = nil
+		}
+	}
+
 	regDeps := agentRegDeps{
 		sched:          sched,
 		metrics:        metrics,
@@ -905,6 +915,7 @@ func RunServe(args []string, stdout, stderr io.Writer, version, commit string) i
 		queueMgr:       queueMgr,
 		notifiers:      notifiers,
 		mcpReg:         mcpReg,
+		toolLimiter:    toolLimiter,
 		cfg:            cfg,
 		log:            log,
 		stdout:         stdout,
@@ -1264,6 +1275,7 @@ type agentRegDeps struct {
 	queueMgr       *queue.Manager
 	notifiers      map[string]notify.Notifier
 	mcpReg         *mcp.Registry
+	toolLimiter    *tool.HostLimiter
 	cfg            model.Config
 	log            *logging.Logger
 	stdout         io.Writer
@@ -1297,6 +1309,7 @@ func registerAgentJob(deps agentRegDeps, a model.Agent) error {
 		QueueMgr:      deps.queueMgr,
 		Notifiers:     deps.notifiers,
 		MCPRegistry:   deps.mcpReg,
+		ToolLimiter:   deps.toolLimiter,
 	}
 	deps.metrics.registerAgent(agentCopy)
 	// Curing-driven agents have no schedule or queue input; the curing worker wakes
@@ -1651,6 +1664,11 @@ type configResponse struct {
 // metricsResponse is the JSON shape returned by GET /metrics.
 type metricsResponse struct {
 	Agents map[string]agentMetricSummary `json:"agents"`
+	// Outbound tool resilience counters (issues #7–#9).
+	ToolRetryTotal         int64 `json:"leather_tool_retry_total"`
+	ToolBackoffTotal       int64 `json:"leather_tool_backoff_total"`
+	ToolRateLimitWaitTotal int64 `json:"leather_tool_rate_limit_wait_total"`
+	OutboundDLQDepth       int   `json:"leather_outbound_dlq_depth"`
 }
 
 // snapshotResponse is the JSON shape of GET /snapshot and of snapshot files on disk.
@@ -1910,6 +1928,11 @@ func apiMux(deps apiDeps) http.Handler {
 	})
 
 	mux.HandleFunc("/metrics", func(w http.ResponseWriter, r *http.Request) {
+		retryTotal, backoffTotal, rateLimitWaitTotal := tool.MetricSnapshot()
+		dlqDepth := 0
+		if deps.queueMgr != nil {
+			dlqDepth = deps.queueMgr.Depth("outbound-dlq")
+		}
 		var body metricsResponse
 		if deps.replay != nil {
 			body = metricsResponse{Agents: deps.replay.Metrics}
@@ -1918,6 +1941,10 @@ func apiMux(deps apiDeps) http.Handler {
 		} else {
 			body = metricsResponse{Agents: deps.metrics.summaries()}
 		}
+		body.ToolRetryTotal = retryTotal
+		body.ToolBackoffTotal = backoffTotal
+		body.ToolRateLimitWaitTotal = rateLimitWaitTotal
+		body.OutboundDLQDepth = dlqDepth
 		httpx.WriteJSON(w, http.StatusOK, body)
 	})
 
diff --git a/internal/cli/help.go b/internal/cli/help.go
index 8d8a26e..14f775f 100644
--- a/internal/cli/help.go
+++ b/internal/cli/help.go
@@ -14,6 +14,7 @@ Commands:
   validate    parse and validate agent definition files; report errors
   test-agent  run an agent with a mock LLM and print the turn transcript
   status      show scheduler state, job history, token budget usage
+  dlq         inspect and requeue outbound dead-letter queue items
   ingest      store raw bytes as a hide and optionally enqueue for curing
   replay      replay a captured snapshot or runs directory via the API
   snapshot    save or restore a point-in-time archive of runtime state
diff --git a/internal/config/config.go b/internal/config/config.go
index 75faa99..85fa945 100644
--- a/internal/config/config.go
+++ b/internal/config/config.go
@@ -224,6 +224,9 @@ func loadYAMLFile(path string, cfg *model.Config) error {
 		return err
 	}
 	cfg.NotifyBackends = parseNotifyBackends(string(b))
+	if limits := parseToolRateLimits(string(b)); len(limits) > 0 {
+		cfg.ToolRateLimits = limits
+	}
 	return nil
 }
 
@@ -671,6 +674,73 @@ func parseNotifyBackendItem(lines []string) model.NotifyBackendConfig {
 	return b
 }
 
+// parseToolRateLimits extracts the tools.rate_limits map from raw YAML source.
+// It looks for a block of the form:
+//
+//	tools:
+//	  rate_limits:
+//	    api.github.com: 5000/h
+//	    example.com: 10/s
+//
+// Returns nil when the block is absent.
+func parseToolRateLimits(src string) map[string]string {
+	lines := strings.Split(src, "\n")
+
+	// Find "tools:" top-level key.
+	toolsStart := -1
+	for i, line := range lines {
+		if strings.TrimSpace(line) == "tools:" {
+			toolsStart = i + 1
+			break
+		}
+	}
+	if toolsStart < 0 {
+		return nil
+	}
+
+	// Collect indented lines inside the tools: block.
+	var toolsLines []string
+	for i := toolsStart; i < len(lines); i++ {
+		line := lines[i]
+		if line == "" || strings.TrimSpace(line) == "" {
+			toolsLines = append(toolsLines, "")
+			continue
+		}
+		if len(line) > 0 && line[0] != ' ' && line[0] != '\t' {
+			break
+		}
+		toolsLines = append(toolsLines, strings.TrimSpace(line))
+	}
+
+	// Find "rate_limits:" sub-block.
+	rlStart := -1
+	for i, line := range toolsLines {
+		if line == "rate_limits:" {
+			rlStart = i + 1
+			break
+		}
+	}
+	if rlStart < 0 {
+		return nil
+	}
+
+	limits := make(map[string]string)
+	for _, line := range toolsLines[rlStart:] {
+		if line == "" {
+			continue
+		}
+		k, v, ok := yamlx.SplitKV(line)
+		if !ok || v == "" {
+			continue
+		}
+		limits[k] = v
+	}
+	if len(limits) == 0 {
+		return nil
+	}
+	return limits
+}
+
 // parseLLMAPIKeyBlock extracts the llm_api_key field from raw YAML source.
 //
 // Two forms are accepted:
diff --git a/internal/model/model.go b/internal/model/model.go
index b371cd7..25d594f 100644
--- a/internal/model/model.go
+++ b/internal/model/model.go
@@ -25,6 +25,22 @@ const (
 	JobStatusSkipped JobStatus = "skipped"
 )
 
+// ToolRetryConfig controls per-tool retry behaviour for transient failures.
+// The zero value disables the configured retry policy; the executor falls back
+// to the legacy single-retry-on-rate-limit behaviour.
+type ToolRetryConfig struct {
+	// MaxAttempts is the total number of attempts (initial + retries).
+	// 0 means use the default (3). Set to 1 to disable retries entirely.
+	MaxAttempts int `json:"max_attempts,omitempty"`
+	// BaseDelay is the initial backoff delay. 0 means use the default (1s).
+	BaseDelay time.Duration `json:"base_delay,omitempty"`
+	// MaxDelay caps the backoff delay. 0 means use the default (30s).
+	MaxDelay time.Duration `json:"max_delay,omitempty"`
+	// HonorRetryAfter, when true, uses the Retry-After response header to
+	// override the computed backoff delay. Defaults to true when MaxAttempts > 0.
+	HonorRetryAfter bool `json:"honor_retry_after,omitempty"`
+}
+
 // ToolDefinition describes a callable tool available to an agent.
 type ToolDefinition struct {
 	// Name is the unique identifier of the tool, used by the model to invoke it.
@@ -52,6 +68,9 @@ type ToolDefinition struct {
 	// (the default) all env vars are permitted for backwards compatibility;
 	// set to an explicit list in new skill definitions.
 	AllowedEnv []string `json:"allowed_env,omitempty"`
+	// Retry configures the per-tool retry policy for transient failures.
+	// The zero value preserves the legacy single-retry-on-rate-limit behaviour.
+	Retry ToolRetryConfig `json:"retry,omitempty"`
 }
 
 // MCPToolConfig holds the configuration for an mcp-type tool.
@@ -290,6 +309,12 @@ type QueueItem struct {
 	// CorrelationID (= the original webhook hide_id). Downstream curings
 	// with collect_by: correlation_id group by this field for correlated joins.
 	CorrelationID string `json:"correlation_id,omitempty"`
+	// ToolName, when non-empty, identifies this as an outbound-DLQ item produced
+	// by a failed tool execution. Set to the tool name that failed.
+	ToolName string `json:"tool_name,omitempty"`
+	// ToolTarget is the URL (for HTTP tools) or "<server>/<tool>" (for MCP tools)
+	// that the failed tool was targeting. Used for DLQ inspection.
+	ToolTarget string `json:"tool_target,omitempty"`
 }
 
 // AgentHooks describes shell commands executed at agent lifecycle events.
@@ -531,6 +556,10 @@ type Config struct {
 	// to LLMEndpoint. Empty disables auth (suitable for local Ollama / vLLM).
 	// The raw key is never written to structured logs.
 	LLMAPIKey string
+	// ToolRateLimits maps a hostname (e.g. "api.github.com") to a rate spec
+	// expressed as "N/s", "N/m", or "N/h". An empty map means no limits.
+	// Populated from the tools.rate_limits block in config.yaml.
+	ToolRateLimits map[string]string
 }
 
 // SessionContext is a point-in-time snapshot of a session's conversation window.
diff --git a/internal/runner/runner.go b/internal/runner/runner.go
index 611a5b9..65861df 100644
--- a/internal/runner/runner.go
+++ b/internal/runner/runner.go
@@ -84,6 +84,9 @@ type Runner struct {
 	// turn. Use in reflection mode so the final structured output is plain text
 	// after all pages have been read.
 	NoToolsForLastTurn bool
+	// ToolLimiter, when non-nil, applies per-host token-bucket rate limiting to
+	// outbound HTTP and MCP tool calls. Constructed from Config.ToolRateLimits.
+	ToolLimiter *tool.HostLimiter
 	// Vars holds named values that replace {{key}} placeholders in the agent's
 	// system prompt and user prompt before the first LLM call. Populated by
 	// leather run when skills declare parameters.
@@ -430,7 +433,7 @@ func (r *Runner) Run(ctx context.Context, a model.Agent, budget model.TokenBudge
 						hideToolSucceeded = true
 					}
 				} else {
-					result = (&tool.Executor{MCP: r.MCPRegistry}).Execute(ctx, def, tc.Arguments)
+					result = (&tool.Executor{MCP: r.MCPRegistry, QueueMgr: r.QueueMgr, AgentName: a.Name, Limiter: r.ToolLimiter}).Execute(ctx, def, tc.Arguments)
 				}
 				if result.Error == "" && def.Buffer {
 					if r.HideBuffer == nil {
diff --git a/internal/tool/executor.go b/internal/tool/executor.go
index bbae42c..2f050b5 100644
--- a/internal/tool/executor.go
+++ b/internal/tool/executor.go
@@ -4,18 +4,24 @@ import (
 	"bytes"
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"io"
+	"math/rand"
 	"net/http"
 	"net/url"
 	"os"
 	"path/filepath"
 	"strconv"
 	"strings"
+	"sync/atomic"
+	"syscall"
 	"time"
 
+	"github.com/tgpski/leather/internal/ids"
 	"github.com/tgpski/leather/internal/mcp"
 	"github.com/tgpski/leather/internal/model"
+	"github.com/tgpski/leather/internal/queue"
 )
 
 // toolClient is a shared HTTP client with a conservative timeout.
@@ -23,12 +29,41 @@ import (
 // tool calls from blocking indefinitely on unresponsive endpoints.
 var toolClient = &http.Client{Timeout: 30 * time.Second}
 
+// Package-level atomic metrics counters.
+var (
+	metricRetryTotal         int64 // incremented on each retry attempt
+	metricBackoffTotal       int64 // incremented when a retry-after sleep occurs
+	metricRateLimitWaitTotal int64 // incremented when HostLimiter.Wait blocks
+)
+
+// MetricSnapshot returns a point-in-time copy of the outbound tool counters.
+func MetricSnapshot() (retryTotal, backoffTotal, rateLimitWaitTotal int64) {
+	return atomic.LoadInt64(&metricRetryTotal),
+		atomic.LoadInt64(&metricBackoffTotal),
+		atomic.LoadInt64(&metricRateLimitWaitTotal)
+}
+
+const (
+	outboundDLQName    = "outbound-dlq"
+	defaultMaxAttempts = 3
+	defaultBaseDelay   = 1 * time.Second
+	defaultMaxDelay    = 30 * time.Second
+)
+
 // Executor dispatches tool calls to the appropriate backend.
 // The zero value (all fields nil) handles http-type tools only.
 type Executor struct {
 	// MCP is the registry of running MCP server clients.
 	// Nil means mcp-type tools are unavailable.
 	MCP *mcp.Registry
+	// QueueMgr, when non-nil, enables the outbound DLQ: permanent failures and
+	// exhausted-retry failures are enqueued to the "outbound-dlq" queue.
+	QueueMgr *queue.Manager
+	// AgentName is propagated into outbound-DLQ items for traceability.
+	AgentName string
+	// Limiter, when non-nil, applies per-host token-bucket throttling before
+	// every outbound HTTP or MCP-backed call (including retries).
+	Limiter *HostLimiter
 }
 
 // Execute dispatches a ToolCall to the appropriate executor based on def.Type.
@@ -38,14 +73,16 @@ func (e *Executor) Execute(ctx context.Context, def model.ToolDefinition, args m
 	var content string
 	var execErr error
 
+	retrycfg := resolvedRetry(def.Retry)
+
 	switch def.Type {
 	case "http", "":
-		content, execErr = execHTTP(ctx, def.HTTP, args, def.AllowedEnv)
+		content, execErr = e.execHTTPWithRetry(ctx, def, args, retrycfg)
 	case "mcp":
 		if e.MCP == nil {
 			execErr = fmt.Errorf("mcp tool %q: no MCP registry configured", def.Name)
 		} else {
-			content, execErr = execMCP(ctx, e.MCP, def.MCP, args)
+			content, execErr = e.execMCPWithRetry(ctx, def, args, retrycfg)
 		}
 	default:
 		execErr = fmt.Errorf("unsupported tool type %q", def.Type)
@@ -76,6 +113,182 @@ func Execute(ctx context.Context, def model.ToolDefinition, args map[string]any)
 	return (&Executor{}).Execute(ctx, def, args)
 }
 
+// resolvedRetry returns a ToolRetryConfig ready for use.
+// When MaxAttempts is 0 (zero value / not configured), the retry policy is
+// disabled and the legacy single-attempt path is used. Backoff fields default
+// when MaxAttempts > 0 and the fields are not explicitly set.
+func resolvedRetry(r model.ToolRetryConfig) model.ToolRetryConfig {
+	if r.MaxAttempts == 0 {
+		// Not configured: single attempt, no retry. Legacy behaviour.
+		return model.ToolRetryConfig{MaxAttempts: 1}
+	}
+	if r.BaseDelay == 0 {
+		r.BaseDelay = defaultBaseDelay
+	}
+	if r.MaxDelay == 0 {
+		r.MaxDelay = defaultMaxDelay
+	}
+	// HonorRetryAfter defaults to true when not set but MaxAttempts is configured.
+	if !r.HonorRetryAfter {
+		r.HonorRetryAfter = true
+	}
+	return r
+}
+
+// execHTTPWithRetry wraps execHTTPInner with the configured retry policy.
+func (e *Executor) execHTTPWithRetry(ctx context.Context, def model.ToolDefinition, args map[string]any, retrycfg model.ToolRetryConfig) (string, error) {
+	cfg := def.HTTP
+	allowedEnv := def.AllowedEnv
+
+	// Extract host for rate limiting.
+	rawURL, err := expandTemplate(cfg.URL, args, allowedEnv)
+	if err != nil {
+		return "", fmt.Errorf("tool/execHTTP: expand url: %w", err)
+	}
+	host := ""
+	if u, parseErr := url.Parse(rawURL); parseErr == nil {
+		host = u.Hostname()
+	}
+
+	maxAttempts := retrycfg.MaxAttempts
+	if maxAttempts < 1 {
+		maxAttempts = 1
+	}
+
+	var lastErr error
+	for attempt := 1; attempt <= maxAttempts; attempt++ {
+		// Apply per-host rate limiting before the attempt.
+		if e.Limiter != nil {
+			waited, waitErr := e.Limiter.Wait(ctx, host)
+			if waitErr != nil {
+				return "", fmt.Errorf("tool/execHTTP: rate limit wait: %w", waitErr)
+			}
+			if waited {
+				atomic.AddInt64(&metricRateLimitWaitTotal, 1)
+			}
+		}
+
+		var content string
+		content, lastErr = execHTTPInner(ctx, cfg, args, allowedEnv)
+		if lastErr == nil {
+			return content, nil
+		}
+
+		// Check if the error is transient and we have retries left.
+		if attempt >= maxAttempts {
+			break
+		}
+
+		statusCode := httpStatusFromErr(lastErr)
+		if !isTransient(statusCode, lastErr) {
+			// Permanent failure — don't retry.
+			break
+		}
+
+		atomic.AddInt64(&metricRetryTotal, 1)
+
+		// Compute backoff delay.
+		delay := backoffDelay(attempt, retrycfg, lastErr)
+		if delay > 0 {
+			atomic.AddInt64(&metricBackoffTotal, 1)
+			select {
+			case <-time.After(delay):
+			case <-ctx.Done():
+				return "", fmt.Errorf("tool/execHTTP: retry wait cancelled: %w", ctx.Err())
+			}
+		}
+	}
+
+	// Enqueue to outbound-DLQ on failure (permanent or exhausted).
+	e.enqueueDLQ(def, args, host, lastErr, maxAttempts)
+	return "", lastErr
+}
+
+// execMCPWithRetry wraps execMCP with the configured retry policy.
+func (e *Executor) execMCPWithRetry(ctx context.Context, def model.ToolDefinition, args map[string]any, retrycfg model.ToolRetryConfig) (string, error) {
+	target := def.MCP.Server + "/" + def.MCP.Tool
+
+	maxAttempts := retrycfg.MaxAttempts
+	if maxAttempts < 1 {
+		maxAttempts = 1
+	}
+
+	var lastErr error
+	for attempt := 1; attempt <= maxAttempts; attempt++ {
+		// Apply per-host rate limiting (keyed by MCP server name).
+		if e.Limiter != nil {
+			waited, waitErr := e.Limiter.Wait(ctx, def.MCP.Server)
+			if waitErr != nil {
+				return "", fmt.Errorf("tool/execMCP: rate limit wait: %w", waitErr)
+			}
+			if waited {
+				atomic.AddInt64(&metricRateLimitWaitTotal, 1)
+			}
+		}
+
+		var content string
+		content, lastErr = execMCP(ctx, e.MCP, def.MCP, args)
+		if lastErr == nil {
+			return content, nil
+		}
+
+		if attempt >= maxAttempts {
+			break
+		}
+
+		// MCP errors are treated as transient unless they indicate a missing server.
+		if strings.Contains(lastErr.Error(), "not found in MCP registry") {
+			break // permanent: server not configured
+		}
+
+		atomic.AddInt64(&metricRetryTotal, 1)
+
+		delay := backoffDelay(attempt, retrycfg, lastErr)
+		if delay > 0 {
+			atomic.AddInt64(&metricBackoffTotal, 1)
+			select {
+			case <-time.After(delay):
+			case <-ctx.Done():
+				return "", fmt.Errorf("tool/execMCP: retry wait cancelled: %w", ctx.Err())
+			}
+		}
+	}
+
+	e.enqueueDLQ(def, args, target, lastErr, maxAttempts)
+	return "", lastErr
+}
+
+// enqueueDLQ enqueues a failed tool call to the outbound-DLQ when QueueMgr is set.
+// This is a best-effort side-effect; it does not affect the ToolResult.
+func (e *Executor) enqueueDLQ(def model.ToolDefinition, args map[string]any, target string, lastErr error, attempts int) {
+	if e.QueueMgr == nil {
+		return
+	}
+	errStr := ""
+	if lastErr != nil {
+		errStr = lastErr.Error()
+	}
+	item := model.QueueItem{
+		ID:         ids.TimestampHex("odlq"),
+		AgentName:  e.AgentName,
+		ToolName:   def.Name,
+		ToolTarget: target,
+		EnqueuedAt: time.Now().Unix(),
+		Payload: map[string]any{
+			"tool":    def.Name,
+			"target":  target,
+			"agent":   e.AgentName,
+			"error":   errStr,
+			"attempt": attempts,
+			"args":    args,
+		},
+	}
+	if enqErr := e.QueueMgr.Enqueue(outboundDLQName, item); enqErr != nil {
+		// Non-fatal: DLQ enqueue failure is logged by the caller if needed.
+		_ = enqErr
+	}
+}
+
 // execMCP calls a named tool on a running MCP server and returns the text result.
 func execMCP(ctx context.Context, reg *mcp.Registry, cfg model.MCPToolConfig, args map[string]any) (string, error) {
 	client, ok := reg.Get(cfg.Server)
@@ -89,17 +302,14 @@ func execMCP(ctx context.Context, reg *mcp.Registry, cfg model.MCPToolConfig, ar
 	return result, nil
 }
 
-// execHTTP performs an HTTP tool call by expanding templates, building the
-// request, and returning the response body as a string. On a rate-limit
-// response (429 or 403 with X-RateLimit-Remaining: 0) it waits up to 60 s
-// and retries exactly once.
+// execHTTP performs a single HTTP tool call with no retry logic.
+// Callers that want retry should use execHTTPWithRetry via the Executor.
 func execHTTP(ctx context.Context, cfg model.HTTPToolConfig, args map[string]any, allowedEnv []string) (string, error) {
-	return execHTTPInner(ctx, cfg, args, allowedEnv, true)
+	return execHTTPInner(ctx, cfg, args, allowedEnv)
 }
 
-// execHTTPInner is the implementation of execHTTP. allowRetry controls whether
-// a rate-limited response triggers a single retry attempt.
-func execHTTPInner(ctx context.Context, cfg model.HTTPToolConfig, args map[string]any, allowedEnv []string, allowRetry bool) (string, error) {
+// execHTTPInner is the single-attempt HTTP implementation.
+func execHTTPInner(ctx context.Context, cfg model.HTTPToolConfig, args map[string]any, allowedEnv []string) (string, error) {
 	// Expand the URL template.
 	rawURL, err := expandTemplate(cfg.URL, args, allowedEnv)
 	if err != nil {
@@ -175,27 +385,109 @@ func execHTTPInner(ctx context.Context, cfg model.HTTPToolConfig, args map[strin
 		return "", fmt.Errorf("tool/execHTTP: read response: %w", err)
 	}
 
-	if allowRetry && isRateLimited(resp) {
-		wait := retryWait(resp, 60*time.Second)
-		select {
-		case <-time.After(wait):
-		case <-ctx.Done():
-			return "", fmt.Errorf("tool/execHTTP: rate limit wait cancelled: %w", ctx.Err())
-		}
-		return execHTTPInner(ctx, cfg, args, allowedEnv, false)
-	}
-
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		snippet := body
 		if len(snippet) > 256 {
 			snippet = snippet[:256]
 		}
-		return "", fmt.Errorf("tool/execHTTP: status %d: %s", resp.StatusCode, snippet)
+		return "", &httpError{status: resp.StatusCode, body: string(snippet), header: resp.Header}
 	}
 
 	return string(body), nil
 }
 
+// httpError carries the HTTP status code and body snippet so callers can
+// inspect the status without reparsing the error string.
+type httpError struct {
+	status int
+	body   string
+	header http.Header
+}
+
+func (e *httpError) Error() string {
+	return fmt.Sprintf("tool/execHTTP: status %d: %s", e.status, e.body)
+}
+
+// httpStatusFromErr returns the HTTP status code embedded in an httpError, or 0.
+func httpStatusFromErr(err error) int {
+	var he *httpError
+	if errors.As(err, &he) {
+		return he.status
+	}
+	return 0
+}
+
+// isTransient reports whether the error or HTTP status represents a condition
+// that may resolve on retry (server-side overload, network blip, rate limit).
+// Permanent failures (auth errors, bad requests) return false.
+// A 403 with X-RateLimit-Remaining: 0 (GitHub-style quota exhaustion) is also
+// treated as transient since it resolves once the quota resets.
+func isTransient(statusCode int, err error) bool {
+	if statusCode != 0 {
+		switch statusCode {
+		case 429, 500, 502, 503, 504:
+			return true
+		default:
+			// 403 + rate-limit header is transient (quota exhaustion, not auth failure).
+			var he *httpError
+			if errors.As(err, &he) && he.status == 403 &&
+				he.header != nil && he.header.Get("X-RateLimit-Remaining") == "0" {
+				return true
+			}
+			return false
+		}
+	}
+	if err == nil {
+		return false
+	}
+	// Network-level transient errors.
+	if os.IsTimeout(err) {
+		return true
+	}
+	if errors.Is(err, io.EOF) || errors.Is(err, io.ErrUnexpectedEOF) {
+		return true
+	}
+	if errors.Is(err, syscall.ECONNRESET) || errors.Is(err, syscall.ECONNREFUSED) {
+		return true
+	}
+	// url.Error wraps net errors (includes timeouts).
+	var ue *url.Error
+	if errors.As(err, &ue) {
+		return os.IsTimeout(ue.Err) || errors.Is(ue.Err, io.EOF) ||
+			errors.Is(ue.Err, syscall.ECONNRESET) || errors.Is(ue.Err, syscall.ECONNREFUSED)
+	}
+	return false
+}
+
+// backoffDelay computes how long to wait before the next attempt.
+// attempt is 1-indexed (attempt=1 means the first failure; next attempt is #2).
+func backoffDelay(attempt int, cfg model.ToolRetryConfig, err error) time.Duration {
+	// If Retry-After header is present and honored, use it.
+	if cfg.HonorRetryAfter {
+		var he *httpError
+		if errors.As(err, &he) && he.header != nil {
+			wait := retryWait(he.header, cfg.MaxDelay)
+			return wait
+		}
+	}
+	// Exponential backoff: BaseDelay * 2^(attempt-1), capped at MaxDelay.
+	exp := attempt - 1
+	if exp > 30 {
+		exp = 30
+	}
+	delay := cfg.BaseDelay
+	for i := 0; i < exp; i++ {
+		delay *= 2
+		if delay > cfg.MaxDelay {
+			delay = cfg.MaxDelay
+			break
+		}
+	}
+	// Add up to 10% jitter.
+	jitter := time.Duration(rand.Int63n(int64(delay/10) + 1)) //nolint:gosec
+	return delay + jitter
+}
+
 // isRateLimited reports whether resp indicates a rate-limit condition:
 // 429 (Too Many Requests) or 403 with X-RateLimit-Remaining: 0.
 func isRateLimited(resp *http.Response) bool {
@@ -208,8 +500,8 @@ func isRateLimited(resp *http.Response) bool {
 // retryWait returns how long to wait before retrying, capped at max.
 // It reads Retry-After (seconds) first, then X-RateLimit-Reset (Unix timestamp).
 // Falls back to max if neither header is present or parseable.
-func retryWait(resp *http.Response, max time.Duration) time.Duration {
-	if v := resp.Header.Get("Retry-After"); v != "" {
+func retryWait(header http.Header, max time.Duration) time.Duration {
+	if v := header.Get("Retry-After"); v != "" {
 		if secs, err := strconv.Atoi(v); err == nil && secs >= 0 {
 			d := time.Duration(secs) * time.Second
 			if d < max {
@@ -218,7 +510,7 @@ func retryWait(resp *http.Response, max time.Duration) time.Duration {
 			return max
 		}
 	}
-	if v := resp.Header.Get("X-RateLimit-Reset"); v != "" {
+	if v := header.Get("X-RateLimit-Reset"); v != "" {
 		if ts, err := strconv.ParseInt(v, 10, 64); err == nil {
 			d := time.Until(time.Unix(ts, 0))
 			if d > 0 && d < max {
diff --git a/internal/tool/executor_test.go b/internal/tool/executor_test.go
index ca5a141..e7ff01b 100644
--- a/internal/tool/executor_test.go
+++ b/internal/tool/executor_test.go
@@ -5,11 +5,14 @@ import (
 	"fmt"
 	"net/http"
 	"net/http/httptest"
+	"net/url"
 	"strings"
+	"sync/atomic"
 	"testing"
 	"time"
 
 	"github.com/tgpski/leather/internal/model"
+	"github.com/tgpski/leather/internal/queue"
 )
 
 func TestExecute_HTTPSuccess(t *testing.T) {
@@ -170,21 +173,26 @@ func TestExecHTTP_429RetrySucceeds(t *testing.T) {
 	}))
 	defer srv.Close()
 
-	cfg := model.HTTPToolConfig{Method: "GET", URL: srv.URL}
-	got, err := execHTTP(context.Background(), cfg, nil, nil)
-	if err != nil {
-		t.Fatalf("execHTTP: %v", err)
+	def := model.ToolDefinition{
+		Name:  "retry_tool",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Second, HonorRetryAfter: true},
+	}
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error != "" {
+		t.Fatalf("Execute error: %s", result.Error)
 	}
-	if got != "ok after retry" {
-		t.Errorf("content = %q, want %q", got, "ok after retry")
+	if result.Content != "ok after retry" {
+		t.Errorf("content = %q, want %q", result.Content, "ok after retry")
 	}
 	if calls != 2 {
 		t.Errorf("server called %d times, want 2", calls)
 	}
 }
 
-// TestExecHTTP_429NoRetryReturnsError verifies that when allowRetry=false a
-// 429 response is returned as an error (no further retries).
+// TestExecHTTP_429NoRetryReturnsError verifies that when MaxAttempts=1 a
+// 429 response is returned as an error without retrying.
 func TestExecHTTP_429NoRetryReturnsError(t *testing.T) {
 	calls := 0
 	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -195,13 +203,18 @@ func TestExecHTTP_429NoRetryReturnsError(t *testing.T) {
 	}))
 	defer srv.Close()
 
-	cfg := model.HTTPToolConfig{Method: "GET", URL: srv.URL}
-	_, err := execHTTPInner(context.Background(), cfg, nil, nil, false)
-	if err == nil {
-		t.Fatal("expected error, got nil")
+	def := model.ToolDefinition{
+		Name:  "no_retry_tool",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 1}, // disable retries
+	}
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error, got empty")
 	}
-	if !strings.Contains(err.Error(), "429") {
-		t.Errorf("error = %q, want to contain 429", err)
+	if !strings.Contains(result.Error, "429") {
+		t.Errorf("error = %q, want to contain 429", result.Error)
 	}
 	if calls != 1 {
 		t.Errorf("server called %d times, want 1", calls)
@@ -226,13 +239,18 @@ func TestExecHTTP_403RateLimitRetry(t *testing.T) {
 	}))
 	defer srv.Close()
 
-	cfg := model.HTTPToolConfig{Method: "GET", URL: srv.URL}
-	got, err := execHTTP(context.Background(), cfg, nil, nil)
-	if err != nil {
-		t.Fatalf("execHTTP: %v", err)
+	def := model.ToolDefinition{
+		Name:  "rate_limit_tool",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Second, HonorRetryAfter: true},
 	}
-	if got != "ok" {
-		t.Errorf("content = %q, want ok", got)
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error != "" {
+		t.Fatalf("Execute error: %s", result.Error)
+	}
+	if result.Content != "ok" {
+		t.Errorf("content = %q, want ok", result.Content)
 	}
 	if calls != 2 {
 		t.Errorf("server called %d times, want 2", calls)
@@ -249,27 +267,21 @@ func TestExecHTTP_RetryContextCancelled(t *testing.T) {
 	defer srv.Close()
 
 	ctx, cancel := context.WithCancel(context.Background())
-
-	// Use a background context for the initial request so we reach the wait,
-	// then cancel while waiting.
-	waitCh := make(chan struct{})
-	srv2 := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
-		w.Header().Set("Retry-After", "60")
-		w.WriteHeader(http.StatusTooManyRequests)
-		close(waitCh)
-	}))
-	defer srv2.Close()
-
 	cancel() // cancel before any request
-	cfg := model.HTTPToolConfig{Method: "GET", URL: srv.URL}
-	_, err := execHTTP(ctx, cfg, nil, nil)
-	if err == nil {
-		t.Fatal("expected error, got nil")
+
+	def := model.ToolDefinition{
+		Name: "cancel_tool",
+		Type: "http",
+		HTTP: model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+	}
+	result := (&Executor{}).Execute(ctx, def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error, got empty")
 	}
 	// The error may come from the HTTP client (context canceled before request)
 	// or from our wait select (rate limit wait cancelled). Both are acceptable.
-	if !strings.Contains(err.Error(), "cancel") {
-		t.Errorf("error = %q, want to contain 'cancel'", err)
+	if !strings.Contains(result.Error, "cancel") {
+		t.Errorf("error = %q, want to contain 'cancel'", result.Error)
 	}
 }
 
@@ -309,44 +321,44 @@ func TestRetryWait(t *testing.T) {
 	max := 60 * time.Second
 
 	t.Run("Retry-After within max", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
-		resp.Header.Set("Retry-After", "30")
-		if got := retryWait(resp, max); got != 30*time.Second {
+		h := make(http.Header)
+		h.Set("Retry-After", "30")
+		if got := retryWait(h, max); got != 30*time.Second {
 			t.Errorf("got %v, want 30s", got)
 		}
 	})
 
 	t.Run("Retry-After exceeds max", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
-		resp.Header.Set("Retry-After", "120")
-		if got := retryWait(resp, max); got != max {
+		h := make(http.Header)
+		h.Set("Retry-After", "120")
+		if got := retryWait(h, max); got != max {
 			t.Errorf("got %v, want %v", got, max)
 		}
 	})
 
 	t.Run("Retry-After zero", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
-		resp.Header.Set("Retry-After", "0")
+		h := make(http.Header)
+		h.Set("Retry-After", "0")
 		// 0 is a valid value meaning "retry immediately"
-		if got := retryWait(resp, max); got != 0 {
+		if got := retryWait(h, max); got != 0 {
 			t.Errorf("got %v, want 0", got)
 		}
 	})
 
 	t.Run("X-RateLimit-Reset in the past", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
+		h := make(http.Header)
 		past := fmt.Sprintf("%d", time.Now().Add(-10*time.Second).Unix())
-		resp.Header.Set("X-RateLimit-Reset", past)
-		if got := retryWait(resp, max); got != 0 {
+		h.Set("X-RateLimit-Reset", past)
+		if got := retryWait(h, max); got != 0 {
 			t.Errorf("got %v, want 0 (past timestamp)", got)
 		}
 	})
 
 	t.Run("X-RateLimit-Reset in future within max", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
+		h := make(http.Header)
 		future := fmt.Sprintf("%d", time.Now().Add(10*time.Second).Unix())
-		resp.Header.Set("X-RateLimit-Reset", future)
-		got := retryWait(resp, max)
+		h.Set("X-RateLimit-Reset", future)
+		got := retryWait(h, max)
 		// allow ±2s tolerance around 10s
 		if got < 8*time.Second || got > 12*time.Second {
 			t.Errorf("got %v, want ~10s", got)
@@ -354,18 +366,282 @@ func TestRetryWait(t *testing.T) {
 	})
 
 	t.Run("X-RateLimit-Reset exceeds max", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
+		h := make(http.Header)
 		far := fmt.Sprintf("%d", time.Now().Add(120*time.Second).Unix())
-		resp.Header.Set("X-RateLimit-Reset", far)
-		if got := retryWait(resp, max); got != max {
+		h.Set("X-RateLimit-Reset", far)
+		if got := retryWait(h, max); got != max {
 			t.Errorf("got %v, want %v", got, max)
 		}
 	})
 
 	t.Run("no headers fallback to max", func(t *testing.T) {
-		resp := &http.Response{Header: make(http.Header)}
-		if got := retryWait(resp, max); got != max {
+		h := make(http.Header)
+		if got := retryWait(h, max); got != max {
 			t.Errorf("got %v, want %v", got, max)
 		}
 	})
 }
+
+// --- Issue #7: retry policy ---
+
+// TestExecHTTP_TransientRetryExhausted verifies that 3 consecutive 500 responses
+// exhaust the retry budget and return an error after MaxAttempts calls.
+func TestExecHTTP_TransientRetryExhausted(t *testing.T) {
+	calls := 0
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls++
+		w.WriteHeader(http.StatusInternalServerError)
+		_, _ = w.Write([]byte("server error"))
+	}))
+	defer srv.Close()
+
+	def := model.ToolDefinition{
+		Name:  "retry_exhaust",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error after exhausted retries, got empty")
+	}
+	if !strings.Contains(result.Error, "500") {
+		t.Errorf("error = %q, want to contain 500", result.Error)
+	}
+	if calls != 3 {
+		t.Errorf("server called %d times, want 3", calls)
+	}
+}
+
+// TestExecHTTP_TransientRetrySucceeds verifies that transient failures before
+// the last attempt still result in success.
+func TestExecHTTP_TransientRetrySucceeds(t *testing.T) {
+	calls := 0
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls++
+		if calls < 3 {
+			w.WriteHeader(http.StatusServiceUnavailable)
+			_, _ = w.Write([]byte("unavailable"))
+			return
+		}
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("success"))
+	}))
+	defer srv.Close()
+
+	def := model.ToolDefinition{
+		Name:  "retry_success",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error != "" {
+		t.Fatalf("Execute error: %s", result.Error)
+	}
+	if result.Content != "success" {
+		t.Errorf("content = %q, want success", result.Content)
+	}
+	if calls != 3 {
+		t.Errorf("server called %d times, want 3", calls)
+	}
+}
+
+// TestExecHTTP_PermanentNoRetry verifies that a 400 (permanent) returns
+// immediately without retrying.
+func TestExecHTTP_PermanentNoRetry(t *testing.T) {
+	calls := 0
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		calls++
+		w.WriteHeader(http.StatusBadRequest)
+		_, _ = w.Write([]byte("bad request"))
+	}))
+	defer srv.Close()
+
+	def := model.ToolDefinition{
+		Name:  "perm_fail",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	result := (&Executor{}).Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error for 400, got empty")
+	}
+	if calls != 1 {
+		t.Errorf("server called %d times, want 1 (no retry on permanent failure)", calls)
+	}
+}
+
+// TestIsTransient covers the isTransient helper with a representative set of codes.
+func TestIsTransient(t *testing.T) {
+	cases := []struct {
+		name   string
+		status int
+		want   bool
+	}{
+		{"429 transient", 429, true},
+		{"500 transient", 500, true},
+		{"502 transient", 502, true},
+		{"503 transient", 503, true},
+		{"504 transient", 504, true},
+		{"400 permanent", 400, false},
+		{"401 permanent", 401, false},
+		{"404 permanent", 404, false},
+		{"403 no header permanent", 403, false},
+		{"200 not transient", 200, false},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			err := &httpError{status: tc.status, header: make(http.Header)}
+			if got := isTransient(tc.status, err); got != tc.want {
+				t.Errorf("isTransient(%d) = %v, want %v", tc.status, got, tc.want)
+			}
+		})
+	}
+}
+
+// --- Issue #8: DLQ enqueue ---
+
+// TestExecute_DLQEnqueueOnExhaustion verifies that after MaxAttempts transient
+// failures the item is enqueued to outbound-dlq.
+func TestExecute_DLQEnqueueOnExhaustion(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusInternalServerError)
+	}))
+	defer srv.Close()
+
+	queueDir := t.TempDir()
+	mgr := newTestQueueManager(t, queueDir)
+
+	def := model.ToolDefinition{
+		Name:  "dlq_exhaust",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 2, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	exec := &Executor{QueueMgr: mgr, AgentName: "test-agent"}
+	result := exec.Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error, got empty")
+	}
+
+	dlqQ, err := mgr.Get(outboundDLQName)
+	if err != nil {
+		t.Fatalf("get dlq: %v", err)
+	}
+	items := dlqQ.Scan()
+	if len(items) != 1 {
+		t.Fatalf("dlq depth = %d, want 1", len(items))
+	}
+	if items[0].ToolName != "dlq_exhaust" {
+		t.Errorf("ToolName = %q, want dlq_exhaust", items[0].ToolName)
+	}
+	if items[0].AgentName != "test-agent" {
+		t.Errorf("AgentName = %q, want test-agent", items[0].AgentName)
+	}
+}
+
+// TestExecute_DLQEnqueueOnPermanent verifies that a 400 (permanent failure)
+// immediately enqueues to outbound-dlq when QueueMgr is set.
+func TestExecute_DLQEnqueueOnPermanent(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusBadRequest)
+	}))
+	defer srv.Close()
+
+	queueDir := t.TempDir()
+	mgr := newTestQueueManager(t, queueDir)
+
+	def := model.ToolDefinition{
+		Name:  "perm_dlq",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 3, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	exec := &Executor{QueueMgr: mgr}
+	result := exec.Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error, got empty")
+	}
+
+	dlqQ, err := mgr.Get(outboundDLQName)
+	if err != nil {
+		t.Fatalf("get dlq: %v", err)
+	}
+	items := dlqQ.Scan()
+	if len(items) != 1 {
+		t.Fatalf("dlq depth = %d, want 1", len(items))
+	}
+}
+
+// TestExecute_NoDLQWhenQueueMgrNil verifies that a nil QueueMgr does not panic
+// and the tool result error is still set.
+func TestExecute_NoDLQWhenQueueMgrNil(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusInternalServerError)
+	}))
+	defer srv.Close()
+
+	def := model.ToolDefinition{
+		Name:  "no_dlq",
+		Type:  "http",
+		HTTP:  model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+		Retry: model.ToolRetryConfig{MaxAttempts: 2, BaseDelay: 0, MaxDelay: time.Millisecond},
+	}
+	result := (&Executor{QueueMgr: nil}).Execute(context.Background(), def, nil)
+	if result.Error == "" {
+		t.Fatal("expected error, got empty")
+	}
+}
+
+// --- Issue #9: rate limit counter ---
+
+// TestExecute_RateLimitWaitIncrementsCounter verifies that a throttled host
+// increments the metricRateLimitWaitTotal counter.
+func TestExecute_RateLimitWaitIncrementsCounter(t *testing.T) {
+	srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+		_, _ = w.Write([]byte("ok"))
+	}))
+	defer srv.Close()
+
+	// url.Parse.Hostname() strips the port; match that key in the limiter.
+	parsed, _ := url.Parse(srv.URL)
+	host := parsed.Hostname()
+	limiter, err := NewHostLimiter(map[string]string{host: "1/s"})
+	if err != nil {
+		t.Fatalf("NewHostLimiter: %v", err)
+	}
+
+	def := model.ToolDefinition{
+		Name: "rate_counter",
+		Type: "http",
+		HTTP: model.HTTPToolConfig{Method: "GET", URL: srv.URL},
+	}
+
+	before := atomic.LoadInt64(&metricRateLimitWaitTotal)
+
+	// First call: token available, no wait.
+	result := (&Executor{Limiter: limiter}).Execute(context.Background(), def, nil)
+	if result.Error != "" {
+		t.Fatalf("first Execute error: %s", result.Error)
+	}
+
+	// Second call: token not yet refilled, should wait and increment counter.
+	result = (&Executor{Limiter: limiter}).Execute(context.Background(), def, nil)
+	if result.Error != "" {
+		t.Fatalf("second Execute error: %s", result.Error)
+	}
+
+	after := atomic.LoadInt64(&metricRateLimitWaitTotal)
+	if after <= before {
+		t.Errorf("metricRateLimitWaitTotal = %d, want > %d (counter not incremented)", after, before)
+	}
+}
+
+// newTestQueueManager creates a queue.Manager backed by a temp directory.
+func newTestQueueManager(t *testing.T, dir string) *queue.Manager {
+	t.Helper()
+	return queue.NewManager(dir)
+}
diff --git a/internal/tool/ratelimit.go b/internal/tool/ratelimit.go
new file mode 100644
index 0000000..45066c4
--- /dev/null
+++ b/internal/tool/ratelimit.go
@@ -0,0 +1,142 @@
+package tool
+
+import (
+	"context"
+	"fmt"
+	"strconv"
+	"strings"
+	"sync"
+	"time"
+)
+
+// HostLimiter applies per-host token-bucket rate limiting.
+// Each host bucket allows up to burst tokens immediately; thereafter one token
+// is added every interval. The zero value (nil pointer) is safe to call — Wait
+// returns immediately when the limiter is nil.
+type HostLimiter struct {
+	mu      sync.Mutex
+	buckets map[string]*tokenBucket
+}
+
+// tokenBucket is a simple stdlib-only token bucket per host.
+type tokenBucket struct {
+	interval time.Duration // time between token refills
+	burst    int           // maximum tokens (also initial fill)
+	tokens   int
+	last     time.Time
+}
+
+// take attempts to consume one token. It returns (true, wait) where wait is 0
+// if a token was available immediately, or the duration to sleep before the
+// next token arrives. The caller is responsible for sleeping.
+func (b *tokenBucket) take(now time.Time) (immediate bool, wait time.Duration) {
+	// Refill tokens based on elapsed time.
+	elapsed := now.Sub(b.last)
+	if elapsed >= b.interval {
+		added := int(elapsed / b.interval)
+		b.tokens += added
+		if b.tokens > b.burst {
+			b.tokens = b.burst
+		}
+		b.last = b.last.Add(time.Duration(added) * b.interval)
+	}
+
+	if b.tokens > 0 {
+		b.tokens--
+		return true, 0
+	}
+	// Calculate wait until the next token.
+	wait = b.interval - now.Sub(b.last)
+	return false, wait
+}
+
+// NewHostLimiter builds a HostLimiter from a host→rateSpec map.
+// Each rateSpec is "N/s", "N/m", or "N/h" where N is the token count per period.
+// An empty map returns a limiter that never blocks.
+func NewHostLimiter(specs map[string]string) (*HostLimiter, error) {
+	l := &HostLimiter{buckets: make(map[string]*tokenBucket, len(specs))}
+	for host, spec := range specs {
+		interval, burst, err := parseRateSpec(spec)
+		if err != nil {
+			return nil, fmt.Errorf("host limiter %q: %w", host, err)
+		}
+		l.buckets[host] = &tokenBucket{
+			interval: interval,
+			burst:    burst,
+			tokens:   burst,
+			last:     time.Now(),
+		}
+	}
+	return l, nil
+}
+
+// Wait blocks until the token bucket for host allows one more request, or until
+// ctx is cancelled. It returns (true, nil) when it had to wait, (false, nil)
+// when the token was available immediately, and (false, err) on context cancel.
+// Hosts with no configured limit pass through immediately.
+func (l *HostLimiter) Wait(ctx context.Context, host string) (waited bool, err error) {
+	if l == nil {
+		return false, nil
+	}
+	l.mu.Lock()
+	b, ok := l.buckets[host]
+	if !ok {
+		l.mu.Unlock()
+		return false, nil
+	}
+	immediate, wait := b.take(time.Now())
+	l.mu.Unlock()
+
+	if immediate {
+		return false, nil
+	}
+
+	// Need to wait for a token.
+	select {
+	case <-time.After(wait):
+		// Consume the token that just became available.
+		l.mu.Lock()
+		b.take(time.Now())
+		l.mu.Unlock()
+		return true, nil
+	case <-ctx.Done():
+		return false, ctx.Err()
+	}
+}
+
+// parseRateSpec parses "N/s", "N/m", or "N/h" into a (interval, burst) pair.
+// interval is the time between individual token refills (period / N).
+// burst is N (the maximum sustained rate per period equals N tokens).
+func parseRateSpec(spec string) (interval time.Duration, burst int, err error) {
+	spec = strings.TrimSpace(spec)
+	slash := strings.LastIndex(spec, "/")
+	if slash < 0 {
+		return 0, 0, fmt.Errorf("rate spec %q: missing '/' separator (want N/s, N/m, or N/h)", spec)
+	}
+	countStr := strings.TrimSpace(spec[:slash])
+	unit := strings.TrimSpace(spec[slash+1:])
+
+	n, convErr := strconv.Atoi(countStr)
+	if convErr != nil || n <= 0 {
+		return 0, 0, fmt.Errorf("rate spec %q: count must be a positive integer", spec)
+	}
+
+	var period time.Duration
+	switch strings.ToLower(unit) {
+	case "s":
+		period = time.Second
+	case "m":
+		period = time.Minute
+	case "h":
+		period = time.Hour
+	default:
+		return 0, 0, fmt.Errorf("rate spec %q: unknown unit %q (want s, m, or h)", spec, unit)
+	}
+
+	// interval = period / N so that N tokens are available per period.
+	interval = period / time.Duration(n)
+	if interval < time.Millisecond {
+		interval = time.Millisecond // floor to avoid spinning
+	}
+	return interval, n, nil
+}
diff --git a/internal/tool/ratelimit_test.go b/internal/tool/ratelimit_test.go
new file mode 100644
index 0000000..82d5df5
--- /dev/null
+++ b/internal/tool/ratelimit_test.go
@@ -0,0 +1,178 @@
+package tool
+
+import (
+	"context"
+	"testing"
+	"time"
+)
+
+// TestParseRateSpec covers valid and invalid rate spec strings.
+func TestParseRateSpec(t *testing.T) {
+	cases := []struct {
+		spec        string
+		wantBurst   int
+		wantErrText string
+	}{
+		{"1/s", 1, ""},
+		{"60/m", 60, ""},
+		{"5000/h", 5000, ""},
+		{"10/S", 10, ""}, // case-insensitive unit
+		{"10/M", 10, ""},
+		{"1/H", 1, ""},
+		{"0/s", 0, "positive integer"},   // zero count
+		{"-1/s", 0, "positive integer"},  // negative
+		{"abc/s", 0, "positive integer"}, // non-numeric count
+		{"10/x", 0, "unknown unit"},      // bad unit
+		{"10", 0, "missing '/'"},         // no slash
+		{"", 0, "missing '/'"},           // empty
+	}
+	for _, tc := range cases {
+		t.Run(tc.spec, func(t *testing.T) {
+			interval, burst, err := parseRateSpec(tc.spec)
+			if tc.wantErrText != "" {
+				if err == nil {
+					t.Fatalf("parseRateSpec(%q): want error containing %q, got nil", tc.spec, tc.wantErrText)
+				}
+				if !containsStr(err.Error(), tc.wantErrText) {
+					t.Errorf("error = %q, want to contain %q", err.Error(), tc.wantErrText)
+				}
+				return
+			}
+			if err != nil {
+				t.Fatalf("parseRateSpec(%q): unexpected error: %v", tc.spec, err)
+			}
+			if burst != tc.wantBurst {
+				t.Errorf("burst = %d, want %d", burst, tc.wantBurst)
+			}
+			if interval <= 0 {
+				t.Errorf("interval = %v, want > 0", interval)
+			}
+		})
+	}
+}
+
+// TestNewHostLimiter_InvalidSpec verifies that a bad rate spec returns an error.
+func TestNewHostLimiter_InvalidSpec(t *testing.T) {
+	_, err := NewHostLimiter(map[string]string{"example.com": "bad"})
+	if err == nil {
+		t.Fatal("expected error for invalid spec, got nil")
+	}
+}
+
+// TestHostLimiter_NilPassThrough verifies that a nil limiter's Wait returns
+// immediately without blocking or erroring.
+func TestHostLimiter_NilPassThrough(t *testing.T) {
+	var l *HostLimiter
+	waited, err := l.Wait(context.Background(), "any.host")
+	if err != nil {
+		t.Errorf("Wait on nil limiter returned error: %v", err)
+	}
+	if waited {
+		t.Error("Wait on nil limiter reported waited=true, want false")
+	}
+}
+
+// TestHostLimiter_UnknownHostPassThrough verifies that a host with no configured
+// limit passes through immediately.
+func TestHostLimiter_UnknownHostPassThrough(t *testing.T) {
+	l, err := NewHostLimiter(map[string]string{"other.host": "1/s"})
+	if err != nil {
+		t.Fatalf("NewHostLimiter: %v", err)
+	}
+	waited, err := l.Wait(context.Background(), "unknown.host")
+	if err != nil {
+		t.Errorf("Wait returned error: %v", err)
+	}
+	if waited {
+		t.Error("Wait for unconfigured host reported waited=true, want false")
+	}
+}
+
+// TestHostLimiter_FirstCallImmediate verifies that the first call on a fresh
+// bucket does not block (burst token available).
+func TestHostLimiter_FirstCallImmediate(t *testing.T) {
+	l, err := NewHostLimiter(map[string]string{"api.example.com": "1/s"})
+	if err != nil {
+		t.Fatalf("NewHostLimiter: %v", err)
+	}
+	start := time.Now()
+	waited, err := l.Wait(context.Background(), "api.example.com")
+	if err != nil {
+		t.Fatalf("Wait: %v", err)
+	}
+	elapsed := time.Since(start)
+	if waited {
+		t.Error("first call should not wait (burst token available)")
+	}
+	if elapsed > 50*time.Millisecond {
+		t.Errorf("first call took %v, want < 50ms", elapsed)
+	}
+}
+
+// TestHostLimiter_ThrottledSecondCall verifies that with a 1/s limit the second
+// immediate call blocks until the next token is available.
+func TestHostLimiter_ThrottledSecondCall(t *testing.T) {
+	l, err := NewHostLimiter(map[string]string{"slow.example.com": "2/s"})
+	if err != nil {
+		t.Fatalf("NewHostLimiter: %v", err)
+	}
+
+	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
+	defer cancel()
+
+	// Exhaust the burst (2 tokens).
+	for i := 0; i < 2; i++ {
+		if _, err := l.Wait(ctx, "slow.example.com"); err != nil {
+			t.Fatalf("Wait %d: %v", i, err)
+		}
+	}
+
+	// Third call should block briefly (waiting for next token at ~500ms interval).
+	start := time.Now()
+	waited, err := l.Wait(ctx, "slow.example.com")
+	elapsed := time.Since(start)
+	if err != nil {
+		t.Fatalf("Wait throttled: %v", err)
+	}
+	if !waited {
+		t.Error("third call should have waited, got waited=false")
+	}
+	// Interval for 2/s is 500ms; we just need some positive wait.
+	if elapsed < time.Millisecond {
+		t.Errorf("throttled call took %v, want >= 1ms", elapsed)
+	}
+}
+
+// TestHostLimiter_ContextCancel verifies that cancelling the context while
+// waiting for a token returns the context error.
+func TestHostLimiter_ContextCancel(t *testing.T) {
+	l, err := NewHostLimiter(map[string]string{"blocked.example.com": "1/h"})
+	if err != nil {
+		t.Fatalf("NewHostLimiter: %v", err)
+	}
+
+	// Consume the only burst token.
+	ctx := context.Background()
+	if _, err := l.Wait(ctx, "blocked.example.com"); err != nil {
+		t.Fatalf("first Wait: %v", err)
+	}
+
+	// Next call should block for ~1h; cancel it immediately.
+	cancelCtx, cancel := context.WithCancel(ctx)
+	cancel()
+	_, err = l.Wait(cancelCtx, "blocked.example.com")
+	if err == nil {
+		t.Fatal("expected context error, got nil")
+	}
+}
+
+func containsStr(s, sub string) bool {
+	return len(sub) == 0 || (len(s) >= len(sub) && func() bool {
+		for i := 0; i <= len(s)-len(sub); i++ {
+			if s[i:i+len(sub)] == sub {
+				return true
+			}
+		}
+		return false
+	}())
+}
diff --git a/schemas/skill-1.schema.yaml b/schemas/skill-1.schema.yaml
index e9bb22a..50a8071 100644
--- a/schemas/skill-1.schema.yaml
+++ b/schemas/skill-1.schema.yaml
@@ -142,3 +142,37 @@ properties:
             tool:
               type: string
               description: MCP tool name to invoke on the server.
+
+        retry:
+          type: object
+          additionalProperties: false
+          description: |
+            Optional per-tool retry policy for transient failures (5xx, 429, network errors).
+            When omitted, no retries are performed (single attempt, legacy behaviour).
+            Defaults apply when a field is omitted but max_attempts is set.
+          properties:
+            max_attempts:
+              type: integer
+              minimum: 1
+              description: |
+                Total number of attempts (initial + retries). 1 disables retries.
+                Default: 3 when the retry block is present and this field is omitted.
+              examples: [3]
+            base_delay:
+              type: string
+              description: |
+                Initial backoff duration (e.g. "1s", "500ms"). Doubles each retry up
+                to max_delay. Default: "1s".
+              examples: ["1s", "500ms"]
+            max_delay:
+              type: string
+              description: |
+                Backoff ceiling. Delays are capped at this value before jitter is added.
+                Default: "30s".
+              examples: ["30s", "60s"]
+            honor_retry_after:
+              type: boolean
+              description: |
+                When true, use the Retry-After response header value as the backoff
+                delay instead of the computed exponential value. Default: true.
+              default: true

From 71b66a874dff82b00fccbb2ff63c6a3109f73491 Mon Sep 17 00:00:00 2001
From: Tyler Pate <tyler.graham.pate@gmail.com>
Date: Fri, 5 Jun 2026 20:57:54 -0700
Subject: [PATCH 2/3] skills: release-prep adds ROADMAP.md and SECURITY.md
 update steps

Add Step 4 (strike shipped items in ROADMAP.md, update _Last reviewed_)
and Step 5 (update Supported Versions table in SECURITY.md) to the
release-prep skill. Renumber downstream steps; add both files to git add
and the pre-handoff checklist.
---
 .agents/skills/release-prep/SKILL.md | 45 ++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/.agents/skills/release-prep/SKILL.md b/.agents/skills/release-prep/SKILL.md
index 945c002..2060f9a 100644
--- a/.agents/skills/release-prep/SKILL.md
+++ b/.agents/skills/release-prep/SKILL.md
@@ -66,7 +66,44 @@ Use `grep -rn "LAST_VERSION"` to find any other version-pinned references.
 
 ---
 
-## Step 4 — Verify subcommand tables are current
+## Step 4 — Update ROADMAP.md
+
+Open `ROADMAP.md` and apply these changes:
+
+1. **Mark shipped items.** For each feature or item in the roadmap that was
+   delivered in the commits since LAST_TAG, strike through the bullet with
+   `~~text~~` and append `— shipped in NEXT_VERSION (issue #N if known)`.
+   Use the commit subjects and the CHANGELOG section you just wrote as the
+   source of truth.
+
+2. **Strike the outbound HTTP tool resilience block** if it is present
+   verbatim and has now shipped (it shipped in v0.3.0 as issues #7–#9).
+   Replace the prose block with a single struck-through line:
+   `- ~~Outbound HTTP tool resilience (retry, DLQ, per-host rate limits)~~ — shipped in NEXT_VERSION (#7, #8, #9).`
+
+3. **Add a new version section** for the _next_ planned minor if none
+   exists, or leave existing planned work in place if a forward-looking
+   section is already present.
+
+4. Update the `_Last reviewed:` footer to TODAY.
+
+---
+
+## Step 5 — Update SECURITY.md
+
+Open `SECURITY.md` and update the **Supported Versions** table:
+
+1. Add a new row for `NEXT_VERSION_MINOR.x` (the X.Y portion of
+   NEXT_VERSION) with `:white_check_mark:`.
+2. If LAST_VERSION was on a different minor line (e.g. LAST_TAG = v0.2.x
+   and NEXT_VERSION = v0.3.0), mark the old minor line as `:x:` (end of
+   support) — leather supports only the current minor line.
+3. If NEXT_VERSION is a patch on the same minor (e.g. v0.2.1 on v0.2.x),
+   no row change is needed; the existing row already covers it.
+
+---
+
+## Step 6 — Verify subcommand tables are current
 
 Confirm that every `Run*` function in `internal/cli/cli.go` has a corresponding
 row in each of these tables:
@@ -80,14 +117,14 @@ If any row is missing, add it before committing.
 
 ---
 
-## Step 5 — Commit and push
+## Step 7 — Commit and push
 
 Stay on the **current branch** — do not switch to or push directly to `main`.
 Stage all changed files and create one commit:
 
 ```
 CURRENT_BRANCH=$(git branch --show-current)
-git add CHANGELOG.md README.md docs/ .subagents/
+git add CHANGELOG.md README.md ROADMAP.md SECURITY.md docs/ .subagents/
 git commit -m "chore(release): prepare NEXT_VERSION"
 git push origin "$CURRENT_BRANCH"
 ```
@@ -108,6 +145,8 @@ Do not tag in this step. Tagging is the job of `leather-release-tag`.
 - [ ] NEXT_VERSION is set and justified
 - [ ] CHANGELOG has the new section with at least one bullet
 - [ ] No stale version string remains in docs (grep clean)
+- [ ] ROADMAP.md: shipped items struck through; `_Last reviewed:` updated
+- [ ] SECURITY.md: Supported Versions table reflects NEXT_VERSION
 - [ ] Subcommand tables are in sync
 - [ ] Commit is pushed to current branch (not directly to main)
 - [ ] PR is open targeting main (create one if it doesn't exist)

From 531ea9741b114ce4103cd2ca26c015ec6a676d34 Mon Sep 17 00:00:00 2001
From: Tyler Pate <tyler.graham.pate@gmail.com>
Date: Fri, 5 Jun 2026 21:06:24 -0700
Subject: [PATCH 3/3] add release and ci badges

---
 README.md | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/README.md b/README.md
index 9a27416..bb131d2 100644
--- a/README.md
+++ b/README.md
@@ -3,6 +3,10 @@
 [![Go Reference](https://pkg.go.dev/badge/github.com/tgpski/leather.svg)](https://pkg.go.dev/github.com/tgpski/leather)
 [![Go Version](https://img.shields.io/github/go-mod/go-version/TGPSKI/leather?v=2)](https://go.dev/)
 [![License](https://img.shields.io/github/license/TGPSKI/leather?v=2)](LICENSE)
+[![GitHub Release](https://img.shields.io/github/release/TGPSKI/leather.svg?style=flat)](https://github.com/TGPSKI/leather/releases)
+[![CI](https://github.com/TGPSKI/leather/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/TGPSKI/leather/actions/workflows/ci.yml)
+
+  
 
 **Local agent infrastructure in one stdlib-only Go binary.**