TGPSKI · TGPSKI · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026
@@ -66,7 +66,44 @@ Use `grep -rn "LAST_VERSION"` to find any other version-pinned references.
 
 ---
 
-## Step 4 — Verify subcommand tables are current
+## Step 4 — Update ROADMAP.md
+
+Open `ROADMAP.md` and apply these changes:
+
+1. **Mark shipped items.** For each feature or item in the roadmap that was
+   delivered in the commits since LAST_TAG, strike through the bullet with
+   `~~text~~` and append `— shipped in NEXT_VERSION (issue #N if known)`.
+   Use the commit subjects and the CHANGELOG section you just wrote as the
+   source of truth.
+
+2. **Strike the outbound HTTP tool resilience block** if it is present
+   verbatim and has now shipped (it shipped in v0.3.0 as issues #7–#9).
+   Replace the prose block with a single struck-through line:
+   `- ~~Outbound HTTP tool resilience (retry, DLQ, per-host rate limits)~~ — shipped in NEXT_VERSION (#7, #8, #9).`
+
+3. **Add a new version section** for the _next_ planned minor if none
+   exists, or leave existing planned work in place if a forward-looking
+   section is already present.
+
+4. Update the `_Last reviewed:` footer to TODAY.
+
+---
+
+## Step 5 — Update SECURITY.md
+
+Open `SECURITY.md` and update the **Supported Versions** table:
+
+1. Add a new row for `NEXT_VERSION_MINOR.x` (the X.Y portion of
+   NEXT_VERSION) with `:white_check_mark:`.
+2. If LAST_VERSION was on a different minor line (e.g. LAST_TAG = v0.2.x
+   and NEXT_VERSION = v0.3.0), mark the old minor line as `:x:` (end of
+   support) — leather supports only the current minor line.
+3. If NEXT_VERSION is a patch on the same minor (e.g. v0.2.1 on v0.2.x),
+   no row change is needed; the existing row already covers it.
+
+---
+
+## Step 6 — Verify subcommand tables are current
 
 Confirm that every `Run*` function in `internal/cli/cli.go` has a corresponding
 row in each of these tables:
@@ -80,14 +117,14 @@ If any row is missing, add it before committing.
 
 ---
 
-## Step 5 — Commit and push
+## Step 7 — Commit and push
 
 Stay on the **current branch** — do not switch to or push directly to `main`.
 Stage all changed files and create one commit:
 
 ```
 CURRENT_BRANCH=$(git branch --show-current)
-git add CHANGELOG.md README.md docs/ .subagents/
+git add CHANGELOG.md README.md ROADMAP.md SECURITY.md docs/ .subagents/
 git commit -m "chore(release): prepare NEXT_VERSION"
 git push origin "$CURRENT_BRANCH"
 ```
@@ -108,6 +145,8 @@ Do not tag in this step. Tagging is the job of `leather-release-tag`.
 - [ ] NEXT_VERSION is set and justified
 - [ ] CHANGELOG has the new section with at least one bullet
 - [ ] No stale version string remains in docs (grep clean)
+- [ ] ROADMAP.md: shipped items struck through; `_Last reviewed:` updated
+- [ ] SECURITY.md: Supported Versions table reflects NEXT_VERSION
 - [ ] Subcommand tables are in sync
 - [ ] Commit is pushed to current branch (not directly to main)
 - [ ] PR is open targeting main (create one if it doesn't exist)

@@ -176,6 +176,10 @@ metric requires a dashboard update note in the PR description.
 | `leather_cache_hits_total` | counter | `kind` | Response-cache hits. |
 | `leather_cache_misses_total` | counter | `kind` | Response-cache misses. |
 | `leather_build_info` | gauge=1 | `version`, `commit` | Build metadata. |
+| `leather_tool_retry_total` | counter | _(none)_ | Tool call attempts beyond the first; tells the operator how often transient failures occur across all tools. |
+| `leather_tool_backoff_total` | counter | _(none)_ | Times a backoff sleep was applied (retry-after or exponential); indicates rate-limiting pressure from upstream services. |
+| `leather_tool_rate_limit_wait_total` | counter | _(none)_ | Times a tool call waited for a per-host token-bucket token; nonzero means the configured rate limits are actively throttling traffic. |
+| `leather_outbound_dlq_depth` | gauge | _(none)_ | Current item count in `outbound-dlq`; nonzero means tool failures need operator attention (`leather dlq inspect`). |
 
 Rules:
 

@@ -271,6 +271,21 @@ Before opening a PR:
 - [ ] No real credentials or secrets in `testdata/`
 - [ ] CI workflow action refs are SHA-pinned with version comments
 
+### Outbound tool resilience (retry, DLQ, rate limits)
+
+PRs touching `internal/tool`, `internal/config`, or `internal/cli/cmd_dlq.go`:
+
+- [ ] `go test ./internal/tool/... ./internal/cli/...` passes
+- [ ] `go test -race ./internal/tool/... ./internal/cli/...` passes
+- [ ] New tools with `retry:` config have tests covering transient retry and permanent no-retry paths
+- [ ] `TestExecute_DLQEnqueueOnExhaustion` and `TestExecute_DLQEnqueueOnPermanent` pass
+- [ ] `TestHostLimiter_*` suite passes; no real network calls
+- [ ] `TestRunDLQ*` suite passes with `t.TempDir()` state dirs
+- [ ] `leather dlq inspect` output includes ID, tool name, agent, attempt, error
+- [ ] `leather dlq requeue --state-dir ... <item-id>` (item-id **last**) moves item
+- [ ] `tools.rate_limits` in config.yaml parses without error; bad spec is warn+disable, not panic
+- [ ] `/metrics` response contains `leather_tool_retry_total`, `leather_tool_backoff_total`, `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth`
+
 ---
 
 _Last reviewed: 2026-06-05_ 
@@ -116,13 +116,60 @@ func (r *Registry) GetTools(skillNames []string) []model.ToolDefinition
 func (r *Registry) ResolveTools(skillNames, toolsetNames, toolNames []string) []model.ToolDefinition
 
 // Executor dispatches HTTP- and MCP-backed tools.
-type Executor struct { MCP *mcp.Registry }
+// QueueMgr enables the outbound DLQ: permanent/exhausted failures are
+// enqueued to "outbound-dlq" when this field is non-nil.
+// Limiter enforces per-host token-bucket rate limits before each attempt.
+type Executor struct {
+    MCP       *mcp.Registry
+    QueueMgr  *queue.Manager  // nil = outbound DLQ disabled
+    AgentName string           // injected by Runner; stored in DLQ items
+    Limiter   *HostLimiter     // nil = no rate limiting
+}
 
 // Execute runs a single tool call. Always returns a ToolResult; never panics.
 // On failure, ToolResult.Error is set and Content may be empty.
+// With def.Retry.MaxAttempts > 0, transient failures (5xx, 429, network)
+// are retried with exponential backoff + jitter.
 func (e *Executor) Execute(ctx context.Context, def model.ToolDefinition, args map[string]any) model.ToolResult
+
+// MetricSnapshot returns a point-in-time snapshot of outbound tool counters.
+// Counters are process-lifetime atomics; they reset only on restart.
+func MetricSnapshot() (retryTotal, backoffTotal, rateLimitWaitTotal int64)
 ```
 
+#### Per-tool retry policy
+
+`ToolDefinition.Retry` (`model.ToolRetryConfig`) controls retry behaviour:
+
+| Field | Default | Meaning |
+|---|---|---|
+| `max_attempts` | 1 (no retry) | Total attempts including the initial one. |
+| `base_delay` | `1s` | Initial backoff; doubles each attempt. |
+| `max_delay` | `30s` | Backoff ceiling before jitter. |
+| `honor_retry_after` | `true` | Use `Retry-After` header value when present. |
+
+A zero `ToolRetryConfig` (all fields at zero value) means **single attempt, no
+retry** — preserving backward compatibility for tools that predate the policy.
+Only tools with an explicit `retry:` block in their skill YAML get the retry loop.
+
+`isTransient` classifies the failure to decide whether to retry:
+- **Transient** → retry: 429, 500, 502, 503, 504; network/timeout errors;
+  403 with `X-RateLimit-Remaining: 0`
+- **Permanent** → return immediately without retrying: all other 4xx
+
+#### Outbound DLQ
+
+When `Executor.QueueMgr != nil` and a tool call either:
+- exhausts its retry budget on transient errors, or
+- fails immediately with a permanent error,
+
+a `model.QueueItem` is enqueued to the well-known `"outbound-dlq"` queue.
+DLQ items carry `ToolName`, `ToolTarget`, and the last error in `Payload`.
+DLQ enqueue is a fire-and-forget side-effect; failure to enqueue is logged
+at `warn` and does not affect the returned `ToolResult`.
+
+Items in `outbound-dlq` can be inspected and requeued via `leather dlq`.
+
 #### Skill file format (`*.skill.yaml`)
 
 ```yaml
@@ -139,6 +186,11 @@ tools:
       headers:
         Authorization: "Bearer {{env:GITHUB_TOKEN}}"
         Accept: application/vnd.github+json
+    retry:
+      max_attempts: 3
+      base_delay: 1s
+      max_delay: 30s
+      honor_retry_after: true
 ```
 
 #### Toolset file format (`*.toolset.yaml`)
@@ -161,11 +213,15 @@ every toolset reference points at an already-known tool.
 - `mcp` → `execMCP` through a started `mcp.Registry`
 
 HTTP execution (`execHTTP`):
+- Calls `e.Limiter.Wait(ctx, host)` before each attempt; blocks until the
+  per-host token bucket allows the request or ctx is cancelled.
 - Expands URL template with tool call arguments (`{{.field}}`).
 - Expands `{{env:VAR}}` in header values; **never logs auth header values**.
 - Sends the request with the runner's context (inherits timeout).
 - Response body is capped at 1 MB.
 - Non-2xx responses populate `ToolResult.Error` with the status code message.
+- Transient failures are retried up to `def.Retry.MaxAttempts` times with
+  exponential backoff; permanent failures return immediately.
 
 MCP execution (`execMCP`):
 - Looks up the named server in the running registry.

@@ -46,6 +46,7 @@ Each subcommand has:
 | `run` | `RunOnce` | Load and execute a single agent, then exit |
 | `validate` | `RunValidate` | Parse and validate agent files; report errors |
 | `test-agent` | `RunTestAgent` | Execute an agent with `MockLLM` and print the turn transcript |
+| `dlq` | `RunDLQ` | Inspect and requeue outbound dead-letter queue items |
 | `status` | `RunStatus` | Print scheduler state, job history, token usage |
 | `ingest` | `RunIngest` | Store raw bytes as a hide and optionally enqueue for curing |
 | `snapshot` | `RunSnapshot` → `RunSnapshotSave` / `RunSnapshotRestore` | Save or restore a `tar.gz` point-in-time archive of runtime state |
@@ -212,8 +213,19 @@ cache_dir: ""               # empty = serve uses <state_dir>/cache
 mcp_servers_file: ""        # empty = run/serve/validate use ~/.leather/mcp-servers.yaml
 loop: 1                     # repeat leather run N times
 tannery: ""                 # path to tannery.yaml; empty = tannery disabled
+
+tools:
+  rate_limits:              # per-host token-bucket rate limits for outbound tool calls
+    api.github.com: "60/m"  # format: "N/s", "N/m", or "N/h"
+    api.example.com: "10/s"
 ```
 
+`tools.rate_limits` is a nested map. Each key is a hostname (no port, no
+scheme); the value is a rate spec in the form `N/<unit>` where unit is `s`
+(seconds), `m` (minutes), or `h` (hours). The second call to the same host
+within the interval blocks until the next token is available. Unknown hosts
+pass through immediately with no limiting.
+
 YAML keys are the snake_case equivalents of the flag names (strip `--`,
 replace `-` with `_`).
 
@@ -255,6 +267,24 @@ All endpoints live in `internal/cli/api_tannery.go`.
 | `/curings` | GET | `handleCurings` | List curing definitions with queue depth |
 | `/intake` | POST | `handleIntake` | Direct-ingest endpoint; writes hide from request body |
 
+### `leather dlq` subcommand (`internal/cli/cmd_dlq.go`)
+
+`RunDLQ(args, stdout, stderr)` dispatches `inspect` and `requeue`:
+
+```
+leather dlq inspect [--queue outbound-dlq] [--state-dir ...]
+leather dlq requeue [--queue outbound-dlq] [--work-queue <name>] [--state-dir ...] <item-id>
+```
+
+- **`inspect`** — lists all items in the DLQ; prints `ID | tool | agent | attempt | enqueued_at | error`.
+- **`requeue`** — moves the named item from the DLQ to `--work-queue`, resetting
+  `AttemptCount` to 0 so it gets a fresh retry budget. Default `--work-queue` is
+  the DLQ name with the `-dlq` suffix stripped.
+
+**Important**: `<item-id>` must come **after** all flags. Go's `flag.FlagSet`
+stops parsing at the first non-flag token, so placing `<item-id>` before flags
+silently ignores the remaining flags.
+
 ### `leather ingest` subcommand (`internal/cli/cmd_ingest.go`)
 
 `RunIngest(args, stdin, stdout, stderr)` — reads body from `--file` or stdin,
@@ -378,6 +408,7 @@ internal/schema  →  internal/config
 | Flag name doesn't match env var | `--flag-name` → `LEATHER_FLAG_NAME`; check both |
 | Skipping graceful shutdown | Always call `scheduler.Drain` before returning from `RunServe` |
 | Flags after positional arg in `leather run` | Go's `flag.FlagSet` stops at the first non-flag token. The agent file path must come **last**: `leather run --config=... --var k=v agent.md` — not `leather run agent.md --config ...` |
+| `<item-id>` before flags in `leather dlq requeue` | Same issue: item-id must be **last** after all flags: `leather dlq requeue --state-dir ... <item-id>` |
 | `leather init` overwriting without `--overwrite` | `RunInit` fails closed: any pre-existing file causes a non-zero exit and reports `--overwrite` hint. Never silently clobber. |
 | Calling `RunValidate` from `RunInit` for post-write validation | `RunValidate` performs a full semantic check including model resolution (fails without `LEATHER_MODEL`). `RunInit` uses schema-only validation (`runInitValidate`) which is syntax-only and does not require a model to be set. |
 

@@ -218,6 +218,29 @@ binary for shell commands; an HTTP MCP server for HTTP APIs).
 - The capability is a coherent verb (e.g. `release-write`,
   `inbox-triage`), not a grab bag.
 
+#### Adding retry policy to a skill tool
+
+Tools that call remote APIs should declare a `retry:` block to handle transient
+failures (5xx, 429, network errors) without relying on the caller to retry:
+
+```yaml
+tools:
+  - name: github_list_issues
+    type: http
+    http:
+      url: "https://api.github.com/repos/{{.repo}}/issues"
+    retry:
+      max_attempts: 3    # initial attempt + 2 retries
+      base_delay: 1s     # doubles each retry; capped at max_delay
+      max_delay: 30s
+      honor_retry_after: true  # use Retry-After header when present
+```
+
+Omitting `retry:` (or setting `max_attempts: 1`) preserves the legacy
+single-attempt behaviour — no retries, no backoff. Only transient errors
+(5xx, 429, network timeouts) trigger retries; permanent 4xx errors return
+immediately regardless of the retry config.
+
 ### When to write a per-turn declaration
 
 - The turn does something dangerous and the agent's base scope is too

@@ -132,6 +132,35 @@ re-serialize the full slice. Files are mode 0600; directories mode 0700.
 - `Payload` — `map[string]any`; template variables available in agent prompts
 - `EnqueuedAt` — Unix timestamp
 - `AttemptCount` — incremented by the runner on each dequeue-and-run attempt
+- `ToolName` — non-empty for outbound DLQ items; the failed tool's name
+- `ToolTarget` — non-empty for outbound DLQ items; the URL or `server/tool` string
+
+#### Outbound DLQ (`outbound-dlq`)
+
+Tool execution failures that are permanent or that exhaust their retry budget
+are enqueued to the well-known `outbound-dlq` queue by `tool.Executor`. These
+items are **not** processed by `internal/curing` workers (`CuringName == ""`).
+They are surfaced for operator inspection and manual requeue via `leather dlq`.
+
+Outbound DLQ item shape:
+
+| Field | Meaning |
+|---|---|
+| `ID` | `odlq_<date>_<time>_<hex>` |
+| `AgentName` | Agent that triggered the tool call |
+| `ToolName` | Name of the failed tool |
+| `ToolTarget` | URL (HTTP tools) or `server/tool` (MCP tools) |
+| `AttemptCount` | Number of attempts made before giving up |
+| `EnqueuedAt` | Unix timestamp |
+| `Payload["tool"]` | Tool name (duplicate for backward compat) |
+| `Payload["args"]` | Tool arguments at time of failure |
+| `Payload["error"]` | Last error message |
+| `Payload["attempt"]` | Attempt count at time of failure |
+
+**DLQ enqueue path**: `tool.Executor.Execute` → final attempt fails or
+`isTransient=false` → `QueueMgr.Enqueue("outbound-dlq", item)`. Enqueue
+failure is logged at `warn` and does not affect the `ToolResult` returned to
+the runner. DLQ is disabled when `Executor.QueueMgr` is nil.
 
 ---
 

@@ -3,6 +3,10 @@
 [![Go Reference](https://pkg.go.dev/badge/github.com/tgpski/leather.svg)](https://pkg.go.dev/github.com/tgpski/leather)
 [![Go Version](https://img.shields.io/github/go-mod/go-version/TGPSKI/leather?v=2)](https://go.dev/)
 [![License](https://img.shields.io/github/license/TGPSKI/leather?v=2)](LICENSE)
+[![GitHub Release](https://img.shields.io/github/release/TGPSKI/leather.svg?style=flat)](https://github.com/TGPSKI/leather/releases)
+[![CI](https://github.com/TGPSKI/leather/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/TGPSKI/leather/actions/workflows/ci.yml)
+
+
 
 **Local agent infrastructure in one stdlib-only Go binary.**
 

@@ -46,6 +46,8 @@ func Run(args []string, stdout, stderr io.Writer, version, commit string) int {
 		return RunReplay(rest, stdout, stderr, version, commit)
 	case "snapshot":
 		return RunSnapshot(rest, stdout, stderr)
+	case "dlq":
+		return RunDLQ(rest, stdout, stderr)
 	case "attach":
 		return RunAttach(rest, stdout, stderr)
 	case "help", "--help", "-h":