From c440481e564cbfbf9766c6f4839d0607ba216e5e Mon Sep 17 00:00:00 2001 From: Tyler Pate Date: Fri, 5 Jun 2026 20:53:27 -0700 Subject: [PATCH 1/3] tools: add outbound retry policy, DLQ, and per-host rate limits (#7, #8, #9) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements three tightly related outbound tool resilience features: **#7 — Per-tool retry policy** - `model.ToolRetryConfig` (max_attempts, base_delay, max_delay, honor_retry_after) wired into `model.ToolDefinition.Retry` - `isTransient` classifies 429/5xx/network as transient; other 4xx permanent - Exponential backoff with 10% jitter; `Retry-After` header honoured when set - Zero `ToolRetryConfig` = single attempt (backward-compatible; no retry) - `retry:` block added to `schemas/skill-1.schema.yaml` **#8 — Outbound DLQ and requeue** - Exhausted/permanent tool failures enqueued to `outbound-dlq` when `Executor.QueueMgr != nil`; `QueueItem.ToolName`/`ToolTarget` carry context - New `leather dlq inspect` / `leather dlq requeue` CLI subcommands - `` must be last (after all flags) due to Go flag.FlagSet behaviour **#9 — Per-host rate limits and metrics** - `internal/tool/ratelimit.go`: stdlib token-bucket `HostLimiter`; nil-safe `Wait`; rate spec format `"N/s"`, `"N/m"`, `"N/h"` - `config.yaml tools.rate_limits:` parsed into `model.Config.ToolRateLimits` - Package-level atomic counters (`retry_total`, `backoff_total`, `rate_limit_wait_total`) exposed via `tool.MetricSnapshot()` - `/metrics` gains `leather_tool_retry_total`, `leather_tool_backoff_total`, `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth` All new tests pass under `-race`; `make ci` green. --- .subagents/AGENTS-OBSERVABILITY.md | 4 + .subagents/AGENTS-QUALITY.md | 15 + .subagents/AGENTS-RUNTIME.md | 58 ++- .subagents/AGENTS-SERVE.md | 31 ++ .subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md | 23 ++ .subagents/AGENTS-WORKER.md | 29 ++ internal/cli/cli.go | 2 + internal/cli/cmd_dlq.go | 189 ++++++++++ internal/cli/cmd_dlq_test.go | 188 ++++++++++ internal/cli/cmd_run.go | 10 + internal/cli/cmd_serve.go | 27 ++ internal/cli/help.go | 1 + internal/config/config.go | 70 ++++ internal/model/model.go | 29 ++ internal/runner/runner.go | 5 +- internal/tool/executor.go | 340 ++++++++++++++++-- internal/tool/executor_test.go | 390 ++++++++++++++++++--- internal/tool/ratelimit.go | 142 ++++++++ internal/tool/ratelimit_test.go | 178 ++++++++++ schemas/skill-1.schema.yaml | 34 ++ 20 files changed, 1682 insertions(+), 83 deletions(-) create mode 100644 internal/cli/cmd_dlq.go create mode 100644 internal/cli/cmd_dlq_test.go create mode 100644 internal/tool/ratelimit.go create mode 100644 internal/tool/ratelimit_test.go diff --git a/.subagents/AGENTS-OBSERVABILITY.md b/.subagents/AGENTS-OBSERVABILITY.md index 1d27bfc..45ee29c 100644 --- a/.subagents/AGENTS-OBSERVABILITY.md +++ b/.subagents/AGENTS-OBSERVABILITY.md @@ -176,6 +176,10 @@ metric requires a dashboard update note in the PR description. | `leather_cache_hits_total` | counter | `kind` | Response-cache hits. | | `leather_cache_misses_total` | counter | `kind` | Response-cache misses. | | `leather_build_info` | gauge=1 | `version`, `commit` | Build metadata. | +| `leather_tool_retry_total` | counter | _(none)_ | Tool call attempts beyond the first; tells the operator how often transient failures occur across all tools. | +| `leather_tool_backoff_total` | counter | _(none)_ | Times a backoff sleep was applied (retry-after or exponential); indicates rate-limiting pressure from upstream services. | +| `leather_tool_rate_limit_wait_total` | counter | _(none)_ | Times a tool call waited for a per-host token-bucket token; nonzero means the configured rate limits are actively throttling traffic. | +| `leather_outbound_dlq_depth` | gauge | _(none)_ | Current item count in `outbound-dlq`; nonzero means tool failures need operator attention (`leather dlq inspect`). | Rules: diff --git a/.subagents/AGENTS-QUALITY.md b/.subagents/AGENTS-QUALITY.md index 5a398bb..49e39cb 100644 --- a/.subagents/AGENTS-QUALITY.md +++ b/.subagents/AGENTS-QUALITY.md @@ -271,6 +271,21 @@ Before opening a PR: - [ ] No real credentials or secrets in `testdata/` - [ ] CI workflow action refs are SHA-pinned with version comments +### Outbound tool resilience (retry, DLQ, rate limits) + +PRs touching `internal/tool`, `internal/config`, or `internal/cli/cmd_dlq.go`: + +- [ ] `go test ./internal/tool/... ./internal/cli/...` passes +- [ ] `go test -race ./internal/tool/... ./internal/cli/...` passes +- [ ] New tools with `retry:` config have tests covering transient retry and permanent no-retry paths +- [ ] `TestExecute_DLQEnqueueOnExhaustion` and `TestExecute_DLQEnqueueOnPermanent` pass +- [ ] `TestHostLimiter_*` suite passes; no real network calls +- [ ] `TestRunDLQ*` suite passes with `t.TempDir()` state dirs +- [ ] `leather dlq inspect` output includes ID, tool name, agent, attempt, error +- [ ] `leather dlq requeue --state-dir ... ` (item-id **last**) moves item +- [ ] `tools.rate_limits` in config.yaml parses without error; bad spec is warn+disable, not panic +- [ ] `/metrics` response contains `leather_tool_retry_total`, `leather_tool_backoff_total`, `leather_tool_rate_limit_wait_total`, `leather_outbound_dlq_depth` + --- _Last reviewed: 2026-06-05_ diff --git a/.subagents/AGENTS-RUNTIME.md b/.subagents/AGENTS-RUNTIME.md index b5392f1..992dcc2 100644 --- a/.subagents/AGENTS-RUNTIME.md +++ b/.subagents/AGENTS-RUNTIME.md @@ -116,13 +116,60 @@ func (r *Registry) GetTools(skillNames []string) []model.ToolDefinition func (r *Registry) ResolveTools(skillNames, toolsetNames, toolNames []string) []model.ToolDefinition // Executor dispatches HTTP- and MCP-backed tools. -type Executor struct { MCP *mcp.Registry } +// QueueMgr enables the outbound DLQ: permanent/exhausted failures are +// enqueued to "outbound-dlq" when this field is non-nil. +// Limiter enforces per-host token-bucket rate limits before each attempt. +type Executor struct { + MCP *mcp.Registry + QueueMgr *queue.Manager // nil = outbound DLQ disabled + AgentName string // injected by Runner; stored in DLQ items + Limiter *HostLimiter // nil = no rate limiting +} // Execute runs a single tool call. Always returns a ToolResult; never panics. // On failure, ToolResult.Error is set and Content may be empty. +// With def.Retry.MaxAttempts > 0, transient failures (5xx, 429, network) +// are retried with exponential backoff + jitter. func (e *Executor) Execute(ctx context.Context, def model.ToolDefinition, args map[string]any) model.ToolResult + +// MetricSnapshot returns a point-in-time snapshot of outbound tool counters. +// Counters are process-lifetime atomics; they reset only on restart. +func MetricSnapshot() (retryTotal, backoffTotal, rateLimitWaitTotal int64) ``` +#### Per-tool retry policy + +`ToolDefinition.Retry` (`model.ToolRetryConfig`) controls retry behaviour: + +| Field | Default | Meaning | +|---|---|---| +| `max_attempts` | 1 (no retry) | Total attempts including the initial one. | +| `base_delay` | `1s` | Initial backoff; doubles each attempt. | +| `max_delay` | `30s` | Backoff ceiling before jitter. | +| `honor_retry_after` | `true` | Use `Retry-After` header value when present. | + +A zero `ToolRetryConfig` (all fields at zero value) means **single attempt, no +retry** — preserving backward compatibility for tools that predate the policy. +Only tools with an explicit `retry:` block in their skill YAML get the retry loop. + +`isTransient` classifies the failure to decide whether to retry: +- **Transient** → retry: 429, 500, 502, 503, 504; network/timeout errors; + 403 with `X-RateLimit-Remaining: 0` +- **Permanent** → return immediately without retrying: all other 4xx + +#### Outbound DLQ + +When `Executor.QueueMgr != nil` and a tool call either: +- exhausts its retry budget on transient errors, or +- fails immediately with a permanent error, + +a `model.QueueItem` is enqueued to the well-known `"outbound-dlq"` queue. +DLQ items carry `ToolName`, `ToolTarget`, and the last error in `Payload`. +DLQ enqueue is a fire-and-forget side-effect; failure to enqueue is logged +at `warn` and does not affect the returned `ToolResult`. + +Items in `outbound-dlq` can be inspected and requeued via `leather dlq`. + #### Skill file format (`*.skill.yaml`) ```yaml @@ -139,6 +186,11 @@ tools: headers: Authorization: "Bearer {{env:GITHUB_TOKEN}}" Accept: application/vnd.github+json + retry: + max_attempts: 3 + base_delay: 1s + max_delay: 30s + honor_retry_after: true ``` #### Toolset file format (`*.toolset.yaml`) @@ -161,11 +213,15 @@ every toolset reference points at an already-known tool. - `mcp` → `execMCP` through a started `mcp.Registry` HTTP execution (`execHTTP`): +- Calls `e.Limiter.Wait(ctx, host)` before each attempt; blocks until the + per-host token bucket allows the request or ctx is cancelled. - Expands URL template with tool call arguments (`{{.field}}`). - Expands `{{env:VAR}}` in header values; **never logs auth header values**. - Sends the request with the runner's context (inherits timeout). - Response body is capped at 1 MB. - Non-2xx responses populate `ToolResult.Error` with the status code message. +- Transient failures are retried up to `def.Retry.MaxAttempts` times with + exponential backoff; permanent failures return immediately. MCP execution (`execMCP`): - Looks up the named server in the running registry. diff --git a/.subagents/AGENTS-SERVE.md b/.subagents/AGENTS-SERVE.md index 15caa33..9cef85a 100644 --- a/.subagents/AGENTS-SERVE.md +++ b/.subagents/AGENTS-SERVE.md @@ -46,6 +46,7 @@ Each subcommand has: | `run` | `RunOnce` | Load and execute a single agent, then exit | | `validate` | `RunValidate` | Parse and validate agent files; report errors | | `test-agent` | `RunTestAgent` | Execute an agent with `MockLLM` and print the turn transcript | +| `dlq` | `RunDLQ` | Inspect and requeue outbound dead-letter queue items | | `status` | `RunStatus` | Print scheduler state, job history, token usage | | `ingest` | `RunIngest` | Store raw bytes as a hide and optionally enqueue for curing | | `snapshot` | `RunSnapshot` → `RunSnapshotSave` / `RunSnapshotRestore` | Save or restore a `tar.gz` point-in-time archive of runtime state | @@ -212,8 +213,19 @@ cache_dir: "" # empty = serve uses /cache mcp_servers_file: "" # empty = run/serve/validate use ~/.leather/mcp-servers.yaml loop: 1 # repeat leather run N times tannery: "" # path to tannery.yaml; empty = tannery disabled + +tools: + rate_limits: # per-host token-bucket rate limits for outbound tool calls + api.github.com: "60/m" # format: "N/s", "N/m", or "N/h" + api.example.com: "10/s" ``` +`tools.rate_limits` is a nested map. Each key is a hostname (no port, no +scheme); the value is a rate spec in the form `N/` where unit is `s` +(seconds), `m` (minutes), or `h` (hours). The second call to the same host +within the interval blocks until the next token is available. Unknown hosts +pass through immediately with no limiting. + YAML keys are the snake_case equivalents of the flag names (strip `--`, replace `-` with `_`). @@ -255,6 +267,24 @@ All endpoints live in `internal/cli/api_tannery.go`. | `/curings` | GET | `handleCurings` | List curing definitions with queue depth | | `/intake` | POST | `handleIntake` | Direct-ingest endpoint; writes hide from request body | +### `leather dlq` subcommand (`internal/cli/cmd_dlq.go`) + +`RunDLQ(args, stdout, stderr)` dispatches `inspect` and `requeue`: + +``` +leather dlq inspect [--queue outbound-dlq] [--state-dir ...] +leather dlq requeue [--queue outbound-dlq] [--work-queue ] [--state-dir ...] +``` + +- **`inspect`** — lists all items in the DLQ; prints `ID | tool | agent | attempt | enqueued_at | error`. +- **`requeue`** — moves the named item from the DLQ to `--work-queue`, resetting + `AttemptCount` to 0 so it gets a fresh retry budget. Default `--work-queue` is + the DLQ name with the `-dlq` suffix stripped. + +**Important**: `` must come **after** all flags. Go's `flag.FlagSet` +stops parsing at the first non-flag token, so placing `` before flags +silently ignores the remaining flags. + ### `leather ingest` subcommand (`internal/cli/cmd_ingest.go`) `RunIngest(args, stdin, stdout, stderr)` — reads body from `--file` or stdin, @@ -378,6 +408,7 @@ internal/schema → internal/config | Flag name doesn't match env var | `--flag-name` → `LEATHER_FLAG_NAME`; check both | | Skipping graceful shutdown | Always call `scheduler.Drain` before returning from `RunServe` | | Flags after positional arg in `leather run` | Go's `flag.FlagSet` stops at the first non-flag token. The agent file path must come **last**: `leather run --config=... --var k=v agent.md` — not `leather run agent.md --config ...` | +| `` before flags in `leather dlq requeue` | Same issue: item-id must be **last** after all flags: `leather dlq requeue --state-dir ... ` | | `leather init` overwriting without `--overwrite` | `RunInit` fails closed: any pre-existing file causes a non-zero exit and reports `--overwrite` hint. Never silently clobber. | | Calling `RunValidate` from `RunInit` for post-write validation | `RunValidate` performs a full semantic check including model resolution (fails without `LEATHER_MODEL`). `RunInit` uses schema-only validation (`runInitValidate`) which is syntax-only and does not require a model to be set. | diff --git a/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md b/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md index 6490f2e..7578434 100644 --- a/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md +++ b/.subagents/AGENTS-TOOLS-SKILLS-TOOLSETS.md @@ -218,6 +218,29 @@ binary for shell commands; an HTTP MCP server for HTTP APIs). - The capability is a coherent verb (e.g. `release-write`, `inbox-triage`), not a grab bag. +#### Adding retry policy to a skill tool + +Tools that call remote APIs should declare a `retry:` block to handle transient +failures (5xx, 429, network errors) without relying on the caller to retry: + +```yaml +tools: + - name: github_list_issues + type: http + http: + url: "https://api.github.com/repos/{{.repo}}/issues" + retry: + max_attempts: 3 # initial attempt + 2 retries + base_delay: 1s # doubles each retry; capped at max_delay + max_delay: 30s + honor_retry_after: true # use Retry-After header when present +``` + +Omitting `retry:` (or setting `max_attempts: 1`) preserves the legacy +single-attempt behaviour — no retries, no backoff. Only transient errors +(5xx, 429, network timeouts) trigger retries; permanent 4xx errors return +immediately regardless of the retry config. + ### When to write a per-turn declaration - The turn does something dangerous and the agent's base scope is too diff --git a/.subagents/AGENTS-WORKER.md b/.subagents/AGENTS-WORKER.md index 7922103..7c87e4a 100644 --- a/.subagents/AGENTS-WORKER.md +++ b/.subagents/AGENTS-WORKER.md @@ -132,6 +132,35 @@ re-serialize the full slice. Files are mode 0600; directories mode 0700. - `Payload` — `map[string]any`; template variables available in agent prompts - `EnqueuedAt` — Unix timestamp - `AttemptCount` — incremented by the runner on each dequeue-and-run attempt +- `ToolName` — non-empty for outbound DLQ items; the failed tool's name +- `ToolTarget` — non-empty for outbound DLQ items; the URL or `server/tool` string + +#### Outbound DLQ (`outbound-dlq`) + +Tool execution failures that are permanent or that exhaust their retry budget +are enqueued to the well-known `outbound-dlq` queue by `tool.Executor`. These +items are **not** processed by `internal/curing` workers (`CuringName == ""`). +They are surfaced for operator inspection and manual requeue via `leather dlq`. + +Outbound DLQ item shape: + +| Field | Meaning | +|---|---| +| `ID` | `odlq__