Error taxonomy: shared vocabulary for agent failures

### Why

Every layer of ClawLoop — reflectors writing playbook entries, evolvers choosing what to mutate, operators reading dashboards — reasons about *why* an episode failed. Today each layer does this ad hoc, with free-text rationales that don't compose across runs or across agents. A shared, structured taxonomy of failure modes would:

- Let playbook entries reference categories ("handles tool-call argument-shape errors") instead of hoping a free-text match works.
- Let users track the *distribution* of failure modes across runs, not just the aggregate reward.
- Give the reflector a more precise target than "diagnose this episode."

### The open question

We don't yet have a good answer to what the right categories are, and we think the community probably has strong opinions worth hearing before we freeze anything. Some of the axes we're unsure about:

- **Granularity.** Five categories or fifty?
- **Shape.** Flat list, hierarchical tree, multi-label tags, or structured fields (`layer`, `surface`, `recoverable?`)?
- **Universality.** Can one taxonomy cover math, web navigation, code repair, and tool-use agents — or does each task family need its own leaves under a shared root?
- **Assignment.** How does an episode get labelled? LLM judge, rule-based extractor, learned classifier, human?

### Starting-point categories (to be challenged)

Straw-man list from what we see most often across the existing environments — explicitly not a proposal, just a seed for discussion:

- **Tool-call errors** — wrong tool, wrong argument shape, missing argument.
- **Grounding errors** — references an entity or fact the env does not contain.
- **Reasoning errors** — correct setup, wrong conclusion.
- **Planning errors** — wrong ordering of otherwise-correct steps.
- **Formatting errors** — answer is right but in the wrong shape for the grader.
- **Budget / timeout** — ran out of steps or tokens before finishing.

### Prior art worth pointing at

If you know of failure taxonomies from other agent / RL / eval projects that are worth borrowing from, please link them in the comments. We'd rather stand on an existing vocabulary than invent a new one.

### Why open this up early

Once a taxonomy gets baked into the reflector prompt, playbook schema, and run records, it is painful to change. We want the community to shape it before it calcifies.

### Engage

Comment with proposed categories, references to existing taxonomies, critiques of the straw-man list, or thoughts on assignment strategy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error taxonomy: shared vocabulary for agent failures #57

Why

The open question

Starting-point categories (to be challenged)

Prior art worth pointing at

Why open this up early

Engage

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Error taxonomy: shared vocabulary for agent failures #57

Description

Why

The open question

Starting-point categories (to be challenged)

Prior art worth pointing at

Why open this up early

Engage

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions