Why
Every layer of ClawLoop — reflectors writing playbook entries, evolvers choosing what to mutate, operators reading dashboards — reasons about why an episode failed. Today each layer does this ad hoc, with free-text rationales that don't compose across runs or across agents. A shared, structured taxonomy of failure modes would:
- Let playbook entries reference categories ("handles tool-call argument-shape errors") instead of hoping a free-text match works.
- Let users track the distribution of failure modes across runs, not just the aggregate reward.
- Give the reflector a more precise target than "diagnose this episode."
The open question
We don't yet have a good answer to what the right categories are, and we think the community probably has strong opinions worth hearing before we freeze anything. Some of the axes we're unsure about:
- Granularity. Five categories or fifty?
- Shape. Flat list, hierarchical tree, multi-label tags, or structured fields (
layer, surface, recoverable?)?
- Universality. Can one taxonomy cover math, web navigation, code repair, and tool-use agents — or does each task family need its own leaves under a shared root?
- Assignment. How does an episode get labelled? LLM judge, rule-based extractor, learned classifier, human?
Starting-point categories (to be challenged)
Straw-man list from what we see most often across the existing environments — explicitly not a proposal, just a seed for discussion:
- Tool-call errors — wrong tool, wrong argument shape, missing argument.
- Grounding errors — references an entity or fact the env does not contain.
- Reasoning errors — correct setup, wrong conclusion.
- Planning errors — wrong ordering of otherwise-correct steps.
- Formatting errors — answer is right but in the wrong shape for the grader.
- Budget / timeout — ran out of steps or tokens before finishing.
Prior art worth pointing at
If you know of failure taxonomies from other agent / RL / eval projects that are worth borrowing from, please link them in the comments. We'd rather stand on an existing vocabulary than invent a new one.
Why open this up early
Once a taxonomy gets baked into the reflector prompt, playbook schema, and run records, it is painful to change. We want the community to shape it before it calcifies.
Engage
Comment with proposed categories, references to existing taxonomies, critiques of the straw-man list, or thoughts on assignment strategy.
Why
Every layer of ClawLoop — reflectors writing playbook entries, evolvers choosing what to mutate, operators reading dashboards — reasons about why an episode failed. Today each layer does this ad hoc, with free-text rationales that don't compose across runs or across agents. A shared, structured taxonomy of failure modes would:
The open question
We don't yet have a good answer to what the right categories are, and we think the community probably has strong opinions worth hearing before we freeze anything. Some of the axes we're unsure about:
layer,surface,recoverable?)?Starting-point categories (to be challenged)
Straw-man list from what we see most often across the existing environments — explicitly not a proposal, just a seed for discussion:
Prior art worth pointing at
If you know of failure taxonomies from other agent / RL / eval projects that are worth borrowing from, please link them in the comments. We'd rather stand on an existing vocabulary than invent a new one.
Why open this up early
Once a taxonomy gets baked into the reflector prompt, playbook schema, and run records, it is painful to change. We want the community to shape it before it calcifies.
Engage
Comment with proposed categories, references to existing taxonomies, critiques of the straw-man list, or thoughts on assignment strategy.