Add §11: Usage-driven improvement via OODA loop#29
Conversation
Add a new principle for instrumenting AXI tools with usage logging and a summary command that feeds agent behavior data back into tuning decisions. Covers: what to log (JSONL per invocation), what to detect (discovery friction, retry cascades, verification follow-ups, unhinted transitions), what to summarize (organized around success/cost/ duration/turns metrics), and the improvement cycle (OODA loop). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
As an example of usage, this is the Claude Opus 4.7 analysis of experience using the rail AXI app. 1. 2. API errors disclose nothing actionable. 10 of 13 errors had no contextual hint. 3. Cross-board hints are missing.
4. Default 5. |
|
hey @sjmurdoch this is a really thoughtful idea. I love it. thoughts on your open questions - in general I don't think the AXI principles should prescribe the underlying implementation. the core idea of instrumentation + feedback loop to refine tool design is solid and agnostic of how it's implemented. we should leave the implementation to each tool. as such we should significantly reduce the amount of content we write to describe this 11th principle. right now it's quite lengthy and has implementation details I think we can drop do you want to take a pass at it? |
|
Thanks for your comments. I'm glad you like the idea. I'll take a shot at shortening the guidance and, in particular, removing the implementation details. |
…hat should be achieved, not how
|
I've substantially revised the proposal, focusing on making it more general and removing anything that prescribes a particular implementation. It's shorter than before, but longer than the other principles, mostly due to the examples. The examples came from implementation experience of using the technique, so I think the next open question is which are sufficiently obvious that they don't need to be stated, versus which would be valuable to implementers. |
Motivation
AXI §1–10 tell you what to optimize for, but not how to know whether you got it right. Every AXI tool ships with best-guess defaults — 3-4 default fields, a row limit, a truncation threshold — and those guesses stay frozen unless someone manually reviews agent behavior.
This PR proposes a new principle for discussion: instrument AXI tools to observe how agents actually use them, then feed that data back into tuning decisions.
What this adds
§11: Usage-driven improvement — a new section covering:
What to log — a JSONL record per invocation capturing AXI-relevant signals: schema overrides, list-length overrides, parameterized default overrides, truncation usage, aggregates produced, errors with/without hints, parse errors with full argv, and latency.
What to detect — four agent failure modes, three identified by the AXI benchmarks as primary drivers of wasted cost and turns, plus a disclosure-quality signal:
--help(wasted turn pair)What to summarize — a
mytool usagesubcommand that analyzes the log and reports insights organized around the four outcome metrics from AXI benchmarking (success rate, cost, duration, turns). Each analysis is annotated with which metrics it improves. Includes command sequence analysis that compares frequent A→B transitions against A's actual hints to identify missing, unused, and effective contextual disclosure.The improvement cycle — framed as an OODA loop (Observe → Orient → Decide → Act) where logging is Observe, the summary command is Orient, recommendations are Decide, and implementing changes is Act.
Rationale
This came from building an AXI tool interfacing with the UK rail Live Departure Boards web service and running this cycle in practice. Concrete examples of what usage data revealed:
rsidwas requested via--fieldsoverride in 2/3 of departure board calls → promoted to default schema (saves a follow-up turn every time the agent wants to drill into a service)--offsetwas attempted twice on the journey command and rejected as "No such option" → flag was added (eliminated a discovery-friction turn pair)--helphint to error outputThe OODA framing adds value beyond a numbered checklist because it separates Orient (mechanical analysis) from Decide (developer judgment). The log might show 75% empty results, but only the developer knows whether that's a default-parameter problem or an expected domain pattern.
Open questions
Log format standardization: should AXI prescribe a specific JSONL schema, or just the signal categories? A standard schema would let generic analysis tools work across AXI tools, but risks being too rigid for tools with different parameter shapes.
Privacy and retention: the current guidance says "local cache file, clear with
--clear". Should there be stronger guidance on retention limits, sensitive data in argv (e.g. API keys passed as flags), or opt-out mechanisms?Cross-session analysis: the current design treats each
mytool usagerun as a point-in-time snapshot. Would it be valuable to track trends across OODA cycles (e.g. "empty rate dropped from 46% to 12% after widening --window")? This would require not clearing the log, or maintaining a separate metrics history.Automated recommendations vs. developer judgment: the section recommends that the usage command output actionable suggestions ("Add 'status' to default schema"). How prescriptive should these be? There's a tension between making it easy to act on (concrete suggestions) and respecting that the developer may have domain reasons to reject them (as in the late-night journey example).
Scope: is this the right level of detail for the skill prompt, or should the skill contain just the principles and link to a longer reference document for implementation guidance?