Skip to content

Add §11: Usage-driven improvement via OODA loop#29

Open
sjmurdoch wants to merge 2 commits into
kunchenguid:mainfrom
sjmurdoch:usage-driven-improvement
Open

Add §11: Usage-driven improvement via OODA loop#29
sjmurdoch wants to merge 2 commits into
kunchenguid:mainfrom
sjmurdoch:usage-driven-improvement

Conversation

@sjmurdoch

Copy link
Copy Markdown

Motivation

AXI §1–10 tell you what to optimize for, but not how to know whether you got it right. Every AXI tool ships with best-guess defaults — 3-4 default fields, a row limit, a truncation threshold — and those guesses stay frozen unless someone manually reviews agent behavior.

This PR proposes a new principle for discussion: instrument AXI tools to observe how agents actually use them, then feed that data back into tuning decisions.

What this adds

§11: Usage-driven improvement — a new section covering:

  1. What to log — a JSONL record per invocation capturing AXI-relevant signals: schema overrides, list-length overrides, parameterized default overrides, truncation usage, aggregates produced, errors with/without hints, parse errors with full argv, and latency.

  2. What to detect — four agent failure modes, three identified by the AXI benchmarks as primary drivers of wasted cost and turns, plus a disclosure-quality signal:

    • Discovery friction — agent tries a nonexistent flag, falls back to --help (wasted turn pair)
    • Retry cascades — same command repeated within seconds with escalating parameters
    • Verification follow-ups — detail call after a list to read a field the default schema omitted
    • Unhinted transitions — agent frequently follows command A with command B, but A's output never suggested B (missing contextual disclosure)
  3. What to summarize — a mytool usage subcommand that analyzes the log and reports insights organized around the four outcome metrics from AXI benchmarking (success rate, cost, duration, turns). Each analysis is annotated with which metrics it improves. Includes command sequence analysis that compares frequent A→B transitions against A's actual hints to identify missing, unused, and effective contextual disclosure.

  4. The improvement cycle — framed as an OODA loop (Observe → Orient → Decide → Act) where logging is Observe, the summary command is Orient, recommendations are Decide, and implementing changes is Act.

Rationale

This came from building an AXI tool interfacing with the UK rail Live Departure Boards web service and running this cycle in practice. Concrete examples of what usage data revealed:

  • rsid was requested via --fields override in 2/3 of departure board calls → promoted to default schema (saves a follow-up turn every time the agent wants to drill into a service)
  • --offset was attempted twice on the journey command and rejected as "No such option" → flag was added (eliminated a discovery-friction turn pair)
  • 12/16 journey calls returned 0 results → identified as late-night queries (domain knowledge said don't widen the default, contradicting what a naive heuristic would suggest — illustrating why Orient and Decide must be separate steps)
  • All 5 parse errors had zero contextual hints → added --help hint to error output

The OODA framing adds value beyond a numbered checklist because it separates Orient (mechanical analysis) from Decide (developer judgment). The log might show 75% empty results, but only the developer knows whether that's a default-parameter problem or an expected domain pattern.

Open questions

  • Log format standardization: should AXI prescribe a specific JSONL schema, or just the signal categories? A standard schema would let generic analysis tools work across AXI tools, but risks being too rigid for tools with different parameter shapes.

  • Privacy and retention: the current guidance says "local cache file, clear with --clear". Should there be stronger guidance on retention limits, sensitive data in argv (e.g. API keys passed as flags), or opt-out mechanisms?

  • Cross-session analysis: the current design treats each mytool usage run as a point-in-time snapshot. Would it be valuable to track trends across OODA cycles (e.g. "empty rate dropped from 46% to 12% after widening --window")? This would require not clearing the log, or maintaining a separate metrics history.

  • Automated recommendations vs. developer judgment: the section recommends that the usage command output actionable suggestions ("Add 'status' to default schema"). How prescriptive should these be? There's a tension between making it easy to act on (concrete suggestions) and respecting that the developer may have domain reasons to reject them (as in the late-night journey example).

  • Scope: is this the right level of detail for the skill prompt, or should the skill contain just the principles and link to a longer reference document for implementation guidance?

Add a new principle for instrumenting AXI tools with usage logging
and a summary command that feeds agent behavior data back into
tuning decisions.

Covers: what to log (JSONL per invocation), what to detect (discovery
friction, retry cascades, verification follow-ups, unhinted
transitions), what to summarize (organized around success/cost/
duration/turns metrics), and the improvement cycle (OODA loop).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sjmurdoch

Copy link
Copy Markdown
Author

As an example of usage, this is the Claude Opus 4.7 analysis of experience using the rail AXI app.


1. rail station shows an unused arrivals hint — drop it. unused_hints shows station → arrivals was offered 5 times and followed 0 times. Looking up a station almost always precedes departures, not arrivals. cli.py:70-73 currently emits both — drop the arrivals line. (Recommendation #​8 in the report is correct.)

2. API errors disclose nothing actionable. 10 of 13 errors had no contextual hint. error_categories shows 5 api_error. Looking at cli.py:326-329, 426-428, 511-540, the API-failure paths just toon_error(str(e)) with no hints. Add hints like "Run rail station <name> to verify CRS" / "Run rail setup to verify API connectivity".

3. Cross-board hints are missing. unhinted_transitions shows the top sequences agents make without being prompted:

  • departures → arrivals (10) — need a "Run rail arrivals <crs> for arrivals at this station" hint on departures output (and the inverse on arrivals).
  • journey → departures (7) — journey output should suggest checking live departures from the origin/destination CRS.
  • arrivals → departures (6) — symmetric to above.

4. Default --window is too narrow for journey. 17 of 29 journey calls returned 0 results (59% empty rate). Only 1 of 29 used --window override — agents are mostly not taking the "widen window" hint and abandoning. Either widen the default to 240 min, or auto-retry once with a 2× window before reporting empty. The retry data backs this up: journey: 10 retries are largely the "0 results → try again with different params" pattern.

5. journey empty-state still hides the search structure. When SRA→CBG returns 0, the only hint is "widen window" or "specify --via". The planner should tell the agent which interchanges it actually probed (interchanges searched: BHM,LDS,…) so the agent can pick a sensible --via instead of guessing. Useful aggregate, low cost — fits AXI §3 (pre-computed structure to prevent follow-ups).

@kunchenguid

Copy link
Copy Markdown
Owner

hey @sjmurdoch this is a really thoughtful idea. I love it.

thoughts on your open questions - in general I don't think the AXI principles should prescribe the underlying implementation. the core idea of instrumentation + feedback loop to refine tool design is solid and agnostic of how it's implemented. we should leave the implementation to each tool.

as such we should significantly reduce the amount of content we write to describe this 11th principle. right now it's quite lengthy and has implementation details I think we can drop

do you want to take a pass at it?

@kunchenguid kunchenguid added the ezoss/triaged Managed by ezoss label Apr 29, 2026
@sjmurdoch

Copy link
Copy Markdown
Author

Thanks for your comments. I'm glad you like the idea. I'll take a shot at shortening the guidance and, in particular, removing the implementation details.

@sjmurdoch

Copy link
Copy Markdown
Author

I've substantially revised the proposal, focusing on making it more general and removing anything that prescribes a particular implementation. It's shorter than before, but longer than the other principles, mostly due to the examples. The examples came from implementation experience of using the technique, so I think the next open question is which are sufficiently obvious that they don't need to be stated, versus which would be valuable to implementers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ezoss/triaged Managed by ezoss

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants