Skip to content

martin0ne/agent-flow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-flow

A minimal, transparent multi-step (ReAct) document-analyst agent — one Python file, pure stdlib, no framework. Point it at a folder of documents, give it a task, and it iterates think → tool → observe until it can write a report in which every claim carries file:line provenance.

Built by Marcin J. Ołdak to show what an agentic workflow looks like when you can read all of it in ten minutes: planning loop, tool dispatch, grounding rules, human-approval gate, and a test harness that runs the whole loop with a mock LLM — no API key needed.

            ┌────────────────────────────────────────────┐
            │  LLM (any CLI: `claude -p` by default)     │
            │  returns one JSON action per step          │
            └────────────┬───────────────────────────────┘
                         │ {"thought", "action", "args"}
            ┌────────────▼───────────────┐
            │  agent loop (agent.py)     │◄─── optional human
            │  parse → approve? → run    │     approval gate
            └────┬───────────┬───────────┘
                 │           │
        ┌────────▼──┐  ┌─────▼────────────────────────┐
        │ tools     │  │ finish(report)               │
        │ list_files│  │ → markdown with ## Sources   │
        │ search_docs (BM25 + file:line provenance)   │
        │ read_file (sandboxed to the corpus folder)  │
        └───────────┘                                 │

Quickstart

Requires Python 3.9+ and any LLM CLI. Default backend is the Claude Code CLI (claude -p).

python3 agent.py \
  --docs examples/sample-docs \
  --task "What is the invoice approval process and who signs off on a 60,000 PLN invoice?" \
  --out report.md --verbose

Real output of exactly this command: examples/sample-report.md (and the full step trace in examples/sample-trace.json). The agent listed the corpus, ran one search, and wrote a report citing both policy files — combining the approval path (from invoicing-policy.md) with the threshold tiers (from delegation-matrix.md) to answer who signs off a 60,000 PLN invoice.

Human-in-the-loop

python3 agent.py --docs <folder> --task "..." --approve

--approve stops before every tool call: approve it, or reject it with feedback — the feedback is injected into the agent's context as an observation, steering the next step. The same pattern (AI proposes, human signs off) I use in production tools for accounting workflows, where unsupervised automation is a non-starter.

Any LLM backend

The agent shells out to a CLI that reads a prompt on stdin and prints a completion:

AGENTFLOW_LLM_CMD="claude -p"                 # default
AGENTFLOW_LLM_CMD="ollama run llama3"         # local model
AGENTFLOW_LLM_CMD="python3 tests/mock_llm.py" # deterministic mock (tests)

Design decisions

  • No framework. The point is to show the mechanics. A ReAct loop is ~150 lines; reading them beats trusting a black box.
  • Grounding is enforced by structure, not vibes. The only knowledge source is the tool output; search results carry file:start-end spans; the system prompt requires citations and an explicit "the corpus doesn't say" when retrieval comes up empty.
  • Lexical retrieval (BM25), not embeddings. For small project/client corpora, terminology-heavy search with zero dependencies and zero index maintenance wins. The scorer includes a cheap bidirectional-prefix match that handles inflection in morphology-rich languages (built with Polish in mind: faktura / faktury / fakturze).
  • Sandboxed tools. read_file resolves paths and refuses anything outside the corpus folder.
  • Bounded loop. A step budget forces a final report ("here's what I found and what remains unknown") instead of an infinite research spiral; a one-shot JSON-repair retry handles malformed model output.
  • Testable without an API. tests/mock_llm.py replays a fixed action sequence, so CI can exercise the full loop — parsing, dispatch, sandbox, report writing — deterministically.

Tests

python3 -m unittest discover tests -v

Five tests: retrieval provenance, no-match honesty, path-escape denial, in-corpus read, and the full agent loop with the mock LLM.

Limitations (honest ones)

  • BM25 is lexical — conceptual paraphrases may need a reworded query (the agent usually does this itself; the prompt tells it to).
  • One agent, sequential tools. Orchestration of multiple agents is intentionally out of scope — for that I use purpose-built orchestrators; this repo is about making one agent's reasoning legible.
  • Text formats only (.md, .txt, .rst, .csv, .org, .adoc). Convert PDFs first.

License

MIT — see LICENSE.

About

Minimal, transparent multi-step (ReAct) document-analyst agent — one Python file, stdlib only, every claim carries file:line provenance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages