agent-flow

A minimal, transparent multi-step (ReAct) document-analyst agent — one Python file, pure stdlib, no framework. Point it at a folder of documents, give it a task, and it iterates think → tool → observe until it can write a report in which every claim carries file:line provenance.

Built by Marcin J. Ołdak to show what an agentic workflow looks like when you can read all of it in ten minutes: planning loop, tool dispatch, grounding rules, human-approval gate, and a test harness that runs the whole loop with a mock LLM — no API key needed.

            ┌────────────────────────────────────────────┐
            │  LLM (any CLI: `claude -p` by default)     │
            │  returns one JSON action per step          │
            └────────────┬───────────────────────────────┘
                         │ {"thought", "action", "args"}
            ┌────────────▼───────────────┐
            │  agent loop (agent.py)     │◄─── optional human
            │  parse → approve? → run    │     approval gate
            └────┬───────────┬───────────┘
                 │           │
        ┌────────▼──┐  ┌─────▼────────────────────────┐
        │ tools     │  │ finish(report)               │
        │ list_files│  │ → markdown with ## Sources   │
        │ search_docs (BM25 + file:line provenance)   │
        │ read_file (sandboxed to the corpus folder)  │
        └───────────┘                                 │

Quickstart

Requires Python 3.9+ and any LLM CLI. Default backend is the Claude Code CLI (claude -p).

python3 agent.py \
  --docs examples/sample-docs \
  --task "What is the invoice approval process and who signs off on a 60,000 PLN invoice?" \
  --out report.md --verbose

Real output of exactly this command: examples/sample-report.md (and the full step trace in examples/sample-trace.json). The agent listed the corpus, ran one search, and wrote a report citing both policy files — combining the approval path (from invoicing-policy.md) with the threshold tiers (from delegation-matrix.md) to answer who signs off a 60,000 PLN invoice.

Human-in-the-loop

python3 agent.py --docs <folder> --task "..." --approve

--approve stops before every tool call: approve it, or reject it with feedback — the feedback is injected into the agent's context as an observation, steering the next step. The same pattern (AI proposes, human signs off) I use in production tools for accounting workflows, where unsupervised automation is a non-starter.

Any LLM backend

The agent shells out to a CLI that reads a prompt on stdin and prints a completion:

AGENTFLOW_LLM_CMD="claude -p"                 # default
AGENTFLOW_LLM_CMD="ollama run llama3"         # local model
AGENTFLOW_LLM_CMD="python3 tests/mock_llm.py" # deterministic mock (tests)

Design decisions

No framework. The point is to show the mechanics. A ReAct loop is ~150 lines; reading them beats trusting a black box.
Grounding is enforced by structure, not vibes. The only knowledge source is the tool output; search results carry file:start-end spans; the system prompt requires citations and an explicit "the corpus doesn't say" when retrieval comes up empty.
Lexical retrieval (BM25), not embeddings. For small project/client corpora, terminology-heavy search with zero dependencies and zero index maintenance wins. The scorer includes a cheap bidirectional-prefix match that handles inflection in morphology-rich languages (built with Polish in mind: faktura / faktury / fakturze).
Sandboxed tools. read_file resolves paths and refuses anything outside the corpus folder.
Bounded loop. A step budget forces a final report ("here's what I found and what remains unknown") instead of an infinite research spiral; a one-shot JSON-repair retry handles malformed model output.
Testable without an API. tests/mock_llm.py replays a fixed action sequence, so CI can exercise the full loop — parsing, dispatch, sandbox, report writing — deterministically.

Tests

python3 -m unittest discover tests -v

Five tests: retrieval provenance, no-match honesty, path-escape denial, in-corpus read, and the full agent loop with the mock LLM.

Limitations (honest ones)

BM25 is lexical — conceptual paraphrases may need a reworded query (the agent usually does this itself; the prompt tells it to).
One agent, sequential tools. Orchestration of multiple agents is intentionally out of scope — for that I use purpose-built orchestrators; this repo is about making one agent's reasoning legible.
Text formats only (.md, .txt, .rst, .csv, .org, .adoc). Convert PDFs first.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agent.py		agent.py
retrieval.py		retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-flow

Quickstart

Human-in-the-loop

Any LLM backend

Design decisions

Tests

Limitations (honest ones)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-flow

Quickstart

Human-in-the-loop

Any LLM backend

Design decisions

Tests

Limitations (honest ones)

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages