🧪 Agentic Test Generator

Multi-agent system that converts an OpenAPI/Swagger spec into a comprehensive, ready-to-run pytest suite — automatically.

What it does

Paste an OpenAPI spec. Get a complete pytest suite — functional tests, boundary checks, edge cases, and robustness tests — with a conftest.py, a Markdown report, and live streaming progress in the browser.

No hand-written test templates. No one-size-fits-all prompts. Each test category uses a different prompting strategy, routed to the model best suited for that type of reasoning.

Architecture

Five specialised agents are orchestrated by LangGraph in a directed graph with a conditional retry loop:

Agent	Role
Scout	Parses the OpenAPI spec (JSON or YAML), resolves `$ref` chains, extracts every endpoint with its schemas, parameters, auth requirements, and response codes. Uses GPT-4o to score each endpoint's criticality (1–10) and identify inter-endpoint dependencies.
Strategist	Decides how many tests of each category to generate per endpoint. Accounts for HTTP method semantics (POSTs need more tests than GETs), criticality score, and whether the endpoint accepts a request body. Produces a structured test plan via a single batched LLM call.
Generator	Executes the test plan concurrently. Each test is generated by one of five prompt strategies, with test count distributed evenly across strategies. Uses GPT-4o for functional/boundary tests and Claude Sonnet for edge-case/robustness tests. Validates every output with `ast.parse` and retries on syntax errors.
Critic	Three-phase quality gate: (1) syntax check, (2) URL-target check — confirms the code actually exercises the right endpoint, (3) completeness check — enforces at least one `assert`. Near-duplicate tests are detected via SBERT cosine similarity (all-MiniLM-L6-v2) with Union-Find clustering and queued for regeneration with targeted feedback.
Assembler	Groups validated tests by endpoint into Python classes, generates a `conftest.py` with base-URL and session fixtures, detects auth requirements and adds auth-header fixtures, and produces a Markdown summary report.

Agent flow

graph LR
    A([Start]) --> B[Scout]
    B --> C[Strategist]
    C --> D[Generator]
    D --> E[Critic]
    E -->|retry_queue non-empty\nand under retry cap| D
    E -->|done| F[Assembler]
    F --> G([End])

    style D fill:#4C9BE8,color:#fff
    style E fill:#E76F51,color:#fff

The retry edge fires when the Critic places tests in retry_queue (failed validation or detected as near-duplicates) and at least one slot is below the per-slot retry cap (2 retries). Each RetryRequest carries the Critic's targeted feedback message which the Generator uses as the prompt for the replacement test.

Tech stack

Technology	Why
LangGraph	Stateful agent orchestration with typed shared state (`TypedDict`), conditional edges, and built-in recursion limiting. Avoids the glue code of rolling a custom DAG executor.
LiteLLM	Single unified API surface for OpenAI and Anthropic. Swapping models or adding providers requires changing one string — no SDK-specific code anywhere in the codebase.
GPT-4o	Fastest and most cost-effective for structured, schema-driven test generation (functional and boundary categories).
Claude Sonnet	Stronger adversarial and creative reasoning; significantly better at identifying subtle failure modes for edge-case and robustness tests.
FastAPI	Async-native Python API framework. Server-Sent Events for real-time streaming progress require no third-party library — just `StreamingResponse` with `text/event-stream`.
Streamlit	Rapid browser UI without JavaScript. `st.empty()` placeholder-based progressive rendering gives the SSE stream a live feel.
SBERT (all-MiniLM-L6-v2)	Semantic similarity for near-duplicate detection. Token-set (Jaccard) similarity misses paraphrased duplicates; SBERT catches them. The model is loaded lazily to avoid download-at-import cost.
Pydantic v2	Typed models for all data structures in shared state. Alias support (`alias="in"`) handles OpenAPI's reserved-keyword field names without schema hacks.
LangSmith	Zero-config observability — trace every LLM call, token count, and latency per run. Enabled by setting two environment variables.
ChromaDB	Vector store for the planned RAG layer (see Future Work). Already scaffolded in `src/rag/`.

Research foundation

The prompt strategy registry in src/prompts/registry.py is grounded in empirical research on LLM test generation. The key findings that shaped this system's design:

Prompt strategy matters more than model size. Across functional test generation benchmarks, a well-chosen prompting strategy (e.g. chain-of-thought decomposition) applied to a mid-tier model consistently outperforms a naive single-shot prompt sent to a frontier model. This is why the system maintains a curated registry of five distinct strategies rather than relying on a single general-purpose prompt.

Self-refine harms robustness test diversity. The self-refine strategy (generate → critique → rewrite) is effective for functional and edge-case tests, where a second LLM pass sharpens assertion specificity. For robustness tests, however, the critique stage systematically steers the model away from unusual inputs toward "reasonable" ones — the exact opposite of what robustness testing requires. As a result, self_refine is explicitly excluded from the robustness strategy map.

Per-category model routing improves coverage. Functional and boundary tests benefit from the structured, schema-faithful completions GPT-4o produces quickly and cheaply. Edge-case and robustness tests require a model that reasons about adversarial inputs and failure modes rather than happy-path schemas — Claude Sonnet fills this role significantly better on the same prompts.

Setup

Prerequisites

Python 3.11+
An OpenAI API key
An Anthropic API key
(Optional) A LangSmith API key for tracing
Docker and Docker Compose (for containerised deployment)

1. Clone the repository

git clone https://github.com/your-username/agentic-test-generator.git
cd agentic-test-generator

2. Configure environment variables

cp .env.example .env

Edit .env:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Optional — enable LangSmith tracing
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentic-test-generator
LANGSMITH_TRACING=true

3. Install dependencies

pip install -e ".[dev]"

4. Run locally

Start the backend in one terminal:

python -m src.api.main
# FastAPI running on http://localhost:8000
# Interactive docs at http://localhost:8000/docs

Start the frontend in another terminal:

streamlit run frontend/app.py
# Streamlit running on http://localhost:8501

5. Run with Docker

docker compose up --build

Service	URL
Streamlit UI	http://localhost:8501
FastAPI backend	http://localhost:8000
API docs (Swagger)	http://localhost:8000/docs

Usage

Streamlit UI

Open http://localhost:8501
Paste an OpenAPI spec into the Paste spec tab, or upload a .json/.yaml file via Upload file
Click ⚡ Generate Test Suite
Watch the five agents run in real time — each step updates as it completes
Results appear in three tabs: Test Suite, Conftest, and Report
Use the download buttons to save tests.py and conftest.py

API — curl examples

Synchronous generation:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"spec": "'"$(cat data/petstore.yaml)"'"}' \
  | jq '{total: .stats.total, by_category: .stats.by_category}'

Streaming generation (SSE):

curl -N -X POST http://localhost:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"spec": "'"$(cat data/petstore.yaml)"'"}'

Download generated files:

# After a successful /generate call:
curl http://localhost:8000/download/tests.py -o tests.py
curl http://localhost:8000/download/conftest.py -o conftest.py

Health check:

curl http://localhost:8000/health
# {"status":"ok","version":"0.1.0"}

Example output

Given a POST /users endpoint with an email + password request body, the system generates tests like:

class TestPostUsers:
    """Tests for POST /users."""

    def test_post_users_functional_happy_path(self, http_session, base_url):
        """Functional: valid user creation returns 201 with id field."""
        payload = {"email": "alice@example.com", "password": "S3cur3P@ss!"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 201
        body = response.json()
        assert "id" in body
        assert body["email"] == payload["email"]

    def test_post_users_boundary_max_length_email(self, http_session, base_url):
        """Boundary: email at maximum allowed length is accepted."""
        long_local = "a" * 64
        domain = "example.com"
        payload = {"email": f"{long_local}@{domain}", "password": "ValidPass1!"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code in (201, 422)

    def test_post_users_edge_case_sql_injection_email(self, http_session, base_url):
        """Edge case: SQL injection attempt in email field is rejected."""
        payload = {"email": "' OR '1'='1", "password": "irrelevant"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 422
        assert "id" not in response.json()

    def test_post_users_robustness_missing_required_field(self, http_session, base_url):
        """Robustness: missing password field returns 422 Unprocessable Entity."""
        payload = {"email": "bob@example.com"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 422

Project structure

agentic-test-generator/
├── src/
│   ├── agents/
│   │   ├── scout.py          # OpenAPI parser + criticality enrichment
│   │   ├── strategist.py     # Test plan generation
│   │   ├── generator.py      # Async parallel test code generation
│   │   ├── critic.py         # Validation, dedup, retry queue
│   │   └── assembler.py      # Suite assembly + conftest + report
│   ├── graph/
│   │   ├── state.py          # LangGraph TypedDict state + Pydantic models
│   │   └── workflow.py       # Graph definition, retry edge, run_pipeline()
│   ├── prompts/
│   │   ├── registry.py       # STRATEGY_MAP and get_prompt() API
│   │   ├── strategies/       # five prompt strategy modules
│   │   └── examples/         # few-shot example snippets
│   ├── utils/
│   │   ├── parser.py         # $ref-resolving OpenAPI parser
│   │   ├── cleaner.py        # LLM output cleaning + ast.parse validation
│   │   └── similarity.py     # SBERT cosine sim + Union-Find dedup
│   ├── rag/
│   │   ├── store.py          # ChromaDB vector store (scaffolded)
│   │   └── retriever.py      # Retrieval logic (scaffolded)
│   └── api/
│       └── main.py           # FastAPI app — /generate, /generate/stream, /download
├── frontend/
│   └── app.py                # Streamlit UI with live SSE progress
├── tests/                    # pytest test suite
├── data/                     # Sample OpenAPI specs
├── Dockerfile                # Backend container
├── Dockerfile.frontend       # Frontend container
├── docker-compose.yml        # Orchestrated deployment
└── pyproject.toml            # Dependencies and tooling config

Future work

RAG integration for historical test patterns — src/rag/ is already scaffolded with ChromaDB. The next step is to store every validated test in the vector store after each run, then retrieve the most semantically similar historical tests at prompt-construction time. This would let the system learn from past generations and avoid re-discovering the same test patterns from scratch.

A2A protocol support — The five agents currently communicate through LangGraph shared state. Exposing them as independent services speaking the Agent-to-Agent (A2A) protocol would allow third-party agents to invoke individual pipeline stages (e.g. just the Critic as a standalone code-review agent) and would enable heterogeneous multi-agent deployments.

CI/CD integration — A GitHub Actions workflow that runs the generator against a repo's OpenAPI spec on every PR, diffs the generated suite against the committed one, and comments a coverage delta directly on the pull request. Combined with a pytest --tb=short run against the target API in a staging environment, this would close the loop from spec change to test regression automatically.

License

MIT — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧪 Agentic Test Generator

What it does

Architecture

Agent flow

Tech stack

Research foundation

Setup

Prerequisites

1. Clone the repository

2. Configure environment variables

3. Install dependencies

4. Run locally

5. Run with Docker

Usage

Streamlit UI

API — curl examples

Example output

Project structure

Future work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
frontend		frontend
prompts		prompts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🧪 Agentic Test Generator

What it does

Architecture

Agent flow

Tech stack

Research foundation

Setup

Prerequisites

1. Clone the repository

2. Configure environment variables

3. Install dependencies

4. Run locally

5. Run with Docker

Usage

Streamlit UI

API — curl examples

Example output

Project structure

Future work

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages