Multi-agent system that converts an OpenAPI/Swagger spec into a comprehensive, ready-to-run pytest suite — automatically.
Paste an OpenAPI spec. Get a complete pytest suite — functional tests, boundary checks, edge cases, and robustness tests — with a conftest.py, a Markdown report, and live streaming progress in the browser.
No hand-written test templates. No one-size-fits-all prompts. Each test category uses a different prompting strategy, routed to the model best suited for that type of reasoning.
Five specialised agents are orchestrated by LangGraph in a directed graph with a conditional retry loop:
| Agent | Role |
|---|---|
| Scout | Parses the OpenAPI spec (JSON or YAML), resolves $ref chains, extracts every endpoint with its schemas, parameters, auth requirements, and response codes. Uses GPT-4o to score each endpoint's criticality (1–10) and identify inter-endpoint dependencies. |
| Strategist | Decides how many tests of each category to generate per endpoint. Accounts for HTTP method semantics (POSTs need more tests than GETs), criticality score, and whether the endpoint accepts a request body. Produces a structured test plan via a single batched LLM call. |
| Generator | Executes the test plan concurrently. Each test is generated by one of five prompt strategies, with test count distributed evenly across strategies. Uses GPT-4o for functional/boundary tests and Claude Sonnet for edge-case/robustness tests. Validates every output with ast.parse and retries on syntax errors. |
| Critic | Three-phase quality gate: (1) syntax check, (2) URL-target check — confirms the code actually exercises the right endpoint, (3) completeness check — enforces at least one assert. Near-duplicate tests are detected via SBERT cosine similarity (all-MiniLM-L6-v2) with Union-Find clustering and queued for regeneration with targeted feedback. |
| Assembler | Groups validated tests by endpoint into Python classes, generates a conftest.py with base-URL and session fixtures, detects auth requirements and adds auth-header fixtures, and produces a Markdown summary report. |
graph LR
A([Start]) --> B[Scout]
B --> C[Strategist]
C --> D[Generator]
D --> E[Critic]
E -->|retry_queue non-empty\nand under retry cap| D
E -->|done| F[Assembler]
F --> G([End])
style D fill:#4C9BE8,color:#fff
style E fill:#E76F51,color:#fff
The retry edge fires when the Critic places tests in retry_queue (failed validation or detected as near-duplicates) and at least one slot is below the per-slot retry cap (2 retries). Each RetryRequest carries the Critic's targeted feedback message which the Generator uses as the prompt for the replacement test.
| Technology | Why |
|---|---|
| LangGraph | Stateful agent orchestration with typed shared state (TypedDict), conditional edges, and built-in recursion limiting. Avoids the glue code of rolling a custom DAG executor. |
| LiteLLM | Single unified API surface for OpenAI and Anthropic. Swapping models or adding providers requires changing one string — no SDK-specific code anywhere in the codebase. |
| GPT-4o | Fastest and most cost-effective for structured, schema-driven test generation (functional and boundary categories). |
| Claude Sonnet | Stronger adversarial and creative reasoning; significantly better at identifying subtle failure modes for edge-case and robustness tests. |
| FastAPI | Async-native Python API framework. Server-Sent Events for real-time streaming progress require no third-party library — just StreamingResponse with text/event-stream. |
| Streamlit | Rapid browser UI without JavaScript. st.empty() placeholder-based progressive rendering gives the SSE stream a live feel. |
| SBERT (all-MiniLM-L6-v2) | Semantic similarity for near-duplicate detection. Token-set (Jaccard) similarity misses paraphrased duplicates; SBERT catches them. The model is loaded lazily to avoid download-at-import cost. |
| Pydantic v2 | Typed models for all data structures in shared state. Alias support (alias="in") handles OpenAPI's reserved-keyword field names without schema hacks. |
| LangSmith | Zero-config observability — trace every LLM call, token count, and latency per run. Enabled by setting two environment variables. |
| ChromaDB | Vector store for the planned RAG layer (see Future Work). Already scaffolded in src/rag/. |
The prompt strategy registry in src/prompts/registry.py is grounded in empirical research on LLM test generation. The key findings that shaped this system's design:
Prompt strategy matters more than model size. Across functional test generation benchmarks, a well-chosen prompting strategy (e.g. chain-of-thought decomposition) applied to a mid-tier model consistently outperforms a naive single-shot prompt sent to a frontier model. This is why the system maintains a curated registry of five distinct strategies rather than relying on a single general-purpose prompt.
Self-refine harms robustness test diversity. The self-refine strategy (generate → critique → rewrite) is effective for functional and edge-case tests, where a second LLM pass sharpens assertion specificity. For robustness tests, however, the critique stage systematically steers the model away from unusual inputs toward "reasonable" ones — the exact opposite of what robustness testing requires. As a result, self_refine is explicitly excluded from the robustness strategy map.
Per-category model routing improves coverage. Functional and boundary tests benefit from the structured, schema-faithful completions GPT-4o produces quickly and cheaply. Edge-case and robustness tests require a model that reasons about adversarial inputs and failure modes rather than happy-path schemas — Claude Sonnet fills this role significantly better on the same prompts.
- Python 3.11+
- An OpenAI API key
- An Anthropic API key
- (Optional) A LangSmith API key for tracing
- Docker and Docker Compose (for containerised deployment)
git clone https://github.com/your-username/agentic-test-generator.git
cd agentic-test-generatorcp .env.example .envEdit .env:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# Optional — enable LangSmith tracing
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentic-test-generator
LANGSMITH_TRACING=truepip install -e ".[dev]"Start the backend in one terminal:
python -m src.api.main
# FastAPI running on http://localhost:8000
# Interactive docs at http://localhost:8000/docsStart the frontend in another terminal:
streamlit run frontend/app.py
# Streamlit running on http://localhost:8501docker compose up --build| Service | URL |
|---|---|
| Streamlit UI | http://localhost:8501 |
| FastAPI backend | http://localhost:8000 |
| API docs (Swagger) | http://localhost:8000/docs |
- Open http://localhost:8501
- Paste an OpenAPI spec into the Paste spec tab, or upload a
.json/.yamlfile via Upload file - Click ⚡ Generate Test Suite
- Watch the five agents run in real time — each step updates as it completes
- Results appear in three tabs: Test Suite, Conftest, and Report
- Use the download buttons to save
tests.pyandconftest.py
Synchronous generation:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"spec": "'"$(cat data/petstore.yaml)"'"}' \
| jq '{total: .stats.total, by_category: .stats.by_category}'Streaming generation (SSE):
curl -N -X POST http://localhost:8000/generate/stream \
-H "Content-Type: application/json" \
-d '{"spec": "'"$(cat data/petstore.yaml)"'"}'Download generated files:
# After a successful /generate call:
curl http://localhost:8000/download/tests.py -o tests.py
curl http://localhost:8000/download/conftest.py -o conftest.pyHealth check:
curl http://localhost:8000/health
# {"status":"ok","version":"0.1.0"}Given a POST /users endpoint with an email + password request body, the system generates tests like:
class TestPostUsers:
"""Tests for POST /users."""
def test_post_users_functional_happy_path(self, http_session, base_url):
"""Functional: valid user creation returns 201 with id field."""
payload = {"email": "alice@example.com", "password": "S3cur3P@ss!"}
response = http_session.post(f"{base_url}/users", json=payload)
assert response.status_code == 201
body = response.json()
assert "id" in body
assert body["email"] == payload["email"]
def test_post_users_boundary_max_length_email(self, http_session, base_url):
"""Boundary: email at maximum allowed length is accepted."""
long_local = "a" * 64
domain = "example.com"
payload = {"email": f"{long_local}@{domain}", "password": "ValidPass1!"}
response = http_session.post(f"{base_url}/users", json=payload)
assert response.status_code in (201, 422)
def test_post_users_edge_case_sql_injection_email(self, http_session, base_url):
"""Edge case: SQL injection attempt in email field is rejected."""
payload = {"email": "' OR '1'='1", "password": "irrelevant"}
response = http_session.post(f"{base_url}/users", json=payload)
assert response.status_code == 422
assert "id" not in response.json()
def test_post_users_robustness_missing_required_field(self, http_session, base_url):
"""Robustness: missing password field returns 422 Unprocessable Entity."""
payload = {"email": "bob@example.com"}
response = http_session.post(f"{base_url}/users", json=payload)
assert response.status_code == 422agentic-test-generator/
├── src/
│ ├── agents/
│ │ ├── scout.py # OpenAPI parser + criticality enrichment
│ │ ├── strategist.py # Test plan generation
│ │ ├── generator.py # Async parallel test code generation
│ │ ├── critic.py # Validation, dedup, retry queue
│ │ └── assembler.py # Suite assembly + conftest + report
│ ├── graph/
│ │ ├── state.py # LangGraph TypedDict state + Pydantic models
│ │ └── workflow.py # Graph definition, retry edge, run_pipeline()
│ ├── prompts/
│ │ ├── registry.py # STRATEGY_MAP and get_prompt() API
│ │ ├── strategies/ # five prompt strategy modules
│ │ └── examples/ # few-shot example snippets
│ ├── utils/
│ │ ├── parser.py # $ref-resolving OpenAPI parser
│ │ ├── cleaner.py # LLM output cleaning + ast.parse validation
│ │ └── similarity.py # SBERT cosine sim + Union-Find dedup
│ ├── rag/
│ │ ├── store.py # ChromaDB vector store (scaffolded)
│ │ └── retriever.py # Retrieval logic (scaffolded)
│ └── api/
│ └── main.py # FastAPI app — /generate, /generate/stream, /download
├── frontend/
│ └── app.py # Streamlit UI with live SSE progress
├── tests/ # pytest test suite
├── data/ # Sample OpenAPI specs
├── Dockerfile # Backend container
├── Dockerfile.frontend # Frontend container
├── docker-compose.yml # Orchestrated deployment
└── pyproject.toml # Dependencies and tooling config
RAG integration for historical test patterns — src/rag/ is already scaffolded with ChromaDB. The next step is to store every validated test in the vector store after each run, then retrieve the most semantically similar historical tests at prompt-construction time. This would let the system learn from past generations and avoid re-discovering the same test patterns from scratch.
A2A protocol support — The five agents currently communicate through LangGraph shared state. Exposing them as independent services speaking the Agent-to-Agent (A2A) protocol would allow third-party agents to invoke individual pipeline stages (e.g. just the Critic as a standalone code-review agent) and would enable heterogeneous multi-agent deployments.
CI/CD integration — A GitHub Actions workflow that runs the generator against a repo's OpenAPI spec on every PR, diffs the generated suite against the committed one, and comments a coverage delta directly on the pull request. Combined with a pytest --tb=short run against the target API in a staging environment, this would close the loop from spec change to test regression automatically.
MIT — see LICENSE for details.