Skip to content

Hrushitha12/agentic-test-generator

Repository files navigation

🧪 Agentic Test Generator

Multi-agent system that converts an OpenAPI/Swagger spec into a comprehensive, ready-to-run pytest suite — automatically.

Python LangGraph OpenAI Anthropic FastAPI Docker


What it does

Paste an OpenAPI spec. Get a complete pytest suite — functional tests, boundary checks, edge cases, and robustness tests — with a conftest.py, a Markdown report, and live streaming progress in the browser.

No hand-written test templates. No one-size-fits-all prompts. Each test category uses a different prompting strategy, routed to the model best suited for that type of reasoning.


Architecture

Five specialised agents are orchestrated by LangGraph in a directed graph with a conditional retry loop:

Agent Role
Scout Parses the OpenAPI spec (JSON or YAML), resolves $ref chains, extracts every endpoint with its schemas, parameters, auth requirements, and response codes. Uses GPT-4o to score each endpoint's criticality (1–10) and identify inter-endpoint dependencies.
Strategist Decides how many tests of each category to generate per endpoint. Accounts for HTTP method semantics (POSTs need more tests than GETs), criticality score, and whether the endpoint accepts a request body. Produces a structured test plan via a single batched LLM call.
Generator Executes the test plan concurrently. Each test is generated by one of five prompt strategies, with test count distributed evenly across strategies. Uses GPT-4o for functional/boundary tests and Claude Sonnet for edge-case/robustness tests. Validates every output with ast.parse and retries on syntax errors.
Critic Three-phase quality gate: (1) syntax check, (2) URL-target check — confirms the code actually exercises the right endpoint, (3) completeness check — enforces at least one assert. Near-duplicate tests are detected via SBERT cosine similarity (all-MiniLM-L6-v2) with Union-Find clustering and queued for regeneration with targeted feedback.
Assembler Groups validated tests by endpoint into Python classes, generates a conftest.py with base-URL and session fixtures, detects auth requirements and adds auth-header fixtures, and produces a Markdown summary report.

Agent flow

graph LR
    A([Start]) --> B[Scout]
    B --> C[Strategist]
    C --> D[Generator]
    D --> E[Critic]
    E -->|retry_queue non-empty\nand under retry cap| D
    E -->|done| F[Assembler]
    F --> G([End])

    style D fill:#4C9BE8,color:#fff
    style E fill:#E76F51,color:#fff
Loading

The retry edge fires when the Critic places tests in retry_queue (failed validation or detected as near-duplicates) and at least one slot is below the per-slot retry cap (2 retries). Each RetryRequest carries the Critic's targeted feedback message which the Generator uses as the prompt for the replacement test.


Tech stack

Technology Why
LangGraph Stateful agent orchestration with typed shared state (TypedDict), conditional edges, and built-in recursion limiting. Avoids the glue code of rolling a custom DAG executor.
LiteLLM Single unified API surface for OpenAI and Anthropic. Swapping models or adding providers requires changing one string — no SDK-specific code anywhere in the codebase.
GPT-4o Fastest and most cost-effective for structured, schema-driven test generation (functional and boundary categories).
Claude Sonnet Stronger adversarial and creative reasoning; significantly better at identifying subtle failure modes for edge-case and robustness tests.
FastAPI Async-native Python API framework. Server-Sent Events for real-time streaming progress require no third-party library — just StreamingResponse with text/event-stream.
Streamlit Rapid browser UI without JavaScript. st.empty() placeholder-based progressive rendering gives the SSE stream a live feel.
SBERT (all-MiniLM-L6-v2) Semantic similarity for near-duplicate detection. Token-set (Jaccard) similarity misses paraphrased duplicates; SBERT catches them. The model is loaded lazily to avoid download-at-import cost.
Pydantic v2 Typed models for all data structures in shared state. Alias support (alias="in") handles OpenAPI's reserved-keyword field names without schema hacks.
LangSmith Zero-config observability — trace every LLM call, token count, and latency per run. Enabled by setting two environment variables.
ChromaDB Vector store for the planned RAG layer (see Future Work). Already scaffolded in src/rag/.

Research foundation

The prompt strategy registry in src/prompts/registry.py is grounded in empirical research on LLM test generation. The key findings that shaped this system's design:

Prompt strategy matters more than model size. Across functional test generation benchmarks, a well-chosen prompting strategy (e.g. chain-of-thought decomposition) applied to a mid-tier model consistently outperforms a naive single-shot prompt sent to a frontier model. This is why the system maintains a curated registry of five distinct strategies rather than relying on a single general-purpose prompt.

Self-refine harms robustness test diversity. The self-refine strategy (generate → critique → rewrite) is effective for functional and edge-case tests, where a second LLM pass sharpens assertion specificity. For robustness tests, however, the critique stage systematically steers the model away from unusual inputs toward "reasonable" ones — the exact opposite of what robustness testing requires. As a result, self_refine is explicitly excluded from the robustness strategy map.

Per-category model routing improves coverage. Functional and boundary tests benefit from the structured, schema-faithful completions GPT-4o produces quickly and cheaply. Edge-case and robustness tests require a model that reasons about adversarial inputs and failure modes rather than happy-path schemas — Claude Sonnet fills this role significantly better on the same prompts.


Setup

Prerequisites

1. Clone the repository

git clone https://github.com/your-username/agentic-test-generator.git
cd agentic-test-generator

2. Configure environment variables

cp .env.example .env

Edit .env:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Optional — enable LangSmith tracing
LANGSMITH_API_KEY=lsv2_...
LANGSMITH_PROJECT=agentic-test-generator
LANGSMITH_TRACING=true

3. Install dependencies

pip install -e ".[dev]"

4. Run locally

Start the backend in one terminal:

python -m src.api.main
# FastAPI running on http://localhost:8000
# Interactive docs at http://localhost:8000/docs

Start the frontend in another terminal:

streamlit run frontend/app.py
# Streamlit running on http://localhost:8501

5. Run with Docker

docker compose up --build
Service URL
Streamlit UI http://localhost:8501
FastAPI backend http://localhost:8000
API docs (Swagger) http://localhost:8000/docs

Usage

Streamlit UI

  1. Open http://localhost:8501
  2. Paste an OpenAPI spec into the Paste spec tab, or upload a .json/.yaml file via Upload file
  3. Click ⚡ Generate Test Suite
  4. Watch the five agents run in real time — each step updates as it completes
  5. Results appear in three tabs: Test Suite, Conftest, and Report
  6. Use the download buttons to save tests.py and conftest.py

API — curl examples

Synchronous generation:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"spec": "'"$(cat data/petstore.yaml)"'"}' \
  | jq '{total: .stats.total, by_category: .stats.by_category}'

Streaming generation (SSE):

curl -N -X POST http://localhost:8000/generate/stream \
  -H "Content-Type: application/json" \
  -d '{"spec": "'"$(cat data/petstore.yaml)"'"}'

Download generated files:

# After a successful /generate call:
curl http://localhost:8000/download/tests.py -o tests.py
curl http://localhost:8000/download/conftest.py -o conftest.py

Health check:

curl http://localhost:8000/health
# {"status":"ok","version":"0.1.0"}

Example output

Given a POST /users endpoint with an email + password request body, the system generates tests like:

class TestPostUsers:
    """Tests for POST /users."""

    def test_post_users_functional_happy_path(self, http_session, base_url):
        """Functional: valid user creation returns 201 with id field."""
        payload = {"email": "alice@example.com", "password": "S3cur3P@ss!"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 201
        body = response.json()
        assert "id" in body
        assert body["email"] == payload["email"]

    def test_post_users_boundary_max_length_email(self, http_session, base_url):
        """Boundary: email at maximum allowed length is accepted."""
        long_local = "a" * 64
        domain = "example.com"
        payload = {"email": f"{long_local}@{domain}", "password": "ValidPass1!"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code in (201, 422)

    def test_post_users_edge_case_sql_injection_email(self, http_session, base_url):
        """Edge case: SQL injection attempt in email field is rejected."""
        payload = {"email": "' OR '1'='1", "password": "irrelevant"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 422
        assert "id" not in response.json()

    def test_post_users_robustness_missing_required_field(self, http_session, base_url):
        """Robustness: missing password field returns 422 Unprocessable Entity."""
        payload = {"email": "bob@example.com"}
        response = http_session.post(f"{base_url}/users", json=payload)
        assert response.status_code == 422

Project structure

agentic-test-generator/
├── src/
│   ├── agents/
│   │   ├── scout.py          # OpenAPI parser + criticality enrichment
│   │   ├── strategist.py     # Test plan generation
│   │   ├── generator.py      # Async parallel test code generation
│   │   ├── critic.py         # Validation, dedup, retry queue
│   │   └── assembler.py      # Suite assembly + conftest + report
│   ├── graph/
│   │   ├── state.py          # LangGraph TypedDict state + Pydantic models
│   │   └── workflow.py       # Graph definition, retry edge, run_pipeline()
│   ├── prompts/
│   │   ├── registry.py       # STRATEGY_MAP and get_prompt() API
│   │   ├── strategies/       # five prompt strategy modules
│   │   └── examples/         # few-shot example snippets
│   ├── utils/
│   │   ├── parser.py         # $ref-resolving OpenAPI parser
│   │   ├── cleaner.py        # LLM output cleaning + ast.parse validation
│   │   └── similarity.py     # SBERT cosine sim + Union-Find dedup
│   ├── rag/
│   │   ├── store.py          # ChromaDB vector store (scaffolded)
│   │   └── retriever.py      # Retrieval logic (scaffolded)
│   └── api/
│       └── main.py           # FastAPI app — /generate, /generate/stream, /download
├── frontend/
│   └── app.py                # Streamlit UI with live SSE progress
├── tests/                    # pytest test suite
├── data/                     # Sample OpenAPI specs
├── Dockerfile                # Backend container
├── Dockerfile.frontend       # Frontend container
├── docker-compose.yml        # Orchestrated deployment
└── pyproject.toml            # Dependencies and tooling config

Future work

RAG integration for historical test patternssrc/rag/ is already scaffolded with ChromaDB. The next step is to store every validated test in the vector store after each run, then retrieve the most semantically similar historical tests at prompt-construction time. This would let the system learn from past generations and avoid re-discovering the same test patterns from scratch.

A2A protocol support — The five agents currently communicate through LangGraph shared state. Exposing them as independent services speaking the Agent-to-Agent (A2A) protocol would allow third-party agents to invoke individual pipeline stages (e.g. just the Critic as a standalone code-review agent) and would enable heterogeneous multi-agent deployments.

CI/CD integration — A GitHub Actions workflow that runs the generator against a repo's OpenAPI spec on every PR, diffs the generated suite against the committed one, and comments a coverage delta directly on the pull request. Combined with a pytest --tb=short run against the target API in a staging environment, this would close the loop from spec change to test regression automatically.


License

MIT — see LICENSE for details.

About

5-agent AI system using CrewAI, MCP, and A2A Protocol to autonomously generate API test suites from OpenAPI specs — GPT-4o/Claude Sonnet via LiteLLM with LangSmith tracing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors