Chat with any OpenAPI specification using RAG - no context window limits, no hallucinations, source-grounded answers.
Docquery.mp4
The standard advice for understanding an API is "just paste the spec into ChatGPT." That breaks down fast.
| Problem | DocQuery's Solution |
|---|---|
| Stripe's spec is ~50,000 lines - won't fit in any context window | RAG retrieves only the relevant chunks per query |
| Internal/private APIs can't be pasted into public LLMs | Fully local vector store, nothing leaves your machine |
| LLM training data has a cutoff - new endpoints are unknown | Re-ingest any time, answers always reflect the current spec |
| Raw LLM answers can't cite where they got the answer | Every response includes the exact source endpoints |
| Can't query multiple APIs simultaneously | Multi-spec support with per-spec ChromaDB collections |
- RAG pipeline - OpenAPI spec is chunked, embedded with
all-mpnet-base-v2, and stored in ChromaDB alongside a BM25 index; each query runs dense and keyword retrieval in parallel, fuses them with Reciprocal Rank Fusion (RRF), then re-ranks candidates with a cross-encoder before hitting the LLM. $refresolution - request body schemas are fully dereferenced at ingest time, includingallOf,anyOf, andoneOfcomposition patterns (up to 6 levels deep), so field names likenameare semantically searchable.- Confidence scoring - a hybrid of Vectara's HHEM hallucination model (70%) and cosine similarity (30%), with a JSON field grounding check blended into the HHEM score. Falls back to cosine-only if HHEM is unavailable.
- Source citations - every answer includes the exact HTTP method + path that grounded it
- Multi-spec support - ingest multiple APIs simultaneously and switch between them with one click
- Dual ingest modes - paste a public URL or upload a local
.jsonfile (for private/internal specs) - Persistent vector store - ChromaDB and BM25 indexes persist to disk; specs survive restarts without re-ingestion.
- Voice input - record a question via the browser's MediaRecorder API; audio is transcribed by Whisper via Groq (~1s) and populated into the query box automatically.
- Pipeline observability - every query is traced end-to-end with Langfuse; retrieval span logs which endpoints were retrieved, the LLM generation span logs the full prompt and answer, and confidence scores are tracked as metrics across all queries.
Summary-first chunking - Each chunk starts with {summary} - {METHOD} {path} before the technical details. The sentence transformer sees the human-readable description first, which dramatically improves retrieval for natural language queries over path-only chunking. Synonym expansion is also injected into each chunk so keyword queries hit the right endpoints.
$ref resolution at ingest time - OpenAPI specs use $ref pointers for all request/response schemas. Without resolving them, chunks contain #/components/schemas/User instead of the actual field names (name, password). The resolver handles allOf, anyOf, and oneOf composition up to 6 levels deep, so nested schemas are fully searchable.
Hybrid BM25 + dense retrieval with RRF - Dense embeddings capture semantic similarity; BM25 catches exact keyword matches (endpoint paths, parameter names). Reciprocal Rank Fusion combines both ranked lists into a single candidate set without needing a tuned interpolation weight. A cross-encoder then re-ranks the fused top-N for precision before the prompt is built.
BM25 + RRF over HyDE - HyDE generates a hypothetical answer to use as a proxy query, which helps when queries are vague and documents are unstructured prose. API documentation is the opposite: queries are specific ("how do I update a customer") and documents are structured chunks with exact method+path identifiers. In that setting HyDE introduces hallucination risk at the retrieval stage - a fabricated answer may embed closer to the wrong endpoint than the right one. BM25 handles this domain more reliably: "update a customer" is a near-exact keyword match to Update a customer — PATCH /v1/customers/{customer}, whereas a dense-only retriever can rank it far lower because the embedding model doesn't bridge synonyms like "modify" → "update" at query time. Running both in parallel and fusing with RRF captures keyword precision and semantic recall without the hallucination risk HyDE introduces at retrieval.
HHEM + cosine confidence scoring - Vectara's Hallucination Evaluation Model (HHEM) scores factual consistency between the answer prose and retrieved chunks. A separate JSON field grounding check verifies that any code examples in the answer use only field names present in the spec. The two signals are blended (70/30) and combined with cosine similarity as a fallback, giving a 0–100 score that acts as a metric for detecting hallucinations.
Per-spec ChromaDB collections + BM25 indexes - Each ingested API gets its own ChromaDB collection and a paired .pkl BM25 index on disk. This allows simultaneous multi-spec support with zero cross-contamination and O(1) collection switching.
Dual ingest modes - URL ingest for public specs, file upload for private/internal APIs that can't be sent to external services. Both paths converge at the same extract_chunks() function.
Llama 3.1 8B over Zephyr 7B - Zephyr consistently ignored strict prompt rules and hallucinated parameters not in the spec. Llama 3.1 follows instruction constraints reliably enough for production-quality answers on API documentation tasks.
all-mpnet-base-v2 over all-MiniLM-L6-v2 - MiniLM-L6 is faster but uses a distilled attention mechanism. API documentation chunks pack a summary, HTTP method, path, parameters, and schema fields into a single block; mpnet's full attention across the whole sequence produces more accurate embeddings for these dense, structured inputs. The trade-off - larger model, slower encode - is negligible in practice since embedding happens once at ingest and query embedding is a single short string.
- Python 3.11+
- React.js 18+
- A free HuggingFace token (for the LLM)
- A free Groq API key (for voice transcription)
Clone the repository:
git clone https://github.com/superb-striker/DocQuery
cd DocQueryCreate a virtual environment and install required libraries:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtCreate .env:
HF_TOKEN=your_token_here
GROQ_API_KEY=your_groq_key_hereStart the backend server:
uvicorn main:app --reload
# API running at http://localhost:8000
# Swagger docs at http://localhost:8000/docsInstall required packages and start the frontend server:
cd frontend
npm install
npm run dev
# UI running at http://localhost:5173Fetch and index an OpenAPI spec from a public URL.
Request
{
"specification_url": "https://petstore3.swagger.io/api/v3/openapi.json"
}Response
{
"specification_name": "Swagger Petstore - OpenAPI 3.0",
"chunks_stored": 19,
"message": "Successfully ingested 19 endpoints."
}Upload a local .json spec file (for private/internal APIs).
Request
file: <your-openapi-spec.json>
Response - same as /ingest
Ask a question against an ingested spec.
Request
{
"question": "How do I create a secret with a view notification?",
"specification_name": "phantom_share"
}Response
{
"answer": "Use POST /api/secrets with notify_on_view: true and provide a notify_email...",
"confidence": 87,
"sources": [
{
"endpoint": "POST /api/secrets",
"summary": "Create Secret"
}
]
}Transcribe a browser audio recording to text via Groq Whisper, for use as a query.
Request
file: <recording.webm>Response
{
"transcript": "How do I authenticate with the API?"
}List all currently ingested specification names.
Response
{
"specifications": ["swagger_petstore__openapi_30", "phantom_share"]
}DocQuery/
└── backend/
| ├── main.py
| ├── ingest.py
| ├── vectorStore.py
| ├── llm.py
| ├── confidence.py
| ├── loggerConfig.py
| ├── chromaDB/
| ├── bm25_indexes/
├── .env
└── frontend/
├── DocQuery.jsx
├── main.jsx
├── src/
├── theme.js
├── utils/
│ └── classifyError.js
├── hooks/
| └── useTypewriter.js
└── components/
├── ui/
| ├── ErrorBanner.jsx
| ├── Badge.jsx
| ├── NoiseFilter.jsx
| ├── Spinner.jsx
| ├── ThemeToggle.jsx
├── AppHeader.jsx
├── ConfidenceRing.jsx
├── SourceChip.jsx
├── IngestPanel.jsx
├── QueryPanel.jsx
└── ResultPanel.jsx| Variable | Required | Description |
|---|---|---|
HF_TOKEN |
Yes | HuggingFace access token for the inference router |
GROQ_API_KEY |
Yes | Groq API key for Whisper voice transcription |
LANGFUSE_PUBLIC_KEY |
Yes | Langfuse project public key |
LANGFUSE_SECRET_KEY |
Yes | Langfuse project secret key |
LANGFUSE_HOST |
Yes | Langfuse host (e.g. https://jp.cloud.langfuse.com) |
The model is set in llm.py. Any model available on the HuggingFace router works as a drop-in replacement.
# Wipe all ingested specs and start fresh
rm -rf chromaDB/ bm25_indexes/ # Linux/Mac
Remove-Item -Recurse -Force chromaDB, bm25_indexes # Windows PowerShell- HuggingFace free tier - rate limited to the LLM; responses may take 5–15 seconds under load. For faster LLM responses, swap to Groq in
llm.py. - Schema depth -
$refresolution is capped at 6 levels. Deeply nested$refchains beyond that are not resolved. - YAML specs - only JSON OpenAPI specs are supported. For YAML specs, convert first:
python -c "import yaml,json,sys; json.dump(yaml.safe_load(open('spec.yaml')), open('spec.json','w'))". - HHEM model load time - the Vectara hallucination model is loaded at startup and adds a few seconds to cold start.