Documents → an LLM-extracted knowledge graph → multi-hop, cited answers with an interactive network view. The sophisticated cousin of vanilla RAG — it can connect facts across documents, not just retrieve isolated passages.
Upload-free, self-contained demo over a curated corpus spanning space exploration, biotechnology & genetics, and artificial intelligence. Ask a question in plain English and the system:
- seeds entities by semantic similarity + name match,
- traverses the knowledge graph to find the entities, relationships and shortest paths that connect your question,
- assembles that sub-graph (plus the original source passages) into context, and
- has the LLM answer with inline
[S#]citations and an explicit reasoning path — while highlighting the exact sub-graph it used.
Because the connections are explicit edges, it shines on multi-hop questions that bridge domains, e.g.:
- "Which AI lab developed a system that predicts protein structures, and who led it?"
- "Trace a path from deep learning to the Mars rover Perseverance."
- "What do AlphaFold and the mRNA COVID-19 vaccines have in common?"
The full knowledge graph extracted from the 8-document corpus (116 entities, 131 relationships), coloured by community. Space, biotech and AI form clusters joined by cross-domain bridges such as AlphaFold (AI ↔ biology) and AutoNav (AI ↔ space).
| Vanilla RAG | GraphRAG (this project) | |
|---|---|---|
| Retrieval unit | top-k isolated text chunks | entities + relationships + paths, then their source chunks |
| Multi-hop questions | weak — must hope all hops sit in one chunk | strong — follows edges across documents |
| Explainability | "here are some chunks" | the exact sub-graph + reasoning path + citations |
| Global structure | none | communities group related entities |
Inspired by Microsoft's GraphRAG (2024), implemented from scratch with LangChain, NetworkX and Streamlit.
flowchart LR
A[Corpus<br/>.txt documents] --> B[Chunk]
B --> C{{LLM extraction}}
C -->|entity, relation, entity<br/>+ descriptions| D[Merge into<br/>NetworkX graph]
D --> E[Louvain<br/>communities]
D --> F[(Prebuilt<br/>graph_cache)]
Q[Question] --> G[Seed entities<br/>vector + name match]
F --> G
G --> H[Traverse:<br/>k-hop + shortest paths]
H --> I[Assemble context<br/>entities · relations · sources]
I --> J{{LLM answer<br/>with citations}}
H --> K[Highlight<br/>sub-graph]
J --> L[Answer + reasoning path]
K --> L
Retrieval is hybrid: vector similarity over entity descriptions finds semantically relevant seeds, name-matching catches exact mentions, and graph traversal (k-hop neighbourhood + shortest paths between seeds) supplies the connective tissue that plain vector search misses.
A retrieved sub-graph for "Trace a path from deep learning to the Mars rover Perseverance" — seeds in red, traversed neighbours in colour.
pip install -r requirements.txt
cp .env.example .env # paste a free Groq key from console.groq.com/keys
streamlit run app/app.pyThe graph loads instantly from the prebuilt cache. No key? It still answers extractively from the graph — add a key for fully written answers. See STEPS.md for the full run/test/deploy checklist.
from graphrag import GraphRAG, Settings, get_llm
settings = Settings() # Groq + TF-IDF by default
rag = GraphRAG(settings, llm=get_llm(settings)).build() # extract graph with the LLM
rag.save() # cache it
ans = rag.query("How is the lab behind AlphaGo connected to biology?")
print(ans.text) # answer with [S#] citations
for s in ans.sources:
print(s.cite, s.title)Or load the shipped cache without a key and answer extractively:
rag = GraphRAG(Settings()).load()
print(rag.query("What gene-editing tool won a Nobel Prize?").text)Asking a multi-hop question — the app returns a cited answer, the reasoning path, and the exact sub-graph it used:
Exploring the full knowledge graph — drag, zoom and hover the interactive network:
graphrag/— the libraryconfig.py— oneSettingsdataclass, all env-overridablellm.py— provider abstraction (Groq default, Gemini fallback) with lazy imports, plus an extractive fallback for the no-key pathchunking.py— corpus loading, chunking, entity-name normalisationprompts.py— extraction / community-summary / answer promptsextraction.py— LLM → tolerant JSON →(entity, relation, entity)triplesembeddings.py— pluggabletfidf/sentence-transformers/hashinggraph_build.py—KnowledgeGraph(NetworkX), entity merge, Louvain communities, JSON cache, traversal helpersretrieval.py— hybrid seed + multi-hop traversal + context assemblyqa.py— answer synthesis with citations (LLM or extractive)pipeline.py—GraphRAGorchestrator + sample questionsviz.py— interactive pyvis graph + static matplotlib figures
app/app.py— Streamlit UI (ask, explore, rebuild)data/corpus/— 8 curated factual documentsgraph_cache/graph.json— prebuilt graph so the demo starts instantlytests/test_offline.py— 13 tests, fully offline (stubbed LLM, hashing embedder)scripts/— the offline graph-cache builder
langchain-groq · networkx · scikit-learn (TF-IDF) · pyvis · streamlit · optional sentence-transformers / langchain-google-genai.
pip install -r requirements-dev.txt
pytest -q # 13 passed — runs with no API key and no networkThe suite stubs the LLM and uses a pure-numpy embedder, then exercises the real pipeline: chunking, tolerant JSON parsing, entity merging, graph build, multi-hop retrieval across documents, cited answering (LLM + extractive fallback), cache round-trip, and visualisation.
- Lean deploy.
requirements.txtavoids heavy deps (no torch) so Streamlit Cloud builds in ~1–2 minutes. Higher-qualitysentence-transformersembeddings are opt-in. - Provider-agnostic. One free Groq key runs everything; switch to Gemini with two env vars.
- Secrets are safe. The API key is read server-side only and is never placed in a widget, so it is never sent to a visitor's browser.
- Prebuilt cache.
graph_cache/graph.jsonis shipped so the live demo loads instantly and doesn't spend API quota on every visit; "Rebuild" re-extracts it live with the LLM.
MIT — see LICENSE.
Built by Ankit Saxena as GenAI portfolio project #2 (knowledge-graph RAG / agentic retrieval).



