🕸️ GraphRAG Knowledge Explorer

Documents → an LLM-extracted knowledge graph → multi-hop, cited answers with an interactive network view. The sophisticated cousin of vanilla RAG — it can connect facts across documents, not just retrieve isolated passages.

What it does

Upload-free, self-contained demo over a curated corpus spanning space exploration, biotechnology & genetics, and artificial intelligence. Ask a question in plain English and the system:

seeds entities by semantic similarity + name match,
traverses the knowledge graph to find the entities, relationships and shortest paths that connect your question,
assembles that sub-graph (plus the original source passages) into context, and
has the LLM answer with inline [S#] citations and an explicit reasoning path — while highlighting the exact sub-graph it used.

Because the connections are explicit edges, it shines on multi-hop questions that bridge domains, e.g.:

"Which AI lab developed a system that predicts protein structures, and who led it?"
"Trace a path from deep learning to the Mars rover Perseverance."
"What do AlphaFold and the mRNA COVID-19 vaccines have in common?"

The full knowledge graph extracted from the 8-document corpus (116 entities, 131 relationships), coloured by community. Space, biotech and AI form clusters joined by cross-domain bridges such as AlphaFold (AI ↔ biology) and AutoNav (AI ↔ space).

Why a graph? (GraphRAG vs vanilla RAG)

	Vanilla RAG	GraphRAG (this project)
Retrieval unit	top-k isolated text chunks	entities + relationships + paths, then their source chunks
Multi-hop questions	weak — must hope all hops sit in one chunk	strong — follows edges across documents
Explainability	"here are some chunks"	the exact sub-graph + reasoning path + citations
Global structure	none	communities group related entities

Inspired by Microsoft's GraphRAG (2024), implemented from scratch with LangChain, NetworkX and Streamlit.

Architecture

flowchart LR
    A[Corpus<br/>.txt documents] --> B[Chunk]
    B --> C{{LLM extraction}}
    C -->|entity, relation, entity<br/>+ descriptions| D[Merge into<br/>NetworkX graph]
    D --> E[Louvain<br/>communities]
    D --> F[(Prebuilt<br/>graph_cache)]

    Q[Question] --> G[Seed entities<br/>vector + name match]
    F --> G
    G --> H[Traverse:<br/>k-hop + shortest paths]
    H --> I[Assemble context<br/>entities · relations · sources]
    I --> J{{LLM answer<br/>with citations}}
    H --> K[Highlight<br/>sub-graph]
    J --> L[Answer + reasoning path]
    K --> L

Retrieval is hybrid: vector similarity over entity descriptions finds semantically relevant seeds, name-matching catches exact mentions, and graph traversal (k-hop neighbourhood + shortest paths between seeds) supplies the connective tissue that plain vector search misses.

A retrieved sub-graph for "Trace a path from deep learning to the Mars rover Perseverance" — seeds in red, traversed neighbours in colour.

Quickstart

pip install -r requirements.txt
cp .env.example .env          # paste a free Groq key from console.groq.com/keys
streamlit run app/app.py

The graph loads instantly from the prebuilt cache. No key? It still answers extractively from the graph — add a key for fully written answers. See STEPS.md for the full run/test/deploy checklist.

Use it from Python

from graphrag import GraphRAG, Settings, get_llm

settings = Settings()                       # Groq + TF-IDF by default
rag = GraphRAG(settings, llm=get_llm(settings)).build()   # extract graph with the LLM
rag.save()                                  # cache it

ans = rag.query("How is the lab behind AlphaGo connected to biology?")
print(ans.text)                             # answer with [S#] citations
for s in ans.sources:
    print(s.cite, s.title)

Or load the shipped cache without a key and answer extractively:

rag = GraphRAG(Settings()).load()
print(rag.query("What gene-editing tool won a Nobel Prize?").text)

Demo

Asking a multi-hop question — the app returns a cited answer, the reasoning path, and the exact sub-graph it used:

Exploring the full knowledge graph — drag, zoom and hover the interactive network:

How it's built

graphrag/ — the library
- config.py — one Settings dataclass, all env-overridable
- llm.py — provider abstraction (Groq default, Gemini fallback) with lazy imports, plus an extractive fallback for the no-key path
- chunking.py — corpus loading, chunking, entity-name normalisation
- prompts.py — extraction / community-summary / answer prompts
- extraction.py — LLM → tolerant JSON → (entity, relation, entity) triples
- embeddings.py — pluggable tfidf / sentence-transformers / hashing
- graph_build.py — KnowledgeGraph (NetworkX), entity merge, Louvain communities, JSON cache, traversal helpers
- retrieval.py — hybrid seed + multi-hop traversal + context assembly
- qa.py — answer synthesis with citations (LLM or extractive)
- pipeline.py — GraphRAG orchestrator + sample questions
- viz.py — interactive pyvis graph + static matplotlib figures
app/app.py — Streamlit UI (ask, explore, rebuild)
data/corpus/ — 8 curated factual documents
graph_cache/graph.json — prebuilt graph so the demo starts instantly
tests/test_offline.py — 13 tests, fully offline (stubbed LLM, hashing embedder)
scripts/ — the offline graph-cache builder

Tech stack

langchain-groq · networkx · scikit-learn (TF-IDF) · pyvis · streamlit · optional sentence-transformers / langchain-google-genai.

Testing

pip install -r requirements-dev.txt
pytest -q        # 13 passed — runs with no API key and no network

The suite stubs the LLM and uses a pure-numpy embedder, then exercises the real pipeline: chunking, tolerant JSON parsing, entity merging, graph build, multi-hop retrieval across documents, cited answering (LLM + extractive fallback), cache round-trip, and visualisation.

Notes & design choices

Lean deploy. requirements.txt avoids heavy deps (no torch) so Streamlit Cloud builds in ~1–2 minutes. Higher-quality sentence-transformers embeddings are opt-in.
Provider-agnostic. One free Groq key runs everything; switch to Gemini with two env vars.
Secrets are safe. The API key is read server-side only and is never placed in a widget, so it is never sent to a visitor's browser.
Prebuilt cache. graph_cache/graph.json is shipped so the live demo loads instantly and doesn't spend API quota on every visit; "Rebuild" re-extracts it live with the LLM.

License

MIT — see LICENSE.

Built by Ankit Saxena as GenAI portfolio project #2 (knowledge-graph RAG / agentic retrieval).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕸️ GraphRAG Knowledge Explorer

What it does

Why a graph? (GraphRAG vs vanilla RAG)

Architecture

Quickstart

Use it from Python

Demo

How it's built

Tech stack

Testing

Notes & design choices

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
assets		assets
data/corpus		data/corpus
graph_cache		graph_cache
graphrag		graphrag
reports		reports
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STEPS.md		STEPS.md
requirements-dev.txt		requirements-dev.txt
requirements-optional.txt		requirements-optional.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🕸️ GraphRAG Knowledge Explorer

What it does

Why a graph? (GraphRAG vs vanilla RAG)

Architecture

Quickstart

Use it from Python

Demo

How it's built

Tech stack

Testing

Notes & design choices

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages