Skip to content

AnkitSaxena-AI/graphrag-explorer

Repository files navigation

🕸️ GraphRAG Knowledge Explorer

Documents → an LLM-extracted knowledge graphmulti-hop, cited answers with an interactive network view. The sophisticated cousin of vanilla RAG — it can connect facts across documents, not just retrieve isolated passages.

Live Demo Python LangChain NetworkX Tests License: MIT


What it does

Upload-free, self-contained demo over a curated corpus spanning space exploration, biotechnology & genetics, and artificial intelligence. Ask a question in plain English and the system:

  1. seeds entities by semantic similarity + name match,
  2. traverses the knowledge graph to find the entities, relationships and shortest paths that connect your question,
  3. assembles that sub-graph (plus the original source passages) into context, and
  4. has the LLM answer with inline [S#] citations and an explicit reasoning path — while highlighting the exact sub-graph it used.

Because the connections are explicit edges, it shines on multi-hop questions that bridge domains, e.g.:

  • "Which AI lab developed a system that predicts protein structures, and who led it?"
  • "Trace a path from deep learning to the Mars rover Perseverance."
  • "What do AlphaFold and the mRNA COVID-19 vaccines have in common?"

Knowledge graph

The full knowledge graph extracted from the 8-document corpus (116 entities, 131 relationships), coloured by community. Space, biotech and AI form clusters joined by cross-domain bridges such as AlphaFold (AI ↔ biology) and AutoNav (AI ↔ space).


Why a graph? (GraphRAG vs vanilla RAG)

Vanilla RAG GraphRAG (this project)
Retrieval unit top-k isolated text chunks entities + relationships + paths, then their source chunks
Multi-hop questions weak — must hope all hops sit in one chunk strong — follows edges across documents
Explainability "here are some chunks" the exact sub-graph + reasoning path + citations
Global structure none communities group related entities

Inspired by Microsoft's GraphRAG (2024), implemented from scratch with LangChain, NetworkX and Streamlit.


Architecture

flowchart LR
    A[Corpus<br/>.txt documents] --> B[Chunk]
    B --> C{{LLM extraction}}
    C -->|entity, relation, entity<br/>+ descriptions| D[Merge into<br/>NetworkX graph]
    D --> E[Louvain<br/>communities]
    D --> F[(Prebuilt<br/>graph_cache)]

    Q[Question] --> G[Seed entities<br/>vector + name match]
    F --> G
    G --> H[Traverse:<br/>k-hop + shortest paths]
    H --> I[Assemble context<br/>entities · relations · sources]
    I --> J{{LLM answer<br/>with citations}}
    H --> K[Highlight<br/>sub-graph]
    J --> L[Answer + reasoning path]
    K --> L
Loading

Retrieval is hybrid: vector similarity over entity descriptions finds semantically relevant seeds, name-matching catches exact mentions, and graph traversal (k-hop neighbourhood + shortest paths between seeds) supplies the connective tissue that plain vector search misses.

Example sub-graph

A retrieved sub-graph for "Trace a path from deep learning to the Mars rover Perseverance" — seeds in red, traversed neighbours in colour.


Quickstart

pip install -r requirements.txt
cp .env.example .env          # paste a free Groq key from console.groq.com/keys
streamlit run app/app.py

The graph loads instantly from the prebuilt cache. No key? It still answers extractively from the graph — add a key for fully written answers. See STEPS.md for the full run/test/deploy checklist.

Use it from Python

from graphrag import GraphRAG, Settings, get_llm

settings = Settings()                       # Groq + TF-IDF by default
rag = GraphRAG(settings, llm=get_llm(settings)).build()   # extract graph with the LLM
rag.save()                                  # cache it

ans = rag.query("How is the lab behind AlphaGo connected to biology?")
print(ans.text)                             # answer with [S#] citations
for s in ans.sources:
    print(s.cite, s.title)

Or load the shipped cache without a key and answer extractively:

rag = GraphRAG(Settings()).load()
print(rag.query("What gene-editing tool won a Nobel Prize?").text)

Demo

Asking a multi-hop question — the app returns a cited answer, the reasoning path, and the exact sub-graph it used:

App answering a multi-hop question

Exploring the full knowledge graph — drag, zoom and hover the interactive network:

Interactive knowledge graph


How it's built

  • graphrag/ — the library
    • config.py — one Settings dataclass, all env-overridable
    • llm.py — provider abstraction (Groq default, Gemini fallback) with lazy imports, plus an extractive fallback for the no-key path
    • chunking.py — corpus loading, chunking, entity-name normalisation
    • prompts.py — extraction / community-summary / answer prompts
    • extraction.py — LLM → tolerant JSON → (entity, relation, entity) triples
    • embeddings.py — pluggable tfidf / sentence-transformers / hashing
    • graph_build.pyKnowledgeGraph (NetworkX), entity merge, Louvain communities, JSON cache, traversal helpers
    • retrieval.py — hybrid seed + multi-hop traversal + context assembly
    • qa.py — answer synthesis with citations (LLM or extractive)
    • pipeline.pyGraphRAG orchestrator + sample questions
    • viz.py — interactive pyvis graph + static matplotlib figures
  • app/app.py — Streamlit UI (ask, explore, rebuild)
  • data/corpus/ — 8 curated factual documents
  • graph_cache/graph.json — prebuilt graph so the demo starts instantly
  • tests/test_offline.py — 13 tests, fully offline (stubbed LLM, hashing embedder)
  • scripts/ — the offline graph-cache builder

Tech stack

langchain-groq · networkx · scikit-learn (TF-IDF) · pyvis · streamlit · optional sentence-transformers / langchain-google-genai.


Testing

pip install -r requirements-dev.txt
pytest -q        # 13 passed — runs with no API key and no network

The suite stubs the LLM and uses a pure-numpy embedder, then exercises the real pipeline: chunking, tolerant JSON parsing, entity merging, graph build, multi-hop retrieval across documents, cited answering (LLM + extractive fallback), cache round-trip, and visualisation.


Notes & design choices

  • Lean deploy. requirements.txt avoids heavy deps (no torch) so Streamlit Cloud builds in ~1–2 minutes. Higher-quality sentence-transformers embeddings are opt-in.
  • Provider-agnostic. One free Groq key runs everything; switch to Gemini with two env vars.
  • Secrets are safe. The API key is read server-side only and is never placed in a widget, so it is never sent to a visitor's browser.
  • Prebuilt cache. graph_cache/graph.json is shipped so the live demo loads instantly and doesn't spend API quota on every visit; "Rebuild" re-extracts it live with the LLM.

License

MIT — see LICENSE.


Built by Ankit Saxena as GenAI portfolio project #2 (knowledge-graph RAG / agentic retrieval).

About

Graph-based RAG: documents -> LLM-extracted knowledge graph -> multi-hop, cited answers with an interactive network view. LangChain + NetworkX + Streamlit.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages