New codebase? Own it in minutes.
CodeAtlas is an open-source, AI-powered repository intelligence platform. Paste any public GitHub URL and instantly get a guided onboarding path, visual dependency map, full API reference, change impact analysis, and an AI that answers architecture questions cited back to the exact file and line.
Instead of reading docs manually or asking teammates, CodeAtlas constructs a structured knowledge base from any public GitHub repository and lets you ask questions like:
- "How does authentication work?"
- "What breaks if I modify PaymentService?"
- "Where should a new developer start?"
- Auto Onboarding Guide — Prioritised reading path, key services, and core workflows generated on ingest
- API Endpoint Discovery — Every route and HTTP method extracted via Tree-sitter AST — no LLM, no annotations
- Interactive Dependency Map — Visual cluster graph of every import relationship; click any node to focus its connections
- Change Impact Analysis — Type any function, class, or file to instantly see every file that depends on it
- Architecture-Aware Q&A — Ask anything in plain English; every answer cites the exact file, function, and line range
- AI Answer Evaluation — Run a retrieval benchmark (Recall@5 + MRR) to check how accurately the AI finds the right code
- Hybrid Search — Dense vector + BM25 sparse search merged with manual Reciprocal Rank Fusion (k=60)
- Multi-Turn Chat — Persistent sessions with full conversation history
- Multi-Language Support — Full AST parsing for Python, JavaScript, TypeScript, Java, Go; fallback chunking for all others
| Layer | Technology |
|---|---|
| Frontend | React 18, TypeScript, Vite, Tailwind CSS |
| Graph Visualisation | @xyflow/react (React Flow), dagre (hierarchical layout) |
| Backend | FastAPI, Python 3.11, SQLAlchemy (async) |
| Auth | GitHub OAuth (Authlib) + JWT |
| Relational DB | PostgreSQL 16 |
| Vector DB | Qdrant (dense + sparse; manual RRF fusion) |
| Embeddings | gemini-embedding-001 (3072-dim) |
| LLM | Gemini 2.0 Flash (Google AI Studio) |
| Code Parsing | Tree-sitter (AST-level semantic chunking) |
| Dependency Analysis | NetworkX (directed import graph, hub detection) |
| Containerisation | Docker + Docker Compose |
Browser (React + TypeScript + Tailwind)
│
│ HTTPS (JWT in Authorization header)
▼
FastAPI Backend
│
┌────┴──────────────────────────────────────┐
│ │
▼ ▼
PostgreSQL Qdrant
(users, repos, (dense + sparse vectors,
chunks, chat, jobs) manual RRF retrieval)
│ │
└──────────────┬────────────────────────────┘
│
┌───────────┼───────────────┐
▼ ▼ ▼
Gemini API gemini- Analysis Services
(answers) embedding-001 (NetworkX dep graph,
(indexing) Tree-sitter endpoints,
Impact analysis,
Retrieval eval)
| Tool | Version | Notes |
|---|---|---|
| Docker | 24+ | Required for all services |
| Docker Compose | v2+ | Bundled with Docker Desktop |
| Git | Any | To clone the repo |
| GitHub OAuth App | — | For authentication |
| Google AI Studio API key | — | For Gemini + Embeddings (free tier) |
git clone https://github.com/CodeTirtho97/CodeAtlas.git
cd CodeAtlas- Go to GitHub Developer Settings → OAuth Apps → New OAuth App
- Fill in:
- Application name:
CodeAtlas - Homepage URL:
http://localhost:3000 - Authorization callback URL:
http://localhost:3000/callback
- Application name:
- Click Register application
- Copy the Client ID and generate a Client Secret
- Go to Google AI Studio
- Click Create API key
- Copy the key (used for both Gemini Flash and gemini-embedding-001 — same key)
# Backend
cp backend/.env.example backend/.envEdit backend/.env:
GOOGLE_API_KEY=your_google_api_key_here
GITHUB_CLIENT_ID=your_github_client_id
GITHUB_CLIENT_SECRET=your_github_client_secret
JWT_SECRET=a_random_string_at_least_32_characters_long
FRONTEND_URL=http://localhost:3000Generate a secure JWT_SECRET:
python -c "import secrets; print(secrets.token_hex(32))"# Frontend
cp frontend/.env.example frontend/.envEdit frontend/.env:
VITE_API_URL=http://localhost:8000
VITE_GITHUB_CLIENT_ID=your_github_client_iddocker-compose upThis starts 4 services:
| Service | URL | Notes |
|---|---|---|
| Frontend | http://localhost:3000 | React app with hot reload |
| Backend | http://localhost:8000 | FastAPI with hot reload |
| PostgreSQL | localhost:5432 | Database |
| Qdrant | http://localhost:6333 | Vector database |
# Backend health check
curl http://localhost:8000/health
# Qdrant health check
curl http://localhost:6333/health
# Open the app
open http://localhost:3000 # macOS
start http://localhost:3000 # WindowsClick Sign in with GitHub on the landing page. You'll be redirected to GitHub's OAuth consent screen and then back.
Paste a public GitHub URL and click Analyze:
https://github.com/fastapi/fastapi
The ingestion pipeline runs in the background. A progress bar tracks each stage:
- Cloning the repository
- Filtering files (excludes vendor, build artifacts, etc.)
- Parsing code (Tree-sitter AST)
- Generating embeddings (gemini-embedding-001)
- Storing vectors in Qdrant + metadata in PostgreSQL
- Building the dependency graph (NetworkX)
- Extracting API endpoints (Tree-sitter — no LLM)
- Generating repository summary (Gemini)
- Generating the onboarding guide (Gemini)
Once complete, the dashboard provides six tabs:
| Tab | What it does |
|---|---|
| Overview | Purpose, tech stack, architecture description, entry points, and quick navigation |
| Understand | Step-by-step onboarding guide — files to read first and core workflows |
| Explore | All discovered API endpoints + interactive dependency map with cluster tabs |
| Ask AI | Multi-turn chat with persistent history and cited answers |
| Impact Area | Type any function, class, or file — see every file that depends on it |
| Evaluate | Run a retrieval benchmark to check AI answer accuracy (Recall@5 + MRR) |
Start a chat session and type any question about the repository:
How does authentication work?
What is the request flow for a POST /login?
Which files interact with UserService?
Where should a new developer start?
Answers include cited sources with exact file, function, and line range:
Sources:
► auth/service.py — authenticate_user() [L42–67]
► middleware/jwt.py — verify_token() [L17–31]
► db/models.py — User [L5–28]
Go to the Impact Area tab and type any symbol or file path:
UserService
auth/service.py
POST /login
CodeAtlas traces every file that transitively depends on it — so you know exactly what will break before touching a line.
codeatlas/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI entry point + middleware
│ │ ├── api/routes/
│ │ │ ├── auth.py # GitHub OAuth login/callback/me
│ │ │ ├── repos.py # Ingest, list, get, delete repos
│ │ │ ├── chat.py # Multi-turn chat sessions + Q&A
│ │ │ ├── impact.py # Change impact analysis endpoint
│ │ │ └── eval.py # Retrieval evaluation endpoint
│ │ ├── core/
│ │ │ ├── config.py # Pydantic settings (validated env vars)
│ │ │ ├── database.py # Async SQLAlchemy engine + session
│ │ │ └── qdrant.py # Qdrant client + collection init
│ │ ├── models/
│ │ │ ├── db/ # SQLAlchemy ORM models
│ │ │ │ ├── base.py
│ │ │ │ ├── user.py # GitHub user + rate limiting
│ │ │ │ ├── repository.py # Repo metadata + summary/guide JSON
│ │ │ │ ├── chunk.py # Code chunks (semantic units)
│ │ │ │ ├── ingestion_job.py # Ingestion job progress tracking
│ │ │ │ ├── chat_session.py
│ │ │ │ └── chat_message.py
│ │ │ └── schemas/ # Pydantic request/response schemas
│ │ └── services/
│ │ ├── ingestion/ # Clone → parse → embed → store
│ │ │ ├── pipeline.py # Async orchestrator
│ │ │ ├── cloner.py # GitPython repo clone
│ │ │ ├── file_filter.py # Exclude vendor/build artifacts
│ │ │ ├── parser.py # AST parsing dispatcher
│ │ │ ├── parsers/ # Per-language Tree-sitter parsers
│ │ │ │ ├── python_parser.py
│ │ │ │ ├── js_parser.py
│ │ │ │ ├── ts_parser.py
│ │ │ │ ├── java_parser.py
│ │ │ │ ├── go_parser.py
│ │ │ │ └── fallback_parser.py
│ │ │ ├── embedder.py # gemini-embedding-001 calls
│ │ │ └── store.py # Qdrant + PostgreSQL persistence
│ │ ├── search/
│ │ │ ├── retriever.py # Hybrid search: 2× query_points + manual RRF
│ │ │ └── sparse.py # BM25-like feature-hash tokenizer
│ │ ├── generation/
│ │ │ ├── qa.py # LLM Q&A with context + chat history
│ │ │ ├── summarizer.py # Purpose, stack, architecture summary
│ │ │ └── onboarding.py # Step-by-step learning guide
│ │ ├── analysis/
│ │ │ ├── api_extractor.py # HTTP endpoint discovery (Tree-sitter)
│ │ │ ├── dependency_graph.py # Import graph via NetworkX
│ │ │ └── impact.py # Transitive dependency impact analysis
│ │ └── evaluation/
│ │ └── retrieval_eval.py # Recall@5 + MRR benchmark runner
│ ├── alembic/ # DB migration scripts
│ ├── pyproject.toml
│ ├── Dockerfile
│ ├── .dockerignore
│ └── .env.example
│
├── frontend/
│ ├── src/
│ │ ├── App.tsx # Router + ProtectedRoute wrapper
│ │ ├── api/
│ │ │ ├── client.ts # Axios instance with JWT injection
│ │ │ ├── repos.ts # Repository CRUD + ingestion
│ │ │ └── chat.ts # Chat sessions + multi-turn Q&A
│ │ ├── context/
│ │ │ └── AuthContext.tsx
│ │ ├── pages/
│ │ │ ├── LandingPage.tsx
│ │ │ ├── CallbackPage.tsx
│ │ │ ├── DashboardPage.tsx # 6-tab repository dashboard
│ │ │ ├── IngestionPage.tsx # Real-time ingestion progress
│ │ │ └── ArchitecturePage.tsx
│ │ └── components/
│ │ ├── Header.tsx
│ │ ├── AppFooter.tsx
│ │ ├── dashboard/
│ │ │ ├── OverviewPanel.tsx
│ │ │ ├── UnderstandPanel.tsx
│ │ │ ├── ExplorePanel.tsx # API map + dependency graph
│ │ │ ├── DependencyGraph.tsx # React Flow + dagre cluster graph
│ │ │ ├── AskAIWorkspace.tsx
│ │ │ ├── ImpactWorkbench.tsx # Change impact analysis UI
│ │ │ └── EvalDashboard.tsx # Retrieval evaluation UI
│ │ ├── landing/
│ │ │ ├── AnalyzeForm.tsx
│ │ │ ├── RepoCard.tsx
│ │ │ ├── RepoSection.tsx
│ │ │ └── FeaturesGrid.tsx
│ │ └── architecture/
│ │ ├── PipelineSection.tsx
│ │ ├── HybridSearchSection.tsx
│ │ ├── FeaturesUnlockedSection.tsx
│ │ ├── BenchmarksSection.tsx
│ │ └── TechStackSection.tsx
│ ├── package.json
│ ├── vite.config.ts
│ ├── tailwind.config.js
│ ├── Dockerfile
│ └── .env.example
│
├── docker-compose.yml
└── README.md
cd backend
# Install uv (fast Python package manager)
pip install uv
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows
uv pip install -e ".[dev]"
# Run the development server
uvicorn app.main:app --reloadcd frontend
npm install
npm run dev# Backend unit tests
cd backend
pytest tests/unit -v
# With coverage
pytest --cov=app --cov-report=html
# Integration tests (requires PostgreSQL + Qdrant running)
pytest tests/integration -v -m integration
# Frontend
cd frontend
npm run testdocker-compose up # Start all services
docker-compose up -d # Start in background
docker-compose build --no-cache # Rebuild after dependency changes
docker-compose logs -f backend # Stream backend logs
docker-compose exec postgres psql -U codeatlas -d codeatlas_dev
docker-compose down # Stop all services
docker-compose down -v # Stop and wipe all volumesInteractive docs: http://localhost:8000/api/docs (Swagger UI)
| Method | Path | Description |
|---|---|---|
GET |
/health |
Service health check |
GET |
/auth/login |
Initiate GitHub OAuth flow |
POST |
/auth/callback |
Exchange OAuth code for JWT |
GET |
/auth/me |
Get current authenticated user |
| Method | Path | Description |
|---|---|---|
GET |
/repos |
List user's repositories |
GET |
/repos/{id} |
Get repository details + dashboard data |
POST |
/repos/ingest |
Submit repository for ingestion |
GET |
/repos/ingest/{job_id}/status |
Poll ingestion progress |
DELETE |
/repos/{id} |
Delete repository + vectors |
| Method | Path | Description |
|---|---|---|
POST |
/chat/sessions |
Create a new chat session |
GET |
/chat/sessions |
List sessions for a repository |
GET |
/chat/sessions/{id}/messages |
Get all messages in a session |
POST |
/chat/sessions/{id}/ask |
Ask a question in a session |
DELETE |
/chat/sessions/{id} |
Delete a chat session |
| Method | Path | Description |
|---|---|---|
POST |
/impact/{repo_id} |
Run change impact analysis for a symbol or file |
POST |
/eval/{repo_id} |
Run retrieval evaluation (Recall@5 + MRR) |
| Variable | Required | Description |
|---|---|---|
DATABASE_URL |
Yes | PostgreSQL async connection string |
QDRANT_URL |
Yes | Qdrant server URL |
QDRANT_API_KEY |
No | Qdrant Cloud API key (leave blank for local) |
GOOGLE_API_KEY |
Yes | Google AI Studio key (Gemini + gemini-embedding-001) |
GITHUB_CLIENT_ID |
Yes | GitHub OAuth App Client ID |
GITHUB_CLIENT_SECRET |
Yes | GitHub OAuth App Client Secret |
JWT_SECRET |
Yes | Random secret for signing JWTs (min 32 chars) |
FRONTEND_URL |
Yes | Frontend URL for CORS + OAuth redirect |
LOG_LEVEL |
No | Logging level (INFO, DEBUG, WARNING) |
| Variable | Required | Description |
|---|---|---|
VITE_API_URL |
Yes | Backend API base URL |
VITE_GITHUB_CLIENT_ID |
Yes | GitHub OAuth App Client ID |
| Limit | Value | Notes |
|---|---|---|
| Repositories stored | 3 | Delete one to add another |
| Repositories analyzed per day | 3 | Resets at midnight UTC |
| Max files per repository | 10,000 | Larger repos are rejected |
| Total chunks per user | 100,000 | Hard limit before ingestion |
| Chat questions per session | 15 | Create a new session to continue |
| Chat questions per day | 30 | Resets at midnight UTC |
Docker Compose infrastructure, FastAPI backend, GitHub OAuth, SQLAlchemy + Qdrant setup, React + TypeScript scaffold
Repository cloning, Tree-sitter AST parsing (Python, JS, TS, Java, Go), semantic chunking, gemini-embedding-001 embeddings, hybrid vector + BM25 storage, auto-generated summary and onboarding guide, API endpoint discovery, NetworkX dependency graph
Hybrid dense+sparse search with manual RRF fusion (k=60), multi-turn chat with persistent history, cited Q&A, Gemini 2.0 Flash integration
Interactive dependency graph with cluster tabs (React Flow + dagre), change impact analysis, AI answer quality evaluation (Recall@5 + MRR)
6-tab dashboard (Overview, Understand, Explore, Ask AI, Impact Area, Evaluate), real-time ingestion progress, chat interface with session history, repository management
- Vercel (frontend)
- Railway (backend)
- Neon (PostgreSQL)
- Qdrant Cloud
- Private repository support (GitHub App installation flow)
- Incremental re-indexing (diff-based — only changed files)
- Multi-turn chat history search
- Repository health insights
- Additional language parsers (Rust, Ruby, C#, C++)
| Language | Extensions | Endpoint Detection |
|---|---|---|
| Python | .py |
FastAPI / Flask route decorators |
| JavaScript | .js, .jsx |
Express router.get/post patterns |
| TypeScript | .ts, .tsx |
Express + typed exports |
| Java | .java |
@RestController, @GetMapping annotations |
| Go | .go |
http.HandleFunc, chi / gin router patterns |
All other languages chunked by line count (100 lines max), stored as chunk_type: raw.
node_modules/, vendor/, .git/, dist/, build/, __pycache__/, migrations/, *.min.js, *.min.css, *-lock.json, *.lock, binary files.
Ensure all required env vars are set in backend/.env. The app validates them at startup.
docker-compose logs backend
docker-compose down -v
docker-compose build --no-cache
docker-compose up- Verify callback URL in your GitHub OAuth App matches exactly:
http://localhost:3000/callback - Check
GITHUB_CLIENT_IDandGITHUB_CLIENT_SECRETinbackend/.env - Ensure
FRONTEND_URL=http://localhost:3000is set
# macOS/Linux
lsof -i :8000 && kill -9 <PID>
# Windows
netstat -ano | findstr :8000
taskkill /PID <PID> /Fdocker-compose logs -f backend
# Verify GOOGLE_API_KEY is valid and has available quotaEnsure you're running qdrant-client ≥ 1.18.0. CodeAtlas uses two separate query_points calls with manual RRF — the older native Prefetch+Fusion API is not used.
MIT License — see LICENSE for details.
- FastAPI — Async Python web framework
- Qdrant — Vector database with dense + sparse support
- Tree-sitter — Language-agnostic AST parser
- Google AI Studio — Gemini Flash + gemini-embedding-001
- React Flow (@xyflow/react) — Interactive graph visualisation
- dagre — Hierarchical graph layout
- NetworkX — Python graph analysis
- Vite — Frontend build tool
- uv — Fast Python package manager
Built with the goal of making every codebase as navigable as one you wrote yourself.