Skip to content

DanielDeshmukh/Hector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

H.E.C.T.O.R.

Hierarchical Evaluation of Civil-Criminal Textual's Orchestrator & Retrieval

HECTOR is a high-precision "Hard-RAG" legal intelligence system for Indian Law. It specializes in mapping the transition from the Indian Penal Code (IPC) to the Bharatiya Nyaya Sanhita (BNS), providing authoritative citations from a curated library of Bare Acts and commentaries with zero hallucination.


Quick Start

Docker (Recommended)

git clone <repo-url> && cd Hector
cp .env.example .env          # Add your API keys
docker compose --profile full up -d
# Frontend: http://localhost:3000
# API:      http://localhost:8000
# Docs:     http://localhost:8000/docs

Local Development

# Backend
python -m venv venv && venv\Scripts\activate   # Windows
pip install -r requirements.txt
cp .env.example .env            # Add your API keys
uvicorn api.app:app --reload --port 8000

# Frontend (separate terminal)
cd frontend
npm install && npm run dev

CLI

pip install -e .
hector status                   # Verify system
hector ingest                   # Index books
hector search "Section 302 IPC"

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        USER LAYER                               │
│  ┌──────────┐  ┌──────────────┐  ┌─────────┐  ┌────────────┐  │
│  │ React UI │  │  REST API    │  │   CLI   │  │  Voice I/O │  │
│  │ (Vite)   │  │  (FastAPI)   │  │ (Typer) │  │  (Web API) │  │
│  └────┬─────┘  └──────┬───────┘  └────┬────┘  └─────┬──────┘  │
│       └────────────────┴───────────────┴─────────────┘         │
└───────────────────────────────┬─────────────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────┐
│                      CORE ENGINE                                │
│                                                                 │
│  ┌─────────┐    ┌──────────────┐    ┌────────────┐             │
│  │ Router  │───▶│  Retriever   │───▶│  Verifier  │             │
│  │(Groq)   │    │(Hybrid RAG)  │    │(Chain-of-  │             │
│  │         │    │              │    │Verification)│             │
│  └─────────┘    └──────┬───────┘    └─────┬──────┘             │
│                        │                  │                     │
│  ┌─────────────────────▼──────────────────▼─────────────┐      │
│  │              RESPONSE GENERATOR                       │      │
│  │  (Citation grounding, IPC↔BNS comparison tables)     │      │
│  └──────────────────────────────────────────────────────┘      │
└───────────────────────────────┬─────────────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────┐
│                     DATA LAYER                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌────────────────────┐    │
│  │   ChromaDB   │  │  BM25 Index  │  │  PDF Corpus        │    │
│  │  (Semantic)  │  │  (Keyword)   │  │  (24 Bare Acts +   │    │
│  │              │  │              │  │   13 Commentaries)  │    │
│  └──────────────┘  └──────────────┘  └────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Query Pipeline

  1. Intent Routing -- Taxonomy agent classifies domain (Criminal/Civil/Procedural) to prevent data bleeding
  2. Hybrid Retrieval -- Semantic search (sentence-transformers) + BM25 keyword search, fused via Reciprocal Rank Fusion, reranked by cross-encoder
  3. Hierarchical Contextualization -- Sub-clauses automatically pull parent Section, Chapter, and Act titles
  4. Citation Grounding -- Validator checks response against source; unverified claims flagged, never guessed
  5. IPC to BNS Mapping -- 495 cross-reference mappings with temporal validation (IPC repealed July 1, 2024)

Environment Variables

Copy .env.example to .env and configure:

Variable Required Default Description
HECTOR_API_KEY Yes -- API authentication key
HECTOR_JWT_SECRET Yes -- JWT signing secret (min 32 chars)
HECTOR_JWT_EXPIRY_SECONDS No 3600 Token lifetime
GROQ_API_KEY Yes -- Groq API key for LLM routing
GEMINI_API_KEY No -- Google Gemini API key
NVIDIA_API_KEY No -- NVIDIA NIM API key
NIM_API_KEY No -- NVIDIA NIM API key (alt)
NIM_BASE_URL No https://integrate.api.nvidia.com/v1 NIM endpoint
HECTOR_ROUTER_MODEL No llama-3.3-70b-versatile Groq model for routing
HECTOR_BOOKS_DIR No ./data/Books PDF corpus directory
HECTOR_DB_PATH No ./hector_db ChromaDB storage path
HECTOR_TESSERACT_CMD No tesseract Tesseract OCR binary path
HECTOR_POPPLER_PATH No -- Poppler bin/ directory (for pdf2image)
HECTOR_CORS_ORIGINS No http://localhost:3000 Comma-separated CORS origins
HECTOR_LOG_LEVEL No INFO Logging level
HECTOR_DEBUG No false Debug mode

Frontend (frontend/.env):

Variable Required Default Description
VITE_API_URL No http://localhost:8000 Backend API URL
VITE_API_KEY No -- Pre-configured API key for UI

API Endpoints

Method Path Auth Description
POST /search API Key / JWT Hybrid legal search
POST /compare API Key / JWT IPC to BNS section comparison
POST /route API Key / JWT Intent classification
POST /ingest API Key / JWT PDF ingestion trigger
GET /status API Key / JWT System health + ChromaDB status
GET /healthz None Liveness probe (for orchestrators)
GET /readyz None Readiness probe (ChromaDB + disk)
POST /auth/token API Key Get JWT bearer token
WS /ws/search Query param Streaming search events

Authenticate with:

  • X-API-Key: <your-key> header, or
  • Authorization: Bearer <jwt-token> header

Tech Stack

Layer Technology
Backend FastAPI, Python 3.11+
Vector DB ChromaDB
Embeddings sentence-transformers (all-MiniLM-L6-v2)
Reranker cross-encoder (ms-marco-MiniLM-L-6-v2)
LLM Router Groq (llama-3.3-70b-versatile)
Frontend Vite 5, React 18, Tailwind CSS 4
OCR Tesseract 5, Poppler, pdf2image
CLI Typer
Containerization Docker Compose

Project Structure

Hector/
├── api/                    # FastAPI application
│   ├── app.py              # Main app, middleware, routes
│   ├── security.py         # AuthManager, JWT, bcrypt
│   ├── rate_limit.py       # Token bucket rate limiting
│   ├── schemas.py          # Pydantic request/response models
│   └── services.py         # Business logic layer
├── core/                   # Core engine
│   ├── router.py           # Intent classification (Groq LLM)
│   ├── orchestrator.py     # Query pipeline coordinator
│   ├── hybrid_retriever.py # Semantic + BM25 + cross-encoder
│   ├── verifier.py         # Chain-of-Verification
│   ├── response_generator.py # Citation-grounded responses
│   ├── voice.py            # Voice I/O (Web Speech API)
│   ├── precedent.py        # Precedent analysis
│   ├── enterprise/         # Enterprise user management
│   └── mapping.json        # 495 IPC-BNS cross-references
├── data/Books/             # PDF corpus (24 bare acts + commentaries)
├── frontend/               # Vite + React frontend
│   ├── src/                # React components
│   ├── nginx.conf          # Production nginx config
│   └── Dockerfile          # Multi-stage build
├── tests/                  # Test suite
├── utils/                  # Ingestion pipeline
│   ├── enhanced_ingestor.py # PDF to ChromaDB pipeline
│   └── legal_structure_parser.py # Legal document parsing
├── docker-compose.yml      # Container orchestration
├── requirements.txt        # Python dependencies
└── main.py                 # CLI entry point

Prerequisites

  • Python 3.11+
  • Node.js 18+ (for frontend)
  • Tesseract OCR (for scanned PDFs): winget install UB-Mannheim.TesseractOCR
  • Poppler (for pdf2image): Download from github.com/oschwartz10612/poppler-windows
  • Docker (optional, for containerized deploy)

Troubleshooting

Server refuses to start -- missing environment variables

RuntimeError: HECTOR_API_KEY and HECTOR_JWT_SECRET must be set

Fix: Copy .env.example to .env and add your API keys. The server will not start without them.

Tesseract not found

TesseractNotFoundError: ...

Fix: Set HECTOR_TESSERACT_CMD in .env to the full path:

HECTOR_TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe

Poppler not found (PDF to image conversion fails)

Fix: Set HECTOR_POPPLER_PATH in .env to the Poppler bin/ directory:

HECTOR_POPPLER_PATH=C:\path\to\poppler-xx\Library\bin

CORS errors in browser

Fix: Ensure HECTOR_CORS_ORIGINS in .env includes your frontend URL:

HECTOR_CORS_ORIGINS=http://localhost:3000,http://localhost:5173

ChromaDB collection not found

Fix: Run ingestion first:

hector ingest           # via CLI
# or
python main.py ingest   # via main.py

Rate limited (429 responses)

The API enforces rate limiting. Wait for the Retry-After period in the response header.

Docker build fails

Fix: Ensure .env exists in the project root. Docker Compose reads it automatically:

cp .env.example .env
# Edit .env with your keys
docker compose --profile full up -d

License

See LICENSE for details.

About

HECTOR (Hierarchical Evaluation of Civil-Criminal Textual Orchestrator & Retrieval) is a zero-hallucination, Hard-RAG legal intelligence system for Indian law, specializing in IPC to BNS mapping with precise citations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors