Skip to content

mabaeyens/RAG

Repository files navigation

RAG

Retrieval-Augmented Generation (RAG) pipeline for ingesting and querying PDF documents using ChromaDB and local language models.

Description

Processes PDF documents, generates local embeddings and stores them in ChromaDB. Supports natural language queries with result reranking via CrossEncoder.

Main flow:

  1. run_ingestor.pysrc/pdf_ingestor.py: ingests PDFs, generates embeddings and stores them in ChromaDB
  2. prompting-chromadb-reranker.py: queries the vector store with CrossEncoder reranking
  3. prompting-improved-direct-chromadb-context-collection-works-works.py: simplified query version

Requirements

  • Python 3.10+
  • ChromaDB installed locally

Setup

# 1. Copy and edit the config file
cp config.example.py config.py

# 2. Edit config.py with your local paths and LM Studio URL

# 3. Install dependencies
pip install -r requirements.txt

Usage

# Ingest PDFs
python run_ingestor.py

# Query with reranking
python prompting-chromadb-reranker.py

# Simplified query
python "prompting-improved-direct-chromadb-context-collection-works-works.py"

Structure

RAG/
├── src/
│   └── pdf_ingestor.py                                          # Core ingestion engine
├── run_ingestor.py                                              # Ingestion entry point
├── prompting-chromadb-reranker.py                               # Query with reranking
├── prompting-improved-direct-chromadb-context-collection-...py  # Simplified query
├── ingestion-multi-pdf-PyMuPDF-direct Chroma-testqueries-works.py  # Alternative ingestion
├── config.example.py                                            # Config template (copy to config.py)
└── requirements.txt

Status

  • PDF ingestion with batching and logging
  • Local embeddings with sentence-transformers
  • Query with CrossEncoder reranking
  • User interface
  • Support for other document formats

Notes

  • Embeddings generated with sentence-transformers/all-MiniLM-L6-v2
  • Vector store persisted in local ChromaDB (path configurable in config.py)
  • Local paths and LM Studio URL are configured in config.py (see config.example.py)
  • Iterative development scripts are preserved in mabaeyens/RAG-experiments (private)

🛠️ Development Workflow: Human-AI Collaboration

This project is the result of a strategic collaboration between human design and AI-assisted code generation.

  • Architecture & Logic: Fully defined by the author. This includes system structure, business rules, data flow, and implementation strategy.
  • Code Generation: The syntactic implementation and line-by-line code writing was performed by Claude Code, following precise and iterative instructions provided by the author.
  • Supervision & Refinement: All code was manually reviewed, tested, and adjusted to ensure quality, consistency, and compliance with project standards.

This approach demonstrates the ability to direct advanced AI tools to accelerate development without sacrificing creative control or technical quality.

📄 License

This project is licensed under the MIT License. You can find the full text in the LICENSE file.

Note on authorship: Although much of the source code was generated by an AI, the creative direction, architecture, and final integration are human work. Usage rights are granted under the terms of the MIT License.

🚀 Contributing

Feel free to fork this project!

  • If you find a bug, open an issue.
  • If you have an improvement, submit a Pull Request.
  • Feel free to use this code in your own projects!

About

RAG pipeline with ChromaDB and local embeddings for PDF ingestion and semantic search via LM Studio

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages