A comprehensive multi-modal RAG (Retrieval-Augmented Generation) system that supports multiple AI models and can process both text and images from documents.
- Multi-Model Support: Configurable AI models via JSON configuration file
- Vision Capabilities: Analyze images from PDFs and standalone image files
- Document Processing: Support for PDF, Word, PowerPoint, and image files
- Environment Configuration: Secure API key management via .env files
- Persistent Sessions: Maintain conversation history and vector store across sessions
- Easy Model Management: Add/remove models without code changes
- Smart Image Processing: Extract and analyze images from PDFs with context
- Comprehensive Error Handling: Robust fallback mechanisms and user feedback
The application uses a flexible JSON configuration system for managing AI models. See MODEL_CONFIG.md for detailed instructions on adding and configuring models.
- OpenAI: GPT-4 Vision, GPT-4 Turbo, GPT-4o, GPT-3.5 Turbo
- Google Gemini: Gemini 2.5 Pro, Gemini 2.5 Flash, Gemini 1.5 Pro
- Anthropic Claude: Claude 3.5 Sonnet, Claude 3.5 Haiku, Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku
To add a new model:
- Edit
models_config.json - Add your model configuration
- Run
python validate_config.pyto verify - Restart the application
pip install -r requirements.txtCopy the example environment file and add your API keys:
cp .env.example .envEdit .env and add your API keys:
# Required for embeddings (always needed)
COHERE_API_KEY=your_cohere_api_key_here
# Choose one or more AI providers
GENAI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_herepython validate_config.pystreamlit run visionrag.py- Select Model: Use the dropdown in the sidebar to choose your preferred AI model
- Upload Documents: Upload PDF, Word, PowerPoint, or image files
- Ask Questions: Type your questions in the text input field and click "Ask"
- View Results: Get comprehensive answers with source citations
- Clear Data: Use "Clear All" to reset the knowledge base and start fresh
- Vision Tasks: Use vision-capable models for PDFs with images or standalone images
- Text Only: Any model works for text-only documents
- Performance: Gemini models offer good performance and cost-effectiveness
- Quality: Claude models excel at detailed analysis and reasoning
- Speed: Gemini Flash is optimized for fast responses
- Cohere: Required for all embeddings (text and vision) - Always needed
- Google Gemini: Required for Gemini models
- OpenAI: Required for GPT models
- Anthropic: Required for Claude models
- PDF: Text extraction + image analysis with vision models
- Word (.docx): Text extraction with full document structure
- PowerPoint (.pptx): Text and slide content extraction
- Images: PNG, JPG, JPEG, TIFF with vision analysis and context understanding
visionrag/
├── visionrag.py # Main Streamlit application
├── models_config.json # AI model configurations
├── validate_config.py # Configuration validator
├── requirements.txt # Python dependencies
├── README.md # This file
├── .env.example # Environment variables template
├── .gitignore # Git ignore rules
├── original_notebook.ipynb # Original development notebook (reference)
├── uploaded_images/ # Runtime image storage (auto-created)
└── .env # Your API keys (create from .env.example)
Vision-RAG combines traditional text-based Retrieval-Augmented Generation with multimodal vision capabilities to process and understand both text and images from documents.
The system uses a dual-pathway approach to handle both text and visual content:
- Text Extraction: Uses multiple fallback methods (PyMuPDF → pdfminer → UnstructuredLoader) for robust PDF text extraction
- Image Extraction: Extracts embedded images from PDFs with contextual information (page numbers, surrounding text)
- Vision Analysis: Processes images using Cohere's Embed v4.0 model for vision-based embeddings
- Vector Database: FAISS (Facebook AI Similarity Search) for efficient similarity search
- Custom Vector Store:
VisionRAGVectorStoreclass handles both text and image embeddings - Unified Embedding Strategy:
- Single Model: Cohere's
embed-v4.0model for all content (text and images) - Consistency: Matches the original notebook approach exactly
- Simplicity: One model for all embedding tasks ensures consistency
- Single Model: Cohere's
- Hybrid Search: Combines text and image similarity scores for comprehensive retrieval
- Query Analysis: Determines if the query requires text-only or multimodal search
- Parallel Search: Simultaneously searches both text and image vector spaces
- Context Assembly: Combines relevant text snippets and images with metadata
- Smart Ranking: Returns top-k results (default: 6) with diverse content coverage
- Multimodal Prompts: Constructs prompts containing both text context and images
- Model Selection: Routes to appropriate AI model based on vision requirements
- Response Generation: AI models process combined text+image context to generate comprehensive answers
- Vector Database: FAISS-CPU for local, efficient similarity search
- Embeddings: Cohere Embed v4.0 (vision) + multilingual-22-12 (text)
- Document Processing: PyMuPDF, pdfminer, Unstructured, python-docx, python-pptx
- AI Models: OpenAI GPT, Google Gemini, Anthropic Claude (configurable)
- Framework: Streamlit for the user interface
- Context-Aware Image Processing: Images are processed with surrounding text for better understanding
- Fallback Mechanisms: Multiple extraction methods ensure robust document processing
- Dynamic Model Routing: Automatically selects vision-capable models when images are present
- Persistent Storage: Temporary image storage with automatic cleanup for security
- Scalability: FAISS enables efficient search across large document collections
- Speed: Local vector storage eliminates external database dependencies
- Accuracy: Dual embedding approach captures both semantic text meaning and visual content
- Flexibility: JSON-configurable models allow easy adaptation to new AI services
- Missing API Keys: Check your
.envfile and ensure keys are properly set - Model Not Available: Verify the corresponding API key is configured
- Configuration Errors: Run
python validate_config.pyto check your setup - Vision Errors: Ensure you're using a vision-capable model for image analysis
- Upload Issues: Check file formats and ensure files aren't corrupted
- Keep your
.envfile secure and never commit it to version control - API keys are loaded from environment variables for security
- Images are temporarily stored and cleaned up automatically
- Processing logs can be cleared using the "Clear All" function
- Fork the repository
- Create a feature branch
- Make your changes
- Test with
python validate_config.py - Submit a pull request
This project is open source. See the license file for details.