Document-QA

A Retrieval-Augmented Generation (RAG) Question Answering System for Documents
Supports PDF, Word (.docx), PowerPoint (.pptx), and image files (PNG, JPG, JPEG) with OCR and reference link crawling.

Features

Multi-format Support: Upload PDF, DOCX, PPTX, PNG, JPG, or JPEG files.
Image OCR: Extracts text from standalone images and images embedded in documents using Tesseract OCR.
Reference Link Crawling: Detects and fetches content from URLs and hyperlinks found in documents, making referenced web content searchable.
Semantic Search: Uses Cohere embeddings and Chroma vector database for semantic retrieval, not just keyword matching.
RAG Chatbot: Ask questions about your uploaded documents and get context-aware answers.
Web Interface: Simple Flask-based web UI for uploading files and chatting.

How It Works

Upload a document or image via the web interface.
Text Extraction: The system extracts text from the file, including OCR for images and embedded images.
Reference Crawling: Any URLs or hyperlinks in the document are fetched and their content is added to the searchable corpus.
Chunking & Embedding: The combined text is split into chunks and embedded using Cohere.
Semantic Retrieval: When you ask a question, the system retrieves the most relevant chunks using vector similarity.
Answer Generation: The chatbot uses an LLM to generate an answer based on the retrieved context.

Prerequisites

Python 3.8+
Tesseract OCR installed and added to your PATH
Cohere API key (set as COHERE_API_KEY in your environment)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
static		static
templates		templates
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
main.py		main.py
requirements.txt		requirements.txt
streamlit_app.py		streamlit_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document-QA

Features

How It Works

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Document-QA

Features

How It Works

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages