AskDocs is a lightweight, production-ready Retrieval-Augmented Generation (RAG) application that allows users to upload PDF documents and ask natural language questions about their content. The application uses semantic search to find relevant document sections and AI-powered LLM to generate accurate, contextual answers.
| Feature | Description |
|---|---|
| PDF Document Upload | Drag-and-drop interface for easy file upload |
| Semantic Search | Finds relevant document context using FAISS vector store |
| AI-Powered Answers | Uses Groq's Llama 3.1 model for fast, intelligent responses |
| Source Citations | Shows exactly which pages and sections answers come from |
| Document Summarization | Quick document overview with one-click summary |
| Chat History | Maintains conversation context across interactions |
| Clean UI | Modern, user-friendly Streamlit interface |
| Real-time Processing | Spinner indicator while AI processes your question |
| Layer | Technologies |
|---|---|
| Frontend/UI | Streamlit |
| LLM Orchestration | LangChain, LangChain Classic |
| LLM Provider | Groq API |
| Model | Llama 3.1 8B Instant |
| Embeddings | Hugging Face (all-MiniLM-L6-v2) |
| Vector Store | FAISS |
| Document Loading | PyPDF |
| Text Processing | LangChain Text Splitters |
flowchart TD
A[User] -->|1. Upload PDF| B[Streamlit UI]
B -->|2. Save to temp file| C[tempfile.NamedTemporaryFile]
C --> D[PyPDFLoader]
D -->|Load & parse document| E[RecursiveCharacterTextSplitter]
E -->|Split into chunks| F[HuggingFaceEmbeddings]
F -->|Generate embeddings| G[FAISS Vector Store]
A -->|3. Ask Question| H[Streamlit UI]
H --> I[ConversationalRetrievalChain]
G -->|Retrieve relevant chunks| I
Memory[ConversationBufferMemory] -->|Provide chat history| I
I -->|Pass context + history + question| J[Groq LLM]
J -->|Generate answer| K[Streamlit UI]
K -->|Display answer + sources| A
AskDocs/
├── app.py # Main Streamlit application
├── requirements.txt # Project dependencies
├── .gitignore # Git ignore rules
├── README.md # This file
├── config/
│ └── settings.py # Configuration and environment variables
└── core/
├── loader.py # PDF loading and chunking
├── embeddings.py # Embedding model initialization
├── vectorstore.py # FAISS vector store creation
└── chain.py # RetrievalQA chain construction
Centralized configuration management:
- Environment loading: Loads variables from
.envfile - API key validation: Raises ValueError if GROQ_API_KEY is missing
- Model settings: Configurable chunk size, overlap, and embedding model
Responsible for document processing:
- PDF loading: Uses PyPDFLoader to extract text
- Text splitting: RecursiveCharacterTextSplitter (500 char chunks, 50 char overlap)
- Summary extraction: get_summary_text() function extracts first N chunks for quick overview
Embedding model initialization:
- Uses
all-MiniLM-L6-v2for fast, efficient embeddings
Vector store and retriever setup:
- Creates FAISS index from document chunks
- Returns retriever object for semantic search
QA chain construction:
- Uses ConversationalRetrievalChain with ConversationBufferMemory for chat history
- Strict prompt template to ensure answers only come from document context
- Returns source documents for citation
- Handles unanswerable questions gracefully
Main application:
- Streamlit UI and state management
- Chat history tracking
- Question answering workflow
- Source citation display
- Python 3.9 or higher
- Groq API key (get one at console.groq.com)
cd c:\Users\yashk\Desktop\AskDocs# Create virtual environment
python -m venv venv
# Activate virtual environment
.\venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the project root directory:
GROQ_API_KEY=your_groq_api_key_hereImportant: Replace your_groq_api_key_here with your actual Groq API key.
streamlit run app.pyThe application will start and automatically open in your default browser at http://localhost:8501.
- Upload a PDF document using the file uploader
- Wait for the "PDF loaded. Ask your question below." success message
- Optional: Click "📋 Summarize Document" to get a quick overview of the document
- Type your question in the chat input box
- Wait for the AI to process and respond
- View the answer and click "📄 View Sources" to see citations
This application is a Streamlit web app and doesn't expose a traditional REST API. However, below is documentation of the core internal modules:
Loads and chunks a PDF document.
Parameters:
file_path(str): Path to the PDF file
Returns:
List[Document]: List of LangChain Document objects
Example:
from core.loader import load_and_chunk_pdf
chunks = load_and_chunk_pdf("document.pdf")Extracts combined text from first N chunks for document summary.
Parameters:
chunks(List[Document]): Document chunks from load_and_chunk_pdfmax_chunks(int): Maximum number of chunks to use for summary (default: 20)
Returns:
str: Combined text from selected chunks
Example:
from core.loader import load_and_chunk_pdf, get_summary_text
chunks = load_and_chunk_pdf("document.pdf")
summary_text = get_summary_text(chunks)Returns the configured embedding model.
Returns:
HuggingFaceEmbeddings: Embedding model instance
Example:
from core.embeddings import get_embeddings
embeddings = get_embeddings()Creates a FAISS vector store and returns a retriever.
Parameters:
chunks(List[Document]): Document chunks from load_and_chunk_pdfembeddings(HuggingFaceEmbeddings): Embedding model from get_embeddings
Returns:
VectorStoreRetriever: Configured retriever object
Example:
from core.vectorstore import build_vectorstore
retriever = build_vectorstore(chunks, embeddings)Builds the ConversationalRetrievalChain chain with conversation memory.
Parameters:
retriever(VectorStoreRetriever): Retriever from build_vectorstore
Returns:
ConversationalRetrievalChain: Configured QA chain that accepts {"question": "..."} and maintains conversation history
Example:
from core.chain import build_qa_chain
qa_chain = build_qa_chain(retriever)
response = qa_chain.invoke({"question": "What is this document about?"})Response Format:
{
"question": "What is this document about?",
"answer": "This document discusses...",
"source_documents": [Document(...), Document(...)]
}Generates a structured summary of document text using the LLM.
Parameters:
text(str): Document text to summarize
Returns:
str: Structured summary of the document
Example:
from core.chain import summarize_document
summary = summarize_document("This is a document about...")
print(summary)We welcome contributions to AskDocs! Here's how you can help:
-
Fork the Repository
- Create a personal fork of the project
-
Create a Feature Branch
git checkout -b feature/amazing-feature
-
Make Your Changes
- Follow the existing code style
- Add comments where necessary
- Test your changes thoroughly
-
Commit Your Changes
git commit -m "Add amazing feature"
-
Push to Your Branch
git push origin feature/amazing-feature
-
Open a Pull Request
- Describe your changes in detail
- Link any relevant issues
- Follow PEP 8 guidelines
- Use meaningful variable and function names
- Keep functions focused and single-purpose
- Add docstrings for public functions
- Be respectful and inclusive
- Welcome constructive feedback
- Focus on what's best for the community
- Show empathy towards other contributors
This project is licensed under the MIT License - see the LICENSE file for details (if LICENSE file doesn't exist, you may create one).
Bug Fixes:
- Fixed NameError: 'summarize_document' is not defined in app.py
- Simplified summarization workflow to use get_summary_text directly
Features:
- Initial release of AskDocs
- PDF document upload and processing
- Semantic search with FAISS
- AI-powered answers using Groq Llama 3.1
- Source citations
- Document Summarization
- Chat history
- Clean Streamlit UI
Improvements:
- Refactored into modular structure (config/core separation)
- Added GROQ_API_KEY validation at startup
- Replaced hardcoded temp.pdf with tempfile.NamedTemporaryFile to prevent concurrent access conflicts
- Added graceful unanswerable question handling
- Improved UI with source expanders
Bug Fixes:
- Fixed indentation issues in app.py
- Removed duplicate chat history display code
- Added proper temp file cleanup
| Issue | Description | Workaround |
|---|---|---|
| Single document only | Currently supports only one uploaded PDF at a time | Reload app to upload a new document |
| No persistent storage | Vector store is in-memory only | No workaround yet (future improvement) |
| Large PDFs | Very large PDFs may take time to process | Consider splitting large PDFs into smaller files |
- Support for multiple file formats (DOCX, TXT, EPUB, etc.)
- Persistent vector storage (ChromaDB, Pinecone, etc.)
- Multiple document upload and querying
- Advanced chunking strategies (semantic, hierarchical)
- Custom prompt templates
- Export chat history
- Docker containerization
- Authentication and user accounts
- Better error handling and user feedback
- API Key Management: GROQ_API_KEY loaded from environment variable, never hardcoded
- Temporary File Cleanup: Uploaded files deleted after processing using try/finally
- Input Validation: File uploader restricted to PDF files only
- Prompt Injection Protection: Strict prompt template limits model to document context only
- Dependencies: All dependencies listed in requirements.txt with no known vulnerabilities
"GROQ_API_KEY environment variable is required"
- Make sure you created a
.envfile with your API key - Restart the Streamlit app after setting the environment variable
"No module named '...'"
- Make sure you activated your virtual environment
- Run
pip install -r requirements.txt
PDF won't upload or process
- Make sure the file is a valid PDF
- Try a different PDF file to rule out corruption
If you encounter issues:
- Check the Known Issues section above
- Review the Streamlit terminal output for error messages
- Open an issue in the project repository