A PDF document assistant with a FastAPI-based backend and a Streamlit-based frontend. The backend integrates with Google Gemini AI, Pinecone vector database, and MySQL. It allows users to upload PDF files (extracting text for vector search and tables for structured data querying) and ask questions about their content using natural language via the frontend interface. Event bot
- π Features
- ποΈ Logic Architecture
- π Prerequisites
- π οΈ Installation & Setup
- π API Keys Setup
- βοΈ Environment Configuration
- π Running Locally
- π‘ API Endpoints
- π Deploy to Render.com
- π§ Development
- π§ͺ Running Tests
- π οΈ Troubleshooting
- π Monitoring
- π Security
- π License
- π€ Contributing
- π Support
- Comprehensive PDF Document Processing: Upload PDF files, extracting both text (for semantic search) and tables (for structured data storage in MySQL).
- Hybrid AI-Powered Q&A: Ask questions that can be answered by retrieving unstructured text (via RAG with Google Gemini & Pinecone) or querying structured table data.
- Intelligent Response Combination: An agentic system determines the best way to answer a query and combines information from different sources into a coherent response.
- Vector Search: Efficient document retrieval using Pinecone vector database for text segments.
- Relational Data Storage: Extracted tables from PDFs are stored in a MySQL database, enabling structured queries.
- Streamlit Frontend: User-friendly interface for uploading PDFs and interacting with the chatbot.
- RESTful API: Clean REST endpoints for the backend, consumed by the frontend.
- Health Monitoring: Built-in health checks and logging for the backend.
- Agentic System with LangGraph: Utilizes LangGraph for defining and running the multi-agent system (ManagerAgent, RAGAgent, CombinerAgent) for complex query processing.
- Modular Design: Clearly defined modules for agents, services, routing, and utilities.
- Python 3.8 or higher
- Google AI Studio account (for Gemini API key)
- Pinecone account (for Pinecone API key and index)
- MySQL compatible database (e.g., local MySQL, AWS RDS)
- Git
All installation, setup, and running instructions have been moved to docs/INSTALLATION.md.
Please refer to that document for:
- Cloning the repository
- Creating and activating a virtual environment
- Installing dependencies
- Setting up API keys and environment variables
- Running the backend and frontend
- Deployment instructions
- Troubleshooting and more
The API routes are primarily defined in src/backend/routes/chat.py. The root / endpoint is in app.py.
| Endpoint | Method | Description | Request Body (Format) | Success Response (JSON Example) |
|---|---|---|---|---|
/ |
GET | Basic API information and available endpoints. | N/A | {"message": "PDF Assistant Chatbot API", "version": "1.0.0", "endpoints": {"/health": "GET - Health check", ...}} (from chat.py) |
/health |
GET | Detailed health check of backend services. | N/A | {"status": "healthy", "services": {"manager_agent_health": {...}}, "overall_health": true} |
/uploadpdf |
POST | Uploads a PDF file for processing, text vectorization (Pinecone), and table storage (MySQL). | FormData: file (PDF file) |
{"success": true, "message": "PDF processed...", "filename": "name.pdf", "tables_stored": 1, "text_chunks_stored": 10} |
/answer |
POST | Asks a question about the processed PDF content. | JSON: {"query": "Your question?"} |
{"answer": "AI generated answer.", "success": true, "error": null} |
For deployment instructions, see the Detailed Deployment Guide.
HybridRAG/
βββ .env # Local environment variables (gitignored)
βββ .env.template # Template for .env file
βββ .git/ # Git version control directory
βββ .gitignore # Specifies intentionally untracked files for Git
βββ README.md # This guide
βββ Makefile # Defines common tasks like running, testing, linting
βββ app.py # Main FastAPI application entry point
βββ clear_data_script.py # Script for clearing data
βββ docs/
β βββ API.md
β βββ ARCHITECTURE.md # (this file)
β βββ DEPLOYMENT.md
β βββ INSTALLATION.md
βββ logs/ # Log files
βββ Makefile # Common tasks (run, test, lint)
βββ README.md # Main documentation
βββ requirements.txt # Python dependencies
βββ requirements-dev.txt # Dev dependencies
βββ scripts/
β βββ start.sh # Script to start backend
βββ src/
β βββ backend/
β β βββ __init__.py # FastAPI app setup and service initialization
β β βββ agents/
β β β βββ base.py # Abstract base class for chatbot agents
β β β βββ combiner_agent.py # Combines responses from Table and RAG agents
β β β βββ manager_agent.py # Orchestrates query processing (LangGraph)
β β β βββ rag_agent.py # RAG-based chatbot logic (ChatbotAgent)
β β β βββ table_agent.py # Handles SQL generation and execution for table data
β β βββ config.py # Centralized backend configuration
β β βββ models.py # Pydantic models for API requests/responses
β β βββ routes/
β β β βββ __init__.py # Router package initializer
β β β βββ chat.py # API endpoints for chat, upload, health, etc.
β β βββ services/
β β β βββ __init__.py # Service package initializer
β β β βββ clear_data_service.py # Service for clearing data from DB and Pinecone
β β β βββ embedding_service.py # Handles text embeddings and Pinecone storage
β β β βββ orchestrator.py # Orchestrates interactions with ManagerAgent
β β βββ test_manager_agent.py # Example/test script for ManagerAgent
β β βββ utils/
β β βββ __init__.py # Utilities package initializer
β β βββ helper.py # Miscellaneous helper functions (e.g., error handlers)
β β βββ pdf_processor.py # PDF parsing, MySQL table storage, schema saving
β β βββ schema_manager.py # Manages table_schema.json (schema CRUD, docs)
β β βββ table_schema.json # Stores inferred schemas for tables from PDFs
β β βββ upload_pdf.py # PDF upload handling, triggers extraction/storage
β βββ frontend/
β βββ streamlit_app.py # Main Streamlit application file (UI)
βββ tests/
β βββ conftest.py
β βββ test_agents/
β β βββ test_rag_agent.py # Test for RAG agent
β βββ test_routes/
β βββ test_chat_routes.py # Test for chat routes
βββ uploads/ # Uploaded files (if any)
βββ venv/ # Python virtual environment
Note: src/backend/test_manager_agent.py is an example/test script and ideally tests should reside in the tests/ directory.
Backend:
app.py: Initializes the FastAPI app, loads configuration, sets up theOrchestrator(which initializes agents), and includes API routers.src/backend/routes/chat.py: Defines API endpoints (/health,/uploadpdf,/answer) and delegates requests to appropriate handlers (e.g.,Orchestratorfor Q&A,upload_pdfutil for uploads).src/backend/services/orchestrator.py: Central coordinator that usesManagerAgentfor processing queries.src/backend/agents/manager_agent.py: Core agent using LangGraph. Analyzes queries, routes them toTableAgentfor structured data queries orChatbotAgent(RAG) for unstructured information, and then combines results usingCombinerAgent.src/backend/agents/table_agent.py: Specialized agent for querying structured data from MySQL tables based on the query and PDF context.src/backend/agents/rag_agent.py(ClassChatbotAgent): Specialized agent for RAG, performing similarity search in Pinecone and generating answers with Gemini based on unstructured text.src/backend/agents/combiner_agent.py: Merges responses from different sources (e.g.,TableAgentandChatbotAgent) into a single, coherent answer using an LLM.src/backend/services/embedding_service.py: Manages text embedding generation (Gemini) and storage/retrieval in Pinecone.src/backend/utils/pdf_processor.py: Extracts text and tables from PDFs. Stores table data in MySQL and provides schema information.src/backend/utils/upload_pdf.py: Handles the PDF upload process, coordinatingPDFProcessorandEmbeddingService.src/backend/config.py: Manages application configuration from environment variables.src/backend/models.py: Contains Pydantic models for API request/response validation.
Frontend:
src/frontend/streamlit_app.py: A Streamlit application providing the user interface. It interacts with the backend API.
The application uses environment variables for configuration. These are typically defined in a .env file in the project root for local development. See .env.template for a list of required variables.
Backend Variables:
GEMINI_API_KEY: Your Google Gemini API key.PINECONE_API_KEY: Your Pinecone API key.PINECONE_INDEX_NAME: The name of your Pinecone index.PINECONE_DIMENSION: The dimension of vectors for Pinecone (e.g., 768 formodels/embedding-001).PINECONE_CLOUD: The cloud provider for your Pinecone index (e.g.,aws).PINECONE_REGION: The region of your Pinecone index (e.g.,us-east-1).DATABASE_URL: Connection string for your MySQL database (e.g.,mysql+mysqlconnector://user:password@host:port/database). Critical for table storage and querying.APP_ENV: Set todevelopmentorproduction.PORT: Port for the backend server (defaults to8000).LOG_LEVEL: Optional, sets the logging level (e.g.,DEBUG,INFO).
Frontend Variables:
ENDPOINT: The URL of the backend API (e.g.,http://localhost:8000).
(This section outlines general steps. Specific test setup might vary.)
- Install Test Dependencies: If you haven't already, install development dependencies:
pip install -r requirements-dev.txt
- Configure Environment for Tests: Ensure your
.envfile (or environment variables) are set up correctly, as tests might interact with external services if not properly mocked. - Run Tests: Navigate to the project root directory and execute:
Pytest will automatically discover and run tests (typically files named
pytest
test_*.pyor*_test.pyin thetests/directory).
Refer to the tests/ directory and any specific test documentation or configuration files for more detailed instructions on running tests.
For troubleshooting common installation and setup issues, refer to the Detailed Installation and Setup Guide.
For more verbose error output locally:
- Set
APP_ENV=developmentin your.envfile. This often enables FastAPI's debug mode. - Optionally, set
LOG_LEVEL=DEBUGin.envfor more detailed application logs (our application uses this). - Run the app (e.g.,
uvicorn app:app --reload).
The /health endpoint (see API Endpoints) provides detailed status of backend components, including the different agents. Regularly polling this endpoint can help ensure system availability.
- Local Development: Logs are output to the console where
uvicorn app:appis running. AdjustLOG_LEVELin.envfor desired verbosity (e.g.,INFO,DEBUG). - Render Deployment: Access and monitor logs via the Render dashboard for your service. This is crucial for diagnosing issues in the production environment.
Key information to look for in logs:
- Successful/failed PDF uploads and processing durations (including table and text chunk counts).
- Question answering request details, including routing decisions by
ManagerAgentand responses fromTableAgentorChatbotAgent. - Errors from external services (Gemini, Pinecone, MySQL).
- Any unexpected application exceptions or tracebacks.
- API Keys & Database Credentials: Handled via environment variables (
.envlocally, Render's environment settings). Never hardcode credentials. Ensure.envis in.gitignore. - File Uploads:
werkzeug.utils.secure_filenameis used to sanitize filenames.- File type and size are validated as per
ALLOWED_EXTENSIONSandMAX_FILE_SIZEin the app configuration (src/backend/config.py).
- Input Validation: Pydantic models (
src/backend/models.py) are used for request validation in API endpoints. - SQL Injection:
TableAgentuses an LLM to generate SQL queries. While this is powerful, it requires careful prompt engineering to prevent SQL injection. The current implementation relies on the LLM's ability to generate safe SQL based on schema and query descriptions.PDFProcessorwhen storing tables uses parameterized queries or ORM-like behavior if applicable, which is generally safer.- Recommendation: Implement strict validation and sanitization of any table/column names or values derived from LLM output before using them in SQL queries, or use query builders that parameterize inputs.
- CORS: FastAPI handles CORS through
CORSMiddleware. Ensure it's configured securely, especially in production, by specifying allowed origins, methods, and headers. Example fromsrc/backend/__init__.py:from fastapi.middleware.cors import CORSMiddleware app.add_middleware( CORSMiddleware, allow_origins=["*"], # Adjust for production allow_credentials=True, allow_methods=["*"], allow_headers=["*"], )
- Error Handling: FastAPI has built-in support for returning structured JSON error responses (e.g., using
HTTPException) and allows for custom exception handlers. This helps avoid exposing raw stack traces. - Dependency Management: Keep
requirements.txtandrequirements-dev.txtup-to-date. Regularly audit dependencies for vulnerabilities using tools likepip-auditor GitHub's Dependabot. - HTTPS: Render automatically provides HTTPS for deployed services. For local development, consider using a reverse proxy like Caddy or Nginx if HTTPS is needed.
This project is licensed under the MIT License. It's good practice to include a LICENSE file in the repository root with the full text of the MIT License.
Contributions are welcome! Please adhere to the following process:
- Fork the Repository: Create your own fork on GitHub.
- Create a Branch:
git checkout -b feature/your-new-featureorbugfix/issue-description. - Develop: Make your changes.
- Test: Add and run tests for your changes using
pytest. Ensure tests cover new functionality and don't break existing features. - Commit: Write clear, concise commit messages.
- Push: Push your branch to your fork:
git push origin your-branch-name. - Pull Request: Open a PR against the
mainbranch of the original repository. Clearly describe your changes and link any relevant issues. Ensure your PR passes any CI checks.
If you encounter issues or have questions:
- Check GitHub Issues: See if your question or problem has already been addressed.
- Review Troubleshooting Section: The Detailed Installation and Setup Guide might have a solution.
- Create a New Issue: If your issue is new, provide detailed information:
- Steps to reproduce.
- Expected vs. actual behavior.
- Error messages and relevant logs.
- Your environment (OS, Python version, relevant package versions).
- For Render-specific deployment issues, consult the Render documentation.
Happy coding! π