Automated Pipeline for Cloud-Based Financial Research Retrieval and Interactive Exploration with Multi-modal RAG
This project automates the ingestion, processing, and retrieval of financial research publications from the CFA Institute Research Foundation, providing users with interactive document exploration and question-answering capabilities using a multi-modal Retrieval-Augmented Generation (RAG) model. It integrates cloud technologies for data storage and indexing, a client-facing application for exploration, and a robust search capability through Pinecone as the vector database.
WE ATTEST THAT WE HAVEN'T USED ANY OTHER STUDENTS' WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.
Contribution Breakdown:
- Chiu Meng Che: 34%
- Shraddha Bhandarkar: 33%
- Kefan Zhang: 33%
The workflow diagram outlines the integration of major components including Apache Airflow, Streamlit, FastAPI, Amazon S3, Docker, and Pinecone. Documents are initially retrieved from the CFA Institute and stored in Amazon S3. Apache Airflow orchestrates the automated retrieval and processing pipeline. FastAPI serves as the middleware to facilitate the interaction between the Streamlit interface and backend services such as document processing, embedding models, and Pinecone for indexing. Docker is utilized to containerize each module to ensure a consistent and reliable deployment.
- Data Scraping: Apache Airflow automates the extraction of document data from the CFA Institute, including metadata, PDF files, and summaries.
- Cloud Storage: Extracted documents are uploaded and stored in Amazon S3 for secure and scalable storage.
- Document Exploration: Provides REST API endpoints to explore documents and their content, including metadata, summaries, and links.
- Q/A Interface: Allows users to interact with documents by posing queries, which are processed using AI models to extract insightful answers.
- Embedding with Pinecone: Utilizes Pinecone for storing document embeddings, enabling fast and efficient similarity searches.
- Document Interaction: A clean, intuitive UI for users to register, upload, and interact with documents. Users can query documents, explore extracted data, and generate custom research notes.
- Authentication: Provides secure user login and registration managed via an integrated PostgreSQL database.
- Multi-modal RAG Integration: Supports Retrieval-Augmented Generation (RAG) for in-depth content analysis and dynamic answers to research questions.
- Document Summarization: NVIDIA-powered AI models generate concise and informative summaries to help users quickly understand document content.
- Docker Compose Setup: The entire project is containerized using Docker Compose, which ensures that all services (frontend, backend, pipeline) work seamlessly together and are easy to deploy.
- Scalable Architecture: Containerization allows for easy scaling and cloud deployment, providing reliability and flexibility.
│ .env
│ .gitignore
│ project_tree_structure
│
├── airflow
│ │ airflow.cfg
│ │ poetry.lock
│ │ pyproject.toml
│ └── dags
│ ├── pipeline.py
│ └── modules
│ ├── cfa_scrape_data.py
│ └── __init__.py
├── backend
│ │ delete_vector.py
│ │ document_processors.py
│ │ insert_vector.py
│ │ main.py
│ │ poetry.lock
│ └── pyproject.toml
├── frontend
│ │ poetry.lock
│ │ pyproject.toml
│ │ streamlit_app.py
└── images
workflow_diagram.jpegDocker: Required to containerize and run the application services.
- Download and Install Docker
- Verify installation:
docker --version
Docker Compose: To manage the multi-container setup.
- Verify installation:
docker-compose --version
Poetry: A Python dependency management tool.
- Install Poetry by following the instructions: Poetry Installation Guide
- Verify installation:
poetry --version
Python 3.9+: The project requires Python 3.9 or above.
- Verify installation:
python3 --version
Ensure all prerequisites are installed before proceeding to deployment.
-
Clone the Repository
git clone https://github.com/your_repository/ai-driven-document-system.git cd ai-driven-document-system -
Environment Setup
- Use Poetry to install all dependencies:
poetry install
- Use Poetry to install all dependencies:
-
Running the Application with Docker Compose
- Start the entire system:
docker-compose up --build
- This command will spin up all required containers including Airflow, FastAPI, and Streamlit services.
- Start the entire system:
-
Access the Application
- Streamlit frontend is accessible at
http://localhost:8501 - FastAPI backend documentation (Swagger UI) is available at
http://localhost:8000/docs
- Streamlit frontend is accessible at
Chiu Meng Che:
- Use Airflow to automate the transfer of CFA publications to Amazon S3 and Snowflake. (2.5 days)
- Combine Snowflake, Pinecone, NVIDIA embedding model, and LLM model to implement the RAG process. (3.5 days)
- Project workflow graph. (2 hours)
- Deploy our services using Docker. (2 hours)
Shraddha Bhandarkar:
-
Improved search functionality to enhance user interaction. (2 days)
-
Displayed saved research notes when revisiting a document. (1 day)
-
Enabled search within research notes specific to a document or across the entire document. (1 day)
-
Differentiated between searching through the document's full text and the research notes index. (1 day)
-
Allowed derived research notes to be added to the original research note index for continuous learning. (1 day)
-
Managed data ingestion for CFA publications. (2.5 days)
Kefan Zhang:
- Developed FastAPI endpoints to enable users to retrieve and explore stored documents, including metadata such as titles, summaries, and links to images and PDFs. (2 hours)
- Utilized NVIDIA’s embedding models to generate dense vector representations of documents, enhancing semantic search capabilities. (2 days)
- Stored the generated embeddings in Pinecone, facilitating efficient and accurate similarity-based retrieval. (5 hours)
- Incorporated inline references in the generated answers, linking directly to graphs, tables, and figures within the document. (3 hours)
- Enabled users to modify and personalize the generated research notes, providing a tailored experience that adds value to each interaction. (1.5 days)
- Codelab. (3 hours)
- LLAMA Multimodal Report Generation Example
- NVIDIA Multimodal RAG Example
- Multimodal RAG Slide Deck Example
