Skip to content

BigDataIA-Fall2024-TeamA1/Assignment3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automated Pipeline for Cloud-Based Financial Research Retrieval and Interactive Exploration with Multi-modal RAG

Overview

This project automates the ingestion, processing, and retrieval of financial research publications from the CFA Institute Research Foundation, providing users with interactive document exploration and question-answering capabilities using a multi-modal Retrieval-Augmented Generation (RAG) model. It integrates cloud technologies for data storage and indexing, a client-facing application for exploration, and a robust search capability through Pinecone as the vector database.

Attestation and Contribution Declaration

WE ATTEST THAT WE HAVEN'T USED ANY OTHER STUDENTS' WORK IN OUR ASSIGNMENT AND ABIDE BY THE POLICIES LISTED IN THE STUDENT HANDBOOK.

Contribution Breakdown:

  • Chiu Meng Che: 34%
  • Shraddha Bhandarkar: 33%
  • Kefan Zhang: 33%

Workflow Diagram

workflow

The workflow diagram outlines the integration of major components including Apache Airflow, Streamlit, FastAPI, Amazon S3, Docker, and Pinecone. Documents are initially retrieved from the CFA Institute and stored in Amazon S3. Apache Airflow orchestrates the automated retrieval and processing pipeline. FastAPI serves as the middleware to facilitate the interaction between the Streamlit interface and backend services such as document processing, embedding models, and Pinecone for indexing. Docker is utilized to containerize each module to ensure a consistent and reliable deployment.

Key Features

Automated Document Processing and Storage

  • Data Scraping: Apache Airflow automates the extraction of document data from the CFA Institute, including metadata, PDF files, and summaries.
  • Cloud Storage: Extracted documents are uploaded and stored in Amazon S3 for secure and scalable storage.

Backend API with FastAPI

  • Document Exploration: Provides REST API endpoints to explore documents and their content, including metadata, summaries, and links.
  • Q/A Interface: Allows users to interact with documents by posing queries, which are processed using AI models to extract insightful answers.
  • Embedding with Pinecone: Utilizes Pinecone for storing document embeddings, enabling fast and efficient similarity searches.

User Interface with Streamlit

  • Document Interaction: A clean, intuitive UI for users to register, upload, and interact with documents. Users can query documents, explore extracted data, and generate custom research notes.
  • Authentication: Provides secure user login and registration managed via an integrated PostgreSQL database.

AI-Powered Insights

  • Multi-modal RAG Integration: Supports Retrieval-Augmented Generation (RAG) for in-depth content analysis and dynamic answers to research questions.
  • Document Summarization: NVIDIA-powered AI models generate concise and informative summaries to help users quickly understand document content.

Containerized Deployment

  • Docker Compose Setup: The entire project is containerized using Docker Compose, which ensures that all services (frontend, backend, pipeline) work seamlessly together and are easy to deploy.
  • Scalable Architecture: Containerization allows for easy scaling and cloud deployment, providing reliability and flexibility.

Project Structure

│  .env
│  .gitignore
│  project_tree_structure
│  
├── airflow
│   │  airflow.cfg
│   │  poetry.lock
│   │  pyproject.toml
│   └── dags
│       ├── pipeline.py
│       └── modules
│           ├── cfa_scrape_data.py
│           └── __init__.py
├── backend
│   │  delete_vector.py
│   │  document_processors.py
│   │  insert_vector.py
│   │  main.py
│   │  poetry.lock
│   └── pyproject.toml
├── frontend
│   │  poetry.lock
│   │  pyproject.toml
│   │  streamlit_app.py
└── images
        workflow_diagram.jpeg

Prerequisites

Docker: Required to containerize and run the application services.

Docker Compose: To manage the multi-container setup.

  • Verify installation:
    docker-compose --version

Poetry: A Python dependency management tool.

Python 3.9+: The project requires Python 3.9 or above.

  • Verify installation:
    python3 --version

Ensure all prerequisites are installed before proceeding to deployment.

Installation and Setup

  1. Clone the Repository

    git clone https://github.com/your_repository/ai-driven-document-system.git
    cd ai-driven-document-system
  2. Environment Setup

    • Use Poetry to install all dependencies:
      poetry install
  3. Running the Application with Docker Compose

    • Start the entire system:
      docker-compose up --build
    • This command will spin up all required containers including Airflow, FastAPI, and Streamlit services.
  4. Access the Application

    • Streamlit frontend is accessible at http://localhost:8501
    • FastAPI backend documentation (Swagger UI) is available at http://localhost:8000/docs

Contributions and Time Breakdown

Chiu Meng Che:

  1. Use Airflow to automate the transfer of CFA publications to Amazon S3 and Snowflake. (2.5 days)
  2. Combine Snowflake, Pinecone, NVIDIA embedding model, and LLM model to implement the RAG process. (3.5 days)
  3. Project workflow graph. (2 hours)
  4. Deploy our services using Docker. (2 hours)

Shraddha Bhandarkar:

  1. Improved search functionality to enhance user interaction. (2 days)

  2. Displayed saved research notes when revisiting a document. (1 day)

  3. Enabled search within research notes specific to a document or across the entire document. (1 day)

  4. Differentiated between searching through the document's full text and the research notes index. (1 day)

  5. Allowed derived research notes to be added to the original research note index for continuous learning. (1 day)

  6. Managed data ingestion for CFA publications. (2.5 days)

Kefan Zhang:

  1. Developed FastAPI endpoints to enable users to retrieve and explore stored documents, including metadata such as titles, summaries, and links to images and PDFs. (2 hours)
  2. Utilized NVIDIA’s embedding models to generate dense vector representations of documents, enhancing semantic search capabilities. (2 days)
  3. Stored the generated embeddings in Pinecone, facilitating efficient and accurate similarity-based retrieval. (5 hours)
  4. Incorporated inline references in the generated answers, linking directly to graphs, tables, and figures within the document. (3 hours)
  5. Enabled users to modify and personalize the generated research notes, providing a tailored experience that adds value to each interaction. (1.5 days)
  6. Codelab. (3 hours)

Resources

  • LLAMA Multimodal Report Generation Example
  • NVIDIA Multimodal RAG Example
  • Multimodal RAG Slide Deck Example

Demonstration Video

Click here to watch the video demonstration

Codelabs Documentation

Click here to view the Codelabs documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors