Skip to content

emmadionne1/Multimodal-Machine-Learning-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Machine Learning Project: Robust and OCR-Enhanced Model for ChartQA

A comprehensive implementation of a custom Vision-Language Model (VLM) that combines a vision encoder with a language model for chart question-answering tasks on the ChartQA dataset. The pipeline includes LoRA-based fine-tuning and an OCR-enhanced variant that injects EasyOCR-extracted text directly into the prompt.

Submitted as a semester team project for 11-777 Multimodal Machine Learning at CMU.

Project Overview

This project builds and trains a multimodal model that can understand charts and images and answer questions about them. The architecture combines:

  • Vision Encoder: Google SigLIP2 (SO400M) for visual understanding
  • Language Model: Qwen3-4B with optional LoRA adapters for parameter-efficient fine-tuning
  • Projector: A custom learnable MLP that bridges vision and language modalities
  • OCR Augmentation: EasyOCR is used to extract chart text and inject it into prompts at training and inference time

The model is trained and evaluated on the HuggingFace ChartQA dataset, with additional evaluation on ChartQAPro. Performance is measured using relaxed accuracy (semantic equivalence checking) and exact accuracy.


Core Scripts

File Purpose
base_architecture.py Core VLM architecture: vision-language projector, CustomVLM model class, dataset creation, and trainer callbacks. Foundation for all training scripts.
continued_pretraining_and_finetuning.py Multi-GPU continued pretraining and LoRA fine-tuning pipeline. Supports OCR augmentation, multi-dataset evaluation (ChartQA + ChartQAPro), and saves full LoRA artifacts per epoch.
finetuning_without_ocr.py LoRA fine-tuning on ChartQA without OCR-augmented prompts. Includes LoRACustomVLM, eval-only mode, and optional OCR evaluation at inference time.
finetune_ocr.py Fine-tuning with EasyOCR-augmented prompts. Precomputes OCR for all splits before training to avoid VRAM contention.

Getting Started

Prerequisites

  • Python 3.10+
  • CUDA 11.8+ (for GPU support)
  • 80GB+ VRAM for full dataset training (A100 or L40 recommended)
  • Access to HuggingFace models and datasets

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd Multimodal-Machine-Learning-Project
  2. Create a conda environment:

    conda create -n mmml python=3.10
    conda activate mmml
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure credentials: Create a config.py file with your WandB API key:

    WANDB_KEY = "your_wandb_api_key_here"

References


Contact

For questions about this project, please contact the project maintainers.

About

Github for MMML Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors