A comprehensive implementation of a custom Vision-Language Model (VLM) that combines a vision encoder with a language model for chart question-answering tasks on the ChartQA dataset. The pipeline includes LoRA-based fine-tuning and an OCR-enhanced variant that injects EasyOCR-extracted text directly into the prompt.
Submitted as a semester team project for 11-777 Multimodal Machine Learning at CMU.
This project builds and trains a multimodal model that can understand charts and images and answer questions about them. The architecture combines:
- Vision Encoder: Google SigLIP2 (SO400M) for visual understanding
- Language Model: Qwen3-4B with optional LoRA adapters for parameter-efficient fine-tuning
- Projector: A custom learnable MLP that bridges vision and language modalities
- OCR Augmentation: EasyOCR is used to extract chart text and inject it into prompts at training and inference time
The model is trained and evaluated on the HuggingFace ChartQA dataset, with additional evaluation on ChartQAPro. Performance is measured using relaxed accuracy (semantic equivalence checking) and exact accuracy.
| File | Purpose |
|---|---|
base_architecture.py |
Core VLM architecture: vision-language projector, CustomVLM model class, dataset creation, and trainer callbacks. Foundation for all training scripts. |
continued_pretraining_and_finetuning.py |
Multi-GPU continued pretraining and LoRA fine-tuning pipeline. Supports OCR augmentation, multi-dataset evaluation (ChartQA + ChartQAPro), and saves full LoRA artifacts per epoch. |
finetuning_without_ocr.py |
LoRA fine-tuning on ChartQA without OCR-augmented prompts. Includes LoRACustomVLM, eval-only mode, and optional OCR evaluation at inference time. |
finetune_ocr.py |
Fine-tuning with EasyOCR-augmented prompts. Precomputes OCR for all splits before training to avoid VRAM contention. |
- Python 3.10+
- CUDA 11.8+ (for GPU support)
- 80GB+ VRAM for full dataset training (A100 or L40 recommended)
- Access to HuggingFace models and datasets
-
Clone the repository:
git clone <repository-url> cd Multimodal-Machine-Learning-Project
-
Create a conda environment:
conda create -n mmml python=3.10 conda activate mmml
-
Install dependencies:
pip install -r requirements.txt
-
Configure credentials: Create a
config.pyfile with your WandB API key:WANDB_KEY = "your_wandb_api_key_here"
- Vision Model: SigLIP2 (google/siglip2-so400m-patch16-512)
- Language Model: Qwen3-4B
- ChartQA Dataset: HuggingFaceM4/ChartQA
- ChartQAPro Dataset: ahmed-masry/ChartQAPro
- OCR: EasyOCR
For questions about this project, please contact the project maintainers.