Multimodal Machine Learning Project: Robust and OCR-Enhanced Model for ChartQA

A comprehensive implementation of a custom Vision-Language Model (VLM) that combines a vision encoder with a language model for chart question-answering tasks on the ChartQA dataset. The pipeline includes LoRA-based fine-tuning and an OCR-enhanced variant that injects EasyOCR-extracted text directly into the prompt.

Submitted as a semester team project for 11-777 Multimodal Machine Learning at CMU.

Project Overview

This project builds and trains a multimodal model that can understand charts and images and answer questions about them. The architecture combines:

Vision Encoder: Google SigLIP2 (SO400M) for visual understanding
Language Model: Qwen3-4B with optional LoRA adapters for parameter-efficient fine-tuning
Projector: A custom learnable MLP that bridges vision and language modalities
OCR Augmentation: EasyOCR is used to extract chart text and inject it into prompts at training and inference time

The model is trained and evaluated on the HuggingFace ChartQA dataset, with additional evaluation on ChartQAPro. Performance is measured using relaxed accuracy (semantic equivalence checking) and exact accuracy.

Core Scripts

File	Purpose
`base_architecture.py`	Core VLM architecture: vision-language projector, `CustomVLM` model class, dataset creation, and trainer callbacks. Foundation for all training scripts.
`continued_pretraining_and_finetuning.py`	Multi-GPU continued pretraining and LoRA fine-tuning pipeline. Supports OCR augmentation, multi-dataset evaluation (ChartQA + ChartQAPro), and saves full LoRA artifacts per epoch.
`finetuning_without_ocr.py`	LoRA fine-tuning on ChartQA without OCR-augmented prompts. Includes `LoRACustomVLM`, eval-only mode, and optional OCR evaluation at inference time.
`finetune_ocr.py`	Fine-tuning with EasyOCR-augmented prompts. Precomputes OCR for all splits before training to avoid VRAM contention.

Getting Started

Prerequisites

Python 3.10+
CUDA 11.8+ (for GPU support)
80GB+ VRAM for full dataset training (A100 or L40 recommended)
Access to HuggingFace models and datasets

Installation

Clone the repository:

git clone <repository-url>
cd Multimodal-Machine-Learning-Project

Create a conda environment:

conda create -n mmml python=3.10
conda activate mmml

Install dependencies:
```
pip install -r requirements.txt
```
Configure credentials: Create a config.py file with your WandB API key:
```
WANDB_KEY = "your_wandb_api_key_here"
```

References

Vision Model: SigLIP2 (google/siglip2-so400m-patch16-512)
Language Model: Qwen3-4B
ChartQA Dataset: HuggingFaceM4/ChartQA
ChartQAPro Dataset: ahmed-masry/ChartQAPro
OCR: EasyOCR

Contact

For questions about this project, please contact the project maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
baselines		baselines
eda		eda
.gitignore		.gitignore
Full_Model_Chart_Summarization.ipynb		Full_Model_Chart_Summarization.ipynb
Full_Model_Chart_To_Table.ipynb		Full_Model_Chart_To_Table.ipynb
README.md		README.md
base_architecture.py		base_architecture.py
base_architecture.sh		base_architecture.sh
continued_pretraining_and_finetuning.py		continued_pretraining_and_finetuning.py
evaluate_with_ocr.py		evaluate_with_ocr.py
finetune_ocr.py		finetune_ocr.py
finetune_ocr.sh		finetune_ocr.sh
finetuning_without_ocr.py		finetuning_without_ocr.py
finetuning_without_ocr.sh		finetuning_without_ocr.sh
requirements.txt		requirements.txt
run_continued_pretraining.sh		run_continued_pretraining.sh
run_evaluate_with_ocr.sh		run_evaluate_with_ocr.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Machine Learning Project: Robust and OCR-Enhanced Model for ChartQA

Project Overview

Core Scripts

Getting Started

Prerequisites

Installation

References

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Machine Learning Project: Robust and OCR-Enhanced Model for ChartQA

Project Overview

Core Scripts

Getting Started

Prerequisites

Installation

References

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages