Skip to content

helinatefera/VQAGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention

Demo

Model Architecture

This project proposes the first Video Question Answering (VideoQA) system tailored for the Amharic language. It integrates multiple modalities—visual frames, object-level features, and text—through a bidirectional cross-modal attention mechanism.

In the model:

  • Text features: Amharic BERT encodes the question.
  • Visual features: Extracted from TimeSformer (CLS), CLIP, and FastRCNN, projected to the same space.
  • Cross-modal attention:
    • Text → Visual: Tokens attend to visual regions.
    • Visual → Text: Visual regions attend to token embeddings.
  • Fusion: Attention outputs are concatenated and passed to a classifier.

🎯 Objectives

  • Build a multimodal QA pipeline for Amharic videos.
  • Apply novel frame selection using MCLIP.
  • Extract video, object, and text features.
  • Train a cross-modal attention model.
  • Evaluate performance using classification metrics.

📁 Dataset Structure

  • videos/: Video frames organized by video_id/
  • CSV files with: video_id, question, answer (Amharic)
  • Feature folders:
    • TimeSformer CLS features
    • FastRCNN object features
    • CLIP frame embeddings
  • Splits: train, val, test

🧠 Model Summary

  • Text Encoder: Amharic BERT
  • Visual Encoders:
    • TimeSformer (temporal)
    • CLIP (frame-level)
    • FastRCNN (object-level)
  • Fusion: 2-layer bidirectional cross-modal attention (8 heads)
  • Classifier: Linear layer over concatenated fusion output

🛠 Training Configuration

  • Loss: CrossEntropy
  • Optimizer: Adam
  • Batch Size: 32
  • Epochs: 50 (early stop at 5)
  • Metrics: Accuracy, Precision, Recall, F1-score

📦 Clone the Repository

git clone https://github.com/helinatefera/VQAGen
cd VQAGen

📥 Download Dataset and Extracted Feature of video

Download the dataset from Hugging Face:

👉 HuggingFace Dataset Link

After downloading, extract everything and place it into the datasets folder:

mkdir -p datasets
# Move all downloaded contents into datasets/

Expected structure:

datasets/
├── qa/
│   ├── train.csv
│   ├── val.csv
│   └── test.csv
├── obj_feat/
│   ├── *.pkl
├── clip-rcnn-attn/
│   ├── *.pkl

📥 One-Video Frame Sampling and Feature Extraction

To test the full pipeline on a single video, follow these steps:

  1. Download the Amharic caption dataset from Hugging Face:

👉 Download MSDV Amharic Captions

Save the file as:


video_captions/MSDV_amharic_caption.txt

  1. Use any video file to run the following step-by-step scripts:
    • Frame extraction
    • Best 16 frame selection (M-CLIP based)
    • Object detection and labeling
    • TimeSformer feature extraction

For reference scripts, see:

👉 View Scripts on GitHub

🚀 Train the Model

python -m vqagen

🧼 License

MIT License © helinatefera

📞 Contact

👤 Helina Tefera
✉️ E-Mail
📱 Phone

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages