Multimodal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention

This project proposes the first Video Question Answering (VideoQA) system tailored for the Amharic language. It integrates multiple modalities—visual frames, object-level features, and text—through a bidirectional cross-modal attention mechanism.

In the model:

Text features: Amharic BERT encodes the question.
Visual features: Extracted from TimeSformer (CLS), CLIP, and FastRCNN, projected to the same space.
Cross-modal attention:
- Text → Visual: Tokens attend to visual regions.
- Visual → Text: Visual regions attend to token embeddings.
Fusion: Attention outputs are concatenated and passed to a classifier.

🎯 Objectives

Build a multimodal QA pipeline for Amharic videos.
Apply novel frame selection using MCLIP.
Extract video, object, and text features.
Train a cross-modal attention model.
Evaluate performance using classification metrics.

📁 Dataset Structure

videos/: Video frames organized by video_id/
CSV files with: video_id, question, answer (Amharic)
Feature folders:
- TimeSformer CLS features
- FastRCNN object features
- CLIP frame embeddings
Splits: train, val, test

🧠 Model Summary

Text Encoder: Amharic BERT
Visual Encoders:
- TimeSformer (temporal)
- CLIP (frame-level)
- FastRCNN (object-level)
Fusion: 2-layer bidirectional cross-modal attention (8 heads)
Classifier: Linear layer over concatenated fusion output

🛠 Training Configuration

Loss: CrossEntropy
Optimizer: Adam
Batch Size: 32
Epochs: 50 (early stop at 5)
Metrics: Accuracy, Precision, Recall, F1-score

📦 Clone the Repository

git clone https://github.com/helinatefera/VQAGen
cd VQAGen

📥 Download Dataset and Extracted Feature of video

Download the dataset from Hugging Face:

👉 HuggingFace Dataset Link

After downloading, extract everything and place it into the datasets folder:

mkdir -p datasets
# Move all downloaded contents into datasets/

Expected structure:

datasets/
├── qa/
│   ├── train.csv
│   ├── val.csv
│   └── test.csv
├── obj_feat/
│   ├── *.pkl
├── clip-rcnn-attn/
│   ├── *.pkl

📥 One-Video Frame Sampling and Feature Extraction

To test the full pipeline on a single video, follow these steps:

Download the Amharic caption dataset from Hugging Face:

👉 Download MSDV Amharic Captions

Save the file as:


video_captions/MSDV_amharic_caption.txt

Use any video file to run the following step-by-step scripts:
- Frame extraction
- Best 16 frame selection (M-CLIP based)
- Object detection and labeling
- TimeSformer feature extraction

For reference scripts, see:

👉 View Scripts on GitHub

🚀 Train the Model

python -m vqagen

🧼 License

📞 Contact

👤 Helina Tefera
✉️ E-Mail
📱 Phone

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
__pycache__		__pycache__
configs		configs
img		img
scripts		scripts
vqagen		vqagen
README.md		README.md
requirements.txt		requirements.txt
sample_video1.avi		sample_video1.avi
sample_video2.avi		sample_video2.avi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention

🎯 Objectives

📁 Dataset Structure

🧠 Model Summary

🛠 Training Configuration

📦 Clone the Repository

📥 Download Dataset and Extracted Feature of video

📥 One-Video Frame Sampling and Feature Extraction

🚀 Train the Model

🧼 License

📞 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multimodal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention

🎯 Objectives

📁 Dataset Structure

🧠 Model Summary

🛠 Training Configuration

📦 Clone the Repository

📥 Download Dataset and Extracted Feature of video

📥 One-Video Frame Sampling and Feature Extraction

🚀 Train the Model

🧼 License

📞 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages