Multimodal Understanding for Amharic Video Question Answering using Bidirectional Cross Modal Attention
This project proposes the first Video Question Answering (VideoQA) system tailored for the Amharic language. It integrates multiple modalities—visual frames, object-level features, and text—through a bidirectional cross-modal attention mechanism.
In the model:
- Text features: Amharic BERT encodes the question.
- Visual features: Extracted from TimeSformer (CLS), CLIP, and FastRCNN, projected to the same space.
- Cross-modal attention:
- Text → Visual: Tokens attend to visual regions.
- Visual → Text: Visual regions attend to token embeddings.
- Fusion: Attention outputs are concatenated and passed to a classifier.
- Build a multimodal QA pipeline for Amharic videos.
- Apply novel frame selection using MCLIP.
- Extract video, object, and text features.
- Train a cross-modal attention model.
- Evaluate performance using classification metrics.
videos/: Video frames organized byvideo_id/- CSV files with:
video_id,question,answer(Amharic) - Feature folders:
TimeSformerCLS featuresFastRCNNobject featuresCLIPframe embeddings
- Splits:
train,val,test
- Text Encoder: Amharic BERT
- Visual Encoders:
- TimeSformer (temporal)
- CLIP (frame-level)
- FastRCNN (object-level)
- Fusion: 2-layer bidirectional cross-modal attention (8 heads)
- Classifier: Linear layer over concatenated fusion output
- Loss: CrossEntropy
- Optimizer: Adam
- Batch Size: 32
- Epochs: 50 (early stop at 5)
- Metrics: Accuracy, Precision, Recall, F1-score
git clone https://github.com/helinatefera/VQAGen
cd VQAGenDownload the dataset from Hugging Face:
After downloading, extract everything and place it into the datasets folder:
mkdir -p datasets
# Move all downloaded contents into datasets/Expected structure:
datasets/
├── qa/
│ ├── train.csv
│ ├── val.csv
│ └── test.csv
├── obj_feat/
│ ├── *.pkl
├── clip-rcnn-attn/
│ ├── *.pkl
To test the full pipeline on a single video, follow these steps:
- Download the Amharic caption dataset from Hugging Face:
👉 Download MSDV Amharic Captions
Save the file as:
video_captions/MSDV_amharic_caption.txt
- Use any video file to run the following step-by-step scripts:
- Frame extraction
- Best 16 frame selection (M-CLIP based)
- Object detection and labeling
- TimeSformer feature extraction
For reference scripts, see:
python -m vqagenMIT License © helinatefera
👤 Helina Tefera
✉️ E-Mail
📱 Phone
