Source code for our paper: MemoryCard: Video Memory Augmentation for Long-Video Question Answering
Click the links below to view our paper and project resources:
If you find this work useful, please cite our paper and give us a shining star 🌟
@misc{yang2026memorycardtopicawaremultimodalclue,
title={MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering},
author={Qing Yang and Pengcheng Huang and Xinze Li and Zhenghao Liu and Yukun Yan and Yu Gu and Ge Yu and Gang Li and Maosong Sun},
year={2026},
eprint={2606.05917},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.05917},
}MemoryCard is a video-memory-based augmentation framework for long-video question answering. Long videos usually contain sparse and temporally scattered evidence, while most frames are redundant. Instead of using isolated frames as evidence, MemoryCard organizes a video into semantically coherent Memory Cards. Each card contains event-level visual context, representative moments, and aligned speech cues, enabling VLMs to answer long-video questions with compact and high-density multimodal evidence.
The full pipeline contains three stages:
Long Video
│
├── Step 1: Extract ASR
│ └── OUT_ROOT/asr/{video_id}.json
│
├── Step 2: Build Memory Cards
│ ├── OUT_ROOT/cards/{video_id}/slot_XXXX_kf_YY.jpg
│ ├── OUT_ROOT/memory/{video_id}.json
│ └── OUT_ROOT/selfread_raw/{video_id}.txt
│
└── Step 3: Evaluate with lmms-eval
└── qwen3_vl_w_cards_videomme
Note: This README uses Qwen3-VL as the backbone model and Video-MME as the example benchmark. The pipeline can be adapted to other VLMs or long-video QA benchmarks by modifying the corresponding model wrapper, dataset loader, and running scripts.
Use git clone to download this project
git clone https://github.com/NEUIR/MemoryCard.git
cd MemoryCardCreate the environment
We provide the conda environment file used in our experiments.
conda env create -f memorycard.yml
conda activate memorycardYou also need to prepare the following external dependencies and checkpoints:
- ffmpeg
- Qwen3-VL
- Qwen3-ASR
- Qwen3 ForcedAligner
- LongCLIP
Please make sure the corresponding model paths are correctly set at the top of the running scripts in scripts/launch/.
First, edit the path section at the top of:
scripts/launch/run_step1_extract_asr.shYou need to set:
MEMORYCARD_ROOT=/path/to/MemoryCard
ASR_REPO_DIR=/path/to/Qwen3-ASR-main
QWEN3_ASR_MODEL_DIR=/path/to/Qwen3-ASR-1.7B
FORCED_ALIGNER_MODEL_DIR=/path/to/Qwen3-ForcedAligner-0.6B
DATA_JSONL=/path/to/Video-MME/test-00000-of-00001.jsonl
VIDEO_DATA_DIR=/path/to/Video-MME/data
OUT_ROOT=/path/to/output/videomme_memoryThen run:
conda activate memorycard
bash scripts/launch/run_step1_extract_asr.shThe generated ASR files will be saved to:
OUT_ROOT/asr/{video_id}.json
First, edit the path section at the top of:
scripts/launch/run_step2_build_memory.shYou need to set:
MEMORYCARD_ROOT=/path/to/MemoryCard
QWEN3_VL_REPO=/path/to/Qwen3-VL-main
VLM_MODEL_DIR=/path/to/Qwen3-VL-8B-Instruct
DATA_JSONL=/path/to/Video-MME/videomme/test-00000-of-00001.parquet
VIDEO_DATA_DIR=/path/to/Video-MME/data
OUT_ROOT=/path/to/output/videomme_memoryThen run:
conda activate memorycard
bash scripts/launch/run_step2_build_memory.shThe generated Memory Cards will be saved to:
OUT_ROOT/cards/{video_id}/slot_XXXX_kf_YY.jpg
OUT_ROOT/memory/{video_id}.json
OUT_ROOT/selfread_raw/{video_id}.txt
First, edit the path section at the top of:
scripts/launch/run_step3_eval_cards.shYou need to set:
MEMORYCARD_ROOT=/path/to/MemoryCard
PRETRAINED=/path/to/Qwen3-VL-8B-Instruct
LONGCLIP_REPO=/path/to/Long-CLIP-main
LONGCLIP_MODEL=/path/to/longclip-L.pt
CARDS_ROOT=/path/to/output/videomme_memory/cards
OUTPUT_PATH=/path/to/output/eval_logsThen run:
conda activate memorycard
bash scripts/launch/run_step3_eval_cards.shBy default, we use the following Memory Card retrieval budget:
max_num_frames = 128
high_frames = 4
mid_frames = 8
low_frames = 32
sample_frames = 8
You can modify these values directly in scripts/launch/run_step3_eval_cards.sh.
We provide two Qwen3-VL baselines.
Edit paths in:
scripts/launch/run_baseline_image.shThen run:
conda activate memorycard
bash scripts/launch/run_baseline_image.shEdit paths in:
scripts/launch/run_baseline_video.shThen run:
conda activate memorycard
bash scripts/launch/run_baseline_video.shThe retrieval module uses LongCLIP to match questions with Memory Cards. Please make sure these two paths are correctly set in the Step 3 script:
LONGCLIP_REPO=/path/to/Long-CLIP-main
LONGCLIP_MODEL=/path/to/longclip-L.ptStep 3 can save visualization images for retrieved cards. The selected high-, mid-, and low-resolution cards are saved to:
OUTPUT_PATH/debug_retrieval_cards/{LOG_SUFFIX}/
To disable debug dump, set this variable in the Step 3 script:
DEBUG_DUMP_DIR=""If you have questions, suggestions, or bug reports, please contact:
yangqing_neu@outlook.com
