This repository contains the source code for a deep learning-based Video Person Re-Identification system. The model processes video tracklets to accurately identify and track individuals across multiple, non-overlapping camera views, robustly handling occlusion, background clutter, and temporal dynamics.
This project leverages a robust spatial-temporal architecture consisting of three main components:
- Visual Backbone (ResNet): Extracts high-level spatial features from individual video frames.
- Spatial Relation-Aware Attention (SRA): Highlights discriminative foreground regions while suppressing irrelevant background noise in each frame.
- Pyramid Spatial-Temporal Alignment (PSTA): Models the temporal dependencies and aligns features across frames to build a cohesive representation of the person, even when partially occluded over time.