This document is the complete system description for the Precision Sport Science Project (Year 3), split into System Development and Model Design.
-
Entry file:
app.py(Streamlit multipage) -
Main pages:
project_page/Home.pyproject_page/CoLab_project.pyproject_page/Project_2nd.pyproject_page/project_3rd.py(key workflow for this project)
pip install -r requirements.txtstreamlit run app.pyIf you want to use VQA / feature extraction / DVCA backend inference, start the worker services first:
bash lanuch_server.shAfter the UI starts, go to the
project_3rdpage to upload videos and run VQA and Highlight.
File location: project_page/project_3rd.py
-
Select built-in videos or upload a local mp4
-
Sidebar shows the current video
-
Main content tabs:
- Badminton Tactic Analyst
- Badminton Highlight Director
-
Built-in videos: selected from the
GAME_INFOdictionary -
Upload your own: use
st.file_uploaderto upload.mp4- Uploaded video is temporarily stored in
tmp/
- Uploaded video is temporarily stored in
- Video temp:
tmp/ - TrackNet CSV:
tmp_video/tracknet_csv/ - HitFrame results:
tmp_hitframe/hitframe_output/
VQA APIs
backend.load_VQA_enginebackend.VQA_stream(token-by-token streaming)backend.uploaded_VQA_stream
Diversity Highlight (DVCA)
backend.Diverse_Video_Clip_Retrievebackend.Diverse_Video_Clip_Samplingbackend.Diverse_Video_Clip_Concat
- After
backend.VQA_stream(...)assemblesclip_paths, it calls:engine.predict_stream(clip_paths, prompt, task_id="default")
engine.predict_stream(...)returns a token generator that yields(raw_token, token_id)step by step.- The UI does not wait for full output; it updates the screen as each token arrives.
- In
function_button.VQA():raw_generator = self.do_VQA_stream(vid_path, prompt)for raw_token, token_id in raw_generator:iterate token by token- Parse tags on the fly: remove
<thinking>,</thinking>,<answer>,</answer> - Update different blocks based on state (THINKING/ANSWER)
- After each token, immediately call
markdown(... + "▌")
engine.predict_streamstreams tokens instead of returning a full string at once.- Streamlit
st.write_stream/markdowncan refresh the UI inside the loop.
- The UI accumulates all
token_ids and usesengine.model.t5_tokenizer.decode(...)to decode again. parse_and_format_response(...)cleans and normalizes the output format.
This project uses "multiple worker processes + one GPU per worker", with HTTP load sharing in parallel.
lanuch_server.sh starts multiple FastAPI workers with different CUDA_VISIBLE_DEVICES (e.g., 8001–8005):
CUDA_VISIBLE_DEVICES=1 uvicorn worker_service:app --host 0.0.0.0 --port 8001 --workers 1 &
CUDA_VISIBLE_DEVICES=2 uvicorn worker_service:app --host 0.0.0.0 --port 8002 --workers 1 &
.../predict: VQA inference (ENGINE.predict(...))/encode_features: feature extraction (FE.encode_video(...))
In
worker_service.py,_DEVICE="cuda:0"; but since each process only sees one GPU (limited byCUDA_VISIBLE_DEVICES), it effectively binds to the assigned GPU.
Distributed flow (VQA / Temporal Grounding example):
- Discover available workers
- Probe each worker and collect reachable ones (defaults 8001–8007).
- Sharding
- Round-robin distribute questions/clip indices across workers to balance load.
- Bucketize by K
- Group samples per worker by K so each batch has similar K, reducing padding.
- Micro-batching
- Split each worker's samples into micro-batches by
batch_size_per_worker.
- Split each worker's samples into micro-batches by
- Async parallelism + flow control
- Send requests asynchronously; each worker caps concurrent batches with
max_in_flightto avoid GPU spikes.
- Send requests asynchronously; each worker caps concurrent batches with
- Retry and collect results
- Failed batches retry (with backoff); collect all results and restore original order by index.
Distributed flow:
- Discover available workers (same as VQA).
- Round-robin split video paths to each worker.
- Each worker forms micro-batches with
batch_size_per_workerand sends them. - Collect all features, restore order by index, and compose
[N, D].
The core goal is to align natural language queries to semantic events and time segments in badminton match videos, especially fine-grained actions like stroke / rally.
The model uses a three-stage training strategy (Stage 0–2), progressively building action-aware visual representations -> vision-language alignment -> semantic temporal localization reasoning.
The model consists of three main modules:
-
Visual Encoder:
- Converts video frame sequences into high-dimensional visual token representations.
- Serves as the visual backbone for all downstream modules.
-
Q-former:
- Uses learnable query tokens to extract key semantic information from visual features.
- Effectively compresses long videos before passing to the LLM.
-
LLM:
- Performs high-level semantic reasoning and natural language answer generation.
- Outputs event descriptions and implicit temporal localization results.





