This repository contains the codebase for the VILMA (VIsion Language MAnipulation) project, created by the CAIR (Cognitive Artificial Intelligence and Robotics) research group at the CYENS Centre of Excellence. This work was supported by the euROBIN project under the 3rd Open Call for Technology Exchange Programme (Project: "VILMA - Advancing Robotic Manipulation: A Handheld Gripper and Vision-Language Dataset").
The repository provides the hardware documentation and software framework for multimodal manipulation data collection using a head-mounted camera, a microphone, and two handheld grippers equipped with cameras and motion tracking systems. It includes tools for sensor synchronization, data acquisition, processing pipelines, and dataset generation.
The VILMA Dataset can be downloaded from our EuroCore Zenodo repository here: https://doi.org/10.5281/zenodo.19708163
Full hardware documentation is available in the hardware/ folder, that includes:
- Hardware guide
- Full CAD models (STEP, STL, and editable source references)
- Bill of Materials (BOM)
- 3D printing guidelines
- Assembly instructions with deployment setup guidance
- Interchangeable hard and soft finger configurations
- Integrated HTC VIVE tracker mounting for motion tracking
- GoPro mounting support for egocentric recording
- AprilTags
The design preserves the original UMI pistol-style interaction while introducing project-specific modifications for improved tracking integration, deployment readiness, and real-world data collection reliability.
git clone https://github.com/ctheoc/VILMA.git
cd VILMApython -m venv venvLinux / macOS
source venv/bin/activateWindows
venv\Scripts\activateTo check if it's installed, run:
which ffprobe
ffprobe -version If not installed, download from https://ffmpeg.org, or run
Ubuntu/Debian
sudo apt updatesudo apt install ffmpeg macOS (Homebrew):
brew install ffmpeg pip install -r requirements.txtData are collected using VIVE trackers, GoPro cameras, and microphone.
To account for the severe distortion of the GoPro Hero 13 Black’s 'Ultra Wide HyperView' lens, perform a custom finger distance calibration. Map physical measurements to the pixel distances between the centers of the AprilTags on each finger.
python data_collection/vilma_fingers_distance_calibration.pyVIVE trackers and base stations should be connected via SteamVR. Prerequisites: Download and install SteamVR. Connect base stations and trackers:
- Connect the base stations to power.
- Connect the trackers to the PC:
- Connect one end of the USB Type-C cable to the dongle cradle, and then plug the dongle into the cradle.
- Connect the other end of the USB Type-C cable to a USB port on your computer.
- Note: Keep the dongle at least 45 cm away from the computer and place it where it won’t be moved.
- From your computer, open the SteamVR app.
- Click Menu > Devices > Pair Controller.
- Press the Power button for around 2 seconds. The status light will blink blue.
- Wait for the status light to turn green. This means pairing is successful.
- In the Controller Pairing window, click Done.
We use two GoPro HERO13 cameras, each one mounted on each handheld gripper, and a head-mounted GoPro HERO9 camera. HERO13 cameras can be manipulated via this script, while HERO9 is manipulated manually, or via a mobile phone. Prerequisites: Download GoPro Quik app on the mobile phone. Setup and connect cameras:
-
Camera settings:
- GoPro HERO13: 16:9 | 4K | 60 | UHV
- GoPro HERO9: 4K | 60 | L
-
Connect GoPro HERO13 to the PC:
- Turn on the cameras and reset wireless connections: Preferences > Wireless Connections > Reset Connections > Reset All.
- Turn on the PC's Bluetooth and Wi-Fi, and delete GoPro cameras from the computers previous connected Bluetooth devices, if any.
- Select Pair Device on cameras.
-
Connect GoPro HERO9 to a mobile phone:
- Turn on the phone's Bluetooth and Wi-Fi, open GoPro Quik, and select GoPro tab.
- Turn on the camera and select Connections > Connect Device > Quick App to connect.
Set up RODE microphone:
- Connect RX (Receiver) to the PC via USB.
- Connect TX (Transmitter) to the RX.
- Select Wireless GO GX as the input device in the PC settings.
Ensure the PC volume is not muted so you can hear the 'beep' sound when a recording starts. This sound will be later used to synchronize tracking and video recordings.
python data_collection/vilma_collect_data.pyThis script creates in the repository root:
- a folder with the recorded data named recordings, and
- a JSON file named sessions.json that includes the paths to the data files.
For tracking we use https://github.com/TriadSemi/triad_openvr at commit d389aacf2a4caa392398613a9daddba15ee24f92.
Make sure that videos exists in the recordings folder.
- Videos captured by the camera mounted on the left gripper under recordings/left
- Videos captured by the camera mounted on the right gripper under recordings/right
- Videos captured by the head-mounted camera under recordings/head
To associate these videos with the rest of the recorded data, run:
python data_processing/vilma_associate_videos.py --json sessions.json --left recordings/left --right recordings/right --head recordings/headAlso, an OpeanAI API key is required.
Create a .env file in the repository root with your OpenAI API key:
echo 'openai_api_key=YOUR_OPENAI_API_KEY' > .envThe script prepares data before saving into the final dataset:
- Transcribes the recorded instruction (speech-to-text) using OpenAI
- Synchronizes tracking and videos by trimming the videos
- Optimizes videos (H264 codec, and removes audio)
- Computes the distance between the fingers of each gripper by detecting the AprilTags
python data_processing/vilma_process_data.py --json sessions.json --recordings recordingsTo extract depth maps from the videos using Depth-Anything-V2, run:
python data_processing/Depth-Anything-V2/run_video.py --encoder vits --sessions-path sessions.json --pred-only --grayscaleFor depth extraction, we use https://github.com/DepthAnything/Depth-Anything-V2 at commit a561b849ebae10a6f5ef49e26c83cbbcd36c71bf. We modified run_video.py for VILMA's purposes.
To convert videos (including depth maps) to 720p at 30FPS, run:
python data_processing/vilma_compress_videos.py --input-dir recordingsGenerate task, participants, and locations statistics, and export tasks_info from the json file:
python dataset_creation/vilma_calculate_statistics.py sessions.json --tasks-info-output vilma_tasks_info.jsonCreate/append the HDF5 structure from the json file:
python dataset_creation/vilma_create_hdf5_dataset.py --json sessions.json --tasks-info vilma_tasks_info.json --recordings-root recordings --output vilma_dataset.h5Organize video files according to the HDF5 hierarchy:
python dataset_creation/vilma_organize_videos_by_hdf5.py --h5 vilma_dataset.h5 --recordings-root recordings --output-root /path/to/dataset_rootIn case you want to blur faces that may appear in the recorded videos, run the following command, which uses insightface:
python dataset_creation/vilma_blur_faces.pyPrint HDF5 contents (for array datasets, prints only shape and first element or line):
python dataset_creation/vilma_print_hdf5_contents.py --h5 vilma_dataset.h5