VILMA

This repository contains the codebase for the VILMA (VIsion Language MAnipulation) project, created by the CAIR (Cognitive Artificial Intelligence and Robotics) research group at the CYENS Centre of Excellence. This work was supported by the euROBIN project under the 3rd Open Call for Technology Exchange Programme (Project: "VILMA - Advancing Robotic Manipulation: A Handheld Gripper and Vision-Language Dataset").

The repository provides the hardware documentation and software framework for multimodal manipulation data collection using a head-mounted camera, a microphone, and two handheld grippers equipped with cameras and motion tracking systems. It includes tools for sensor synchronization, data acquisition, processing pipelines, and dataset generation.

The VILMA Dataset can be downloaded from our EuroCore Zenodo repository here: https://doi.org/10.5281/zenodo.19708163

1. Hardware

Full hardware documentation is available in the hardware/ folder, that includes:

Hardware guide
Full CAD models (STEP, STL, and editable source references)
Bill of Materials (BOM)
3D printing guidelines
Assembly instructions with deployment setup guidance
Interchangeable hard and soft finger configurations
Integrated HTC VIVE tracker mounting for motion tracking
GoPro mounting support for egocentric recording
AprilTags

The design preserves the original UMI pistol-style interaction while introducing project-specific modifications for improved tracking integration, deployment readiness, and real-world data collection reliability.

2. Software setup

2.1. Clone the repository

git clone https://github.com/ctheoc/VILMA.git
cd VILMA

2.2. Create a virtual environment

python -m venv venv

2.3. Activate the virtual environment

Linux / macOS

source venv/bin/activate

Windows

venv\Scripts\activate

2.4. Install dependencies

2.4.1 ffmpeg

To check if it's installed, run:

which ffprobe
ffprobe -version

If not installed, download from https://ffmpeg.org, or run

Ubuntu/Debian

sudo apt updatesudo apt install ffmpeg

macOS (Homebrew):

brew install ffmpeg

2.4.2 Python libraries

pip install -r requirements.txt

3. Data Collection

Data are collected using VIVE trackers, GoPro cameras, and microphone.

3.1. Finger distance calibration

To account for the severe distortion of the GoPro Hero 13 Black’s 'Ultra Wide HyperView' lens, perform a custom finger distance calibration. Map physical measurements to the pixel distances between the centers of the AprilTags on each finger.

python data_collection/vilma_fingers_distance_calibration.py

3.2. Data Collection setup

3.2.1 Tracking

VIVE trackers and base stations should be connected via SteamVR. Prerequisites: Download and install SteamVR. Connect base stations and trackers:

Connect the base stations to power.
Connect the trackers to the PC:
- Connect one end of the USB Type-C cable to the dongle cradle, and then plug the dongle into the cradle.
- Connect the other end of the USB Type-C cable to a USB port on your computer.
- Note: Keep the dongle at least 45 cm away from the computer and place it where it won’t be moved.
- From your computer, open the SteamVR app.
- Click Menu > Devices > Pair Controller.
- Press the Power button for around 2 seconds. The status light will blink blue.
- Wait for the status light to turn green. This means pairing is successful.
- In the Controller Pairing window, click Done.

3.2.2 Video recording

We use two GoPro HERO13 cameras, each one mounted on each handheld gripper, and a head-mounted GoPro HERO9 camera. HERO13 cameras can be manipulated via this script, while HERO9 is manipulated manually, or via a mobile phone. Prerequisites: Download GoPro Quik app on the mobile phone. Setup and connect cameras:

Camera settings:
- GoPro HERO13: 16:9 | 4K | 60 | UHV
- GoPro HERO9: 4K | 60 | L
Connect GoPro HERO13 to the PC:
- Turn on the cameras and reset wireless connections: Preferences > Wireless Connections > Reset Connections > Reset All.
- Turn on the PC's Bluetooth and Wi-Fi, and delete GoPro cameras from the computers previous connected Bluetooth devices, if any.
- Select Pair Device on cameras.
Connect GoPro HERO9 to a mobile phone:
- Turn on the phone's Bluetooth and Wi-Fi, open GoPro Quik, and select GoPro tab.
- Turn on the camera and select Connections > Connect Device > Quick App to connect.

3.2.3 Audio recording

Set up RODE microphone:

Connect RX (Receiver) to the PC via USB.
Connect TX (Transmitter) to the RX.
Select Wireless GO GX as the input device in the PC settings.

3.2.4 Check PC audio

Ensure the PC volume is not muted so you can hear the 'beep' sound when a recording starts. This sound will be later used to synchronize tracking and video recordings.

3.3. Run the data collection script

python data_collection/vilma_collect_data.py

This script creates in the repository root:

a folder with the recorded data named recordings, and
a JSON file named sessions.json that includes the paths to the data files.

For tracking we use https://github.com/TriadSemi/triad_openvr at commit d389aacf2a4caa392398613a9daddba15ee24f92.

4. Data Processing

4.1. Data Processing setup

Make sure that videos exists in the recordings folder.

Videos captured by the camera mounted on the left gripper under recordings/left
Videos captured by the camera mounted on the right gripper under recordings/right
Videos captured by the head-mounted camera under recordings/head

To associate these videos with the rest of the recorded data, run:

python data_processing/vilma_associate_videos.py --json sessions.json --left recordings/left --right recordings/right --head recordings/head

Also, an OpeanAI API key is required. Create a .env file in the repository root with your OpenAI API key:

echo 'openai_api_key=YOUR_OPENAI_API_KEY' > .env

4.2. Run the data processing script

The script prepares data before saving into the final dataset:

Transcribes the recorded instruction (speech-to-text) using OpenAI
Synchronizes tracking and videos by trimming the videos
Optimizes videos (H264 codec, and removes audio)
Computes the distance between the fingers of each gripper by detecting the AprilTags

python data_processing/vilma_process_data.py --json sessions.json --recordings recordings

4.3. Extract depth maps

To extract depth maps from the videos using Depth-Anything-V2, run:

python data_processing/Depth-Anything-V2/run_video.py --encoder vits --sessions-path sessions.json --pred-only --grayscale

For depth extraction, we use https://github.com/DepthAnything/Depth-Anything-V2 at commit a561b849ebae10a6f5ef49e26c83cbbcd36c71bf. We modified run_video.py for VILMA's purposes.

4.4. Video compression

To convert videos (including depth maps) to 720p at 30FPS, run:

python data_processing/vilma_compress_videos.py --input-dir recordings

5. Dataset creation

5.1. Calculate statistics

Generate task, participants, and locations statistics, and export tasks_info from the json file:

python dataset_creation/vilma_calculate_statistics.py sessions.json --tasks-info-output vilma_tasks_info.json

5.2 Create the dataset

Create/append the HDF5 structure from the json file:

python dataset_creation/vilma_create_hdf5_dataset.py --json sessions.json --tasks-info vilma_tasks_info.json --recordings-root recordings --output vilma_dataset.h5

Organize video files according to the HDF5 hierarchy:

python dataset_creation/vilma_organize_videos_by_hdf5.py --h5 vilma_dataset.h5 --recordings-root recordings --output-root /path/to/dataset_root

In case you want to blur faces that may appear in the recorded videos, run the following command, which uses insightface:

python dataset_creation/vilma_blur_faces.py

5.3 Navigate the dataset

Print HDF5 contents (for array datasets, prints only shape and first element or line):

python dataset_creation/vilma_print_hdf5_contents.py --h5 vilma_dataset.h5

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data_collection		data_collection
data_processing		data_processing
dataset_creation		dataset_creation
hardware		hardware
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
sync.wav		sync.wav

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VILMA

1. Hardware

2. Software setup

2.1. Clone the repository

2.2. Create a virtual environment

2.3. Activate the virtual environment

2.4. Install dependencies

2.4.1 ffmpeg

2.4.2 Python libraries

3. Data Collection

3.1. Finger distance calibration

3.2. Data Collection setup

3.2.1 Tracking

3.2.2 Video recording

3.2.3 Audio recording

3.2.4 Check PC audio

3.3. Run the data collection script

4. Data Processing

4.1. Data Processing setup

4.2. Run the data processing script

4.3. Extract depth maps

4.4. Video compression

5. Dataset creation

5.1. Calculate statistics

5.2 Create the dataset

5.3 Navigate the dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VILMA

1. Hardware

2. Software setup

2.1. Clone the repository

2.2. Create a virtual environment

2.3. Activate the virtual environment

2.4. Install dependencies

2.4.1 ffmpeg

2.4.2 Python libraries

3. Data Collection

3.1. Finger distance calibration

3.2. Data Collection setup

3.2.1 Tracking

3.2.2 Video recording

3.2.3 Audio recording

3.2.4 Check PC audio

3.3. Run the data collection script

4. Data Processing

4.1. Data Processing setup

4.2. Run the data processing script

4.3. Extract depth maps

4.4. Video compression

5. Dataset creation

5.1. Calculate statistics

5.2 Create the dataset

5.3 Navigate the dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages