This is the code repository for papers:
- Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions (https://dl.acm.org/doi/abs/10.1145/3451341)
- Physical Interaction: Reconstructing Hand-object Interactions with Physics (https://dl.acm.org/doi/10.1145/3550469.3555421)
The system reconstructs hand-object interactions in real-time from a single depth camera through the following stages:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ COMPLETE INFERENCE PIPELINE │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────┐
│ RealSense │ Depth Image (320×240, 16-bit)
│ SR300 Camera │─────────────────────────────────────────┐
└─────────────────┘ │
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STAGE 0: NEURAL NETWORK INFERENCE Port 8080 │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Depth Image ──▶ [JointLearningNeuralNetwork] ──▶ ┬─▶ 21 Joint Positions (u,v,z)
│ (Encoder-Decoder CNN) ├─▶ Segmentation Mask
│ └─▶ Hand/Object Depth
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STAGE I: KINEMATIC HAND-OBJECT MOTION TRACKING │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ HAND TRACKING │ │ OBJECT TRACKING │ │
│ │ ┌───────────────────┐ │ │ ┌───────────────────┐ │ │
│ │ │ Sphere-Mesh Model │ │ ICP │ │ TSDF Fusion │ │ │
│ │ │ • 28 DOF │◀─┼────────┼─▶│ • No prior needed │ │ │
│ │ │ • MANO conversion │ │ │ │ • Built on-the-fly│ │ │
│ │ └───────────────────┘ │ │ └───────────────────┘ │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ θ_kin (Hand Pose) W (Object Motion) │
│ 28 joint angles S (Object Shape) │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
After object reconstructed │ Calculate: mass, inertia,
(user presses 'T') │ center of mass, velocity
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STAGE II: PHYSICS-BASED CONTACT STATUS OPTIMIZATION │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ Key Insight: "Object motion must be explained by contact forces" │
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ For each fingertip i: │ │
│ │ • Extract contact candidate (p_i, n_i, d_i) │ │
│ │ • Optimize forces F_i using Newton-Euler equations │ │
│ │ • Recover missing contacts via force analysis │ │
│ │ │ │
│ │ Minimize: E = E_force + E_moment + E_regularize + E_contact │ │
│ │ │ │
│ │ ΣF_i + mg = ma (force balance) │ │
│ │ Στ_i = Iα (moment balance) │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Output: Refined contacts {d̃_i}, Estimated forces {F_i} │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ STAGE III: CONFIDENCE-BASED SLIDE PREVENTION │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ For each fingertip: │ │
│ │ │ │
│ │ 1. Compute kinematic confidence C_i (based on visible depth points) │ │
│ │ 2. Check pressure F_N_i from Stage II │ │
│ │ │ │
│ │ IF (low pressure) → Allow sliding (trust kinematics) │ │
│ │ IF (high pressure + low confidence) → Prevent sliding (trust physics)│ │
│ │ IF (high pressure + high confidence) → Smooth interpolation │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ Output: Final tip positions T_i^(s) │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ FINAL OUTPUT │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Hand Pose │ │ Object Shape │ │Object Motion │ │Contact Forces│ │
│ │ (28 DOF) │ │ (TSDF→Mesh) │ │ (6 DOF + │ │ (per tip) │ │
│ │ │ │ │ │ non-rigid) │ │ │ │
│ └──────┬───────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ MANO Hand Mesh (778 verts, 1538 faces)│ │
│ └──────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────────┐
│ OPTIONAL: LSTMPose Temporal Smoothing Port 8081 │
├─────────────────────────────────────────────────────────────────────────────────┤
│ Joint sequence ──▶ [LSTM Encoder-Decoder] ──▶ Smoothed joints (less jitter) │
└─────────────────────────────────────────────────────────────────────────────────┘
TIMING (Real-time @ 25 FPS):
├─ Neural Network: ~5ms (GPU)
├─ Kinematic Tracking: ~32ms (GPU)
└─ Physics Refinement: ~8ms (CPU)
Total: ~40ms per frame
| Feature | Description |
|---|---|
| No Object Template | Objects are reconstructed on-the-fly using TSDF fusion - works with any arbitrary object |
| MANO Hand Model | Uses the parametric MANO model (778 vertices) for realistic hand mesh |
| Physics-based Refinement | Recovers occluded contacts by enforcing Newton-Euler dynamics |
| Real-time Performance | Runs at ~25 FPS on dual NVIDIA Titan Xp GPUs |
| Single Depth Camera | Only requires Intel RealSense SR300 (no multi-view setup) |
| Component | Model | Template Required? |
|---|---|---|
| Hand | MANO / Sphere-Mesh | ✅ Yes (parametric model with shape/pose parameters) |
| Object | TSDF Voxel Grid | ❌ No (reconstructed from depth in real-time) |
- Windows 10
- CUDA 8.0 and CUDA 10.0
- Visual Studio 2015 (Later versions are not compatible with CUDA 8)
- 1~2 CUDA Graphic cards (total VRAM>=20GB and at least one >=12GB) (NVIDIA Titan Xp * 2 are used in our demo)
- Intel RealSense SR300 Camera
- A wrist band with pure color (Blue is used in out demo)
-
install CUDA 10.0 and corresponding cudnn
-
install dependencies
cd Network/JointLearningNeuralNetwork pip install -r requirements.txt -
download pre-trained network from https://drive.google.com/file/d/1wDdBegEpRqFUs0x_9zV6Rm4-Mk_Yajs3/view?usp=sharing, then extract
Network.zipunder the root directory. -
run
JointLearningNeuralNetwork(need VRAM>=12GB)cd Network/JointLearningNeuralNetwork python inference_server.py --gpu <gpu_id> -
run
LSTMPosecd Network/LSTMPose python inference_server.py --gpu <gpu_id> -
Note:
JointLearningNeuralNetworkandLSTMPosework as local servers on port 8080 and 8081, respectively. So make sure that the two port numbers are free before running.
-
download CMake with GUI (version 3.9 is recommended)
-
make a new directory under source root of C++ code (Please do not change the position of this new directory)
cd InteractionReconstruction/InteractionReconstruction mkdir build -
General configurations for CMake
Where is the source code-> {path to this project}/InteractionReconstruction/InteractionReconstructionWhere to build the binaries-> {path to this project}/InteractionReconstruction/InteractionReconstruction/buildSpecify the generator for this project-> Visual Studio 14 2015Optional platform for generator-> x64
-
Configure third libraries
- boost-1.61.0
- ceres-windows & glog
- cuda 8.0
- curl 7.62.0
- Official: https://curl.se/
- Download: https://github.com/curl/curl/releases
- eigen 3.2.9
- flann
- glew 2.1.0
- Official: https://glew.sourceforge.net/
- Download: https://sourceforge.net/projects/glew/files/glew/2.1.0/
- glm 0.9.8.5
- jsoncpp 1.7.4
- libigl 2.3.0
- Official: https://libigl.github.io/
- Download: https://github.com/libigl/libigl/releases
- opencv 2.4.10
- opengp
- pcl 1.8.0
- Official: https://pointclouds.org/
- Download: https://github.com/PointCloudLibrary/pcl/releases
- qt 5.8
- Official: https://www.qt.io/
- Download: https://download.qt.io/new_archive/qt/5.8/5.8.0/
- realsense-sdk 2.43.0
-
generate Visual Studio project
-
open Visual Studio project, select the
Releaseandx64configuration and setentryas the startup project
This section describes how to run the neural networks and generate 3D hand meshes on Linux without the C++ code.
- Linux (tested on Ubuntu)
- Python 3.8+
- CUDA 10.0+ with cuDNN
- GPU with ≥12GB VRAM
cd Network/JointLearningNeuralNetwork
pip install -r requirements.txt
pip install scipy requestsNote: This fork includes TensorFlow 2.x compatibility fixes. The original code was written for TensorFlow 1.x.
Download the pre-trained networks from Google Drive, then extract:
# Extract Network.zip to get:
# - Network/JointLearningNeuralNetwork/model/
# - Network/LSTMPose/exp/
unzip Network.zipcd Network/JointLearningNeuralNetwork
python inference_server.py --gpu 0The server starts on port 8080 and provides:
- Input: Depth image (240×320, 16-bit PNG)
- Output: 21 hand joint positions + segmentation mask
In a new terminal, run:
cd Network/JointLearningNeuralNetwork
# Generate hand mesh from depth image
python generate_hand_mesh.py --depth_image your_depth.png --output hand_mesh.objArguments:
| Argument | Description | Default |
|---|---|---|
--depth_image |
Input depth image path | org_depth_img_init.png |
--output |
Output OBJ file path | hand_mesh.obj |
--mano_path |
Path to MANO model JSON | Auto-detected |
--server_url |
Inference server URL | http://127.0.0.1:8080 |
Output Files:
hand_mesh.obj- 3D hand mesh (778 vertices, 1538 faces)hand_mesh_joints.obj- 21 joint positions as 3D pointshand_mesh_mask.png- Hand/object segmentation mask
cd Network/JointLearningNeuralNetwork
python test_server.pyDepth Image (240×320)
│
▼
┌─────────────────────────┐
│ JointLearningNN │ ← Neural network (inference_server.py)
│ (port 8080) │
└───────────┬─────────────┘
│ 21 joints (u, v, z)
▼
┌─────────────────────────┐
│ MANO Model Fitting │ ← Python MANO (mano_model.py)
│ (generate_hand_mesh.py)│
└───────────┬─────────────┘
│
▼
hand_mesh.obj ← 3D hand mesh (778 vertices)
For temporal smoothing of hand poses across video frames:
cd Network/LSTMPose
python inference_server.py --gpu 0This runs on port 8081 and smooths joint predictions over time.
A web-based interactive viewer for exploring 3D hand meshes:
cd Network/JointLearningNeuralNetwork
# Start the viewer (default: port 5000)
python interactive_viewer.py --port 5000
# Or specify a custom mesh file
python interactive_viewer.py --mesh hand_mesh.obj --joints hand_mesh_joints.objThen open http://localhost:5000 in your browser.
Features:
| Control | Action |
|---|---|
| Left click + drag | Rotate the model |
| Scroll wheel | Zoom in/out |
| Right click + drag | Pan |
| Reset View | Reset camera position |
| Toggle Wireframe | Show mesh wireframe |
| Toggle Joints | Show/hide 21 joint spheres |
| Toggle Skeleton | Show/hide skeleton lines |
| Auto Rotate | Enable auto-rotation |
Arguments:
| Argument | Description | Default |
|---|---|---|
--port |
Server port | 5000 |
--mesh |
Input OBJ mesh file | hand_from_depth.obj |
--joints |
Joint positions OBJ file | hand_from_depth_joints.obj |
--host |
Host address | 0.0.0.0 |
To render a mesh to a static image (without browser):
cd Network/JointLearningNeuralNetwork
# Render to PNG
python visualize_mesh.py --mesh hand_mesh.obj --output render.png
# Multi-view render
python visualize_mesh.py --mesh hand_mesh.obj --output multiview.png --multiview
# Include joint positions
python visualize_mesh.py --mesh hand_mesh.obj --joints hand_mesh_joints.obj --output render.png- set wrist band color in file
InteractionReconstruction/InteractionReconstruction/build/wristband.txt, make sure the color is betweenhsv_minandhsv_max - write the configuration file, an example is
InteractionReconstruction/InteractionReconstruction/json_manuscript/IO_Parameters_online.json - set the path to the configuration file as the command line argument in
Properties->Debugging->Command Arguments
- wear the wristband and make sure that the wristband is always in the view of the sensor
- run the code and start tracking hand
- press
Rto start object reconstruction - rotate the object to gradually reconstruct it
- press
Twhen you think that the object is fully reconstructed - close the
InteractionReconwindow when you want to stop tracking - if you set
store_org_datato be true in configuration file, then the raw rgbd data will be stored. The frame ids when you pressRandTwill be shown later.
- set
benchmarkto be false in configuration file - set
left_camera_file_maskto be the file prefix of the image sequence you have stored - set
recon_frameandstop_recon_frameto be the frame ids when you pressRandT
Building...
@article{10.1145/3451341,
author = {Zhang, Hao and Zhou, Yuxiao and Tian, Yifei and Yong, Jun-Hai and Xu, Feng},
title = {Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions},
year = {2021},
issue_date = {June 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {40},
number = {3},
issn = {0730-0301},
url = {https://doi.org/10.1145/3451341},
doi = {10.1145/3451341},
month = {jul},
articleno = {29},
numpages = {12},
keywords = {hand-object interaction, hand tracking, Single depth camera, object reconstruction}
}
@inproceedings{10.1145/3550469.3555421,
author = {Hu, Haoyu and Yi, Xinyu and Zhang, Hao and Yong, Jun-Hai and Xu, Feng},
title = {Physical Interaction: Reconstructing Hand-Object Interactions with Physics},
year = {2022},
isbn = {9781450394703},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3550469.3555421},
doi = {10.1145/3550469.3555421},
booktitle = {SIGGRAPH Asia 2022 Conference Papers},
articleno = {43},
numpages = {9},
keywords = {single depth camera, physics-based interaction model, hand tracking, hand-object interaction},
location = {Daegu, Republic of Korea},
series = {SA '22 Conference Papers}
}

