Skip to content

CrystalNeuro/PhysInteraction

 
 

Repository files navigation

Content

Introduction

This is the code repository for papers:

Method Overview

Inference Pipeline

The system reconstructs hand-object interactions in real-time from a single depth camera through the following stages:

┌─────────────────────────────────────────────────────────────────────────────────┐
│                        COMPLETE INFERENCE PIPELINE                               │
└─────────────────────────────────────────────────────────────────────────────────┘

   ┌─────────────────┐
   │  RealSense      │     Depth Image (320×240, 16-bit)
   │  SR300 Camera   │─────────────────────────────────────────┐
   └─────────────────┘                                         │
                                                               ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  STAGE 0: NEURAL NETWORK INFERENCE                                    Port 8080 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   Depth Image ──▶ [JointLearningNeuralNetwork] ──▶ ┬─▶ 21 Joint Positions (u,v,z)
│                   (Encoder-Decoder CNN)            ├─▶ Segmentation Mask
│                                                    └─▶ Hand/Object Depth
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  STAGE I: KINEMATIC HAND-OBJECT MOTION TRACKING                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   ┌─────────────────────────┐        ┌─────────────────────────┐               │
│   │     HAND TRACKING       │        │    OBJECT TRACKING      │               │
│   │  ┌───────────────────┐  │        │  ┌───────────────────┐  │               │
│   │  │ Sphere-Mesh Model │  │  ICP   │  │ TSDF Fusion       │  │               │
│   │  │ • 28 DOF          │◀─┼────────┼─▶│ • No prior needed │  │               │
│   │  │ • MANO conversion │  │        │  │ • Built on-the-fly│  │               │
│   │  └───────────────────┘  │        │  └───────────────────┘  │               │
│   └─────────────────────────┘        └─────────────────────────┘               │
│              │                                    │                             │
│              ▼                                    ▼                             │
│      θ_kin (Hand Pose)                   W (Object Motion)                      │
│      28 joint angles                     S (Object Shape)                       │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                         │
          After object reconstructed     │     Calculate: mass, inertia,
          (user presses 'T')             │     center of mass, velocity
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  STAGE II: PHYSICS-BASED CONTACT STATUS OPTIMIZATION                            │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   Key Insight: "Object motion must be explained by contact forces"              │
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │  For each fingertip i:                                                  │  │
│   │  • Extract contact candidate (p_i, n_i, d_i)                            │  │
│   │  • Optimize forces F_i using Newton-Euler equations                     │  │
│   │  • Recover missing contacts via force analysis                          │  │
│   │                                                                         │  │
│   │  Minimize: E = E_force + E_moment + E_regularize + E_contact            │  │
│   │                                                                         │  │
│   │            ΣF_i + mg = ma    (force balance)                            │  │
│   │            Στ_i = Iα         (moment balance)                           │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                                 │
│   Output: Refined contacts {d̃_i}, Estimated forces {F_i}                       │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  STAGE III: CONFIDENCE-BASED SLIDE PREVENTION                                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   ┌─────────────────────────────────────────────────────────────────────────┐  │
│   │  For each fingertip:                                                    │  │
│   │                                                                         │  │
│   │  1. Compute kinematic confidence C_i (based on visible depth points)    │  │
│   │  2. Check pressure F_N_i from Stage II                                  │  │
│   │                                                                         │  │
│   │     IF (low pressure)     → Allow sliding (trust kinematics)            │  │
│   │     IF (high pressure + low confidence) → Prevent sliding (trust physics)│  │
│   │     IF (high pressure + high confidence) → Smooth interpolation         │  │
│   └─────────────────────────────────────────────────────────────────────────┘  │
│                                                                                 │
│   Output: Final tip positions T_i^(s)                                          │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘
                                         │
                                         ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│  FINAL OUTPUT                                                                   │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│   ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│   │  Hand Pose   │  │ Object Shape │  │Object Motion │  │Contact Forces│       │
│   │  (28 DOF)    │  │ (TSDF→Mesh)  │  │  (6 DOF +    │  │  (per tip)   │       │
│   │              │  │              │  │  non-rigid)  │  │              │       │
│   └──────┬───────┘  └──────────────┘  └──────────────┘  └──────────────┘       │
│          │                                                                      │
│          ▼                                                                      │
│   ┌──────────────────────────────────────┐                                     │
│   │  MANO Hand Mesh (778 verts, 1538 faces)│                                    │
│   └──────────────────────────────────────┘                                     │
│                                                                                 │
└─────────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────────┐
│  OPTIONAL: LSTMPose Temporal Smoothing                                Port 8081 │
├─────────────────────────────────────────────────────────────────────────────────┤
│   Joint sequence ──▶ [LSTM Encoder-Decoder] ──▶ Smoothed joints (less jitter)   │
└─────────────────────────────────────────────────────────────────────────────────┘

TIMING (Real-time @ 25 FPS):
├─ Neural Network:        ~5ms  (GPU)
├─ Kinematic Tracking:   ~32ms  (GPU)
└─ Physics Refinement:    ~8ms  (CPU)
   Total:                ~40ms per frame

Key Features

Feature Description
No Object Template Objects are reconstructed on-the-fly using TSDF fusion - works with any arbitrary object
MANO Hand Model Uses the parametric MANO model (778 vertices) for realistic hand mesh
Physics-based Refinement Recovers occluded contacts by enforcing Newton-Euler dynamics
Real-time Performance Runs at ~25 FPS on dual NVIDIA Titan Xp GPUs
Single Depth Camera Only requires Intel RealSense SR300 (no multi-view setup)

Models Used

Component Model Template Required?
Hand MANO / Sphere-Mesh ✅ Yes (parametric model with shape/pose parameters)
Object TSDF Voxel Grid ❌ No (reconstructed from depth in real-time)

Setup

Preparation

  • Windows 10
  • CUDA 8.0 and CUDA 10.0
  • Visual Studio 2015 (Later versions are not compatible with CUDA 8)
  • 1~2 CUDA Graphic cards (total VRAM>=20GB and at least one >=12GB) (NVIDIA Titan Xp * 2 are used in our demo)
  • Intel RealSense SR300 Camera
  • A wrist band with pure color (Blue is used in out demo)

Setup Python Code

  • install CUDA 10.0 and corresponding cudnn

  • install dependencies

    cd Network/JointLearningNeuralNetwork
    pip install -r requirements.txt
    
  • download pre-trained network from https://drive.google.com/file/d/1wDdBegEpRqFUs0x_9zV6Rm4-Mk_Yajs3/view?usp=sharing, then extract Network.zip under the root directory.

  • runJointLearningNeuralNetwork (need VRAM>=12GB)

    cd Network/JointLearningNeuralNetwork
    python inference_server.py --gpu <gpu_id>
    
  • runLSTMPose

    cd Network/LSTMPose
    python inference_server.py --gpu <gpu_id>
    
  • Note:JointLearningNeuralNetwork and LSTMPose work as local servers on port 8080 and 8081, respectively. So make sure that the two port numbers are free before running.

Setup C++ Code

Linux Setup (Python Only)

This section describes how to run the neural networks and generate 3D hand meshes on Linux without the C++ code.

Requirements

  • Linux (tested on Ubuntu)
  • Python 3.8+
  • CUDA 10.0+ with cuDNN
  • GPU with ≥12GB VRAM

Step 1: Install Dependencies

cd Network/JointLearningNeuralNetwork
pip install -r requirements.txt
pip install scipy requests

Note: This fork includes TensorFlow 2.x compatibility fixes. The original code was written for TensorFlow 1.x.

Step 2: Download Pre-trained Models

Download the pre-trained networks from Google Drive, then extract:

# Extract Network.zip to get:
# - Network/JointLearningNeuralNetwork/model/
# - Network/LSTMPose/exp/
unzip Network.zip

Step 3: Start the Inference Server

cd Network/JointLearningNeuralNetwork
python inference_server.py --gpu 0

The server starts on port 8080 and provides:

  • Input: Depth image (240×320, 16-bit PNG)
  • Output: 21 hand joint positions + segmentation mask

Step 4: Generate 3D Hand Mesh

In a new terminal, run:

cd Network/JointLearningNeuralNetwork

# Generate hand mesh from depth image
python generate_hand_mesh.py --depth_image your_depth.png --output hand_mesh.obj

Arguments:

Argument Description Default
--depth_image Input depth image path org_depth_img_init.png
--output Output OBJ file path hand_mesh.obj
--mano_path Path to MANO model JSON Auto-detected
--server_url Inference server URL http://127.0.0.1:8080

Output Files:

  • hand_mesh.obj - 3D hand mesh (778 vertices, 1538 faces)
  • hand_mesh_joints.obj - 21 joint positions as 3D points
  • hand_mesh_mask.png - Hand/object segmentation mask

Step 5: Test the Server (Optional)

cd Network/JointLearningNeuralNetwork
python test_server.py

Pipeline Overview

Depth Image (240×320)
       │
       ▼
┌─────────────────────────┐
│  JointLearningNN        │  ← Neural network (inference_server.py)
│  (port 8080)            │
└───────────┬─────────────┘
            │ 21 joints (u, v, z)
            ▼
┌─────────────────────────┐
│  MANO Model Fitting     │  ← Python MANO (mano_model.py)
│  (generate_hand_mesh.py)│
└───────────┬─────────────┘
            │
            ▼
     hand_mesh.obj         ← 3D hand mesh (778 vertices)

LSTMPose (Optional)

For temporal smoothing of hand poses across video frames:

cd Network/LSTMPose
python inference_server.py --gpu 0

This runs on port 8081 and smooths joint predictions over time.

Interactive 3D Viewer

A web-based interactive viewer for exploring 3D hand meshes:

cd Network/JointLearningNeuralNetwork

# Start the viewer (default: port 5000)
python interactive_viewer.py --port 5000

# Or specify a custom mesh file
python interactive_viewer.py --mesh hand_mesh.obj --joints hand_mesh_joints.obj

Then open http://localhost:5000 in your browser.

Features:

Control Action
Left click + drag Rotate the model
Scroll wheel Zoom in/out
Right click + drag Pan
Reset View Reset camera position
Toggle Wireframe Show mesh wireframe
Toggle Joints Show/hide 21 joint spheres
Toggle Skeleton Show/hide skeleton lines
Auto Rotate Enable auto-rotation

Arguments:

Argument Description Default
--port Server port 5000
--mesh Input OBJ mesh file hand_from_depth.obj
--joints Joint positions OBJ file hand_from_depth_joints.obj
--host Host address 0.0.0.0

Visualize Mesh to Image

To render a mesh to a static image (without browser):

cd Network/JointLearningNeuralNetwork

# Render to PNG
python visualize_mesh.py --mesh hand_mesh.obj --output render.png

# Multi-view render
python visualize_mesh.py --mesh hand_mesh.obj --output multiview.png --multiview

# Include joint positions
python visualize_mesh.py --mesh hand_mesh.obj --joints hand_mesh_joints.obj --output render.png

Runtime

Configuration

  • set wrist band color in fileInteractionReconstruction/InteractionReconstruction/build/wristband.txt, make sure the color is between hsv_min and hsv_max
  • write the configuration file, an example is InteractionReconstruction/InteractionReconstruction/json_manuscript/IO_Parameters_online.json
  • set the path to the configuration file as the command line argument in Properties->Debugging->Command Arguments

Running

Online
  • wear the wristband and make sure that the wristband is always in the view of the sensor
  • run the code and start tracking hand
  • press R to start object reconstruction
  • rotate the object to gradually reconstruct it
  • press T when you think that the object is fully reconstructed
  • close the InteractionRecon window when you want to stop tracking
  • if you set store_org_data to be true in configuration file, then the raw rgbd data will be stored. The frame ids when you press R and T will be shown later.
Offline
  • set benchmark to be false in configuration file
  • set left_camera_file_mask to be the file prefix of the image sequence you have stored
  • set recon_frame and stop_recon_frame to be the frame ids when you press R and T

Hand Calibration

Building...

Bibtex

@article{10.1145/3451341,
author = {Zhang, Hao and Zhou, Yuxiao and Tian, Yifei and Yong, Jun-Hai and Xu, Feng},
title = {Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions},
year = {2021},
issue_date = {June 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {40},
number = {3},
issn = {0730-0301},
url = {https://doi.org/10.1145/3451341},
doi = {10.1145/3451341},
month = {jul},
articleno = {29},
numpages = {12},
keywords = {hand-object interaction, hand tracking, Single depth camera, object reconstruction}
}
@inproceedings{10.1145/3550469.3555421,
author = {Hu, Haoyu and Yi, Xinyu and Zhang, Hao and Yong, Jun-Hai and Xu, Feng},
title = {Physical Interaction: Reconstructing Hand-Object Interactions with Physics},
year = {2022},
isbn = {9781450394703},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3550469.3555421},
doi = {10.1145/3550469.3555421},
booktitle = {SIGGRAPH Asia 2022 Conference Papers},
articleno = {43},
numpages = {9},
keywords = {single depth camera, physics-based interaction model, hand tracking, hand-object interaction},
location = {Daegu, Republic of Korea},
series = {SA '22 Conference Papers}
}

About

Code for "Physical Interaction: Reconstructing Hand-object Interactions with Physics, SIGGRAPH Asia 2022 Conference Track""

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • C++ 81.3%
  • Python 7.5%
  • GLSL 4.9%
  • C 3.5%
  • CMake 1.9%
  • Cuda 0.9%