This code is tested on Python 3.10.0 on Ubuntu 22.04, with PyTorch 2.1.0+cu121:
conda create -n raap python=3.10
conda activate raap
# pytorch 2.1.0 with cuda 12.1
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
# Note: just install the torch version that matches your own cuda version
pip install -e vision/GroundedSAM/GroundingDINO
pip install -e vision/GroundedSAM/segment_anything
wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P assets/ckpts/
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -P assets/ckpts/
pip install -r requirements.txtThe default dataset is included in this repository: datasets/droid_masked_images/ (per-task images, splits, and similarity JSONs) and datasets/droid_masked_images_features.h5 (feature gallery for retrieval).
To use your own data, set data_path in configs/config.yaml.
By default, the model is trained on the task “open the drawer.”
The task can be modified by adjusting task_filter in configs/config.yaml.
Training:
python train_transformer.py --config configs/config.yamlValidation:
python val_transformer.py --config configs/config.yamlSingle image inference:
python inference.py --image demo/drawer.jpg --task open_the_drawer --enable-affordance --prompt "open drawer" --config configs/config.yamlIf config.model.K == 0, --task is optional and the model runs without retrieval:
python inference.py --image demo/drawer.jpg --config configs/config.yaml