Official PyTorch implementation of ShapeWords (CVPR 2025)
ShapeWords incorporates target 3D shape information into text prompts for guided image synthesis. Given a 3D shape (encoded with PointBERT tokens) and a text prompt, our Shape2CLIP module predicts a shape-aware offset to the CLIP prompt embedding that guides Stable Diffusion towards images that comply both with the target 3D shape and the text prompt.
Create the conda environment:
conda env create -f environment.yml
conda activate shapewords_source
Download the pretrained Shape2CLIP checkpoint here, or via gdown:
gdown 1nvEXnwMpNkRts6rxVqMZt8i9FZ40KjP7 -O projection_model-0920192.pth
The checkpoint corresponds to Shape2CLIP(depth=6, heads=8, pb_dim=384) trained on PointBERT shape tokens (see geometry_guidance_models.py).
We provide a Gradio demo in the demo/ folder (also deployable as a Hugging Face ZeroGPU space). It downloads the Shape2CLIP checkpoint automatically on first run:
pip install -r demo/requirements.txt
cd demo && python app.py
The demo expects per-category PointBERT embedding files (<synset_id>_pb_embs.npz) in demo/embeddings/ — see the sample file in sample_data/shapenet_pointbert_tokens/ for the expected format (ids and embs arrays).
Since our dataset is fairly large, we provide a command to run training on sample data (a few shapes of the 02773838 (bag) category are included in sample_data/). For full-scale training, download our data and replace the paths accordingly — see Data below.
To train the Shape2CLIP guidance model on sample data run the following:
bash ./train_on_sample_data.sh
Model checkpoints will be saved in sample_outputs/.
To run full-scale training on a SLURM cluster, fill in data_root, the SBATCH header fields and HF_HOME in train_on_cluster.sh and submit it:
sbatch ./train_on_cluster.sh
Training is roughly based on the Hugging Face diffusers textual inversion example: we freeze the Stable Diffusion 2.1 VAE, U-Net and text encoder, and optimize only the Shape2CLIP model with the denoising objective, with optional timestep-dependent loss weighting (--weight_loss_by_t).
The full preprocessed training data is available here:
https://console.cloud.google.com/storage/browser/shapewords_data
The training data layout (see sample_data/ for a working example):
controlnet_images_offset_all/<synset_id>/<synset_id>_<shape_id>/combined/ — ControlNet-generated training images per shape (angle_<view>_prompt_<prompt_id>_combo_<k>.jpg)
controlnet_images_offset_all/<synset_id>/<synset_id>_<shape_id>/depth/ — depth renders per view (depth_<view>.jpg)
shapenet_pointbert_tokens/<synset_id>_pb_embs.npz — PointBERT shape tokens per category (ids and embs arrays)
foreground.txt — stylized text prompt templates used during training
categories.json — mapping from ShapeNet synset ids to category names
Train/val/test splits are provided in stats/ (train.txt, val.txt, train_val.txt, test.txt), one <synset_id>_<shape_id> per line.
If you have any questions about the data, feel free to open an issue or email me (Dmitrii Petrov).
If you find our work useful, please cite the CVPR 2025 paper:
@InProceedings{Petrov_2025_CVPR,
author = {Petrov, Dmitry and Goyal, Pradyumn and Shivashok, Divyansh and Tao, Yuanming and Averkiou, Melinos and Kalogerakis, Evangelos},
title = {ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
pages = {13305-13314}
}The attention implementation in geometry_guidance_models.py is based on 3DShape2VecSet by Biao Zhang and co-authors. The training script is roughly based on the Hugging Face diffusers textual inversion example. Shape embeddings are produced with PointBERT encoders from PointBERT.