ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Project Page | Paper (arXiv)

Official PyTorch implementation of ShapeWords (CVPR 2025)

ShapeWords incorporates target 3D shape information into text prompts for guided image synthesis. Given a 3D shape (encoded with PointBERT tokens) and a text prompt, our Shape2CLIP module predicts a shape-aware offset to the CLIP prompt embedding that guides Stable Diffusion towards images that comply both with the target 3D shape and the text prompt.

Setup

Create the conda environment:

conda env create -f environment.yml
conda activate shapewords_source

Quickstart

Pretrained checkpoint

Download the pretrained Shape2CLIP checkpoint here, or via gdown:

gdown 1nvEXnwMpNkRts6rxVqMZt8i9FZ40KjP7 -O projection_model-0920192.pth

The checkpoint corresponds to Shape2CLIP(depth=6, heads=8, pb_dim=384) trained on PointBERT shape tokens (see geometry_guidance_models.py).

Demo

We provide a Gradio demo in the demo/ folder (also deployable as a Hugging Face ZeroGPU space). It downloads the Shape2CLIP checkpoint automatically on first run:

pip install -r demo/requirements.txt
cd demo && python app.py

The demo expects per-category PointBERT embedding files (<synset_id>_pb_embs.npz) in demo/embeddings/ — see the sample file in sample_data/shapenet_pointbert_tokens/ for the expected format (ids and embs arrays).

Training

Since our dataset is fairly large, we provide a command to run training on sample data (a few shapes of the 02773838 (bag) category are included in sample_data/). For full-scale training, download our data and replace the paths accordingly — see Data below.

To train the Shape2CLIP guidance model on sample data run the following:

bash ./train_on_sample_data.sh

Model checkpoints will be saved in sample_outputs/.

To run full-scale training on a SLURM cluster, fill in data_root, the SBATCH header fields and HF_HOME in train_on_cluster.sh and submit it:

sbatch ./train_on_cluster.sh

Training is roughly based on the Hugging Face diffusers textual inversion example: we freeze the Stable Diffusion 2.1 VAE, U-Net and text encoder, and optimize only the Shape2CLIP model with the denoising objective, with optional timestep-dependent loss weighting (--weight_loss_by_t).

Data

The full preprocessed training data is available here:

https://console.cloud.google.com/storage/browser/shapewords_data

The training data layout (see sample_data/ for a working example):

controlnet_images_offset_all/<synset_id>/<synset_id>_<shape_id>/combined/ — ControlNet-generated training images per shape (angle_<view>_prompt_<prompt_id>_combo_<k>.jpg)

controlnet_images_offset_all/<synset_id>/<synset_id>_<shape_id>/depth/ — depth renders per view (depth_<view>.jpg)

shapenet_pointbert_tokens/<synset_id>_pb_embs.npz — PointBERT shape tokens per category (ids and embs arrays)

foreground.txt — stylized text prompt templates used during training

categories.json — mapping from ShapeNet synset ids to category names

Train/val/test splits are provided in stats/ (train.txt, val.txt, train_val.txt, test.txt), one <synset_id>_<shape_id> per line.

If you have any questions about the data, feel free to open an issue or email me (Dmitrii Petrov).

Citation

If you find our work useful, please cite the CVPR 2025 paper:

@InProceedings{Petrov_2025_CVPR,
    author    = {Petrov, Dmitry and Goyal, Pradyumn and Shivashok, Divyansh and Tao, Yuanming and Averkiou, Melinos and Kalogerakis, Evangelos},
    title     = {ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {13305-13314}
}

Acknowledgements

The attention implementation in geometry_guidance_models.py is based on 3DShape2VecSet by Biao Zhang and co-authors. The training script is roughly based on the Hugging Face diffusers textual inversion example. Shape embeddings are produced with PointBERT encoders from PointBERT.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
demo		demo
sample_data		sample_data
stats		stats
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
environment.yml		environment.yml
geometry_guidance_models.py		geometry_guidance_models.py
train_geometry_guidance.py		train_geometry_guidance.py
train_on_cluster.sh		train_on_cluster.sh
train_on_sample_data.sh		train_on_sample_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Project Page | Paper (arXiv)

Setup

Quickstart

Pretrained checkpoint

Demo

Training

Data

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ShapeWords: Guiding Text-to-Image Synthesis with 3D Shape-Aware Prompts

Project Page | Paper (arXiv)

Setup

Quickstart

Pretrained checkpoint

Demo

Training

Data

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages