PHyCLIP: $\ell_1$ -Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning (ICLR 2026)
Official paper found at: OpenReview and arXiv.
PHyCLIP learns a single vision-language space that captures both:
- Taxonomic hierarchy (e.g., a dog is-a mammal) by each hyperbolic space
-
Cross-family compositionality (e.g., a dog in a car) via an
$\ell_1$ -product metric like a Boolean algebra
- A unified geometric framework for hierarchy and compositionality.
- Drop-in CLIP-style training/evaluation pipeline.
- Extensive evaluation: zero-shot classification, retrieval, hierarchical classification, VL-Checklist, and SugarCrepe.
Conceptual diagram of hierarchical and compositional structures.
While all arrows represent entailments (
Overview of PHyCLIP.
Images and texts are encoded as points
@InProceedings{Yoshikawa_2026_ICLR,
author = {Daiki Yoshikawa and Takashi Matsubara},
title = {PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning},
booktitle = {Proceedings of the International Conference on Learning Representations (ICLR)},
month = {Apr.},
year = {2026},
}Our codes were imported from MERU and HyCoCLIP.
All dependencies are listed in pyproject.toml.
To install them, run:
pip install -e .See HyCoCLIP.
To train a PHyCLIP-ViT-B/16 model, run the following command:
python scripts/train.py --config configs/train_phyclip_vit_b.py --num-gpus 8Hyperparameters can be easily modified in the config files or directly within the command (For e.g., add train.total_batch_size=768 to the command to change batch size).
The evaluation script auto-downloads and caches 18 (out of 20) datasets in ./datasets/eval. For ImageNet and Stanford Dogs, please follow the instructions below.
Download and symlink the ImageNet dataset (Torchvision ImageFolder style) at ./datasets/eval/imagenet. The Stanford Dogs dataset also needs to be set-up manually using instructions provided in Pytorch issue 7545 at ./datasets/eval/cars/stanford_cars.
To evaluate a trained PHyCLIP-ViT-S/16 model, run the following command:
python scripts/evaluate.py --config configs/eval_zero_shot_classification.py \
--checkpoint-path checkpoints/phyclip_vit_b.pth \
--train-config configs/train_phyclip_vit_b.pyThe following datasets are configured in the code: COCO captions and Flickr30k captions. Please refer to the documentation in phyclip/data/evaluation.py on how to arrange their files in ./datasets/coco and ./datasets/flickr30k. To evaluate PHyCLIP-ViT-S/16 on these 2 datasets, run the following command:
python scripts/evaluate.py --config configs/eval_zero_shot_retrieval.py \
--checkpoint-path checkpoints/phyclip_vit_b.pth \
--train-config configs/train_phyclip_vit_b.pyWe use the WordNet hierarchy of the ImageNet class labels in ./assets/imagenet_synset for the hierarchical classification task. The ImageNet evaluation dataset needs to be configured as mentioned in point 1. To evaluate PHyCLIP-ViT-S/16 on this task, run the following command:
python scripts/evaluate.py --config configs/eval_hierarchical_metrics.py \
--checkpoint-path checkpoints/phyclip_vit_b.pth \
--train-config configs/train_phyclip_vit_b.pyPrepare evaluation date according to the instructions in VL-Checklist.
Then run the evaluation:
python scripts/run_vl_checklist.py \
--checkpoint-path checkpoints/phyclip_vit_b.pth \
--train-config configs/train_phyclip_vit_b.py \
--vl-checklist-config configs/vl_checklist_config.yamlThe evaluation results will be saved in ./vl_checklist_results by default. You can modify the VL-Checklist configuration in configs/vl_checklist_config.yaml to customize the evaluation settings.
Ensure you have the COCO val2017 images available at ./datasets/eval/coco/val2017/:
Then run the evaluation:
python scripts/run_sugarcrepe.py \
--checkpoint-path checkpoints/phyclip_vit_b.pth \
--train-config configs/train_phyclip_vit_b.pyThe results will be saved in {checkpoint_dir}/sugarcrepe_results/ as both JSON metrics and detailed scores.

