Official implementation of Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors (TPPT), accepted at TMLR.
Authors: Haodong Lu, Xinyu Zhang, Kristen Moore, Jason Xue, Lina Yao, Anton van den Hengel, Dong Gong
Continual learning (CL) enables deep neural networks to acquire new knowledge over time while mitigating catastrophic forgetting of previously learned information. The powerful generalization ability of pre-trained models (PTMs), such as the Contrastive Language-Image Pre-training (CLIP) model, has inspired a range of CL methods targeting new and specialized tasks, further bridging the gap between PTMs and continual adaptation.
Leveraging its multi-modal visual and textual representations, CLIP offers a natural paradigm for CL, where new tasks can be accommodated by incrementally learning lightweight parameters, particularly prompts. However, existing prompt-based CL methods for PTMs often rely on complex designs built upon specific assumptions, such as intricate regularization schemes for prompt pools, specialized routing mechanisms, or multi-stage incrementation processes. While these approaches improve performance, they frequently introduce additional—and possibly unnecessary—complexity, underutilizing CLIP's intrinsic capabilities.
We propose a concise CL approach for CLIP based on incremental prompt tuning that fully exploits its multi-modal structure and the stability of textual representations. Our method, Textual Prototype-guided Prompt Tuning (TPPT), introduces textual prototypes not merely as static classifiers, as in existing methods, but as stable anchors to guide the learning of visual prompts, thereby shaping the embedding space (i.e., TPPT-V). We show that our bidirectional supervision strategy enables more effective learning of new knowledge while reducing forgetting. To further close the vision-language gap during CL, we activate the language branch and extend our approach to jointly optimize both visual and textual prompts (i.e., TPPT-VT). We also introduce a relational diversity regularization on the textual anchors to prevent embedding space collapse and mitigate correlated forgetting. Extensive experiments and analyses demonstrate the effectiveness of our proposed approach, highlighting the benefits of leveraging CLIP's intrinsic guidance for continual adaptation.
Conceptual illustrations of (a) standard Cross-Entropy (CE), (b) our proposed TPPT-V, (c) a naïve multi-modal extension of TPPT-V, and (d) our proposed TPPT-VT.
- (a) Standard CE: Prior methods use CE loss to adapt PTMs, but suffer from representation drift, leading to forgetting.
- (b) TPPT-V: Introduces a textual prototypical contrastive loss to anchor visual features and mitigate drift.
- (c) Naïve Extension: A naïve extension that also tunes textual prompts may improve textual prototype quality but risks collapse to trivial solutions.
- (d) TPPT-VT: Addresses this by regularizing multi-modal prompt learning with diversity constraints on textual prototypes.
Overall framework of our two proposed methods.
-
TPPT-V: The learned visual representations are guided by static textual prototypes. We alleviate the forgetting issue by guiding visual representations with consistent textual prototypes, preventing drift of representations in the embedding space.
-
TPPT-VT: To improve upon the static textual prototypes, we propose to learn textual prompts for prototypes and regulate the learning process by encouraging diversity.
-
Key Advantage: Benefiting from the textual prototype anchors, our proposed methods remain simple yet effective, unlike previous methods that use delicate, complex designs.
# Clone the repository
git clone https://github.com/jeff024/tppt.git
cd tppt
# Create conda environment
conda create -n tppt python=3.8
conda activate tppt
# Install other dependencies
pip install -r requirements.txtWe provide processed datasets for download:
- Aircraft:
- CIFAR-100: Automatically downloaded by the code
- Cars:
- CUB-200:
- ImageNet-R:
General usage:
python main.py --config=./configs/{method}/{dataset}.yamlWhere:
{method}can betppt_vortppt_vt{dataset}can beaircraft,cars,cf100,cub, orinr
Example: Training TPPT-VT on fine-grained aircraft classification:
python main.py --config=./configs/tppt_vt/aircraft.yamlEdit the YAML configuration files in configs/ to customize:
- Dataset settings: dataset name
- Model settings: backbone type, pretrained weights
- Training hyperparameters: batch size, learning rate, epochs
- Prompt settings: prompt depth, prompt length, number of prompts
@article{lu2025tppt,
title={Continual Learning on CLIP via Incremental Prompt Tuning with Intrinsic Textual Anchors},
author={Lu, Haodong and Zhang, Xinyu and Moore, Kristen and Xue, Jason and Yao, Lina and van den Hengel, Anton and Gong, Dong},
journal={Transactions on Machine Learning Research},
year={2025},
url={https://openreview.net/forum?id=YJnjkzKq5Y}
}TPPT is released under the Apache License 2.0. See LICENSE for details.
Our repository benefits from LAMDA-PILOT and open-clip. We thank them for their wonderful work.

