I found that the pretraining phase from this code is a bit different from what I understand about the paper. According to 2 images below, only the Image modality is intra-contrastive with the aid of a semantic module.


However, the recommended pretraining command in README says the differ with both --separate_text and --separte_image are activated. If I understand the paper correctly, only --separate_image should be used.
|
python -m main.run --logs="path/to/logs" --save-frequency 2 --report-to wandb --wandb-project-name="sample_project" --train-data="path/to/cc12m" --train-num-samples 10030127 --warmup 10000 --batch-size=512 --lr=1e-3 --wd=0.1 --epochs=30 --workers=2 --model "ViT-B-16" --precision amp --dataset-type webdataset --clip-inModality-loss --clip-loss --alpha=1 --beta=0.5 --nl_semantic_supervision --train-num-samples 10030127 --dataset-type webdataset --separate_text --separate_image |
I found that the pretraining phase from this code is a bit different from what I understand about the paper. According to 2 images below, only the Image modality is intra-contrastive with the aid of a semantic module.


However, the recommended pretraining command in README says the differ with both
--separate_textand--separte_imageare activated. If I understand the paper correctly, only--separate_imageshould be used.AlignCLIP/README.md
Line 26 in a18e805