Session 7: Learning Transferable Visual Models From Natural Language Supervision
The topic for this session is CLIP “Learning Transferable Visual Models From Natural Language Supervision” by Radford er al, 2021
Some notes
- Supervision: text-description matching with image.
- contrastive representation learning (motivation - section 2.3)
- Model: simplified version of ConVIRT trained from scratch
- removed non-linear projection between the representation and the contrastive embedding space*
- We use only a linear projection to map from each encoder’s representation to the multi-modal embedding space.
- removed the text transformation function which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pre-training dataset are only a single sentence.
- simplified image transformation function
tv
. - Augmentation: A random square crop from resized images
- the temperature parameter controls the range of the logits in the softmax ,τ, directly optimized during training
- Image Encoders:
- Based on Rsnet50 + Resnet-D + antialiased rect-2 blur pool + attention pooling (justification?)
- Vision Transformer with minor modifications (justification?)
- Text encoder: Transformer + modifications. Page 5. Some justification is “left as future work”
-
New dataset “WIT for WebImageText”: 400 million (image,text) pairs collected form a variety of publicly avail ablesources on the Internet.
- Model scaling (page5).
For the image encoder they scale depth, width and resolution, like efficientnet does. For text encoder they scale only width.
- Resnet-50, 101, RN50x4, RN50x16, RN50x64.
- ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
- How do sizes of those models compare?
- RN50x64. trained for 18*592= 10k days
- ViT-L/14@336 trained for 12×256=3k days. Also trained for one more epoch. :-D - this is CLIP/
Expetiments
- Visual N-Grams (2017) - outperformed by a large margin
- Prompt (text query engineering) - significant impact on the performance.
- Analysis of Zero-shot performance. Compare with logistic regression trained on the features from Resnet-50. (is it a reasonable comparison?). Better in approximately half of the datasets (16/27). Is it good?
- Few shot classifiers
- Representation learning - evaluated with linear classifiers. Compared with other models. Were other models trained on a comparable datasets?
- 3.3 Robustness to Natural Distribution Shift/
- What is “logit-transformed accuracy” (end of p 13)?
- 7.1. Bias - mostly caused by the training dataset?