Session 7: Learning Transferable Visual Models From Natural Language Supervision

The topic for this session is CLIP “Learning Transferable Visual Models From Natural Language Supervision” by Radford er al, 2021

https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf

https://openai.com/blog/clip/

Some notes

Supervision: text-description matching with image.
- contrastive representation learning (motivation - section 2.3)
Model: simplified version of ConVIRT trained from scratch
- removed non-linear projection between the representation and the contrastive embedding space*
- We use only a linear projection to map from each encoder’s representation to the multi-modal embedding space.
- removed the text transformation function which samples a single sentence at uniform from the text since many of the (image, text) pairs in CLIP’s pre-training dataset are only a single sentence.
- simplified image transformation function tv.
- Augmentation: A random square crop from resized images
- the temperature parameter controls the range of the logits in the softmax ,τ, directly optimized during training
Image Encoders:
- Based on Rsnet50 + Resnet-D + antialiased rect-2 blur pool + attention pooling (justification?)
- Vision Transformer with minor modifications (justification?)
Text encoder: Transformer + modifications. Page 5. Some justification is “left as future work”
New dataset “WIT for WebImageText”: 400 million (image,text) pairs collected form a variety of publicly avail ablesources on the Internet.
Model scaling (page5). For the image encoder they scale depth, width and resolution, like efficientnet does. For text encoder they scale only width.
- Resnet-50, 101, RN50x4, RN50x16, RN50x64.
- ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
- How do sizes of those models compare?
- RN50x64. trained for 18*592= 10k days
- ViT-L/14@336 trained for 12×256=3k days. Also trained for one more epoch. :-D - this is CLIP/

Expetiments

Visual N-Grams (2017) - outperformed by a large margin
Prompt (text query engineering) - significant impact on the performance.
Analysis of Zero-shot performance. Compare with logistic regression trained on the features from Resnet-50. (is it a reasonable comparison?). Better in approximately half of the datasets (16/27). Is it good?
Few shot classifiers
Representation learning - evaluated with linear classifiers. Compared with other models. Were other models trained on a comparable datasets?
3.3 Robustness to Natural Distribution Shift/
- What is “logit-transformed accuracy” (end of p 13)?
7.1. Bias - mostly caused by the training dataset?