Simple image captioning model using CLIP and GPT-2
98.4K runs

Run time and cost

Predictions run on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds.

CLIP prefix captioning.


To get optimal results for most images, please choose "conceptual captions" as the model and use beam search.


Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. The features of the detected objects are then fed to an additional network that is trained to output the correct caption. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving close to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple Multi-Layer Perceptron (MLP) over the raw encoding, and then fine-tune our language model to generate a valid caption.


This project was created by Ron Mokady and Amir Hertz for the Advanced-NLP course by Omer Levy @ TAU.
This repository is heavily based on CLIP and Hugging-faces repositories.
For training we used the data of COCO dataset and Conceptual Captions.
The project was also inspired from this paper.


For any inquiry please contact us at our email addresses: ron.mokady@gmail.com or amirhertz@mail.tau.ac.il.