yoadtew / zero-shot-image-to-text

image to text generation

  • Public
  • 6.7K runs
  • T4
  • GitHub
  • Paper
Iterate in playground

Input

image
*file

input image

string
Shift + Return to add a new line

conditional text

Default: "Image of a"

integer
(minimum: 1, maximum: 10)

Number of beams to use

Default: 5

number
(minimum: 1, maximum: 1.1)

Higher value for shorter captions

Default: 1.01

integer
(minimum: 1, maximum: 20)

Maximum number of tokens to generate

Default: 15

number
(minimum: 0, maximum: 0.6)

Scale of cross-entropy loss with un-shifted language model

Default: 0.2

Output

Best CLIP: Image of a baby sleeping in a green flower. Best fluency: Image of a baby sleeping in a green flower. Best mixed: Image of a baby.
Generated in

This output was created using a different version of the model, yoadtew/zero-shot-image-to-text:6d1ac11e.

Run time and cost

This model costs approximately $0.060 to run on Replicate, or 16 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 5 minutes. The predict time for this model varies significantly based on the inputs.

Readme

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Approach

Example

Citation

Please cite our work if you use it in your research:

@article{tewel2021zero,
  title={Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic},
  author={Tewel, Yoad and Shalev, Yoav and Schwartz, Idan and Wolf, Lior},
  journal={arXiv preprint arXiv:2111.14447},
  year={2021}
}