daanelson / imagebind

A model for text, audio, and image embeddings in one space

  • Public
  • 2M runs
  • GitHub
  • License



Run time and cost

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds. The predict time for this model varies significantly based on the inputs.


Note: This model is licensed under a non-commercial license, and so should only be used for research and experimentation purposes.

Model description

ImageBind is a model from MetaAI that learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.

This implementation has image, text, and audio modalities.


  title={ImageBind: One Embedding Space To Bind Them All},
  author={Girdhar, Rohit and El-Nouby, Alaaeldin and Liu, Zhuang
and Singh, Mannat and Alwala, Kalyan Vasudev and Joulin, Armand and Misra, Ishan},