meta / omnivore

A Single Model for Many Visual Modalities

Demo API Examples Versions (79fa47c0)

😵 Uh oh! This model can't be run on Replicate because it was built with a version of Cog that is no longer supported. Consider opening an issue on the model's GitHub repository to see if it can be updated to use a recent version of Cog. If you need any help, please hop into our Discord channel or email us about it.

Run time and cost

Predictions run on Nvidia T4 GPU hardware.

Omnivore: A Single Model for Many Visual Modalities


OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.


If this work is helpful in your research, please consider starring :star: us and citing:

  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},


We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.


Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).