cjwbw / unival

Unified Model for Image, Video, Audio and Language Tasks

  • Public
  • 930 runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0026 to run on Replicate, or 384 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

UniVAL: Unified Model for Image, Video, Audio and Language Tasks

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Citation

If you find the work helpful, you can cite it using the following citation:

@article{shukor2023unified,
  title={Unified Model for Image, Video, Audio and Language Tasks},
  author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
  journal={arXiv preprint arXiv:2307.16184},
  year={2023}
}

Aknowledgment

This code is based mainly on the following repos:

We thank the authors for releasing their code.