cjwbw / unival

Unified Model for Image, Video, Audio and Language Tasks

  • Public
  • 956 runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.0034 to run on Replicate, or 294 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

UniVAL: Unified Model for Image, Video, Audio and Language Tasks

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Citation

If you find the work helpful, you can cite it using the following citation:

@article{shukor2023unified,
  title={Unified Model for Image, Video, Audio and Language Tasks},
  author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
  journal={arXiv preprint arXiv:2307.16184},
  year={2023}
}

Aknowledgment

This code is based mainly on the following repos:

We thank the authors for releasing their code.