cjwbw / unival

Unified Model for Image, Video, Audio and Language Tasks

  • Public
  • 784 runs
  • GitHub
  • Paper
  • License

Input

Output

Run time and cost

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

UniVAL: Unified Model for Image, Video, Audio and Language Tasks

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Citation

If you find the work helpful, you can cite it using the following citation:

@article{shukor2023unified,
  title={Unified Model for Image, Video, Audio and Language Tasks},
  author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
  journal={arXiv preprint arXiv:2307.16184},
  year={2023}
}

Aknowledgment

This code is based mainly on the following repos:

We thank the authors for releasing their code.