cjwbw/unival | Readme and Docs

UniVAL: Unified Model for Image, Video, Audio and Language Tasks

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Citation

If you find the work helpful, you can cite it using the following citation:

@article{shukor2023unified,
  title={Unified Model for Image, Video, Audio and Language Tasks},
  author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
  journal={arXiv preprint arXiv:2307.16184},
  year={2023}
}

Aknowledgment

This code is based mainly on the following repos:

OFA
Fairseq
taming-transformers

We thank the authors for releasing their code.

Model created over 1 year ago