UniVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.
Citation
If you find the work helpful, you can cite it using the following citation:
@article{shukor2023unified,
title={Unified Model for Image, Video, Audio and Language Tasks},
author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
journal={arXiv preprint arXiv:2307.16184},
year={2023}
}
Aknowledgment
This code is based mainly on the following repos:
We thank the authors for releasing their code.