cjwbw / unival

Unified Model for Image, Video, Audio and Language Tasks

  • Public
  • 990 runs
  • L40S
  • GitHub
  • Paper
  • License

Input

file
Preview
input_image

Input image.

file

Input audio.

file

Input video.

string

Choose a task.

Default: "Image Captioning"

string
Shift + Return to add a new line

Provide question for the VQA task, region for Visual Grounding task, and instruction for General tasks. The default instruction for Captioning task is ‘What does the image/video/audio describe?’

Output

output
Generated in

This output was created using a different version of the model, cjwbw/unival:b2c88629.

Run time and cost

This model costs approximately $0.0034 to run on Replicate, or 294 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 4 seconds. The predict time for this model varies significantly based on the inputs.

Readme

UniVAL: Unified Model for Image, Video, Audio and Language Tasks

UnIVAL is a 0.25B-parameter unified model that is multitask pretrained on image and video-text data and target image, video and audio-text downstream tasks.

Citation

If you find the work helpful, you can cite it using the following citation:

@article{shukor2023unified,
  title={Unified Model for Image, Video, Audio and Language Tasks},
  author={Shukor, Mustafa and Dancette, Corentin and Rame, Alexandre and Cord, Matthieu},
  journal={arXiv preprint arXiv:2307.16184},
  year={2023}
}

Aknowledgment

This code is based mainly on the following repos:

We thank the authors for releasing their code.