aodianyun / qwen2-vl-7b

  • Public
  • 1.6K runs
  • L40S

Input

string
Shift + Return to add a new line

Prompt to use for the video

Default: "Describe the video."

*file

Video to process

integer
(minimum: 128, maximum: 2048)

Width for the video

Default: 128

integer
(minimum: 128, maximum: 2048)

Height for the video

Default: 128

number
(minimum: 1, maximum: 768)

Maximum duration of the video in seconds (above 360, may run out of VRAM).

Default: 60

integer
(minimum: 1, maximum: 8192)

Maximum number of tokens to generate

Default: 128

number
(minimum: 0.01, maximum: 1)

Temperature for the model (0.7 is a good default).

Default: 0.7

number
(minimum: 0.01, maximum: 1.5)

Repetition penalty for the model (1.1 is a good default).

Default: 1.1

Output

[ "The video features a woman standing behind a podium, speaking to an audience while displaying slides on a screen in front of her. The slides contain text and images related to the topic being discussed by the speaker. The woman appears to be giving a presentation or lecture on a specific subject matter. The slides provide additional information and visual aids to support the speaker's points. The setting suggests that this is likely taking place in a formal environment such as a conference room or auditorium. Overall, the video captures a professional presentation with a focus on delivering informative content through both verbal communication and visual aids." ]
Generated in

Run time and cost

This model costs approximately $0.0030 to run on Replicate, or 333 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 4 seconds.

Readme

This model doesn't have a readme.