A model which generates text in response to an input image and prompt.
Want to make some of these yourself?