Vision models
Multimodal large language models with vision capabilities like object detection and optical character recognition (OCR)

yorickvp / llava-13b
Visual instruction tuning towards large language and vision models with GPT-4 level capabilities

daanelson / minigpt-4
A model which generates text in response to an input image and prompt.
.png)
cjwbw / internlm-xcomposer
Advanced text-image comprehension and composition based on InternLM

joehoover / mplug-owl
An instruction-tuned multimodal large language model that generates text based on user-provided prompts and images

lucataco / qwen-vl-chat
A multimodal LLM-based AI assistant, which is trained with alignment techniques. Qwen-VL-Chat supports more flexible interaction, such as multi-round question answering, and creative capabilities.

alaradirik / owlvit-base-patch32
Zero-shot / open vocabulary object detection

cjwbw / unidiffuser
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
.png)
cjwbw / idefics
Open-access reproduction of large visual language model Flamingo

alaradirik / nougat
Nougat: Neural Optical Understanding for Academic Documents

cjwbw / unival
Unified Model for Image, Video, Audio and Language Tasks