Lightweight multimodal model for visual question answering, reasoning and captioning
Want to make some of these yourself?