Visual instruction tuning towards large language and vision models with GPT-4 level capabilities
Want to make some of these yourself?