lucataco / llama-3-vision-alpha

Projection module trained to add vision capabilties to Llama 3 using SigLIP

Run time and cost

This model runs on Nvidia A40 GPU hardware. Predictions typically complete within 6 seconds. The predict time for this model varies significantly based on the inputs.



projection module trained to add vision capabilties to Llama 3 using SigLIP. built by @yeswondwerr and @qtnx_


Image Examples
What is the title of this book? answer briefly
The title of the book is “The Little Book of Deep Learning”.

Where is the person standing? answer briefly
The person is standing on the balcony.

Describe the image
The image shows a person holding a book with a cityscape visible through the window behind them. The book has a cover with a title that reads “The Little Book of Deep Learning” in bold letters.
What type of food is the girl holding? answer briefly
A hamburger!

What color is the woman’s hair? answer briefly
It’s white!

Describe the image
The image is of a young girl with short, curly hair and a sweet smile, holding a giant hamburger in her hand. She’s sitting at a table with a festive dinner setting, surrounded by candles and a warm glow. Her eyes are shining with excitement and contentment as she takes a big bite of the burger.


  • Liu et al. : LLaVA
  • Moon et al. : AnyMAL
  • vikhyatk : moondream, test images
