Jina CLIP v2
A powerful multilingual multimodal embedding model that can understand both text and images across 89 languages.
What it does
- Creates embeddings (numerical representations) from text, images, or both
- Supports 89 languages for text input
- Handles high-resolution images (512x512)
- Can adjust embedding dimensions from 64 to 1024 (Matryoshka feature)
Why it’s cool
- Multilingual: Works with 89 languages, making it truly global
- High Resolution: Better at understanding image details (512x512 vs typical 224x224)
- Flexible Size: Can reduce embedding size while maintaining strong performance
- Dual Purpose: Can handle text-only, image-only, or both together
Behavior
- When given text only: Returns a single text embedding
- When given image only: Returns a single image embedding
- When given both: Returns two embeddings [text_embedding, image_embedding]
Performance
- 3% improvement over v1 in text-image and text-text retrieval
- Matches state-of-the-art performance on many benchmarks
- Even at 64 dimensions (94% smaller), maintains over 90% of full performance
Best For
- Cross-lingual image search
- Semantic image understanding
- Text-image similarity matching
- Multilingual document retrieval
- Building multimodal AI applications
Model Size
- Total Size: 0.9B parameters
- Text Encoder: 561M parameters
- Image Encoder: 304M parameters
Outputs
Returns embeddings in your chosen format and dimension: - Text input → Text embedding - Image input → Image embedding
Performance & Specifications
- Model Size: 0.9B parameters (561M text + 304M image)
- Quality:
- 3% better than previous version
- 90%+ accuracy even at 64 dimensions
- State-of-the-art on multilingual benchmarks
Ideal Use Cases
- Global image search systems
- Cross-language document matching
- Visual-semantic product search
- Content recommendation
- AI training datasets
License
Available under CC BY-NC 4.0 for non-commercial use. Commercial licensing available through Jina AI.
Citation
@misc{2405.20204,
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
Year = {2024},
Eprint = {arXiv:2405.20204},
}
Follow me on Twitter/X