Readme

Jina CLIP v2

A powerful multilingual multimodal embedding model that can understand both text and images across 89 languages.

What it does

Creates embeddings (numerical representations) from text, images, or both
Supports 89 languages for text input
Handles high-resolution images (512x512)
Can adjust embedding dimensions from 64 to 1024 (Matryoshka feature)

Why it’s cool

Multilingual: Works with 89 languages, making it truly global
High Resolution: Better at understanding image details (512x512 vs typical 224x224)
Flexible Size: Can reduce embedding size while maintaining strong performance
Dual Purpose: Can handle text-only, image-only, or both together

Behavior

When given text only: Returns a single text embedding
When given image only: Returns a single image embedding
When given both: Returns two embeddings [text_embedding, image_embedding]

Performance

3% improvement over v1 in text-image and text-text retrieval
Matches state-of-the-art performance on many benchmarks
Even at 64 dimensions (94% smaller), maintains over 90% of full performance

Best For

Cross-lingual image search
Semantic image understanding
Text-image similarity matching
Multilingual document retrieval
Building multimodal AI applications

Model Size

Total Size: 0.9B parameters
Text Encoder: 561M parameters
Image Encoder: 304M parameters

Outputs

Returns embeddings in your chosen format and dimension: - Text input → Text embedding - Image input → Image embedding

Performance & Specifications

Model Size: 0.9B parameters (561M text + 304M image)
Quality:
3% better than previous version
90%+ accuracy even at 64 dimensions
State-of-the-art on multilingual benchmarks

Ideal Use Cases

Global image search systems
Cross-language document matching
Visual-semantic product search
Content recommendation
AI training datasets

License

Available under CC BY-NC 4.0 for non-commercial use. Commercial licensing available through Jina AI.

Citation

@misc{2405.20204,
    Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
    Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprint = {arXiv:2405.20204},
}

Follow me on Twitter/X

Run time and cost