zsxkib/jina-clip-v2

Jina-CLIP v2: 0.9B multimodal embedding model with 89-language multilingual support, 512x512 image resolution, and Matryoshka representations

Public
654.1K runs

Run time and cost

This model costs approximately $0.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds.

Readme

Jina CLIP v2

A powerful multilingual multimodal embedding model that can understand both text and images across 89 languages.

What it does

  • Creates embeddings (numerical representations) from text, images, or both
  • Supports 89 languages for text input
  • Handles high-resolution images (512x512)
  • Can adjust embedding dimensions from 64 to 1024 (Matryoshka feature)

Why it’s cool

  • Multilingual: Works with 89 languages, making it truly global
  • High Resolution: Better at understanding image details (512x512 vs typical 224x224)
  • Flexible Size: Can reduce embedding size while maintaining strong performance
  • Dual Purpose: Can handle text-only, image-only, or both together

Behavior

  • When given text only: Returns a single text embedding
  • When given image only: Returns a single image embedding
  • When given both: Returns two embeddings [text_embedding, image_embedding]

Performance

  • 3% improvement over v1 in text-image and text-text retrieval
  • Matches state-of-the-art performance on many benchmarks
  • Even at 64 dimensions (94% smaller), maintains over 90% of full performance

Best For

  • Cross-lingual image search
  • Semantic image understanding
  • Text-image similarity matching
  • Multilingual document retrieval
  • Building multimodal AI applications

Model Size

  • Total Size: 0.9B parameters
  • Text Encoder: 561M parameters
  • Image Encoder: 304M parameters

Outputs

Returns embeddings in your chosen format and dimension: - Text input → Text embedding - Image input → Image embedding

Performance & Specifications

  • Model Size: 0.9B parameters (561M text + 304M image)
  • Quality:
  • 3% better than previous version
  • 90%+ accuracy even at 64 dimensions
  • State-of-the-art on multilingual benchmarks

Ideal Use Cases

  • Global image search systems
  • Cross-language document matching
  • Visual-semantic product search
  • Content recommendation
  • AI training datasets

License

Available under CC BY-NC 4.0 for non-commercial use. Commercial licensing available through Jina AI.

Citation

@misc{2405.20204,
    Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao},
    Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever},
    Year = {2024},
    Eprint = {arXiv:2405.20204},
}

Follow me on Twitter/X