Readme
The text embedding set trained by Jina AI, Finetuner team.
Intented Usage & Model Info
jina-embedding-l-en-v1
is a language model that has been trained using Jina AI’s Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a size of 330 million parameters, the model enables single-gpu inference while delivering better performance than our small and base model. Additionally, we provide the following options:
jina-embedding-t-en-v1
: 14 million parameters.jina-embedding-s-en-v1
: 35 million parametersjina-embedding-b-en-v1
: 110 million parameters.jina-embedding-l-en-v1
: 330 million parameters.jina-embedding-1b-en-v1
: 1.2 billion parameters, 10 times bert-base (soon).jina-embedding-6b-en-v1
: 6 billion parameters, 30 times bert-base (soon).
Data & Parameters
Please checkout our technical blog.
Metrics
We compared the model against all-minilm-l6-v2
/all-mpnet-base-v2
from sbert and text-embeddings-ada-002
from OpenAI:
Name | param | dimension |
---|---|---|
all-minilm-l6-v2 | 23m | 384 |
all-mpnet-base-v2 | 110m | 768 |
ada-embedding-002 | Unknown/OpenAI API | 1536 |
jina-embedding-t-en-v1 | 14m | 312 |
jina-embedding-s-en-v1 | 35m | 512 |
jina-embedding-b-en-v1 | 110m | 768 |
jina-embedding-l-en-v1 | 330m | 1024 |
Name | STS12 | STS13 | STS14 | STS15 | STS16 | STS17 | TRECOVID | Quora | SciFact |
---|---|---|---|---|---|---|---|---|---|
all-minilm-l6-v2 | 0.724 | 0.806 | 0.756 | 0.854 | 0.79 | 0.876 | 0.473 | 0.876 | 0.645 |
all-mpnet-base-v2 | 0.726 | 0.835 | 0.78 | 0.857 | 0.8 | 0.906 | 0.513 | 0.875 | 0.656 |
ada-embedding-002 | 0.698 | 0.833 | 0.761 | 0.861 | 0.86 | 0.903 | 0.685 | 0.876 | 0.726 |
jina-embedding-t-en-v1 | 0.717 | 0.773 | 0.731 | 0.829 | 0.777 | 0.860 | 0.482 | 0.840 | 0.522 |
jina-embedding-s-en-v1 | 0.743 | 0.786 | 0.738 | 0.837 | 0.80 | 0.875 | 0.523 | 0.857 | 0.524 |
jina-embedding-b-en-v1 | 0.751 | 0.809 | 0.761 | 0.856 | 0.812 | 0.890 | 0.606 | 0.876 | 0.594 |
jina-embedding-l-en-v1 | 0.745 | 0.832 | 0.781 | 0.869 | 0.837 | 0.902 | 0.573 | 0.881 | 0.598 |
Usage
Use with Jina AI Finetuner
!pip install finetuner
import finetuner
model = finetuner.build_model('jinaai/jina-embedding-l-en-v1')
embeddings = finetuner.encode(
model=model,
data=['how is the weather today', 'What is the current weather like today?']
)
print(finetuner.cos_sim(embeddings[0], embeddings[1]))
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['how is the weather today', 'What is the current weather like today?']
model = SentenceTransformer('jinaai/jina-embedding-b-en-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Fine-tuning
Please consider Finetuner.
Plans
- The development of
jina-embedding-s-en-v2
is currently underway with two main objectives: improving performance and increasing the maximum sequence length. - We are currently working on a bilingual embedding model that combines English and X language. The upcoming model will be called
jina-embedding-s/b/l-de-v1
.
Contact
Join our Discord community and chat with other community members about ideas.
Citation
If you find Jina Embeddings useful in your research, please cite the following paper:
@misc{günther2023jina,
title={Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models},
author={Michael Günther and Louis Milliken and Jonathan Geuter and Georgios Mastrapas and Bo Wang and Han Xiao},
year={2023},
eprint={2307.11224},
archivePrefix={arXiv},
primaryClass={cs.CL}
}