lucataco / snowflake-arctic-embed-l

snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance

  • Public
  • 222.4K runs
  • GitHub
  • License

Input

Output

Run time and cost

This model runs on CPU hardware. Predictions typically complete within 1 seconds.

Readme

Models

snowflake-arctic-embed is a suite of text embedding models that focuses on creating high-quality retrieval models optimized for performance.

The snowflake-arctic-embedding models achieve state-of-the-art performance on the MTEB/BEIR leaderboard for each of their size variants. Evaluation is performed using these scripts. As shown below, each class of model size achieves SOTA retrieval accuracy compared to other top models.

The models are trained by leveraging existing open-source text representation models, such as bert-base-uncased, and are trained in a multi-stage pipeline to optimize their retrieval performance. First, the models are trained with large batches of query-document pairs where negatives are derived in-batch—pretraining leverages about 400m samples of a mix of public datasets and proprietary web search data. Following pretraining models are further optimized with long training on a smaller dataset (about 1m samples) of triplets of query, positive document, and negative document derived from hard harmful mining. Mining of the negatives and data curation is crucial to retrieval accuracy. A detailed technical report will be available shortly.

Name MTEB Retrieval Score (NDCG @ 10) Parameters (Millions) Embedding Dimension
snowflake-arctic-embed-xs 50.15 22 384
snowflake-arctic-embed-s 51.98 33 384
snowflake-arctic-embed-m 54.90 110 768
snowflake-arctic-embed-m-long 54.83 137 768
snowflake-arctic-embed-l 55.98 335 1024

Aside from being great open-source models, the largest model, snowflake-arctic-embed-l, can serve as a natural replacement for closed-source embedding, as shown below.

Model Name MTEB Retrieval Score (NDCG @ 10)
snowflake-arctic-embed-l 55.98
Google-gecko-text-embedding 55.7
text-embedding-3-large 55.44
Cohere-embed-english-v3.0 55.00
bge-large-en-v1.5 54.29

snowflake-arctic-embed-l

Based on the intfloat/e5-large-unsupervised model, this large model is a direct drop-in for closed APIs and delivers the most accurate retrieval experience.

Model Name MTEB Retrieval Score (NDCG @ 10)
snowflake-arctic-embed-l 55.98
UAE-Large-V1 54.66
bge-large-en-v1.5 54.29
mxbai-embed-large-v1 54.39
e5-Large-v2 50.56

Contact

Feel free to open an issue or pull request if you have any questions or suggestions about this project. You also can email Daniel Campos(daniel.campos@snowflake.com).

License

Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.

Acknowledgement

We want to thank the open-source community, which has provided the great building blocks upon which we could make our models. We thank our modeling engineers, Danmei Xu, Luke Merrick, Gaurav Nuti, and Daniel Campos, for making these great models possible. We thank our leadership, Himabindu Pucha, Kelvin So, Vivek Raghunathan, and Sridhar Ramaswamy, for supporting this work. We also thank the open-source community for producing the great models we could build on top of and making these releases possible. Finally, we thank the researchers who created BEIR and MTEB benchmarks. It is largely thanks to their tireless work to define what better looks like that we could improve model performance.