replicate/all-mpnet-base-v2

This model is a sentence-transformer based on MPNet, an encoder-style language model introduced by Microsoft.

You can use this model to support downstream tasks like document clustering and semantic search.

Model description

This model obtains token-level embeddings and then aggregates them with mean-pooling to produce a single 768-dimensional document embedding.

The base model has been fine-tuned with a contrastive learning objective to predict sentence pairs using a dataset of 1 billion sentence pairs.

Intended use

The fine-tuning procedure for this model makes it particularly well suited for encoding the semantic content of short documents, such as sentences. The maximum sequence length of the fine-tuning data was 384 tokens. This means that representational fidelity may degrade as sequence length exceeds this threshold.

However, this issue can be mitigated with additional fine-tuning on relevant data.

Ethical considerations

Language models are widely-known to encode social biases and it should be assumed that this model is not an exception.