bge-large-en-v1.5

Comprehensive information on the functionality and api usage of the bge-large-en-v1.5 model.

API Reference: Embeddings API

Model Reference: bge-large-en-v1.5

Paper: BGE Paper

Crafted by BAAI, bge-large-en-v1.5 models are part of the baai-general-embedding series that achieve state-of-the-art performance on both MTEB and C-MTEB leaderboards. Built on advanced embedding algorithms, these models are trained on a diverse and extensive corpus, making them well-suited for a wide array of downstream tasks associated with text embeddings. These tasks include, but are not limited to, information retrieval, semantic textual similarity, and text reranking.

LayersEmbedding DimensionRecommended Sequence Length
241024512

Suitable Score Functions

  • cosine-similarity

Supported Languages

This model is exclusively trained on English texts, covering a wide range of domains and scenarios.

Examples

Calculate Sentence Similarities

Similarities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import requests
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

english_sentences = [
    "What is the capital of Australia?",
    "Canberra is the capital of Australia."
]

data = {
    'texts': english_sentences,
    'model': 'bge-large-en-v1.5',
}

response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
embeddings = response.json()["data"]

similarities = cosine_similarity([embeddings[0]["embedding"]], [embeddings[1]["embedding"]])
print(similarities)

Information Retrieval

Retrieval
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

def get_embeddings(texts, model, instruction=None):
    data = {'texts': texts, 'model': model}
    if instruction:
        data['instruction'] = instruction
    response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
    embeddings = [entry['embedding'] for entry in response.json()['data']]
    return np.array(embeddings)

query_text = "What is the most populous country in the world?"

corpus_texts = [
    "China is the most populous country in the world, with over 1.4 billion inhabitants.",
    "The United States is one of the most culturally diverse countries in the world.",
    "Russia is the largest country by land area."
]

model_name = "bge-large-en-v1.5"

query_embeddings = get_embeddings([query_text], model_name)
corpus_embeddings = get_embeddings(corpus_texts, model_name)

similarities = cosine_similarity(query_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(corpus_texts[retrieved_doc_id])

Limitations

This model exclusively caters to English texts, and the details on truncation or token limitations are to be confirmed.