GTE-large

Comprehensive information on the functionality and usage of the GTE-large model

API Reference: Embeddings API

Model Reference: GTE-large

Paper: -

The GTE models, crafted by Alibaba DAMO Academy, encompass three distinct sizes: GTE-large, GTE-base, and GTE-small. Founded upon the widely recognized BERT framework, these models are trained on an extensive corpus of relevant text pairs, covering a diversity of domains and scenarios. Such a broad training dataset equips the GTE models with the capability to cater to a variety of downstream tasks associated with text embeddings, which include, but are not limited to, information retrieval, semantic textual similarity, and text reranking.

LayersEmbedding DimensionRecommended Sequence Length
241024512

Suitable Score Functions

  • cosine-similarity

Supported Languages

This model is exclusively trained on English texts, covering a wide range of domains and scenarios.

Examples

Calculate Sentence similarities

Similarities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import requests
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

english_sentences = [
    "What is the capital of Australia?",
    "Canberra is the capital of Australia."
]

data = {
    'texts': english_sentences,
    'model': 'gte-large',
}

response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
embeddings = response.json()["data"]

similarities = cosine_similarity([embeddings[0]["embedding"]], [embeddings[1]["embedding"]])
print(similarities)

Information Retrieval

Retrieval
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

def get_embeddings(texts, model, instruction=None):
    data = {'texts': texts, 'model': model}
    if instruction:
        data['instruction'] = instruction
    response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
    embeddings = [entry['embedding'] for entry in response.json()['data']]
    return np.array(embeddings)

query_text = "What is the most populous country in the world?"

corpus_texts = [
    "China is the most populous country in the world, with over 1.4 billion inhabitants.",
    "The United States is one of the most culturally diverse countries in the world.",
    "Russia is the largest country by land area."
]

model_name = "gte-large"

query_embeddings = get_embeddings([query_text], model_name)
corpus_embeddings = get_embeddings(corpus_texts, model_name)

similarities = cosine_similarity(query_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(corpus_texts[retrieved_doc_id])

Limitations

This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.