GTE-large
Comprehensive information on the functionality and usage of the GTE-large model
API Reference: Embeddings API
Model Reference: GTE-large
Paper: -
The GTE models, crafted by Alibaba DAMO Academy, encompass three distinct sizes: GTE-large, GTE-base, and GTE-small. Founded upon the widely recognized BERT framework, these models are trained on an extensive corpus of relevant text pairs, covering a diversity of domains and scenarios. Such a broad training dataset equips the GTE models with the capability to cater to a variety of downstream tasks associated with text embeddings, which include, but are not limited to, information retrieval, semantic textual similarity, and text reranking.
Layers | Embedding Dimension | Recommended Sequence Length |
---|---|---|
24 | 1024 | 512 |
Suitable Score Functions
- cosine-similarity
Supported Languages
This model is exclusively trained on English texts, covering a wide range of domains and scenarios.
Examples
Calculate Sentence similarities
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import requests from sklearn.metrics.pairwise import cosine_similarity headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'} english_sentences = [ "What is the capital of Australia?", "Canberra is the capital of Australia." ] data = { 'texts': english_sentences, 'model': 'gte-large', } response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers) embeddings = response.json()["data"] similarities = cosine_similarity([embeddings[0]["embedding"]], [embeddings[1]["embedding"]]) print(similarities)
Information Retrieval
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
import requests import numpy as np from sklearn.metrics.pairwise import cosine_similarity headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'} def get_embeddings(texts, model, instruction=None): data = {'texts': texts, 'model': model} if instruction: data['instruction'] = instruction response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers) embeddings = [entry['embedding'] for entry in response.json()['data']] return np.array(embeddings) query_text = "What is the most populous country in the world?" corpus_texts = [ "China is the most populous country in the world, with over 1.4 billion inhabitants.", "The United States is one of the most culturally diverse countries in the world.", "Russia is the largest country by land area." ] model_name = "gte-large" query_embeddings = get_embeddings([query_text], model_name) corpus_embeddings = get_embeddings(corpus_texts, model_name) similarities = cosine_similarity(query_embeddings, corpus_embeddings) retrieved_doc_id = np.argmax(similarities) print(corpus_texts[retrieved_doc_id])
Limitations
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.