Stella-large-zh-v2

Comprehensive information on the functionality and api usage of the Stella-large-zh-v2 model

API Reference: Embeddings API

Model Reference: Stella-large-zh-v2

Paper: Stella Paper

The Stella-large-zh-v2 model, designed by InfGrad, offers cutting-edge performance in natural language understanding tasks for Chinese. This model is based on a series of advanced algorithms and trained on a wide corpus of texts, making it an optimal choice for several downstream tasks such as text embedding, information retrieval, and semantic textual similarity in the Chinese language.

LayersEmbedding DimensionRecommended Sequence Length
2410241024

Suitable Score Functions

  • cosine-similarity

Supported Languages

This model is exclusively trained on Chinese texts and is optimized for a variety of domains and scenarios.

Examples

Calculate Sentence Similarities

Similarities
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

import requests
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

chinese_sentences = [
    "你好吗?",
    "我很好。"
]

data = {
    'texts': chinese_sentences,
    'model': 'stella-large-zh-v2',
}

response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
embeddings = response.json()["data"]

similarities = cosine_similarity([embeddings[0]["embedding"]], [embeddings[1]["embedding"]])
print(similarities)

Information Retrieval

Retrieval
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'}

def get_embeddings(texts, model, instruction=None):
    data = {'texts': texts, 'model': model}
    if instruction:
        data['instruction'] = instruction
    response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers)
    embeddings = [entry['embedding'] for entry in response.json()['data']]
    return np.array(embeddings)

query_text = "世界上人口最多的国家是哪里?"

corpus_texts = [
    "中国是世界上人口最多的国家,拥有超过14亿居民。",
    "美国是世界上文化最多样的国家之一。",
    "俄罗斯是世界上面积最大的国家。"
]

model_name = "stella-large-zh-v2"

query_embeddings = get_embeddings([query_text], model_name)
corpus_embeddings = get_embeddings(corpus_texts, model_name)

similarities = cosine_similarity(query_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)
print(corpus_texts[retrieved_doc_id])

Limitations

This model is optimized for Chinese language texts and follows a recommended sequence length of 1024 tokens.