Stella-large-zh-v2
Comprehensive information on the functionality and api usage of the Stella-large-zh-v2 model
API Reference: Embeddings API
Model Reference: Stella-large-zh-v2
Paper: Stella Paper
The Stella-large-zh-v2 model, designed by InfGrad, offers cutting-edge performance in natural language understanding tasks for Chinese. This model is based on a series of advanced algorithms and trained on a wide corpus of texts, making it an optimal choice for several downstream tasks such as text embedding, information retrieval, and semantic textual similarity in the Chinese language.
Layers | Embedding Dimension | Recommended Sequence Length |
---|---|---|
24 | 1024 | 1024 |
Suitable Score Functions
- cosine-similarity
Supported Languages
This model is exclusively trained on Chinese texts and is optimized for a variety of domains and scenarios.
Examples
Calculate Sentence Similarities
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
import requests from sklearn.metrics.pairwise import cosine_similarity headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'} chinese_sentences = [ "你好吗?", "我很好。" ] data = { 'texts': chinese_sentences, 'model': 'stella-large-zh-v2', } response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers) embeddings = response.json()["data"] similarities = cosine_similarity([embeddings[0]["embedding"]], [embeddings[1]["embedding"]]) print(similarities)
Information Retrieval
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
import requests import numpy as np from sklearn.metrics.pairwise import cosine_similarity headers = {'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}'} def get_embeddings(texts, model, instruction=None): data = {'texts': texts, 'model': model} if instruction: data['instruction'] = instruction response = requests.post("https://api.embaas.io/v1/embeddings/", json=data, headers=headers) embeddings = [entry['embedding'] for entry in response.json()['data']] return np.array(embeddings) query_text = "世界上人口最多的国家是哪里?" corpus_texts = [ "中国是世界上人口最多的国家,拥有超过14亿居民。", "美国是世界上文化最多样的国家之一。", "俄罗斯是世界上面积最大的国家。" ] model_name = "stella-large-zh-v2" query_embeddings = get_embeddings([query_text], model_name) corpus_embeddings = get_embeddings(corpus_texts, model_name) similarities = cosine_similarity(query_embeddings, corpus_embeddings) retrieved_doc_id = np.argmax(similarities) print(corpus_texts[retrieved_doc_id])
Limitations
This model is optimized for Chinese language texts and follows a recommended sequence length of 1024 tokens.