API Reference
Detailed guide to the Embaas API endpoints, usage, and authentication.
We don't store any of your data! Check out our privacy policy for more information.
Welcome to the Embaas API Reference documentation! Here, you'll find extensive details about our API's capabilities, from available endpoints to necessary steps for authenticating your requests. Our aim is to make your integration with the Embaas as smooth as possible.
Pre-requisites
Before diving into the API, ensure you have the following ready:
- An active Embaas account. If you haven't signed up already, please register.
- An API key, which can be obtained from your user dashboard.
Base URL
Embaas API endpoints are available at the following base URL:
https://api.embaas.io
Authentication
Important: Please keep your API keys secure and avoid exposing them in publicly accessible areas like client-side code, public repositories, etc.
Embaas prioritizes the security of your data by using API keys for authentication. These keys are included in the headers of your API requests, providing secure access to our services. Please handle your API keys with utmost care to maintain the confidentiality of your data.
Include your API key in the header of your API request as follows:
Authorization: Bearer EMBAAS_API_KEY
This inclusion allows Embaas to authenticate your requests securely.
LangChain Integration
Embaas is integrated with the python version of LangChain, a framework for developing applications powered by language models. This integration allows you to use the Embaas API endpoints directly from your LangChain application.
- Embeddings: LangChain Docs
- Document Extraction: LangChain Docs
Python Client
We have developed a python client for the Embaas API. You can find it on PyPI here. The source code is available on GitHub. It makes it super easy to use the Embaas API in your python applications. You can install it using pip:
pip install embaas
And then use it as follows:
1 2 3 4 5 6 7 8
from embaas import EmbaasClient client = EmbaasClient(api_key=EMBAAS_API_KEY) res = client.get_embeddings( model='all-MiniLM-L6-v2' texts=['This is an example sentence.', 'Here is another sentence.'], ) embeddings = res.data
EMBAAS_API_KEY
can also be set as an environment variable.
Rate Limiting
The Embaas API employs rate limits to ensure fair usage and maintain system stability. The limits are employed on a per-user basis. If you exceed the rate limit, you will receive a 429 Too Many Requests
response.
API | Rate Limit |
---|---|
Embeddings API | 100 requests per minute |
Document Extraction API | 100 requests per minute |
Reduce API | 1000 requests per minute |
Increasing Rate Limits
If your application requires higher rate limits, you can request an increase by contacting us. You can reach out to us through our contact page or by sending an email to info@embaas.io.
Make sure to take into account the rate limit while designing your application's logic.
Error and Status Codes
This section provides information about HTTP status codes and error responses that you might encounter while using the Document Extraction API. Embaas uses FastAPI's standard error handling, with pydantic for data validation.
HTTP Status Codes
The API responds with different HTTP status codes to indicate the success or failure of a request. Here are some of the most common status codes you might encounter:
HTTP Status Code | Description |
---|---|
200 | OK-The request was successful. The requested resources will be included in the response body. |
400 | Bad Request-The request could not be understood or was missing required parameters. |
401 | Unauthorized-Authentication failed or user does not have permissions for the requested operation. |
402 | Required Payment -The request was valid, but you have insufficient balance to complete the operation. You need to top up your account. |
404 | Not Found -The requested resource could not be found. |
422 | Validation Error -The request could not be validated or was missing required parameters. |
429 | Too Many Requests -You have exceeded the rate limit for your account. |
500 | Internal Server Error -An error occurred on the server. |
Error Responses (Code: 422)
When the API returns status codes indicating an error (422), it also includes error response in the body to help you understand what went wrong. Error responses follow this structure:
1 2 3 4 5 6 7 8 9 10 11
{ "detail": [ { "loc": [ "string" ], "msg": "string", "type": "string" } ] }
Here's an explanation of each field in the error response:
detail
: This is a list of all errors that occurred during the processing of the request.loc
: This is a list indicating the location of the error in the request. For example, for a missing field in the body,loc
might be["body", "fieldName"]
.msg
: A readable message explaining the error.type
: The type of error that occurred.
For example, if you are missing the file
field in the body of your request, you might get an error response like this:
1 2 3 4 5 6 7 8 9 10 11
{ "detail": [ { "loc": [ "file" ], "msg": "field required", "type": "value_error.missing" } ] }
This tells you that you need to include the file
field in your request body.
Remember, good error handling is crucial for building a robust application. Always check the HTTP status code and error messages to understand the results of your API calls.
Document Extraction API
The Document Extraction API is a powerful tool designed to extract text from a wide range of document file types. This API Reference will guide you on how to effectively use this API to extract, chunk, and create embeddings for the extracted text.
Introduction
This API Endpoint is still in beta and might be subject to changes in the future.
The Document Extraction API is designed to cater to the diverse requirements of your applications. At present, we offer support for metadata extraction specifically from PDF files. However, we are actively exploring opportunities to extend our services to other file formats.
We strongly encourage user feedback to enhance our services and better align them with your needs. Feel free to reach out to us on our Discord or contact us with your suggestions and requirements. We are committed to helping you acquire the metadata that will add the most value to your use case.
Document Text Extraction
The Document Text Extraction endpoint enables you to extract text from a given document (file). Additionally, it offers options to chunk the extracted text and create embeddings for each chunk. We currently support over 160 file types and are working towards including mp4 and images.
Endpoint POST
Request Headers
Content-Type
:multipart/form-data
- This indicates that the request body is of type multipart form data.Authorization
:Bearer {EMBAAS_API_KEY}
- This is required to authenticate your request with your API key. You can create an API key on your dashboard.
Request Body
The request body must be of type multipart/form-data
and you can include the following fields:
Field | Type | Description | Default |
---|---|---|---|
file (required) | string | The file from which the text should be extracted | |
should_chunk | boolean | Determines whether the extracted text should be divided into chunks | True |
chunk_size | integer | Defines the maximum size of the text chunks | 256 |
chunk_overlap | integer | Specifies the maximum overlap allowed between chunks | 20 |
chunk_splitter | enum | Indicates the Text-Splitter for creating chunks. Refer to LangChain Text Splitters for valid values. | CharacterTextSplitter |
separators | array | Defines the separators for chunks. Only the first element is used for single separators. | ["\n"] |
should_embed | boolean | Specifies whether embeddings should be created for the extracted text | False |
model | enum | Required when should_embed is true. Refer to the embeddings API docs for details. | |
instruction | string | Can be used when should_embed is true. Refer to the embeddings API docs for details. |
Set should_embed
to true
, if you want to create embeddings for your document. If you do so, the model
field is required. Refer to the embeddings API docs for details.
Example Request
Below are examples of how to use this API endpoint in different programming languages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
import requests url = "https://api.embaas.io/v1/document/extract-text/" headers = { 'Authorization': 'Bearer {EMBAAS_API_KEY}' } data = { 'chunk_size': '256', 'chunk_splitter': 'CharacterTextSplitter', 'should_embed': 'true', 'model': 'instructor-large', 'separators': '\n', 'chunk_overlap': '20', 'should_chunk': 'true', 'instruction': 'Represent the document statement' } files=[ ('file' ,('{FILE_NAME}', open('{FILE_PATH}', 'rb'), 'application/pdf')) ] response = requests.post(url, headers=headers, data=data, files=files) print(response.json())
Example Response
The response from the Document Text Extraction endpoint will vary based on your request parameters. Below is an example of a typical response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
{ "usage": { "total_mb": 69, // Only set, if 'should_embed' is set to 'true' "total_tokens": 420, "prompt_tokens": 420 }, "data": { "chunks": [ { "text": "This is text in some pdf document!", "metadata": { "start_page": 1, "end_page": 2 }, // Only set, if 'should_embed' is set to 'true' "embedding": [ 1.2323, 0.12, ... ], "index": 0 } ], "metadata": { "content_type": "application/pdf", "mime_type": "application/pdf", "file_extension": "pdf", "file_name": "Example.pdf", "file_size": 860079, "char_count": 27313 } }, ... }
Document Text Extraction (Bytes)
The Document Text Extraction Bytes endpoint enables you to extract text from a given document bytes. This is particularly useful when you already have the file content loaded in memory and don't want to write it to a physical file before processing. Like the Document Text Extraction endpoint, it also offers options to chunk the extracted text and create embeddings for each chunk.The Document Text Extraction Bytes endpoint enables you to extract text from a given document bytes. This is particularly useful when you already have the file content loaded in memory and don't want to write it to a physical file before processing. Like the Document Text Extraction endpoint, it also offers options to chunk the extracted text and create embeddings for each chunk.
Endpoint POST
Request Headers
Content-Type
:application/json
- This indicates that the request body is of type JSON.Authorization
:Bearer {EMBAAS_API_KEY}
- This is required to authenticate your request with your API key. You can create an API key on your dashboard.
Request Body
The request body must be of type application/json
and you can include the following fields:
Field | Type | Description | Default |
---|---|---|---|
bytes (required) | string | Base64 encoded string of the file bytes from which the text should be extracted | |
mime_type | string | Mime type of the document | |
file_extension | string | File extension of the document | |
file_name | string | File name of the document | |
should_chunk | boolean | Determines whether the extracted text should be divided into chunks | True |
chunk_size | integer | Defines the maximum size of the text chunks | 256 |
chunk_overlap | integer | Specifies the maximum overlap allowed between chunks | 20 |
chunk_splitter | enum | Indicates the Text-Splitter for creating chunks. Refer to LangChain Text Splitters for valid values. | CharacterTextSplitter |
separators | array | Defines the separators for chunks. Only the first element is used for single separators. | ["\n"] |
should_embed | boolean | Specifies whether embeddings should be created for the extracted text | False |
model | enum | Required when should_embed is true. Refer to the embeddings API docs for details. | |
instruction | string | Can be used when should_embed is true. Refer to the embeddings API docs for details. |
Set should_embed
to true
, if you want to create embeddings for your document. If you do so, the model
field is required. Refer to the embeddings API docs for details.
Example Request
Below are examples of how to use this API endpoint in different programming languages:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
import requests import base64 url = "https://api.embaas.io/v1/document/extract-text/bytes/" headers = { 'Content-Type': 'application/json', 'Authorization': 'Bearer {EMBAAS_API_KEY}' } with open('{FILE_PATH}', 'rb') as f: file_bytes = f.read() bytes_str = base64.b64encode(file_bytes).decode() data = { 'bytes': bytes_str, 'mime_type': 'application/pdf', 'file_extension': 'pdf', 'file_name': '{FILE_NAME}', 'should_chunk': True, 'chunk_size': 256, 'chunk_overlap': 20, 'chunk_splitter': 'CharacterTextSplitter', 'separators': ['\n'], 'should_embed': True, 'model': 'instructor-large', 'instruction': 'Represent the document statement' } response = requests.post(url, headers=headers, json=data) print(response.json())
Example Response
The response from the Document Text Extraction endpoint will vary based on your request parameters. Below is an example of a typical response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
{ "usage": { "total_mb": 69, // Only set, if 'should_embed' is set to 'true' "total_tokens": 420, "prompt_tokens": 420 }, "data": { "chunks": [ { "text": "This is text in some pdf document!", "metadata": { "start_page": 1, "end_page": 2 }, // Only set, if 'should_embed' is set to 'true' "embedding": [ 1.2323, 0.12, ... ], "index": 0 }, ... ], "metadata": { "content_type": "application/pdf", "mime_type": "application/pdf", "file_extension": "pdf", "file_name": "Example.pdf", "file_size": 860079, "char_count": 27313 } } }
Metadata per Chunk
Metadata available for each text chunk is contingent on the MIME type of the document. This table provides an overview of the metadata fields currently available per text chunk for different MIME types:
Mime Type | Fields | Description |
---|---|---|
application/pdf | start_page, end_page | Specifies the pages on which the extracted text commences and concludes. |
Embeddings API
API Reference for the Embeddings API.
Introduction
The Embeddings API is a powerful tool that allows you to generate numerical vector representations, or "embeddings", for a given list of input texts. This service utilizes pre-trained machine learning models to convert textual data into a form that can be processed efficiently by various data analysis tools. You simply provide your text data and specify the model, and the API will return the corresponding embeddings. This service is particularly beneficial for applications involving text similarity, clustering, and other NLP tasks that require understanding of semantic context.
Text Embeddings
The Text Embeddings API enables you to generate embeddings for a list of input texts using a specified model.
Endpoint POST
Request Headers
Content-Type
:application/json
- This indicates that the request body is of type JSON.Authorization
:Bearer ${YOUR_API_KEY}
- This is required to authenticate your request with your API key.
Request Body
The request body must be of type application/json and you can include the following fields:
Field | Type | Description | Default | Constraints |
---|---|---|---|---|
texts (required) | array | A list of strings, each string is a sentence or a piece of text for which the embedding should be generated. | - | 1-256 items Texts will be truncated, if they are longer than the maximum sequence length of model |
model (required) | array | Specifies the model to be used for generating embeddings. Refer to our supported models. | - | Has to be a valid model name. Supported models |
instruction | string | Allows domain-specific embeddings without training. Only possible for the "Instructor" and "e5"-based models. | Default Instruction of model | - |
Example request
Below are examples of how to use this API endpoint in different programming languages:
1 2 3 4 5 6 7
from embaas import EmbaasClient client = EmbaasClient(api_key=EMBAAS_API_KEY) client.get_embeddings( model='all-MiniLM-L6-v2' texts=['This is an example sentence.', 'Here is another sentence.'], )
Example response
The response from the Text Embeddings API endpoint will vary based on your request parameters. Below is an example of a typical response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
{ "data": [ { "embedding": [ 0.1, 0.06069, ... ], "index": 0 }, { "embedding": [ -0.1, 0.3, ... ], "index": 1, "was_truncated": true } ], "model": "all-MiniLM-L6-v2", "usage": { "prompt_tokens": 420, "total_tokens": 420 } }
Reduce API
API Reference for the Reduce API.
Introduction
This API Endpoint is still in beta and might be subject to changes in the future.
The Reduce API is designed to reduce the size of your existing embeddings. Embeddings are a great way to represent text in a vector space, but they can be quite large. The Reduce API allows you to reduce the size of your embeddings by up to 70% while maintaining similar accuracy. This is done by a supervised learning algorithm that learns to map your embeddings to a smaller space.
Please check our supported models for the Reduce API.
Reduce API
The Reduce API is designed to reduce the size of your existing embeddings
Endpoint POST
Request Headers
Content-Type
:application/json
- This indicates that the request body is of type JSON.Authorization
:Bearer ${YOUR_API_KEY}
- This is required to authenticate your request with your API key.
Request Body
The request body must be of type application/json and you can include the following fields:
Field | Type | Description | Default | Constraints |
---|---|---|---|---|
embeddings (required) | array | A list of embeddings you want to reduce. | - | 1 - 10'000 items Each embedding must be the same size and the dimensions of the selected model. |
model (required) | array | Specifies the model which generated the embeddings. Refer to our supported models. | - | Has to be a valid model name. Supported models. |
Example request
Below are examples of how to use this API endpoint in different programming languages:
1 2 3 4 5 6 7 8 9 10 11
import requests data = { "embeddings": [[0.1, 0.078, ...], [-0.342, 0.062, ...]], "model": "text-embedding-ada-002" } headers = { "Content-Type": "application/json", "Authorization": "Bearer {EMBAAS_API_KEY}" # replace {EMBAAS_API_KEY} with the actual API key } response = requests.post('https://api.embaas.io/v1/reduce/', json=data, headers=headers)
Example response
The response from the Text Embeddings API endpoint will vary based on your request parameters. Below is an example of a typical response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
{ "data": [ { "embedding": [ 0.1, 0.06069, ... ], "index": 0 }, { "embedding": [ -0.1, 0.3, ... ], "index": 1, "was_truncated": true } ], "model": "text-embedding-ada-002", "usage": { "total_reduced_embeddings": 2 } }