API Reference

Detailed guide to the Embaas API endpoints, usage, and authentication.

Welcome to the Embaas API Reference documentation! Here, you'll find extensive details about our API's capabilities, from available endpoints to necessary steps for authenticating your requests. Our aim is to make your integration with the Embaas as smooth as possible.

Pre-requisites

Before diving into the API, ensure you have the following ready:

  1. An active Embaas account. If you haven't signed up already, please register.
  2. An API key, which can be obtained from your user dashboard.

Base URL

Embaas API endpoints are available at the following base URL:

https://api.embaas.io

Authentication

Embaas prioritizes the security of your data by using API keys for authentication. These keys are included in the headers of your API requests, providing secure access to our services. Please handle your API keys with utmost care to maintain the confidentiality of your data.

Include your API key in the header of your API request as follows:

Authorization: Bearer EMBAAS_API_KEY

This inclusion allows Embaas to authenticate your requests securely.

LangChain Integration

Embaas is integrated with the python version of LangChain, a framework for developing applications powered by language models. This integration allows you to use the Embaas API endpoints directly from your LangChain application.

Python Client

We have developed a python client for the Embaas API. You can find it on PyPI here. The source code is available on GitHub. It makes it super easy to use the Embaas API in your python applications. You can install it using pip:

pip install embaas

And then use it as follows:

Example
1
2
3
4
5
6
7
8
from embaas import EmbaasClient

client = EmbaasClient(api_key=EMBAAS_API_KEY)
res = client.get_embeddings(
    model='all-MiniLM-L6-v2'
    texts=['This is an example sentence.', 'Here is another sentence.'],
)
embeddings = res.data

EMBAAS_API_KEY can also be set as an environment variable.

Rate Limiting

The Embaas API employs rate limits to ensure fair usage and maintain system stability. The limits are employed on a per-user basis. If you exceed the rate limit, you will receive a 429 Too Many Requests response.

APIRate Limit
Embeddings API100 requests per minute
Document Extraction API100 requests per minute
Reduce API1000 requests per minute

Increasing Rate Limits

If your application requires higher rate limits, you can request an increase by contacting us. You can reach out to us through our contact page or by sending an email to info@embaas.io.

Make sure to take into account the rate limit while designing your application's logic.

Error and Status Codes

This section provides information about HTTP status codes and error responses that you might encounter while using the Document Extraction API. Embaas uses FastAPI's standard error handling, with pydantic for data validation.

HTTP Status Codes

The API responds with different HTTP status codes to indicate the success or failure of a request. Here are some of the most common status codes you might encounter:

HTTP Status CodeDescription
200OK-The request was successful. The requested resources will be included in the response body.
400Bad Request-The request could not be understood or was missing required parameters.
401Unauthorized-Authentication failed or user does not have permissions for the requested operation.
402Required Payment -The request was valid, but you have insufficient balance to complete the operation. You need to top up your account.
404Not Found -The requested resource could not be found.
422Validation Error -The request could not be validated or was missing required parameters.
429Too Many Requests -You have exceeded the rate limit for your account.
500Internal Server Error -An error occurred on the server.

Error Responses (Code: 422)

When the API returns status codes indicating an error (422), it also includes error response in the body to help you understand what went wrong. Error responses follow this structure:

Validation Error
1
2
3
4
5
6
7
8
9
10
11
{
  "detail": [
    {
      "loc": [
        "string"
      ],
      "msg": "string",
      "type": "string"
    }
  ]
}

Here's an explanation of each field in the error response:

  • detail: This is a list of all errors that occurred during the processing of the request.
  • loc: This is a list indicating the location of the error in the request. For example, for a missing field in the body, loc might be ["body", "fieldName"].
  • msg: A readable message explaining the error.
  • type: The type of error that occurred.

For example, if you are missing the file field in the body of your request, you might get an error response like this:

Validation Error example
1
2
3
4
5
6
7
8
9
10
11
{
  "detail": [
    {
      "loc": [
        "file"
      ],
      "msg": "field required",
      "type": "value_error.missing"
    }
  ]
}

This tells you that you need to include the file field in your request body.

Remember, good error handling is crucial for building a robust application. Always check the HTTP status code and error messages to understand the results of your API calls.

Document Extraction API

The Document Extraction API is a powerful tool designed to extract text from a wide range of document file types. This API Reference will guide you on how to effectively use this API to extract, chunk, and create embeddings for the extracted text.

Introduction

The Document Extraction API is designed to cater to the diverse requirements of your applications. At present, we offer support for metadata extraction specifically from PDF files. However, we are actively exploring opportunities to extend our services to other file formats.

We strongly encourage user feedback to enhance our services and better align them with your needs. Feel free to reach out to us on our Discord or contact us with your suggestions and requirements. We are committed to helping you acquire the metadata that will add the most value to your use case.

Document Text Extraction

The Document Text Extraction endpoint enables you to extract text from a given document (file). Additionally, it offers options to chunk the extracted text and create embeddings for each chunk. We currently support over 160 file types and are working towards including mp4 and images.

Endpoint POST

https://api.embaas.io/v1/document/extract-text/

Request Headers

  • Content-Type: multipart/form-data - This indicates that the request body is of type multipart form data.
  • Authorization: Bearer {EMBAAS_API_KEY} - This is required to authenticate your request with your API key. You can create an API key on your dashboard.

Request Body

The request body must be of type multipart/form-data and you can include the following fields:

FieldTypeDescriptionDefault
file (required)stringThe file from which the text should be extracted
should_chunkbooleanDetermines whether the extracted text should be divided into chunksTrue
chunk_sizeintegerDefines the maximum size of the text chunks256
chunk_overlapintegerSpecifies the maximum overlap allowed between chunks20
chunk_splitterenumIndicates the Text-Splitter for creating chunks. Refer to LangChain Text Splitters for valid values.CharacterTextSplitter
separatorsarrayDefines the separators for chunks. Only the first element is used for single separators.["\n"]
should_embedbooleanSpecifies whether embeddings should be created for the extracted textFalse
modelenumRequired when should_embed is true. Refer to the embeddings API docs for details.
instructionstringCan be used when should_embed is true. Refer to the embeddings API docs for details.

Set should_embed to true, if you want to create embeddings for your document. If you do so, the model field is required. Refer to the embeddings API docs for details.

Example Request

Below are examples of how to use this API endpoint in different programming languages:

/v1/document/extract-text/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import requests

url = "https://api.embaas.io/v1/document/extract-text/"

headers = {
'Authorization': 'Bearer {EMBAAS_API_KEY}'
}

data = {
  'chunk_size': '256',
  'chunk_splitter': 'CharacterTextSplitter',
  'should_embed': 'true',
  'model': 'instructor-large',
  'separators': '\n',
  'chunk_overlap': '20',
  'should_chunk': 'true',
  'instruction': 'Represent the document statement'
}

files=[
  ('file' ,('{FILE_NAME}', open('{FILE_PATH}', 'rb'), 'application/pdf'))
]

response = requests.post(url, headers=headers, data=data, files=files)

print(response.json())

Example Response

The response from the Document Text Extraction endpoint will vary based on your request parameters. Below is an example of a typical response.

Response with media type: 'application/pdf'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
  "usage": {
    "total_mb": 69,
    // Only set, if 'should_embed' is set to 'true'
    "total_tokens": 420,
    "prompt_tokens": 420
  },
  "data": {
    "chunks": [
      {
        "text": "This is text in some pdf document!",
        "metadata": {
          "start_page": 1,
          "end_page": 2
        },
        // Only set, if 'should_embed' is set to 'true'
        "embedding": [
          1.2323,
          0.12,
          ...
        ],
        "index": 0
      }
    ],
    "metadata": {
      "content_type": "application/pdf",
      "mime_type": "application/pdf",
      "file_extension": "pdf",
      "file_name": "Example.pdf",
      "file_size": 860079,
      "char_count": 27313
    }
  },
  ...
}

Document Text Extraction (Bytes)

The Document Text Extraction Bytes endpoint enables you to extract text from a given document bytes. This is particularly useful when you already have the file content loaded in memory and don't want to write it to a physical file before processing. Like the Document Text Extraction endpoint, it also offers options to chunk the extracted text and create embeddings for each chunk.The Document Text Extraction Bytes endpoint enables you to extract text from a given document bytes. This is particularly useful when you already have the file content loaded in memory and don't want to write it to a physical file before processing. Like the Document Text Extraction endpoint, it also offers options to chunk the extracted text and create embeddings for each chunk.

Endpoint POST

https://api.embaas.io/v1/document/extract-text/bytes/

Request Headers

  • Content-Type: application/json - This indicates that the request body is of type JSON.
  • Authorization: Bearer {EMBAAS_API_KEY} - This is required to authenticate your request with your API key. You can create an API key on your dashboard.

Request Body

The request body must be of type application/json and you can include the following fields:

FieldTypeDescriptionDefault
bytes (required)stringBase64 encoded string of the file bytes from which the text should be extracted
mime_typestringMime type of the document
file_extensionstringFile extension of the document
file_namestringFile name of the document
should_chunkbooleanDetermines whether the extracted text should be divided into chunksTrue
chunk_sizeintegerDefines the maximum size of the text chunks256
chunk_overlapintegerSpecifies the maximum overlap allowed between chunks20
chunk_splitterenumIndicates the Text-Splitter for creating chunks. Refer to LangChain Text Splitters for valid values.CharacterTextSplitter
separatorsarrayDefines the separators for chunks. Only the first element is used for single separators.["\n"]
should_embedbooleanSpecifies whether embeddings should be created for the extracted textFalse
modelenumRequired when should_embed is true. Refer to the embeddings API docs for details.
instructionstringCan be used when should_embed is true. Refer to the embeddings API docs for details.

Set should_embed to true, if you want to create embeddings for your document. If you do so, the model field is required. Refer to the embeddings API docs for details.

Example Request

Below are examples of how to use this API endpoint in different programming languages:

/v1/document/extract-text/bytes/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import requests
import base64

url = "https://api.embaas.io/v1/document/extract-text/bytes/"
headers = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer {EMBAAS_API_KEY}'
}

with open('{FILE_PATH}', 'rb') as f:
    file_bytes = f.read()

bytes_str = base64.b64encode(file_bytes).decode()

data = {
    'bytes': bytes_str,
    'mime_type': 'application/pdf',
    'file_extension': 'pdf',
    'file_name': '{FILE_NAME}',
    'should_chunk': True,
    'chunk_size': 256,
    'chunk_overlap': 20,
    'chunk_splitter': 'CharacterTextSplitter',
    'separators': ['\n'],
    'should_embed': True,
    'model': 'instructor-large',
    'instruction': 'Represent the document statement'
}

response = requests.post(url, headers=headers, json=data)
print(response.json())

Example Response

The response from the Document Text Extraction endpoint will vary based on your request parameters. Below is an example of a typical response.

Response with media type: 'application/pdf'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
{
  "usage": {
    "total_mb": 69,
    // Only set, if 'should_embed' is set to 'true'
    "total_tokens": 420,
    "prompt_tokens": 420
  },
  "data": {
    "chunks": [
      {
        "text": "This is text in some pdf document!",
        "metadata": {
          "start_page": 1,
          "end_page": 2
        },
        // Only set, if 'should_embed' is set to 'true'
        "embedding": [
          1.2323,
          0.12,
          ...
        ],
        "index": 0
      },
      ...
    ],
    "metadata": {
      "content_type": "application/pdf",
      "mime_type": "application/pdf",
      "file_extension": "pdf",
      "file_name": "Example.pdf",
      "file_size": 860079,
      "char_count": 27313
    }
  }
}

Metadata per Chunk

Metadata available for each text chunk is contingent on the MIME type of the document. This table provides an overview of the metadata fields currently available per text chunk for different MIME types:

Mime TypeFieldsDescription
application/pdfstart_page, end_pageSpecifies the pages on which the extracted text commences and concludes.

Embeddings API

API Reference for the Embeddings API.

Introduction

The Embeddings API is a powerful tool that allows you to generate numerical vector representations, or "embeddings", for a given list of input texts. This service utilizes pre-trained machine learning models to convert textual data into a form that can be processed efficiently by various data analysis tools. You simply provide your text data and specify the model, and the API will return the corresponding embeddings. This service is particularly beneficial for applications involving text similarity, clustering, and other NLP tasks that require understanding of semantic context.

Text Embeddings

The Text Embeddings API enables you to generate embeddings for a list of input texts using a specified model.

Endpoint POST

https://api.embaas.io/v1/embeddings/

Request Headers

  • Content-Type: application/json - This indicates that the request body is of type JSON.
  • Authorization: Bearer ${YOUR_API_KEY} - This is required to authenticate your request with your API key.

Request Body

The request body must be of type application/json and you can include the following fields:

FieldTypeDescriptionDefaultConstraints
texts (required)arrayA list of strings, each string is a sentence or a piece of text for which the embedding should be generated.-1-256 items
Texts will be truncated, if they are longer than the maximum sequence length of model
model (required)arraySpecifies the model to be used for generating embeddings. Refer to our supported models.-Has to be a valid model name. Supported models
instructionstringAllows domain-specific embeddings without training. Only possible for the "Instructor" and "e5"-based models.Default Instruction of model-

Example request

Below are examples of how to use this API endpoint in different programming languages:

/v1/embeddings/
1
2
3
4
5
6
7
from embaas import EmbaasClient

client = EmbaasClient(api_key=EMBAAS_API_KEY)
client.get_embeddings(
    model='all-MiniLM-L6-v2'
    texts=['This is an example sentence.', 'Here is another sentence.'],
)

Example response

The response from the Text Embeddings API endpoint will vary based on your request parameters. Below is an example of a typical response.

Response for model: 'all-MiniLM-L6-v2'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
  "data": [
    {
      "embedding": [
        0.1,
        0.06069,
        ...
      ],
      "index": 0
    },
    {
      "embedding": [
        -0.1,
        0.3,
        ...
      ],
      "index": 1,
      "was_truncated": true
    }
  ],
  "model": "all-MiniLM-L6-v2",
  "usage": {
    "prompt_tokens": 420,
    "total_tokens": 420
  }
}

Reduce API

API Reference for the Reduce API.

Introduction

The Reduce API is designed to reduce the size of your existing embeddings. Embeddings are a great way to represent text in a vector space, but they can be quite large. The Reduce API allows you to reduce the size of your embeddings by up to 70% while maintaining similar accuracy. This is done by a supervised learning algorithm that learns to map your embeddings to a smaller space.

Please check our supported models for the Reduce API.

Reduce API

The Reduce API is designed to reduce the size of your existing embeddings

Endpoint POST

https://api.embaas.io/v1/reduce/

Request Headers

  • Content-Type: application/json - This indicates that the request body is of type JSON.
  • Authorization: Bearer ${YOUR_API_KEY} - This is required to authenticate your request with your API key.

Request Body

The request body must be of type application/json and you can include the following fields:

FieldTypeDescriptionDefaultConstraints
embeddings (required)arrayA list of embeddings you want to reduce.-1 - 10'000 items
Each embedding must be the same size and the dimensions of the selected model.
model (required)arraySpecifies the model which generated the embeddings. Refer to our supported models.-Has to be a valid model name. Supported models.

Example request

Below are examples of how to use this API endpoint in different programming languages:

/v1/reduce/
1
2
3
4
5
6
7
8
9
10
11
import requests

data = {
"embeddings": [[0.1, 0.078, ...], [-0.342, 0.062, ...]],
"model": "text-embedding-ada-002"
}
headers = {
"Content-Type": "application/json",
"Authorization": "Bearer {EMBAAS_API_KEY}" # replace {EMBAAS_API_KEY} with the actual API key
}
response = requests.post('https://api.embaas.io/v1/reduce/', json=data, headers=headers)

Example response

The response from the Text Embeddings API endpoint will vary based on your request parameters. Below is an example of a typical response.

Response for model: 'text-embedding-ada-002'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "data": [
    {
      "embedding": [
        0.1,
        0.06069,
        ...
      ],
      "index": 0
    },
    {
      "embedding": [
        -0.1,
        0.3,
        ...
      ],
      "index": 1,
      "was_truncated": true
    }
  ],
  "model": "text-embedding-ada-002",
  "usage": {
    "total_reduced_embeddings": 2
  }
}