Tika Parser

Apache Tika is a content detection and analysis tool that extracts text, metadata, and language from a wide range of document formats.
It supports parsing documents such as PDFs, Word, Excel, PowerPoint, HTML, and various image and multimedia formats.

Endpoint

POST /api/v1/services/tika-extract?chunk_size=500&overlap=15

Request Format

Sent as multipart/form-data.

Headers

X-Original-Filename: Original file name (supporting Unicode names such as Hebrew)
Content-Type: Must match the uploaded file type (e.g. application/pdf)

Response Format

The response is returned as a parsed file, extracted and divided into chunks according to the specified chunk_size and overlap.
Each chunk contains a segment of the document's content, formatted for further processing or embedding.

Example cURL Usage

English filename:

curl --location 'https://your-domain.com/api/v1/services/tika-extract' \
--header 'X-Original-Filename: document' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/document.pdf'

Hebrew filename:

curl --location 'https://your-domain.com/api/v1/services/tika-extract' \
--header 'X-Original-Filename: %D7%A7%D7%95%D7%91%D7%A5' \
--header 'Content-Type: application/pdf' \
--data-binary '@/path/to/קובץ.pdf'

Supported File Types

.pdf
.doc, .docx
.xls, .xlsx
.ppt, .pptx
.html, .xml, .txt

Notes

Tika focuses on accurate text and metadata extraction, but does not preserve structure like tables or layout formatting.
Best suited for use cases requiring raw text and document indexing.

Endpoint​

Request Format​

Headers​

Response Format​

Example cURL Usage​

English filename:​

Hebrew filename:​

Supported File Types​

Notes​