Skip to main content

Parser Overview

Before embedding or chunking any document, it must first go through a parsing stage. In our system, parsers are responsible for reading the uploaded document (PDF or Word format) and extracting its textual and visual content into a structured format.

What is a Parser?

A parser in our platform is the first step in the document processing pipeline. It takes raw documents and converts them into a clean, machine-readable format — enabling downstream operations like embedding, chunking, translation, and Q&A generation.

Supported Formats

  • PDF (.pdf)
  • Word Documents (.doc, .docx)

How It Works

  1. Upload your document – drag & drop or select from your file system.
  2. Choose a parser – select one of the available parsing strategies depending on your document type or goal.
  3. The parser extracts – we convert the file into structured text (and optionally metadata, tables, images, etc.).
  4. Post-processing options – after parsing, you can:
    • Generate embeddings
    • Split content into chunks
    • Translate to English
    • Generate questions and answers

Available Parsers

ParserDescription
TextBasic raw text extraction from documents. Fast, minimal formatting.
Text & ImageExtracts text and image references for mixed content files.
MarkerAdvanced parser that supports tables, forms, images, equations, and chunking-ready output.
TikaApache Tika-based parser for extracting text and metadata from a wide variety of file formats.
SemanticParser that focuses on semantic structure and logical separation of content.
FlexSmart parser that auto-selects strategy based on file type and content complexity.