Skip to main content

Parser Service

Converts documents (PDF, DOCX, HTML, etc.) to markdown. Supports four parser backends. Runs both a synchronous HTTP API and an asynchronous RabbitMQ worker for queue-based processing.

  • Tech: Python, FastAPI, SQLAlchemy, RabbitMQ (aio-pika)
  • Port: 4004
  • Auth: None (internal service)
  • Database: Shared document database (parsing_jobs table, reads documents table)

Endpoints

MethodPathRate LimitDescription
POST/api/v1/parser/parse10/minSynchronous parse -- upload file, get markdown back
POST/api/v1/parser/parse-async30/minAsync parse -- enqueue job, returns job_id
GET/api/v1/parser/parse-status/{job_id}--Check async job status
GET/api/v1/parser/health--Health of parser backends
GET/api/v1/parser/info--Available parser information
GET/health/--Basic health
GET/health/liveness--Kubernetes liveness probe
GET/health/readiness--Kubernetes readiness probe (checks memory, disk, parsers, RabbitMQ)
GET/health/detailed--Detailed health with all indicators
GET/health/healthz--Legacy alias for /health
GET/health/readyz--Legacy alias for /readiness
GET/metrics--Prometheus metrics

POST /api/v1/parser/parse

Synchronous parsing. Sends the file content as application/octet-stream body.

Headers:

  • X-Original-Filename (required) -- Original file name with extension

Query Parameters:

ParameterTypeDefaultDescription
parser_methodstringNone (auto-detect)Parser backend to use. If not specified, the service auto-selects based on file type.
extract_tablesbooleanfalseExtract tables from document
extract_imagesbooleanfalseExtract images from document
ocr_enabledbooleanfalseEnable OCR
ocr_modestring--OCR mode
caption_imagesbooleanfalseGenerate image captions via Vision LLM
document_idstring--Link to document record
user_idstring--Requesting user

Response: Parsed markdown text.

POST /api/v1/parser/parse-async

Same parameters as /parse. Returns a job ID for status polling.

Supports Idempotency-Key header to prevent duplicate jobs.

Parser Backends

BackendKeySupported FilesNotes
Azure Document Intelligencedocument_intelligencePDF, DOCX, DOC, TXT, HTML, MD, RTF, XLSX, XLSDefault. Uses Azure SDK.
PyMuPDFpdf_pymupdfPDFFallback. HTTP microservice at port 8001.
MinerUmineru_parserPDFOptional. HTTP microservice at port 8002.
Markermarker_parserPDFOptional. HTTP microservice at port 8003.

The parser chain tries the selected backend first and falls back to alternatives on failure.

Worker

Separate process (worker.py) that:

  1. Consumes from parsing_jobs RabbitMQ queue
  2. Parses the document using the configured backend
  3. Publishes results to parsing_results queue
  4. Publishes progress to parsing_progress queue

Messages are formatted for NestJS microservice compatibility (includes pattern field).

Image Captioning

When caption_images=true, extracted images are sent to the completion service's Vision LLM for caption generation. The captions are embedded in the markdown output.

Inter-Service Communication

TargetProtocolPurpose
document-serviceRabbitMQPublish parsing_results and parsing_progress
completion-serviceHTTPImage captioning via Vision LLM
PyMuPDF microserviceHTTP (localhost:8001)PDF parsing
MinerU microserviceHTTP (localhost:8002)PDF parsing
Marker microserviceHTTP (localhost:8003)PDF parsing
Azure Document IntelligenceAzure SDKDocument parsing