Parser Service

Converts documents (PDF, DOCX, HTML, etc.) to markdown. Supports four parser backends. Runs both a synchronous HTTP API and an asynchronous RabbitMQ worker for queue-based processing.

Tech: Python, FastAPI, SQLAlchemy, RabbitMQ (aio-pika)
Port: 4004
Auth: None (internal service)
Database: Shared document database (parsing_jobs table, reads documents table)

Endpoints

Method	Path	Rate Limit	Description
POST	`/api/v1/parser/parse`	10/min	Synchronous parse -- upload file, get markdown back
POST	`/api/v1/parser/parse-async`	30/min	Async parse -- enqueue job, returns job_id
GET	`/api/v1/parser/parse-status/{job_id}`	--	Check async job status
GET	`/api/v1/parser/health`	--	Health of parser backends
GET	`/api/v1/parser/info`	--	Available parser information
GET	`/health/`	--	Basic health
GET	`/health/liveness`	--	Kubernetes liveness probe
GET	`/health/readiness`	--	Kubernetes readiness probe (checks memory, disk, parsers, RabbitMQ)
GET	`/health/detailed`	--	Detailed health with all indicators
GET	`/health/healthz`	--	Legacy alias for /health
GET	`/health/readyz`	--	Legacy alias for /readiness
GET	`/metrics`	--	Prometheus metrics

POST /api/v1/parser/parse

Synchronous parsing. Sends the file content as application/octet-stream body.

Headers:

X-Original-Filename (required) -- Original file name with extension

Query Parameters:

Parameter	Type	Default	Description
parser_method	string	None (auto-detect)	Parser backend to use. If not specified, the service auto-selects based on file type.
extract_tables	boolean	false	Extract tables from document
extract_images	boolean	false	Extract images from document
ocr_enabled	boolean	false	Enable OCR
ocr_mode	string	--	OCR mode
caption_images	boolean	false	Generate image captions via Vision LLM
document_id	string	--	Link to document record
user_id	string	--	Requesting user

Response: Parsed markdown text.

POST /api/v1/parser/parse-async

Same parameters as /parse. Returns a job ID for status polling.

Supports Idempotency-Key header to prevent duplicate jobs.

Parser Backends

Backend	Key	Supported Files	Notes
Azure Document Intelligence	`document_intelligence`	PDF, DOCX, DOC, TXT, HTML, MD, RTF, XLSX, XLS	Default. Uses Azure SDK.
PyMuPDF	`pdf_pymupdf`	PDF	Fallback. HTTP microservice at port 8001.
MinerU	`mineru_parser`	PDF	Optional. HTTP microservice at port 8002.
Marker	`marker_parser`	PDF	Optional. HTTP microservice at port 8003.

The parser chain tries the selected backend first and falls back to alternatives on failure.

Worker

Separate process (worker.py) that:

Consumes from parsing_jobs RabbitMQ queue
Parses the document using the configured backend
Publishes results to parsing_results queue
Publishes progress to parsing_progress queue

Messages are formatted for NestJS microservice compatibility (includes pattern field).

Image Captioning

When caption_images=true, extracted images are sent to the completion service's Vision LLM for caption generation. The captions are embedded in the markdown output.

Inter-Service Communication

Target	Protocol	Purpose
document-service	RabbitMQ	Publish `parsing_results` and `parsing_progress`
completion-service	HTTP	Image captioning via Vision LLM
PyMuPDF microservice	HTTP (localhost:8001)	PDF parsing
MinerU microservice	HTTP (localhost:8002)	PDF parsing
Marker microservice	HTTP (localhost:8003)	PDF parsing
Azure Document Intelligence	Azure SDK	Document parsing

Endpoints​

POST /api/v1/parser/parse​

POST /api/v1/parser/parse-async​

Parser Backends​

Worker​

Image Captioning​

Inter-Service Communication​