Skip to main content

Workers

Two services run background workers that consume RabbitMQ queues. Workers are separate processes from the HTTP API servers.

Parser Service Worker

  • Location: parser-service/worker.py
  • Runtime: Python (uvloop event loop)
  • Queue: Consumes from parsing_jobs
  • Publishes to: parsing_results, parsing_progress

What it does

  1. Picks up a parsing job from the queue
  2. Validates the document metadata
  3. Selects the appropriate parser (Azure Document Intelligence, PyMuPDF, MinerU, or Marker)
  4. Parses the document content to markdown
  5. Optionally extracts images and generates captions via the completion service
  6. Publishes the parsed result back to parsing_results
  7. Sends progress updates to parsing_progress during processing

Scaling

  • Multiple worker instances can run in parallel
  • Each worker handles one job at a time
  • Horizontal scaling: add more worker containers
  • Idempotency keys prevent duplicate processing

Embedding Service Worker

  • Location: embedding-service/worker.py
  • Runtime: Python (asyncio)
  • Queue: Consumes from embedding_jobs
  • Publishes to: embedding_results, embedding_progress

What it does

  1. Picks up an embedding job from the queue
  2. Chunks the parsed text using the configured method (recursive, semantic, fixed_size)
  3. Optionally translates non-English chunks to English
  4. Generates vector embeddings via OpenAI/Azure OpenAI embedding API
  5. Stores chunks and embeddings in PostgreSQL (pgvector)
  6. Publishes the result back to embedding_results
  7. Sends progress updates to embedding_progress

Scaling

  • Multiple worker instances can run in parallel
  • Uses DLQ for failed messages
  • Retry with exponential backoff (configurable attempts and delays)
  • Each worker prefetches a configurable number of messages

Deployment

Both workers are deployed as separate containers from their respective HTTP API services. They share the same codebase but run different entry points:

  • Parser: python worker.py vs uvicorn app.main:app (API)
  • Embedding: python worker.py vs uvicorn app.main:app (API)