Skip to main content

Document Processing Flow

End-to-end flow from file upload to searchable, embedded document.

Services Involved

document-service, parser-service (worker), embedding-service (worker), completion-service, Azure Blob / S3

Steps

1. Initialize Document

Client calls POST /api/v1/documents/init on document-service.

A document record is created in PostgreSQL with status PENDING_UPLOAD. Returns the document ID.

2. Upload File

Client calls POST /api/v1/documents/upload on document-service.

The file is streamed to Azure Blob Storage (or S3). The document record is updated with file metadata (name, size, content type, storage path). Status changes to UPLOADED.

3. Trigger Processing

Client calls POST /api/v1/documents/process on document-service.

Document status changes to PROCESSING. The service publishes a message to the parsing_jobs RabbitMQ queue.

document-service --[parsing_jobs queue]--> parser-service worker

If RabbitMQ is unavailable, falls back to HTTP POST /api/v1/parser/parse on parser-service.

4. Parse Document

Parser-service worker picks up the job from parsing_jobs.

  1. Selects the parser backend (Azure Document Intelligence, PyMuPDF, MinerU, or Marker)
  2. Downloads the file content
  3. Converts the document to markdown
  4. Optionally extracts images and generates captions via completion-service
  5. Publishes progress updates to parsing_progress queue
  6. Publishes the result to parsing_results queue
parser-service worker --[parsing_results]--> document-service
parser-service worker --[parsing_progress]--> document-service

5. Receive Parsing Results

Document-service consumes from parsing_results.

  1. Stores the parsed chunks in the database
  2. Triggers auto-summarization via completion-service (POST /api/v1/completions with Map-Reduce pattern)
  3. Publishes an embedding job to embedding_jobs queue
document-service --[embedding_jobs queue]--> embedding-service worker

6. Generate Embeddings

Embedding-service worker picks up the job from embedding_jobs.

  1. Chunks the parsed text using the configured method (recursive, semantic, fixed_size)
  2. Optionally translates non-English chunks to English
  3. Generates vector embeddings via OpenAI/Azure OpenAI embedding API
  4. Stores chunks in the chunks table
  5. Stores embeddings in the embeddings table (pgvector, 1024-dim vectors with HNSW index)
  6. Publishes progress to embedding_progress queue
  7. Publishes result to embedding_results queue
embedding-service worker --[embedding_results]--> document-service
embedding-service worker --[embedding_progress]--> document-service

7. Mark Complete

Document-service consumes from embedding_results. Document status changes to PROCESSED.

8. Real-Time Updates

Throughout steps 4-7, the client can subscribe to GET /api/v1/documents/stream?documentIds= (SSE) on document-service. Progress updates from both parser and embedding workers are relayed to the client in real time.

Status Transitions

PENDING_UPLOAD --> UPLOADED --> PROCESSING --> PROCESSED
|
v
FAILED

Full Diagram

Client
|
| POST /documents/init
| POST /documents/upload
| POST /documents/process
| GET /documents/stream (SSE)
v
document-service (port 4000)
| ^
| parsing_jobs (RabbitMQ) | parsing_results
v |
parser-service worker ---------------+
| |
| (optional) POST /completions | parsing_progress
v |
completion-service |
v
document-service
| ^
| embedding_jobs (RabbitMQ) | embedding_results
v |
embedding-service worker ------------+
| |
| OpenAI Embedding API | embedding_progress
| PostgreSQL (chunks + embeddings) |
v v
document database document-service
|
v
Status: PROCESSED