Document Processing Flow

End-to-end flow from file upload to searchable, embedded document.

Services Involved

document-service, parser-service (worker), embedding-service (worker), completion-service, Azure Blob / S3

Steps

1. Initialize Document

Client calls POST /api/v1/documents/init on document-service.

A document record is created in PostgreSQL with status PENDING_UPLOAD. Returns the document ID.

2. Upload File

Client calls POST /api/v1/documents/upload on document-service.

The file is streamed to Azure Blob Storage (or S3). The document record is updated with file metadata (name, size, content type, storage path). Status changes to UPLOADED.

3. Trigger Processing

Client calls POST /api/v1/documents/process on document-service.

Document status changes to PROCESSING. The service publishes a message to the parsing_jobs RabbitMQ queue.

document-service --[parsing_jobs queue]--> parser-service worker

If RabbitMQ is unavailable, falls back to HTTP POST /api/v1/parser/parse on parser-service.

4. Parse Document

Parser-service worker picks up the job from parsing_jobs.

Selects the parser backend (Azure Document Intelligence, PyMuPDF, MinerU, or Marker)
Downloads the file content
Converts the document to markdown
Optionally extracts images and generates captions via completion-service
Publishes progress updates to parsing_progress queue
Publishes the result to parsing_results queue

parser-service worker --[parsing_results]--> document-service
parser-service worker --[parsing_progress]--> document-service

5. Receive Parsing Results

Document-service consumes from parsing_results.

Stores the parsed chunks in the database
Triggers auto-summarization via completion-service (POST /api/v1/completions with Map-Reduce pattern)
Publishes an embedding job to embedding_jobs queue

document-service --[embedding_jobs queue]--> embedding-service worker

6. Generate Embeddings

Embedding-service worker picks up the job from embedding_jobs.

Chunks the parsed text using the configured method (recursive, semantic, fixed_size)
Optionally translates non-English chunks to English
Generates vector embeddings via OpenAI/Azure OpenAI embedding API
Stores chunks in the chunks table
Stores embeddings in the embeddings table (pgvector, 1024-dim vectors with HNSW index)
Publishes progress to embedding_progress queue
Publishes result to embedding_results queue

embedding-service worker --[embedding_results]--> document-service
embedding-service worker --[embedding_progress]--> document-service

7. Mark Complete

Document-service consumes from embedding_results. Document status changes to PROCESSED.

8. Real-Time Updates

Throughout steps 4-7, the client can subscribe to GET /api/v1/documents/stream?documentIds= (SSE) on document-service. Progress updates from both parser and embedding workers are relayed to the client in real time.

Status Transitions

PENDING_UPLOAD --> UPLOADED --> PROCESSING --> PROCESSED
                                    |
                                    v
                                  FAILED

Full Diagram

Client
  |
  | POST /documents/init
  | POST /documents/upload
  | POST /documents/process
  | GET  /documents/stream (SSE)
  v
document-service (port 4000)
  |                                  ^
  | parsing_jobs (RabbitMQ)          | parsing_results
  v                                  |
parser-service worker ---------------+
  |                                  |
  | (optional) POST /completions     | parsing_progress
  v                                  |
completion-service                   |
                                     v
document-service
  |                                  ^
  | embedding_jobs (RabbitMQ)        | embedding_results
  v                                  |
embedding-service worker ------------+
  |                                  |
  | OpenAI Embedding API             | embedding_progress
  | PostgreSQL (chunks + embeddings) |
  v                                  v
document database                document-service
                                     |
                                     v
                              Status: PROCESSED

Services Involved​

Steps​

1. Initialize Document​

2. Upload File​

3. Trigger Processing​

4. Parse Document​

5. Receive Parsing Results​

6. Generate Embeddings​

7. Mark Complete​

8. Real-Time Updates​

Status Transitions​

Full Diagram​