Document Processing Flow
End-to-end flow from file upload to searchable, embedded document.
Services Involved
document-service, parser-service (worker), embedding-service (worker), completion-service, Azure Blob / S3
Steps
1. Initialize Document
Client calls POST /api/v1/documents/init on document-service.
A document record is created in PostgreSQL with status PENDING_UPLOAD. Returns the document ID.
2. Upload File
Client calls POST /api/v1/documents/upload on document-service.
The file is streamed to Azure Blob Storage (or S3). The document record is updated with file metadata (name, size, content type, storage path). Status changes to UPLOADED.
3. Trigger Processing
Client calls POST /api/v1/documents/process on document-service.
Document status changes to PROCESSING. The service publishes a message to the parsing_jobs RabbitMQ queue.
document-service --[parsing_jobs queue]--> parser-service worker
If RabbitMQ is unavailable, falls back to HTTP POST /api/v1/parser/parse on parser-service.
4. Parse Document
Parser-service worker picks up the job from parsing_jobs.
- Selects the parser backend (Azure Document Intelligence, PyMuPDF, MinerU, or Marker)
- Downloads the file content
- Converts the document to markdown
- Optionally extracts images and generates captions via completion-service
- Publishes progress updates to
parsing_progressqueue - Publishes the result to
parsing_resultsqueue
parser-service worker --[parsing_results]--> document-service
parser-service worker --[parsing_progress]--> document-service
5. Receive Parsing Results
Document-service consumes from parsing_results.
- Stores the parsed chunks in the database
- Triggers auto-summarization via completion-service (
POST /api/v1/completionswith Map-Reduce pattern) - Publishes an embedding job to
embedding_jobsqueue
document-service --[embedding_jobs queue]--> embedding-service worker
6. Generate Embeddings
Embedding-service worker picks up the job from embedding_jobs.
- Chunks the parsed text using the configured method (recursive, semantic, fixed_size)
- Optionally translates non-English chunks to English
- Generates vector embeddings via OpenAI/Azure OpenAI embedding API
- Stores chunks in the
chunkstable - Stores embeddings in the
embeddingstable (pgvector, 1024-dim vectors with HNSW index) - Publishes progress to
embedding_progressqueue - Publishes result to
embedding_resultsqueue
embedding-service worker --[embedding_results]--> document-service
embedding-service worker --[embedding_progress]--> document-service
7. Mark Complete
Document-service consumes from embedding_results. Document status changes to PROCESSED.
8. Real-Time Updates
Throughout steps 4-7, the client can subscribe to GET /api/v1/documents/stream?documentIds= (SSE) on document-service. Progress updates from both parser and embedding workers are relayed to the client in real time.
Status Transitions
PENDING_UPLOAD --> UPLOADED --> PROCESSING --> PROCESSED
|
v
FAILED
Full Diagram
Client
|
| POST /documents/init
| POST /documents/upload
| POST /documents/process
| GET /documents/stream (SSE)
v
document-service (port 4000)
| ^
| parsing_jobs (RabbitMQ) | parsing_results
v |
parser-service worker ---------------+
| |
| (optional) POST /completions | parsing_progress
v |
completion-service |
v
document-service
| ^
| embedding_jobs (RabbitMQ) | embedding_results
v |
embedding-service worker ------------+
| |
| OpenAI Embedding API | embedding_progress
| PostgreSQL (chunks + embeddings) |
v v
document database document-service
|
v
Status: PROCESSED