Skip to main content

RAG Query Flow

End-to-end flow from a user query to relevant document chunks via vector similarity search and reranking.

Services Involved

rag-service, embedding-service, completion-service (for LLM reranking), reranker-service (optional, for BGE reranking), PostgreSQL (pgvector)

Prerequisites

Documents must already be processed (status: PROCESSED). This means they have been parsed and embedded via the Document Processing Flow.

Steps (Advanced Engine -- Default)

1. Receive Query

Caller (typically the MCP RAG tool, or directly) calls POST /api/v1/rag/retrieve on rag-service:

{
"query": "How does kubernetes handle pod scheduling?",
"grade": 0.7,
"similarityTopK": 20,
"docIds": ["uuid-1", "uuid-2"],
"rerank": {
"enabled": true,
"strategy": "bge",
"topK": 5,
"score": 0.5
}
}

2. Embed the Query

rag-service calls embedding-service to convert the query text into a vector:

rag-service --POST /api/v1/embed/single--> embedding-service

embedding-service calls the OpenAI/Azure OpenAI embedding API and returns a 1024-dimension vector.

rag-service queries PostgreSQL directly using pgvector:

SELECT c.*, e.embedding <=> query_vector AS distance
FROM embeddings e
JOIN chunks c ON e.chunk_id = c.id
WHERE e.document_id IN (:docIds)
ORDER BY distance ASC
LIMIT :similarityTopK

The <=> operator computes cosine distance. Results are ordered by similarity (closest first). The HNSW index on the embedding column makes this fast.

4. Rerank Results

If reranking is enabled, candidates are re-scored for higher precision.

BGE strategy: Calls an external reranker service:

rag-service --POST /v1/rerank--> reranker-service

The BGE cross-encoder model scores each candidate against the original query and re-orders them.

LLM strategy: Calls completion-service:

rag-service --POST /api/v1/completions--> completion-service

Asks the LLM to score and rank candidates. More accurate, slower, more expensive.

5. Filter and Return

After reranking, results below the score threshold are removed. The top-K results are returned:

{
"results": [
{
"content": "Kubernetes uses a scheduler that watches...",
"score": 0.89,
"metadata": {
"documentId": "uuid-1",
"fileName": "k8s-guide.pdf",
"chunkIndex": 12,
"pageNumber": 45,
"contentType": "text"
}
}
],
"totalResults": 3,
"engineUsed": "advanced",
"pipeline": {
"reranked": true,
"rerankStrategy": "bge",
"candidatesConsidered": 20
}
}

Steps (Naive Engine)

A simpler pipeline without reranking:

  1. Embed the query (same as step 2 above)
  2. Vector similarity search (same as step 3)
  3. Filter results by the grade threshold (cosine similarity score)
  4. Return results directly

Use the naive engine by setting "engineName": "naive" in the request.

Diagram

Caller (MCP RAG tool or direct)
|
| POST /api/v1/rag/retrieve
v
rag-service
|
| POST /api/v1/embed/single
v
embedding-service --> OpenAI Embedding API
|
| Returns 1024-dim vector
v
rag-service
|
| SQL: cosine similarity search (pgvector)
v
PostgreSQL (document database)
|
| Top-K candidates
v
rag-service
|
|--- BGE rerank? --> reranker-service (POST /v1/rerank)
|--- LLM rerank? --> completion-service (POST /api/v1/completions)
|--- No rerank? --> filter by grade
|
v
Filtered, ranked results --> Caller