RAG Service

Retrieval-Augmented Generation query layer. Takes a user query, embeds it, performs vector similarity search against pgvector, optionally reranks results, and returns relevant document chunks.

This is the "read/query side" of the RAG system. The embedding-service handles the "write/indexing side."

Tech: NestJS 11, TypeORM, PostgreSQL (pgvector)
Port: 4000
Auth: JWT, API Key, Public
Database: Shared document database (reads documents, chunks, embeddings tables)

Endpoints

Method	Path	Auth	Description
POST	`/api/v1/rag/retrieve`	Public	RAG retrieval with optional reranking
POST	`/api/v1/chunks/similarity-search`	Public	Direct vector similarity search
GET	`/api/v1/health`	Public	Health check

POST /api/v1/rag/retrieve

Main retrieval endpoint.

Request

{
  "query": "How does kubernetes handle pod scheduling?",
  "grade": 0.7,
  "similarityTopK": 20,
  "docIds": ["uuid-1", "uuid-2"],
  "engineName": "default",
  "rerank": {
    "enabled": true,
    "strategy": "bge",
    "topK": 5,
    "score": 0.5
  }
}

Field	Type	Required	Description
query	string	Yes	Search query
grade	number	Yes	0-1 similarity threshold
similarityTopK	number	No	5-50, number of candidates from vector search
docIds	string[]	No	Filter results to specific documents
engineName	string	No	"default" (advanced) or "naive"
rerank	object	No	Reranking configuration
rerank.enabled	boolean	No	Enable reranking
rerank.strategy	string	No	"bge" (model-based) or "llm" (LLM-based)
rerank.topK	number	No	Number of results after reranking
rerank.score	number	No	Minimum rerank score threshold

Response

{
  "results": [
    {
      "content": "Kubernetes uses a scheduler that...",
      "score": 0.89,
      "metadata": {
        "documentId": "uuid",
        "fileName": "k8s-docs.pdf",
        "chunkIndex": 5,
        "pageNumber": 12,
        "contentType": "text"
      }
    }
  ],
  "query": "How does kubernetes handle pod scheduling?",
  "totalResults": 3,
  "engineUsed": "advanced",
  "pipeline": {
    "reranked": true,
    "rerankStrategy": "bge",
    "candidatesConsidered": 20
  }
}

Retrieval Engines

Advanced (default):

Embed the query via embedding-service
Run pgvector cosine similarity search
Rerank candidates (BGE model or LLM-based)
Filter by rerank score threshold
Return top-K results

Naive:

Embed the query via embedding-service
Run pgvector cosine similarity search
Filter by grade (similarity threshold)
Return results directly (no reranking)

POST /api/v1/chunks/similarity-search

Low-level vector search. Accepts a pre-computed embedding vector and returns matching chunks.

Request

{
  "queryEmbedding": [0.0123, -0.0456, ...],
  "topK": 10,
  "docIds": ["uuid-1"]
}

Uses pgvector <=> cosine distance operator directly.

Note: The rag-service entity declares the embedding column as vector(1536), but the actual vectors stored by embedding-service are 1024-dimensional. This is a stale value in the rag-service entity that should be updated to 1024. It does not affect runtime behavior because pgvector handles the cosine distance calculation regardless of the declared column length.

Reranking Strategies

BGE Model Reranking

Calls an external reranker service (POST /v1/rerank) that runs a BGE cross-encoder model. Scores each candidate against the query and re-orders by relevance.

Timeout: 30s
Retry: 2 attempts
Configured via RERANKER_SERVICE_URL
Optional -- RAG works without it

LLM Reranking

Uses the completion service to ask an LLM to score and rank candidates. More accurate but slower and more expensive.

Inter-Service Communication

Target	Protocol	Purpose
embedding-service	HTTP	`POST /api/v1/embed/single` -- Embed query text
completion-service	HTTP	`POST /api/v1/completions` -- LLM reranking
reranker-service	HTTP	`POST /v1/rerank` -- BGE model reranking (optional)
PostgreSQL (shared document database)	SQL	Cosine similarity search via pgvector

Endpoints​

POST /api/v1/rag/retrieve​

Request​

Response​

Retrieval Engines​

POST /api/v1/chunks/similarity-search​

Request​

Reranking Strategies​

BGE Model Reranking​

LLM Reranking​

Inter-Service Communication​

Endpoints

POST /api/v1/rag/retrieve

Request

Response

Retrieval Engines

POST /api/v1/chunks/similarity-search

Request

Reranking Strategies

BGE Model Reranking

LLM Reranking

Inter-Service Communication