RAG Service
Retrieval-Augmented Generation query layer. Takes a user query, embeds it, performs vector similarity search against pgvector, optionally reranks results, and returns relevant document chunks.
This is the "read/query side" of the RAG system. The embedding-service handles the "write/indexing side."
- Tech: NestJS 11, TypeORM, PostgreSQL (pgvector)
- Port: 4000
- Auth: JWT, API Key, Public
- Database: Shared document database (reads documents, chunks, embeddings tables)
Endpoints
| Method | Path | Auth | Description |
|---|---|---|---|
| POST | /api/v1/rag/retrieve | Public | RAG retrieval with optional reranking |
| POST | /api/v1/chunks/similarity-search | Public | Direct vector similarity search |
| GET | /api/v1/health | Public | Health check |
POST /api/v1/rag/retrieve
Main retrieval endpoint.
Request
{
"query": "How does kubernetes handle pod scheduling?",
"grade": 0.7,
"similarityTopK": 20,
"docIds": ["uuid-1", "uuid-2"],
"engineName": "default",
"rerank": {
"enabled": true,
"strategy": "bge",
"topK": 5,
"score": 0.5
}
}
| Field | Type | Required | Description |
|---|---|---|---|
| query | string | Yes | Search query |
| grade | number | Yes | 0-1 similarity threshold |
| similarityTopK | number | No | 5-50, number of candidates from vector search |
| docIds | string[] | No | Filter results to specific documents |
| engineName | string | No | "default" (advanced) or "naive" |
| rerank | object | No | Reranking configuration |
| rerank.enabled | boolean | No | Enable reranking |
| rerank.strategy | string | No | "bge" (model-based) or "llm" (LLM-based) |
| rerank.topK | number | No | Number of results after reranking |
| rerank.score | number | No | Minimum rerank score threshold |
Response
{
"results": [
{
"content": "Kubernetes uses a scheduler that...",
"score": 0.89,
"metadata": {
"documentId": "uuid",
"fileName": "k8s-docs.pdf",
"chunkIndex": 5,
"pageNumber": 12,
"contentType": "text"
}
}
],
"query": "How does kubernetes handle pod scheduling?",
"totalResults": 3,
"engineUsed": "advanced",
"pipeline": {
"reranked": true,
"rerankStrategy": "bge",
"candidatesConsidered": 20
}
}
Retrieval Engines
Advanced (default):
- Embed the query via embedding-service
- Run pgvector cosine similarity search
- Rerank candidates (BGE model or LLM-based)
- Filter by rerank score threshold
- Return top-K results
Naive:
- Embed the query via embedding-service
- Run pgvector cosine similarity search
- Filter by grade (similarity threshold)
- Return results directly (no reranking)
POST /api/v1/chunks/similarity-search
Low-level vector search. Accepts a pre-computed embedding vector and returns matching chunks.
Request
{
"queryEmbedding": [0.0123, -0.0456, ...],
"topK": 10,
"docIds": ["uuid-1"]
}
Uses pgvector <=> cosine distance operator directly.
Note: The rag-service entity declares the embedding column as vector(1536), but the actual vectors stored by embedding-service are 1024-dimensional. This is a stale value in the rag-service entity that should be updated to 1024. It does not affect runtime behavior because pgvector handles the cosine distance calculation regardless of the declared column length.
Reranking Strategies
BGE Model Reranking
Calls an external reranker service (POST /v1/rerank) that runs a BGE cross-encoder model. Scores each candidate against the query and re-orders by relevance.
- Timeout: 30s
- Retry: 2 attempts
- Configured via
RERANKER_SERVICE_URL - Optional -- RAG works without it
LLM Reranking
Uses the completion service to ask an LLM to score and rank candidates. More accurate but slower and more expensive.
Inter-Service Communication
| Target | Protocol | Purpose |
|---|---|---|
| embedding-service | HTTP | POST /api/v1/embed/single -- Embed query text |
| completion-service | HTTP | POST /api/v1/completions -- LLM reranking |
| reranker-service | HTTP | POST /v1/rerank -- BGE model reranking (optional) |
| PostgreSQL (shared document database) | SQL | Cosine similarity search via pgvector |