Completion Service

LLM gateway that routes completion requests to the correct provider. Handles streaming, normalizes responses across providers, and emits token usage events.

Tech: NestJS 11, RabbitMQ (producer)
Port: 4000 (configured differently in deployment)
Auth: JWT, API Key, Public
Database: None (stateless)

Endpoints

Method	Path	Auth	Description
POST	`/api/v1/completions`	Public	Execute an LLM completion
GET	`/api/v1/health`	Public	Health check
GET	`/metrics`	Public	Prometheus metrics

POST /api/v1/completions

The single core endpoint. Accepts a completion request, routes it to the correct LLM provider, returns the response.

Request

{
  "model": "gpt-4o",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Hello" }
  ],
  "stream": true,
  "temperature": 0.7,
  "maxOutputTokens": 4096,
  "tools": [],
  "toolChoice": "auto"
}

Non-Streaming Response

{
  "outputText": "Hello! How can I help you?",
  "output": [],
  "usage": {
    "inputTokens": 25,
    "outputTokens": 8
  },
  "model": "gpt-4o",
  "finishReason": "stop"
}

Streaming Response (NDJSON)

When stream: true, the response is a stream of newline-delimited JSON chunks:

{"type":"text_delta_start","content":"Hello"}
{"type":"text_delta","content":"! How"}
{"type":"text_delta","content":" can I help"}
{"type":"text_delta_end","content":""}
{"type":"usage","inputTokens":25,"outputTokens":8}
{"type":"finish","finishReason":"stop"}

Chunk types:

text_delta_start / text_delta / text_delta_end -- Text generation
tool_call_start / tool_call_delta / tool_call_end -- Tool/function calls
usage -- Token counts
finish -- Completion finished
error -- Error occurred

Supported Providers

Provider	Key	Notes
OpenAI	`openai`	GPT-4, GPT-4o, GPT-3.5, etc.
Azure OpenAI	`azure_openai`	Azure-hosted OpenAI models
Azure OpenAI Completions	`azure_openai_completions`	Azure completions API variant
Anthropic	`anthropic`	Claude models
Anthropic Bedrock	`anthropic_bedrock`	Claude via AWS Bedrock
Google	`google`	Gemini models
Mistral	`mistral`	Mistral models
Jamba	`jamba`	AI21 Jamba models
Ollama	`ollama`	Local models via Ollama
vLLM	`vllm`	Self-hosted models via vLLM
Remote Custom	`remote_completion`	Any OpenAI-compatible endpoint

Model Configuration

At startup, the service fetches its model/provider configuration from llm-core:

GET /api/v1/services-config/models-providers

This returns the full mapping of which models are available from which providers, along with provider-specific configuration (API keys, endpoints, deployment names). Alternatively, a local JSON config file can be used via LOCAL_MODEL_PROVIDERS_CONFIG_PATH.

Token Usage Tracking

After every completion (streaming or non-streaming), the service publishes a transaction event to RabbitMQ:

{
  "model": "gpt-4o",
  "provider": "openai",
  "inputTokens": 250,
  "outputTokens": 150,
  "totalTokens": 400,
  "reasoningTokens": 0,
  "deploymentName": "gpt-4o",
  "stream": true,
  "temperature": 0.7,
  "maxOutputTokens": 4096,
  "toolsCount": 2,
  "timestamp": "2026-01-15T10:30:00.000Z",
  "messages": [{ "role": "user", "content": "..." }],
  "context": { "userId": "uuid", "conversationId": "uuid" }
}

The event includes all completion parameters (temperature, topP, topK, responseFormat, toolChoice, etc.) and the full context object passed in the original request. llm-core consumes these events and stores them in the model_transactions table.

Inter-Service Communication

Target	Protocol	Purpose
llm-core	HTTP (GET)	Fetch model/provider configuration at startup
LLM APIs (external)	HTTPS	Send completions to OpenAI, Anthropic, Google, etc.
RabbitMQ	AMQP (producer)	Emit `transaction` events

Endpoints​

POST /api/v1/completions​

Request​

Non-Streaming Response​

Streaming Response (NDJSON)​

Supported Providers​

Model Configuration​

Token Usage Tracking​

Inter-Service Communication​