Skip to main content

Completion Service

LLM gateway that routes completion requests to the correct provider. Handles streaming, normalizes responses across providers, and emits token usage events.

  • Tech: NestJS 11, RabbitMQ (producer)
  • Port: 4000 (configured differently in deployment)
  • Auth: JWT, API Key, Public
  • Database: None (stateless)

Endpoints

MethodPathAuthDescription
POST/api/v1/completionsPublicExecute an LLM completion
GET/api/v1/healthPublicHealth check
GET/metricsPublicPrometheus metrics

POST /api/v1/completions

The single core endpoint. Accepts a completion request, routes it to the correct LLM provider, returns the response.

Request

{
"model": "gpt-4o",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello" }
],
"stream": true,
"temperature": 0.7,
"maxOutputTokens": 4096,
"tools": [],
"toolChoice": "auto"
}

Non-Streaming Response

{
"outputText": "Hello! How can I help you?",
"output": [],
"usage": {
"inputTokens": 25,
"outputTokens": 8
},
"model": "gpt-4o",
"finishReason": "stop"
}

Streaming Response (NDJSON)

When stream: true, the response is a stream of newline-delimited JSON chunks:

{"type":"text_delta_start","content":"Hello"}
{"type":"text_delta","content":"! How"}
{"type":"text_delta","content":" can I help"}
{"type":"text_delta_end","content":""}
{"type":"usage","inputTokens":25,"outputTokens":8}
{"type":"finish","finishReason":"stop"}

Chunk types:

  • text_delta_start / text_delta / text_delta_end -- Text generation
  • tool_call_start / tool_call_delta / tool_call_end -- Tool/function calls
  • usage -- Token counts
  • finish -- Completion finished
  • error -- Error occurred

Supported Providers

ProviderKeyNotes
OpenAIopenaiGPT-4, GPT-4o, GPT-3.5, etc.
Azure OpenAIazure_openaiAzure-hosted OpenAI models
Azure OpenAI Completionsazure_openai_completionsAzure completions API variant
AnthropicanthropicClaude models
Anthropic Bedrockanthropic_bedrockClaude via AWS Bedrock
GooglegoogleGemini models
MistralmistralMistral models
JambajambaAI21 Jamba models
OllamaollamaLocal models via Ollama
vLLMvllmSelf-hosted models via vLLM
Remote Customremote_completionAny OpenAI-compatible endpoint

Model Configuration

At startup, the service fetches its model/provider configuration from llm-core:

GET /api/v1/services-config/models-providers

This returns the full mapping of which models are available from which providers, along with provider-specific configuration (API keys, endpoints, deployment names). Alternatively, a local JSON config file can be used via LOCAL_MODEL_PROVIDERS_CONFIG_PATH.

Token Usage Tracking

After every completion (streaming or non-streaming), the service publishes a transaction event to RabbitMQ:

{
"model": "gpt-4o",
"provider": "openai",
"inputTokens": 250,
"outputTokens": 150,
"totalTokens": 400,
"reasoningTokens": 0,
"deploymentName": "gpt-4o",
"stream": true,
"temperature": 0.7,
"maxOutputTokens": 4096,
"toolsCount": 2,
"timestamp": "2026-01-15T10:30:00.000Z",
"messages": [{ "role": "user", "content": "..." }],
"context": { "userId": "uuid", "conversationId": "uuid" }
}

The event includes all completion parameters (temperature, topP, topK, responseFormat, toolChoice, etc.) and the full context object passed in the original request. llm-core consumes these events and stores them in the model_transactions table.

Inter-Service Communication

TargetProtocolPurpose
llm-coreHTTP (GET)Fetch model/provider configuration at startup
LLM APIs (external)HTTPSSend completions to OpenAI, Anthropic, Google, etc.
RabbitMQAMQP (producer)Emit transaction events