Inference Engine

Mnemo runs via llama.cpp’s llama-server, exposed as an OpenAI-compatible HTTP API inside a Docker container.

Architecture

Chat UI / API
    │
    ▼
  Model Router (selects model by task_type)
    │
    ▼
  llama_router.chat_stream()
    │
    ▼
  llama-server (Docker, CUDA GPU)
    ├── Port 8090 (chat model)
    └── Port 8091 (code model, dual-GPU only)

Configuration

The llama-server is configured via environment variables in the Docker entrypoint:

Variable	Default	Description
`LLAMA_MODEL_PATH`	`/models/mnemo-chat.gguf`	GGUF model file
`LLAMA_PORT`	`8090`	HTTP server port
`LLAMA_GPU_LAYERS`	`999`	Layers to offload to GPU (999 = all)
`LLAMA_CTX_SIZE`	`8192`	Context window size
`LLAMA_PARALLEL`	`4`	Concurrent request slots
`LLAMA_FLASH_ATTN`	`true`	Enable flash attention
`LLAMA_KV_CACHE_TYPE`	`q8_0`	KV cache quantization
`LLAMA_EXTRA_ARGS`		Additional llama-server flags

Streaming

All inference is streaming. The llama_router yields tokens as they’re generated:

async for token in llama_router.chat_stream(model=model, messages=messages):
    await sio.emit("chat:token", {"token": token}, to=sid)

The frontend receives tokens via Socket.IO and renders them incrementally.

Context Building

Before inference, KULVEX builds context with:

System prompt — KULVEX identity, capabilities, current date/time
RAG context — Relevant knowledge from ChromaDB vector store
Conversation history — Recent messages for continuity
User message — The current query

if use_rag:
    messages = await build_context_with_rag(message, history, project=project)
else:
    messages = build_context(message, history)

Health Checks

# Check if llama-server is healthy
curl http://localhost:9100/api/ai/status
 
# Direct llama-server health (internal Docker network)
docker exec kulvex-llama curl http://localhost:8090/health

Model Catalog Cloud Mode (Claude)