Inference Engine
Mnemo runs via llama.cpp’s llama-server, exposed as an OpenAI-compatible HTTP API inside a Docker container.
Architecture
Chat UI / API
│
▼
Model Router (selects model by task_type)
│
▼
llama_router.chat_stream()
│
▼
llama-server (Docker, CUDA GPU)
├── Port 8090 (chat model)
└── Port 8091 (code model, dual-GPU only)Configuration
The llama-server is configured via environment variables in the Docker entrypoint:
| Variable | Default | Description |
|---|---|---|
LLAMA_MODEL_PATH | /models/mnemo-chat.gguf | GGUF model file |
LLAMA_PORT | 8090 | HTTP server port |
LLAMA_GPU_LAYERS | 999 | Layers to offload to GPU (999 = all) |
LLAMA_CTX_SIZE | 8192 | Context window size |
LLAMA_PARALLEL | 4 | Concurrent request slots |
LLAMA_FLASH_ATTN | true | Enable flash attention |
LLAMA_KV_CACHE_TYPE | q8_0 | KV cache quantization |
LLAMA_EXTRA_ARGS | Additional llama-server flags |
Streaming
All inference is streaming. The llama_router yields tokens as they’re generated:
async for token in llama_router.chat_stream(model=model, messages=messages):
await sio.emit("chat:token", {"token": token}, to=sid)The frontend receives tokens via Socket.IO and renders them incrementally.
Context Building
Before inference, KULVEX builds context with:
- System prompt — KULVEX identity, capabilities, current date/time
- RAG context — Relevant knowledge from ChromaDB vector store
- Conversation history — Recent messages for continuity
- User message — The current query
if use_rag:
messages = await build_context_with_rag(message, history, project=project)
else:
messages = build_context(message, history)Health Checks
# Check if llama-server is healthy
curl http://localhost:9100/api/ai/status
# Direct llama-server health (internal Docker network)
docker exec kulvex-llama curl http://localhost:8090/health