Mnemo (AI Engine)Inference Engine

Inference Engine

Mnemo runs via llama.cpp’s llama-server, exposed as an OpenAI-compatible HTTP API inside a Docker container.

Architecture

Chat UI / API


  Model Router (selects model by task_type)


  llama_router.chat_stream()


  llama-server (Docker, CUDA GPU)
    ├── Port 8090 (chat model)
    └── Port 8091 (code model, dual-GPU only)

Configuration

The llama-server is configured via environment variables in the Docker entrypoint:

VariableDefaultDescription
LLAMA_MODEL_PATH/models/mnemo-chat.ggufGGUF model file
LLAMA_PORT8090HTTP server port
LLAMA_GPU_LAYERS999Layers to offload to GPU (999 = all)
LLAMA_CTX_SIZE8192Context window size
LLAMA_PARALLEL4Concurrent request slots
LLAMA_FLASH_ATTNtrueEnable flash attention
LLAMA_KV_CACHE_TYPEq8_0KV cache quantization
LLAMA_EXTRA_ARGSAdditional llama-server flags

Streaming

All inference is streaming. The llama_router yields tokens as they’re generated:

async for token in llama_router.chat_stream(model=model, messages=messages):
    await sio.emit("chat:token", {"token": token}, to=sid)

The frontend receives tokens via Socket.IO and renders them incrementally.

Context Building

Before inference, KULVEX builds context with:

  1. System prompt — KULVEX identity, capabilities, current date/time
  2. RAG context — Relevant knowledge from ChromaDB vector store
  3. Conversation history — Recent messages for continuity
  4. User message — The current query
if use_rag:
    messages = await build_context_with_rag(message, history, project=project)
else:
    messages = build_context(message, history)

Health Checks

# Check if llama-server is healthy
curl http://localhost:9100/api/ai/status
 
# Direct llama-server health (internal Docker network)
docker exec kulvex-llama curl http://localhost:8090/health