Mnemo — AI Engine
Mnemo is KULVEX’s local AI inference system. It runs abliterated (uncensored) language models on your GPU via llama.cpp.
Architecture
User Message
│
├── Local path: Mnemo (llama.cpp on GPU)
│ └── Ollama-compatible API → streaming response
│
└── Cloud path: Claude API (Anthropic)
└── Native tool_use → streaming responseThe user toggles between local and cloud mode in the chat UI. Both paths support:
- Streaming token-by-token responses
- Tool use (17 domain agents)
- Conversation history with context building
- RAG (Retrieval-Augmented Generation) from knowledge base
Mnemo Branding
Users never see base model names. All models are branded as mnemo:
| Internal Name | User Sees |
|---|---|
| Qwen 3.5 27B abliterated | mnemo |
| GLM-4.7 Flash 30B abliterated | mnemo:code |
| GLM-OCR 0.9B | mnemo:scanner |
| Whisper large-v3 | mnemo:voice |
Why Abliterated?
Standard LLMs have built-in content filters that refuse certain requests. Abliterated models have these refusal mechanisms removed, giving you an AI that:
- Answers any question without moralizing
- Doesn’t lecture about safety or ethics
- Follows your instructions directly
- Acts as a tool, not a nanny
All models in the KULVEX catalog are pre-abliterated from trusted community sources (huihui-ai, mlabonne, mradermacher).
Key Components
- Model Catalog — Curated database of abliterated models with GGUF quantizations
- Model Selector — Automatic hardware-aware model selection
- llama-server — llama.cpp HTTP server (Docker container with CUDA)
- Model Router — Routes requests to the right model based on task type
- Context Builder — Constructs conversation context with system prompts and RAG