Mnemo — AI Engine

Mnemo is KULVEX’s local AI inference system. It runs abliterated (uncensored) language models on your GPU via llama.cpp.

Architecture

User Message
    │
    ├── Local path: Mnemo (llama.cpp on GPU)
    │     └── Ollama-compatible API → streaming response
    │
    └── Cloud path: Claude API (Anthropic)
          └── Native tool_use → streaming response

The user toggles between local and cloud mode in the chat UI. Both paths support:

Streaming token-by-token responses
Tool use (17 domain agents)
Conversation history with context building
RAG (Retrieval-Augmented Generation) from knowledge base

Mnemo Branding

Users never see base model names. All models are branded as mnemo:

Internal Name	User Sees
Qwen 3.5 27B abliterated	mnemo
GLM-4.7 Flash 30B abliterated	mnemo:code
GLM-OCR 0.9B	mnemo:scanner
Whisper large-v3	mnemo:voice

Why Abliterated?

Standard LLMs have built-in content filters that refuse certain requests. Abliterated models have these refusal mechanisms removed, giving you an AI that:

Answers any question without moralizing
Doesn’t lecture about safety or ethics
Follows your instructions directly
Acts as a tool, not a nanny

All models in the KULVEX catalog are pre-abliterated from trusted community sources (huihui-ai, mlabonne, mradermacher).

Key Components

Model Catalog — Curated database of abliterated models with GGUF quantizations
Model Selector — Automatic hardware-aware model selection
llama-server — llama.cpp HTTP server (Docker container with CUDA)
Model Router — Routes requests to the right model based on task type
Context Builder — Constructs conversation context with system prompts and RAG

Updating Model Catalog