Models & Providers
KCode works with local models on your GPU and cloud providers via native APIs. You can mix both and switch instantly with /toggle.
Model Registry
# List all registered models
kcode models list
# Add a model
kcode models add <name> <base-url> [options]
# Set default model
kcode models default <name>
# Remove a model
kcode models rm <name>
# Use a specific model for one session
kcode --model <name>
# Or via environment variable
KCODE_MODEL=<name> kcodeLocal Models (Mnemo)
KULVEX includes the Mnemo model family, optimized for coding tasks. The current generation is mark6, built on Google Gemma 4. Mark6 ships quantized variants of the same 31B base model so you can trade quality for VRAM:
Mark6 (Current — Gemma 4 base)
| VRAM | Model | Quant | Context | Use Case |
|---|---|---|---|---|
| 20 GB | mnemo:mark6-31b-q4 | Q4_K_M (~18 GB) | 64K | Single-GPU coding, fast iteration |
| 24 GB | mnemo:mark6-31b-q6 | Q6_K (~25 GB) | 128K | Balanced quality/VRAM |
| 36+ GB | mnemo:mark6-31b-q8 | Q8_0 (~31 GB) | 256K | Maximum quality, dual-GPU tensor split |
All three variants share the same 31B Gemma 4 base and support vision (multimodal mmproj). Pick by how much VRAM you want to spend; quality increases monotonically with the quant.
For dual-GPU setups (e.g. RTX 4090 + RTX 5090), tensor-split Q8 across both cards and push context to 256K:
llama-server \
-m mnemo-mark6-31b-gemma4-abliterated-Q8_0.gguf \
--mmproj mmproj-gemma-4-31B-it-f16.gguf \
--port 8090 \
-ngl 99 --tensor-split 22,32 \
-c 262144 -n 16384 -np 2 \
--alias mnemo:mark6-31b \
--flash-attn on \
--cache-type-k q4_0 --cache-type-v q4_0Mark5 (Legacy)
The mark5 generation is still downloadable for users who want the multi-tier lineup (pico, nano, mini, mid, max, 80b). It predates mark6 and is kept online for backward compatibility, but new setups should start on mark6.
Mnemo-iOS (Apple Silicon / MLX)
For Macs and iOS devices, KULVEX ships a parallel mnemo-ios line packed for Apple’s MLX runtime and unified memory. The mark6 Apple Silicon build is coming soon; until then, mnemo-ios:mark5-* is the supported path on macOS.
# On macOS / Apple Silicon (current)
kcode --model mnemo-ios:mark5-max "refactor this module"Downloading Mnemo Models
# Via KCode setup wizard (auto-detects your GPU and suggests the right tier)
kcode setup
# Or pull mark6 directly
curl -fsSL https://kulvex.ai/models/mnemo/mark6-31b-q4.gguf \
-o ~/kulvex-models/mnemo/mark6-31b-q4.gguf
curl -fsSL https://kulvex.ai/models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
-o ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.ggufStarting a Local Model
Models run via llama-server (included with KULVEX):
# Start llama-server with mark6 Q4 on a single GPU
llama-server -m ~/kulvex-models/mnemo/mark6-31b-q4.gguf \
--mmproj ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
--port 8090 \
--ctx-size 65536 \
--n-gpu-layers 99 \
--flash-attn on \
--alias mnemo:mark6-31b
# Register it in KCode
kcode models add mnemo:mark6-31b http://localhost:8090/v1 \
--context 65536 \
--gpu "RTX 4090" \
--defaultMulti-GPU Setup
With two GPUs, the recommended setup is to tensor-split the Q8 variant across both cards for a single large-context session. This is the deployment the Kulvex team runs in-house:
# Dual GPU: tensor-split mark6 Q8 across both cards
llama-server \
-m ~/kulvex-models/mnemo/mark6-31b-q8.gguf \
--mmproj ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
--host 0.0.0.0 --port 8090 \
-ngl 99 --tensor-split 22,32 \
-c 262144 -n 16384 -np 2 \
--alias mnemo:mark6-31b \
--flash-attn on \
--cache-type-k q4_0 --cache-type-v q4_0
# Register in KCode
kcode models add mnemo:mark6-31b http://localhost:8090/v1 \
--context 262144 --gpu "RTX 4090 + RTX 5090" --defaultIf you prefer a two-model setup (one large + one fast for quick questions), you can run a second instance on a different port with a Q4 variant:
CUDA_VISIBLE_DEVICES=0 llama-server -m mark6-31b-q4.gguf --port 8091 --ctx-size 65536 --n-gpu-layers 99Apple Silicon (Mac)
For Macs with unified memory, the total RAM is your “VRAM”. Mark6 Gemma 4 on MLX is coming; for now, macOS users should run mark5 via llama.cpp with Metal or mnemo-ios:
| Mac | RAM | Current Recommendation |
|---|---|---|
| MacBook Air M1/M2 | 8–16 GB | mnemo-ios:mark5-nano or mark5-mini |
| MacBook Pro M3/M4 | 18–36 GB | mnemo-ios:mark5-mid or mark5-max |
| Mac Studio M2/M4 Ultra | 64–192 GB | mnemo:mark6-31b-q8 via llama.cpp Metal |
| Mac Pro M2/M4 Ultra | 192+ GB | mnemo:mark6-31b-q8 (full 256K context) |
# llama.cpp with Metal acceleration (mark6 on Apple Silicon with enough unified memory)
llama-server -m ~/kulvex-models/mnemo/mark6-31b-q8.gguf --port 8090 --ctx-size 65536 --n-gpu-layers 99Ollama
If you prefer Ollama:
# Pull a model
ollama pull qwen2.5-coder:32b
# Ollama exposes an OpenAI-compatible API at port 11434
kcode models add qwen2.5-coder:32b http://localhost:11434/v1 \
--context 32768 \
--default
# Or run directly
kcode --model qwen2.5-coder:32bPopular Ollama models for coding:
| Model | Size | Best For |
|---|---|---|
qwen2.5-coder:7b | ~5 GB | Quick edits, 8GB VRAM |
qwen2.5-coder:14b | ~10 GB | General coding, 12GB VRAM |
qwen2.5-coder:32b | ~20 GB | Complex tasks, 24GB VRAM |
deepseek-coder-v2:16b | ~10 GB | Multi-language, 12GB VRAM |
codellama:34b | ~20 GB | Large codebases, 24GB VRAM |
llama3.1:70b | ~40 GB | Maximum local quality, 48GB VRAM |
Cloud Providers
KCode supports cloud providers with native API integration — no proxies needed. Configure them interactively with /cloud or manually.
Interactive Setup (/cloud)
The easiest way to set up a cloud provider is from inside KCode:
> /cloudThis opens an interactive menu where you can:
- Select a provider (Anthropic, OpenAI, Gemini, Groq, DeepSeek, Together AI)
- Paste your API key (masked for security)
- Save to
~/.kcode/settings.json
The provider’s models are automatically registered and the active model switches immediately.
Anthropic (Claude) — Native API
KCode uses the native Anthropic Messages API (/v1/messages) — not an OpenAI-compatible proxy. This gives you:
- Full
tool_use/tool_resultsupport as content blocks - Proper system prompt handling (top-level
systemfield) - Native SSE streaming with all event types
# Via /cloud (recommended)
> /cloud → Anthropic → paste your API key
# Or manually
export ANTHROPIC_API_KEY=sk-ant-...
kcode --model claude-sonnet-4-6Available models: claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5
OpenAI
# Via /cloud (recommended)
> /cloud → OpenAI → paste your API key
# Or manually
export OPENAI_API_KEY=sk-proj-...
kcode --model gpt-4oGoogle Gemini
> /cloud → Google Gemini → paste your API key
# Or manually
export GEMINI_API_KEY=AIza...
kcode --model gemini-2.5-proGroq (Fast Inference)
> /cloud → Groq → paste your API key
# Or manually
export GROQ_API_KEY=gsk_...
kcode --model llama-3.3-70bDeepSeek
> /cloud → DeepSeek → paste your API key
# Or manually
export DEEPSEEK_API_KEY=sk-...
kcode --model deepseek-chatTogether AI
> /cloud → Together AI → paste your API key
# Or manually
export TOGETHER_API_KEY=tok_...
kcode --model meta-llama/Llama-3.3-70BxAI (Grok)
> /cloud → xAI (Grok) → paste your API key
# Or manually
export XAI_API_KEY=xai-...
kcode --model grok-4Available models: grok-4, grok-4-latest, grok-4-fast-reasoning, grok-3, grok-3-mini. Uses the OpenAI-compatible endpoint at https://api.x.ai/v1, so KCode routes Grok requests through the same code path as GPT or DeepSeek.
Any OpenAI-Compatible Server
# vLLM
kcode models add my-model http://my-server:8000/v1 --context 32768
# LM Studio
kcode models add lm-studio http://localhost:1234/v1 --context 8192
# Text Generation Inference (TGI)
kcode models add tgi-model http://localhost:8080/v1 --context 16384
# OpenRouter
kcode models add openrouter https://openrouter.ai/api/v1 --context 128000Switching Models Mid-Session (/toggle)
Inside an active KCode session, use /toggle to open the interactive model switcher:
> /toggleThis shows all registered models grouped by LOCAL and CLOUD:
- Navigate with arrow keys
- Current model shown with a green indicator
- Press Enter to switch, Esc to cancel
- Model, context window, and API key update instantly
You can also switch directly:
> /model mnemo:mark6-31b
> /switchAliases: /toggle, /model, /switch
API Key Management
API keys are resolved per-provider. Each provider has its own environment variable:
| Provider | Environment Variable |
|---|---|
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Google Gemini | GEMINI_API_KEY |
| Groq | GROQ_API_KEY |
| DeepSeek | DEEPSEEK_API_KEY |
| Together AI | TOGETHER_API_KEY |
| Generic | KCODE_API_KEY |
Keys can also be set via:
# 1. /cloud interactive menu (saves to ~/.kcode/settings.json)
> /cloud
# 2. CLI flag (session only)
kcode --api-key sk-... --model gpt-4o
# 3. Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
# 4. Settings file (~/.kcode/settings.json)
{
"anthropicApiKey": "sk-ant-...",
"apiKey": "sk-proj-..."
}
# 5. .env file (project root)
ANTHROPIC_API_KEY=sk-ant-...For security, prefer environment variables or /cloud over manual config files. Never commit API keys.
Model Capabilities
When registering a model, you can specify capabilities:
kcode models add my-model http://localhost:10091/v1 \
--context 65536 \
--gpu "RTX 5090" \
--caps "code,vision,reasoning" \
--desc "My custom fine-tuned model"Capabilities are informational and help you organize your models.
Troubleshooting
Model not responding
# Check if the model server is running
curl http://localhost:10091/v1/models
# Run diagnostics
kcode doctorEmpty responses
Some models may not respond to very short inputs. Try being more specific:
# Instead of:
kcode "hola"
# Try:
kcode "hola, explicame que archivos hay en este proyecto"Context window too small
If KCode hits the context limit frequently, use a model with a larger context window or enable auto-compaction:
# In session
> /auto-compact
# Or start with a bigger context model
kcode --model mnemo:mark6-31b-q8 # 256K contextGPU out of memory
Lower the context size or switch to a smaller model:
# Reduce context window
llama-server -m model.gguf --ctx-size 8192 # instead of 32768
# Or use a smaller quantization
# Q4_K_M uses less VRAM than Q5_K_M or Q6_K