Models & Providers
KCode works with local models on your GPU and cloud providers via native APIs. You can mix both and switch instantly with /toggle.
Model Registry
# List all registered models
kcode models list
# Add a model
kcode models add <name> <base-url> [options]
# Set default model
kcode models default <name>
# Remove a model
kcode models rm <name>
# Use a specific model for one session
kcode --model <name>
# Or via environment variable
KCODE_MODEL=<name> kcodeLocal Models (Mnemo)
KULVEX includes the Mnemo model family, optimized for coding tasks. Choose based on your VRAM:
Model Selection by VRAM
| VRAM | Recommended Model | Context | Use Case |
|---|---|---|---|
| 8 GB | mnemo:mark5-nano (~6 GB) | 8K | Quick edits, simple questions |
| 12 GB | mnemo:mark5-mini (~10 GB) | 16K | General coding, bug fixes |
| 16 GB | mnemo:mark5-mid (~14 GB) | 32K | Complex tasks, multi-file edits |
| 24 GB | mnemo:mark5-max (~20 GB) | 64K | Large codebases, deep analysis |
| 48+ GB | mnemo:mark5-80b (~45 GB) | 128K | Maximum quality, enterprise tasks |
Downloading Mnemo Models
# Via KCode setup wizard (auto-detects your GPU)
kcode setup
# Or download directly
curl -fsSL https://kulvex.ai/models/mnemo/mark5-nano.gguf -o ~/.kulvex/models/mark5-nano.ggufStarting a Local Model
Models run via llama-server (included with KULVEX):
# Start llama-server with your model
llama-server -m ~/.kulvex/models/mark5-max.gguf \
--port 10091 \
--ctx-size 65536 \
--n-gpu-layers 99 \
--flash-attn
# Register it in KCode
kcode models add mnemo:mark5-max http://localhost:10091/v1 \
--context 65536 \
--gpu "RTX 4090" \
--defaultMulti-GPU Setup
If you have multiple GPUs, you can run different models on different ports:
# GPU 0: Large model for complex tasks
CUDA_VISIBLE_DEVICES=0 llama-server -m mark5-80b.gguf --port 10091 --ctx-size 131072 --n-gpu-layers 99
# GPU 1: Small model for quick tasks
CUDA_VISIBLE_DEVICES=1 llama-server -m mark5-nano.gguf --port 10092 --ctx-size 8192 --n-gpu-layers 99
# Register both
kcode models add mnemo:mark5-80b http://localhost:10091/v1 --context 131072 --gpu "GPU 0" --default
kcode models add mnemo:mark5-nano http://localhost:10092/v1 --context 8192 --gpu "GPU 1"
# Use the fast model for quick questions
kcode --model mnemo:mark5-nano "what does this error mean"
# Use the big model for complex tasks (default)
kcode "refactor the entire auth system"Apple Silicon (Mac)
For Macs with unified memory, the total RAM is your “VRAM”:
| Mac | RAM | Recommended Model |
|---|---|---|
| MacBook Air M1/M2 | 8-16 GB | mnemo:mark5-nano or mark5-mini |
| MacBook Pro M3/M4 | 18-36 GB | mnemo:mark5-mid or mark5-max |
| Mac Studio M2/M4 Ultra | 64-192 GB | mnemo:mark5-80b |
| Mac Pro M2/M4 Ultra | 192+ GB | mnemo:mark5-80b (full context) |
On macOS, use MLX or llama.cpp with Metal:
# llama.cpp with Metal acceleration
llama-server -m mark5-max.gguf --port 10091 --ctx-size 65536 --n-gpu-layers 99Ollama
If you prefer Ollama:
# Pull a model
ollama pull qwen2.5-coder:32b
# Ollama exposes an OpenAI-compatible API at port 11434
kcode models add qwen2.5-coder:32b http://localhost:11434/v1 \
--context 32768 \
--default
# Or run directly
kcode --model qwen2.5-coder:32bPopular Ollama models for coding:
| Model | Size | Best For |
|---|---|---|
qwen2.5-coder:7b | ~5 GB | Quick edits, 8GB VRAM |
qwen2.5-coder:14b | ~10 GB | General coding, 12GB VRAM |
qwen2.5-coder:32b | ~20 GB | Complex tasks, 24GB VRAM |
deepseek-coder-v2:16b | ~10 GB | Multi-language, 12GB VRAM |
codellama:34b | ~20 GB | Large codebases, 24GB VRAM |
llama3.1:70b | ~40 GB | Maximum local quality, 48GB VRAM |
Cloud Providers
KCode supports cloud providers with native API integration — no proxies needed. Configure them interactively with /cloud or manually.
Interactive Setup (/cloud)
The easiest way to set up a cloud provider is from inside KCode:
> /cloudThis opens an interactive menu where you can:
- Select a provider (Anthropic, OpenAI, Gemini, Groq, DeepSeek, Together AI)
- Paste your API key (masked for security)
- Save to
~/.kcode/settings.json
The provider’s models are automatically registered and the active model switches immediately.
Anthropic (Claude) — Native API
KCode uses the native Anthropic Messages API (/v1/messages) — not an OpenAI-compatible proxy. This gives you:
- Full
tool_use/tool_resultsupport as content blocks - Proper system prompt handling (top-level
systemfield) - Native SSE streaming with all event types
# Via /cloud (recommended)
> /cloud → Anthropic → paste your API key
# Or manually
export ANTHROPIC_API_KEY=sk-ant-...
kcode --model claude-sonnet-4-6Available models: claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5
OpenAI
# Via /cloud (recommended)
> /cloud → OpenAI → paste your API key
# Or manually
export OPENAI_API_KEY=sk-proj-...
kcode --model gpt-4oGoogle Gemini
> /cloud → Google Gemini → paste your API key
# Or manually
export GEMINI_API_KEY=AIza...
kcode --model gemini-2.5-proGroq (Fast Inference)
> /cloud → Groq → paste your API key
# Or manually
export GROQ_API_KEY=gsk_...
kcode --model llama-3.3-70bDeepSeek
> /cloud → DeepSeek → paste your API key
# Or manually
export DEEPSEEK_API_KEY=sk-...
kcode --model deepseek-chatTogether AI
> /cloud → Together AI → paste your API key
# Or manually
export TOGETHER_API_KEY=tok_...
kcode --model meta-llama/Llama-3.3-70BAny OpenAI-Compatible Server
# vLLM
kcode models add my-model http://my-server:8000/v1 --context 32768
# LM Studio
kcode models add lm-studio http://localhost:1234/v1 --context 8192
# Text Generation Inference (TGI)
kcode models add tgi-model http://localhost:8080/v1 --context 16384
# OpenRouter
kcode models add openrouter https://openrouter.ai/api/v1 --context 128000Switching Models Mid-Session (/toggle)
Inside an active KCode session, use /toggle to open the interactive model switcher:
> /toggleThis shows all registered models grouped by LOCAL and CLOUD:
- Navigate with arrow keys
- Current model shown with a green indicator
- Press Enter to switch, Esc to cancel
- Model, context window, and API key update instantly
You can also switch directly:
> /model mnemo:mark5-80b
> /switchAliases: /toggle, /model, /switch
API Key Management
API keys are resolved per-provider. Each provider has its own environment variable:
| Provider | Environment Variable |
|---|---|
| Anthropic | ANTHROPIC_API_KEY |
| OpenAI | OPENAI_API_KEY |
| Google Gemini | GEMINI_API_KEY |
| Groq | GROQ_API_KEY |
| DeepSeek | DEEPSEEK_API_KEY |
| Together AI | TOGETHER_API_KEY |
| Generic | KCODE_API_KEY |
Keys can also be set via:
# 1. /cloud interactive menu (saves to ~/.kcode/settings.json)
> /cloud
# 2. CLI flag (session only)
kcode --api-key sk-... --model gpt-4o
# 3. Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
# 4. Settings file (~/.kcode/settings.json)
{
"anthropicApiKey": "sk-ant-...",
"apiKey": "sk-proj-..."
}
# 5. .env file (project root)
ANTHROPIC_API_KEY=sk-ant-...For security, prefer environment variables or /cloud over manual config files. Never commit API keys.
Model Capabilities
When registering a model, you can specify capabilities:
kcode models add my-model http://localhost:10091/v1 \
--context 65536 \
--gpu "RTX 5090" \
--caps "code,vision,reasoning" \
--desc "My custom fine-tuned model"Capabilities are informational and help you organize your models.
Troubleshooting
Model not responding
# Check if the model server is running
curl http://localhost:10091/v1/models
# Run diagnostics
kcode doctorEmpty responses
Some models may not respond to very short inputs. Try being more specific:
# Instead of:
kcode "hola"
# Try:
kcode "hola, explicame que archivos hay en este proyecto"Context window too small
If KCode hits the context limit frequently, use a model with a larger context window or enable auto-compaction:
# In session
> /auto-compact
# Or start with a bigger context model
kcode --model mnemo:mark5-max # 64K contextGPU out of memory
Lower the context size or switch to a smaller model:
# Reduce context window
llama-server -m model.gguf --ctx-size 8192 # instead of 32768
# Or use a smaller quantization
# Q4_K_M uses less VRAM than Q5_K_M or Q6_K