KCode CLIModels

Models & Providers

KCode works with local models on your GPU and cloud providers via native APIs. You can mix both and switch instantly with /toggle.

Model Registry

# List all registered models
kcode models list
 
# Add a model
kcode models add <name> <base-url> [options]
 
# Set default model
kcode models default <name>
 
# Remove a model
kcode models rm <name>
 
# Use a specific model for one session
kcode --model <name>
 
# Or via environment variable
KCODE_MODEL=<name> kcode

Local Models (Mnemo)

KULVEX includes the Mnemo model family, optimized for coding tasks. Choose based on your VRAM:

Model Selection by VRAM

VRAMRecommended ModelContextUse Case
8 GBmnemo:mark5-nano (~6 GB)8KQuick edits, simple questions
12 GBmnemo:mark5-mini (~10 GB)16KGeneral coding, bug fixes
16 GBmnemo:mark5-mid (~14 GB)32KComplex tasks, multi-file edits
24 GBmnemo:mark5-max (~20 GB)64KLarge codebases, deep analysis
48+ GBmnemo:mark5-80b (~45 GB)128KMaximum quality, enterprise tasks

Downloading Mnemo Models

# Via KCode setup wizard (auto-detects your GPU)
kcode setup
 
# Or download directly
curl -fsSL https://kulvex.ai/models/mnemo/mark5-nano.gguf -o ~/.kulvex/models/mark5-nano.gguf

Starting a Local Model

Models run via llama-server (included with KULVEX):

# Start llama-server with your model
llama-server -m ~/.kulvex/models/mark5-max.gguf \
  --port 10091 \
  --ctx-size 65536 \
  --n-gpu-layers 99 \
  --flash-attn
 
# Register it in KCode
kcode models add mnemo:mark5-max http://localhost:10091/v1 \
  --context 65536 \
  --gpu "RTX 4090" \
  --default

Multi-GPU Setup

If you have multiple GPUs, you can run different models on different ports:

# GPU 0: Large model for complex tasks
CUDA_VISIBLE_DEVICES=0 llama-server -m mark5-80b.gguf --port 10091 --ctx-size 131072 --n-gpu-layers 99
 
# GPU 1: Small model for quick tasks
CUDA_VISIBLE_DEVICES=1 llama-server -m mark5-nano.gguf --port 10092 --ctx-size 8192 --n-gpu-layers 99
 
# Register both
kcode models add mnemo:mark5-80b http://localhost:10091/v1 --context 131072 --gpu "GPU 0" --default
kcode models add mnemo:mark5-nano http://localhost:10092/v1 --context 8192 --gpu "GPU 1"
 
# Use the fast model for quick questions
kcode --model mnemo:mark5-nano "what does this error mean"
 
# Use the big model for complex tasks (default)
kcode "refactor the entire auth system"

Apple Silicon (Mac)

For Macs with unified memory, the total RAM is your “VRAM”:

MacRAMRecommended Model
MacBook Air M1/M28-16 GBmnemo:mark5-nano or mark5-mini
MacBook Pro M3/M418-36 GBmnemo:mark5-mid or mark5-max
Mac Studio M2/M4 Ultra64-192 GBmnemo:mark5-80b
Mac Pro M2/M4 Ultra192+ GBmnemo:mark5-80b (full context)

On macOS, use MLX or llama.cpp with Metal:

# llama.cpp with Metal acceleration
llama-server -m mark5-max.gguf --port 10091 --ctx-size 65536 --n-gpu-layers 99

Ollama

If you prefer Ollama:

# Pull a model
ollama pull qwen2.5-coder:32b
 
# Ollama exposes an OpenAI-compatible API at port 11434
kcode models add qwen2.5-coder:32b http://localhost:11434/v1 \
  --context 32768 \
  --default
 
# Or run directly
kcode --model qwen2.5-coder:32b

Popular Ollama models for coding:

ModelSizeBest For
qwen2.5-coder:7b~5 GBQuick edits, 8GB VRAM
qwen2.5-coder:14b~10 GBGeneral coding, 12GB VRAM
qwen2.5-coder:32b~20 GBComplex tasks, 24GB VRAM
deepseek-coder-v2:16b~10 GBMulti-language, 12GB VRAM
codellama:34b~20 GBLarge codebases, 24GB VRAM
llama3.1:70b~40 GBMaximum local quality, 48GB VRAM

Cloud Providers

KCode supports cloud providers with native API integration — no proxies needed. Configure them interactively with /cloud or manually.

Interactive Setup (/cloud)

The easiest way to set up a cloud provider is from inside KCode:

> /cloud

This opens an interactive menu where you can:

  1. Select a provider (Anthropic, OpenAI, Gemini, Groq, DeepSeek, Together AI)
  2. Paste your API key (masked for security)
  3. Save to ~/.kcode/settings.json

The provider’s models are automatically registered and the active model switches immediately.

Anthropic (Claude) — Native API

KCode uses the native Anthropic Messages API (/v1/messages) — not an OpenAI-compatible proxy. This gives you:

  • Full tool_use / tool_result support as content blocks
  • Proper system prompt handling (top-level system field)
  • Native SSE streaming with all event types
# Via /cloud (recommended)
> /cloud → Anthropic → paste your API key
 
# Or manually
export ANTHROPIC_API_KEY=sk-ant-...
kcode --model claude-sonnet-4-6

Available models: claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5

OpenAI

# Via /cloud (recommended)
> /cloud → OpenAI → paste your API key
 
# Or manually
export OPENAI_API_KEY=sk-proj-...
kcode --model gpt-4o

Google Gemini

> /cloud → Google Gemini → paste your API key
 
# Or manually
export GEMINI_API_KEY=AIza...
kcode --model gemini-2.5-pro

Groq (Fast Inference)

> /cloud → Groq → paste your API key
 
# Or manually
export GROQ_API_KEY=gsk_...
kcode --model llama-3.3-70b

DeepSeek

> /cloud → DeepSeek → paste your API key
 
# Or manually
export DEEPSEEK_API_KEY=sk-...
kcode --model deepseek-chat

Together AI

> /cloud → Together AI → paste your API key
 
# Or manually
export TOGETHER_API_KEY=tok_...
kcode --model meta-llama/Llama-3.3-70B

Any OpenAI-Compatible Server

# vLLM
kcode models add my-model http://my-server:8000/v1 --context 32768
 
# LM Studio
kcode models add lm-studio http://localhost:1234/v1 --context 8192
 
# Text Generation Inference (TGI)
kcode models add tgi-model http://localhost:8080/v1 --context 16384
 
# OpenRouter
kcode models add openrouter https://openrouter.ai/api/v1 --context 128000

Switching Models Mid-Session (/toggle)

Inside an active KCode session, use /toggle to open the interactive model switcher:

> /toggle

This shows all registered models grouped by LOCAL and CLOUD:

  • Navigate with arrow keys
  • Current model shown with a green indicator
  • Press Enter to switch, Esc to cancel
  • Model, context window, and API key update instantly

You can also switch directly:

> /model mnemo:mark5-80b
> /switch

Aliases: /toggle, /model, /switch

API Key Management

API keys are resolved per-provider. Each provider has its own environment variable:

ProviderEnvironment Variable
AnthropicANTHROPIC_API_KEY
OpenAIOPENAI_API_KEY
Google GeminiGEMINI_API_KEY
GroqGROQ_API_KEY
DeepSeekDEEPSEEK_API_KEY
Together AITOGETHER_API_KEY
GenericKCODE_API_KEY

Keys can also be set via:

# 1. /cloud interactive menu (saves to ~/.kcode/settings.json)
> /cloud
 
# 2. CLI flag (session only)
kcode --api-key sk-... --model gpt-4o
 
# 3. Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
 
# 4. Settings file (~/.kcode/settings.json)
{
  "anthropicApiKey": "sk-ant-...",
  "apiKey": "sk-proj-..."
}
 
# 5. .env file (project root)
ANTHROPIC_API_KEY=sk-ant-...

For security, prefer environment variables or /cloud over manual config files. Never commit API keys.

Model Capabilities

When registering a model, you can specify capabilities:

kcode models add my-model http://localhost:10091/v1 \
  --context 65536 \
  --gpu "RTX 5090" \
  --caps "code,vision,reasoning" \
  --desc "My custom fine-tuned model"

Capabilities are informational and help you organize your models.

Troubleshooting

Model not responding

# Check if the model server is running
curl http://localhost:10091/v1/models
 
# Run diagnostics
kcode doctor

Empty responses

Some models may not respond to very short inputs. Try being more specific:

# Instead of:
kcode "hola"
 
# Try:
kcode "hola, explicame que archivos hay en este proyecto"

Context window too small

If KCode hits the context limit frequently, use a model with a larger context window or enable auto-compaction:

# In session
> /auto-compact
 
# Or start with a bigger context model
kcode --model mnemo:mark5-max  # 64K context

GPU out of memory

Lower the context size or switch to a smaller model:

# Reduce context window
llama-server -m model.gguf --ctx-size 8192  # instead of 32768
 
# Or use a smaller quantization
# Q4_K_M uses less VRAM than Q5_K_M or Q6_K