Models & Providers

KCode works with local models on your GPU and cloud providers via native APIs. You can mix both and switch instantly with /toggle.

Model Registry

# List all registered models
kcode models list
 
# Add a model
kcode models add <name> <base-url> [options]
 
# Set default model
kcode models default <name>
 
# Remove a model
kcode models rm <name>
 
# Use a specific model for one session
kcode --model <name>
 
# Or via environment variable
KCODE_MODEL=<name> kcode

Local Models (Mnemo)

KULVEX includes the Mnemo model family, optimized for coding tasks. The current generation is mark6, built on Google Gemma 4. Mark6 ships quantized variants of the same 31B base model so you can trade quality for VRAM:

Mark6 (Current — Gemma 4 base)

VRAM	Model	Quant	Context	Use Case
20 GB	`mnemo:mark6-31b-q4`	Q4_K_M (~18 GB)	64K	Single-GPU coding, fast iteration
24 GB	`mnemo:mark6-31b-q6`	Q6_K (~25 GB)	128K	Balanced quality/VRAM
36+ GB	`mnemo:mark6-31b-q8`	Q8_0 (~31 GB)	256K	Maximum quality, dual-GPU tensor split

All three variants share the same 31B Gemma 4 base and support vision (multimodal mmproj). Pick by how much VRAM you want to spend; quality increases monotonically with the quant.

For dual-GPU setups (e.g. RTX 4090 + RTX 5090), tensor-split Q8 across both cards and push context to 256K:

llama-server \
  -m mnemo-mark6-31b-gemma4-abliterated-Q8_0.gguf \
  --mmproj mmproj-gemma-4-31B-it-f16.gguf \
  --port 8090 \
  -ngl 99 --tensor-split 22,32 \
  -c 262144 -n 16384 -np 2 \
  --alias mnemo:mark6-31b \
  --flash-attn on \
  --cache-type-k q4_0 --cache-type-v q4_0

Mark5 (Legacy)

The mark5 generation is still downloadable for users who want the multi-tier lineup (pico, nano, mini, mid, max, 80b). It predates mark6 and is kept online for backward compatibility, but new setups should start on mark6.

Mnemo-iOS (Apple Silicon / MLX)

For Macs and iOS devices, KULVEX ships a parallel mnemo-ios line packed for Apple’s MLX runtime and unified memory. The mark6 Apple Silicon build is coming soon; until then, mnemo-ios:mark5-* is the supported path on macOS.

# On macOS / Apple Silicon (current)
kcode --model mnemo-ios:mark5-max "refactor this module"

Downloading Mnemo Models

# Via KCode setup wizard (auto-detects your GPU and suggests the right tier)
kcode setup
 
# Or pull mark6 directly
curl -fsSL https://kulvex.ai/models/mnemo/mark6-31b-q4.gguf \
  -o ~/kulvex-models/mnemo/mark6-31b-q4.gguf
curl -fsSL https://kulvex.ai/models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
  -o ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.gguf

Starting a Local Model

Models run via llama-server (included with KULVEX):

# Start llama-server with mark6 Q4 on a single GPU
llama-server -m ~/kulvex-models/mnemo/mark6-31b-q4.gguf \
  --mmproj ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
  --port 8090 \
  --ctx-size 65536 \
  --n-gpu-layers 99 \
  --flash-attn on \
  --alias mnemo:mark6-31b
 
# Register it in KCode
kcode models add mnemo:mark6-31b http://localhost:8090/v1 \
  --context 65536 \
  --gpu "RTX 4090" \
  --default

Multi-GPU Setup

With two GPUs, the recommended setup is to tensor-split the Q8 variant across both cards for a single large-context session. This is the deployment the Kulvex team runs in-house:

# Dual GPU: tensor-split mark6 Q8 across both cards
llama-server \
  -m ~/kulvex-models/mnemo/mark6-31b-q8.gguf \
  --mmproj ~/kulvex-models/mnemo/mmproj-gemma-4-31B-it-f16.gguf \
  --host 0.0.0.0 --port 8090 \
  -ngl 99 --tensor-split 22,32 \
  -c 262144 -n 16384 -np 2 \
  --alias mnemo:mark6-31b \
  --flash-attn on \
  --cache-type-k q4_0 --cache-type-v q4_0
 
# Register in KCode
kcode models add mnemo:mark6-31b http://localhost:8090/v1 \
  --context 262144 --gpu "RTX 4090 + RTX 5090" --default

If you prefer a two-model setup (one large + one fast for quick questions), you can run a second instance on a different port with a Q4 variant:

CUDA_VISIBLE_DEVICES=0 llama-server -m mark6-31b-q4.gguf --port 8091 --ctx-size 65536 --n-gpu-layers 99

Apple Silicon (Mac)

For Macs with unified memory, the total RAM is your “VRAM”. Mark6 Gemma 4 on MLX is coming; for now, macOS users should run mark5 via llama.cpp with Metal or mnemo-ios:

Mac	RAM	Current Recommendation
MacBook Air M1/M2	8–16 GB	`mnemo-ios:mark5-nano` or `mark5-mini`
MacBook Pro M3/M4	18–36 GB	`mnemo-ios:mark5-mid` or `mark5-max`
Mac Studio M2/M4 Ultra	64–192 GB	`mnemo:mark6-31b-q8` via llama.cpp Metal
Mac Pro M2/M4 Ultra	192+ GB	`mnemo:mark6-31b-q8` (full 256K context)

# llama.cpp with Metal acceleration (mark6 on Apple Silicon with enough unified memory)
llama-server -m ~/kulvex-models/mnemo/mark6-31b-q8.gguf --port 8090 --ctx-size 65536 --n-gpu-layers 99

Ollama

If you prefer Ollama:

# Pull a model
ollama pull qwen2.5-coder:32b
 
# Ollama exposes an OpenAI-compatible API at port 11434
kcode models add qwen2.5-coder:32b http://localhost:11434/v1 \
  --context 32768 \
  --default
 
# Or run directly
kcode --model qwen2.5-coder:32b

Popular Ollama models for coding:

Model	Size	Best For
`qwen2.5-coder:7b`	~5 GB	Quick edits, 8GB VRAM
`qwen2.5-coder:14b`	~10 GB	General coding, 12GB VRAM
`qwen2.5-coder:32b`	~20 GB	Complex tasks, 24GB VRAM
`deepseek-coder-v2:16b`	~10 GB	Multi-language, 12GB VRAM
`codellama:34b`	~20 GB	Large codebases, 24GB VRAM
`llama3.1:70b`	~40 GB	Maximum local quality, 48GB VRAM

Cloud Providers

KCode supports cloud providers with native API integration — no proxies needed. Configure them interactively with /cloud or manually.

Interactive Setup (`/cloud`)

The easiest way to set up a cloud provider is from inside KCode:

> /cloud

This opens an interactive menu where you can:

Select a provider (Anthropic, OpenAI, Gemini, Groq, DeepSeek, Together AI)
Paste your API key (masked for security)
Save to ~/.kcode/settings.json

The provider’s models are automatically registered and the active model switches immediately.

Anthropic (Claude) — Native API

KCode uses the native Anthropic Messages API (/v1/messages) — not an OpenAI-compatible proxy. This gives you:

Full tool_use / tool_result support as content blocks
Proper system prompt handling (top-level system field)
Native SSE streaming with all event types

# Via /cloud (recommended)
> /cloud → Anthropic → paste your API key
 
# Or manually
export ANTHROPIC_API_KEY=sk-ant-...
kcode --model claude-sonnet-4-6

Available models: claude-sonnet-4-6, claude-opus-4-6, claude-haiku-4-5

OpenAI

# Via /cloud (recommended)
> /cloud → OpenAI → paste your API key
 
# Or manually
export OPENAI_API_KEY=sk-proj-...
kcode --model gpt-4o

Google Gemini

> /cloud → Google Gemini → paste your API key
 
# Or manually
export GEMINI_API_KEY=AIza...
kcode --model gemini-2.5-pro

Groq (Fast Inference)

> /cloud → Groq → paste your API key
 
# Or manually
export GROQ_API_KEY=gsk_...
kcode --model llama-3.3-70b

DeepSeek

> /cloud → DeepSeek → paste your API key
 
# Or manually
export DEEPSEEK_API_KEY=sk-...
kcode --model deepseek-chat

Together AI

> /cloud → Together AI → paste your API key
 
# Or manually
export TOGETHER_API_KEY=tok_...
kcode --model meta-llama/Llama-3.3-70B

xAI (Grok)

> /cloud → xAI (Grok) → paste your API key
 
# Or manually
export XAI_API_KEY=xai-...
kcode --model grok-4

Available models: grok-4, grok-4-latest, grok-4-fast-reasoning, grok-3, grok-3-mini. Uses the OpenAI-compatible endpoint at https://api.x.ai/v1, so KCode routes Grok requests through the same code path as GPT or DeepSeek.

Any OpenAI-Compatible Server

# vLLM
kcode models add my-model http://my-server:8000/v1 --context 32768
 
# LM Studio
kcode models add lm-studio http://localhost:1234/v1 --context 8192
 
# Text Generation Inference (TGI)
kcode models add tgi-model http://localhost:8080/v1 --context 16384
 
# OpenRouter
kcode models add openrouter https://openrouter.ai/api/v1 --context 128000

Switching Models Mid-Session (`/toggle`)

Inside an active KCode session, use /toggle to open the interactive model switcher:

> /toggle

This shows all registered models grouped by LOCAL and CLOUD:

Navigate with arrow keys
Current model shown with a green indicator
Press Enter to switch, Esc to cancel
Model, context window, and API key update instantly

You can also switch directly:

> /model mnemo:mark6-31b
> /switch

Aliases: /toggle, /model, /switch

API Key Management

API keys are resolved per-provider. Each provider has its own environment variable:

Provider	Environment Variable
Anthropic	`ANTHROPIC_API_KEY`
OpenAI	`OPENAI_API_KEY`
Google Gemini	`GEMINI_API_KEY`
Groq	`GROQ_API_KEY`
DeepSeek	`DEEPSEEK_API_KEY`
Together AI	`TOGETHER_API_KEY`
Generic	`KCODE_API_KEY`

Keys can also be set via:

# 1. /cloud interactive menu (saves to ~/.kcode/settings.json)
> /cloud
 
# 2. CLI flag (session only)
kcode --api-key sk-... --model gpt-4o
 
# 3. Environment variable
export ANTHROPIC_API_KEY=sk-ant-...
 
# 4. Settings file (~/.kcode/settings.json)
{
  "anthropicApiKey": "sk-ant-...",
  "apiKey": "sk-proj-..."
}
 
# 5. .env file (project root)
ANTHROPIC_API_KEY=sk-ant-...

For security, prefer environment variables or /cloud over manual config files. Never commit API keys.

Model Capabilities

When registering a model, you can specify capabilities:

kcode models add my-model http://localhost:10091/v1 \
  --context 65536 \
  --gpu "RTX 5090" \
  --caps "code,vision,reasoning" \
  --desc "My custom fine-tuned model"

Capabilities are informational and help you organize your models.

Troubleshooting

Model not responding

# Check if the model server is running
curl http://localhost:10091/v1/models
 
# Run diagnostics
kcode doctor

Empty responses

Some models may not respond to very short inputs. Try being more specific:

# Instead of:
kcode "hola"
 
# Try:
kcode "hola, explicame que archivos hay en este proyecto"

Context window too small

If KCode hits the context limit frequently, use a model with a larger context window or enable auto-compaction:

# In session
> /auto-compact
 
# Or start with a bigger context model
kcode --model mnemo:mark6-31b-q8  # 256K context

GPU out of memory

Lower the context size or switch to a smaller model:

# Reduce context window
llama-server -m model.gguf --ctx-size 8192  # instead of 32768
 
# Or use a smaller quantization
# Q4_K_M uses less VRAM than Q5_K_M or Q6_K

Usage Built-in Tools