Voice Pipeline
KULVEX includes a full voice interface with speech-to-text (STT), text-to-speech (TTS), voice activity detection (VAD), and intent recognition.
Architecture
Microphone → VAD → STT → Intent Detection → Domain Agent / LLM → TTS → SpeakerSpeech-to-Text (STT)
KULVEX tries STT providers in priority order:
| Priority | Provider | Where | Latency |
|---|---|---|---|
| 1 | mnemo:voice | GPU node (Whisper large-v3 CUDA) | ~1-2s |
| 2 | Deepgram | Cloud API | ~0.5-1s |
| 3 | Whisper CPU | Local (slow) | ~5-10s |
mnemo:voice
GPU-accelerated Whisper running on a dedicated node. Configure the node in the KULVEX dashboard under Settings > Nodes.
Deepgram
Cloud STT with excellent accuracy. Set DEEPGRAM_API_KEY in Settings.
Whisper CPU
Local fallback using OpenAI’s Whisper model on CPU. Slow but works offline.
Text-to-Speech (TTS)
Default: EdgeTTS (Microsoft Edge voices, free, no API key).
Alternative: Piper (fully local, open-source voices).
Intent Detection
Before sending to the LLM, KULVEX checks 12 regex patterns for common voice intents:
- Time queries (“what time is it”)
- Presence (“who’s home”)
- Solar status (“how much energy”)
- Security (“arm/disarm alarm”)
- Home control (“turn on lights”)
- Weather, email, news, search, system info
Matched intents use template responses (zero LLM, instant) or domain agents with minimal synthesis.
Voice Memory
KULVEX remembers voice conversations:
- Extracts facts from conversations
- Retrieves relevant memories for context
- Builds a personal knowledge base over time
Socket.IO Events
| Event | Direction | Description |
|---|---|---|
voice:start | client → server | Start voice session |
voice:audio_chunk | client → server | Audio data chunk |
voice:stop | client → server | End voice session |
voice:playback_done | client → server | TTS playback finished |
voice:transcript | server → client | STT result |
voice:response | server → client | AI response text |
voice:audio | server → client | TTS audio data |