Mnemo (AI Engine)Voice Pipeline

Voice Pipeline

KULVEX includes a full voice interface with speech-to-text (STT), text-to-speech (TTS), voice activity detection (VAD), and intent recognition.

Architecture

Microphone → VAD → STT → Intent Detection → Domain Agent / LLM → TTS → Speaker

Speech-to-Text (STT)

KULVEX tries STT providers in priority order:

PriorityProviderWhereLatency
1mnemo:voiceGPU node (Whisper large-v3 CUDA)~1-2s
2DeepgramCloud API~0.5-1s
3Whisper CPULocal (slow)~5-10s

mnemo:voice

GPU-accelerated Whisper running on a dedicated node. Configure the node in the KULVEX dashboard under Settings > Nodes.

Deepgram

Cloud STT with excellent accuracy. Set DEEPGRAM_API_KEY in Settings.

Whisper CPU

Local fallback using OpenAI’s Whisper model on CPU. Slow but works offline.

Text-to-Speech (TTS)

Default: EdgeTTS (Microsoft Edge voices, free, no API key).

Alternative: Piper (fully local, open-source voices).

Intent Detection

Before sending to the LLM, KULVEX checks 12 regex patterns for common voice intents:

  • Time queries (“what time is it”)
  • Presence (“who’s home”)
  • Solar status (“how much energy”)
  • Security (“arm/disarm alarm”)
  • Home control (“turn on lights”)
  • Weather, email, news, search, system info

Matched intents use template responses (zero LLM, instant) or domain agents with minimal synthesis.

Voice Memory

KULVEX remembers voice conversations:

  • Extracts facts from conversations
  • Retrieves relevant memories for context
  • Builds a personal knowledge base over time

Socket.IO Events

EventDirectionDescription
voice:startclient → serverStart voice session
voice:audio_chunkclient → serverAudio data chunk
voice:stopclient → serverEnd voice session
voice:playback_doneclient → serverTTS playback finished
voice:transcriptserver → clientSTT result
voice:responseserver → clientAI response text
voice:audioserver → clientTTS audio data