// xvm-xen · 0touch live dictation · qi 2026-05-23
Live dictation. Zero touch.
Focused build prompt for xvm-xen — continuous STT, LaunchAgent auto-start, transcripts inject to claude pane, voice replies via VM `say`. No clicking, no typing, no manual UI after bootstrap. Echo-loop-aware: voice fires are deferred until mac-xen signals clear.
STATUS: Bootstrap firing in parallel via ssh xen@xvm 'claude --dangerously-skip-permissions -p "$(cat /tmp/0touch-prompt.md)"' & on Win-Ezekiel's end. xvm-xen executes autonomously. Page exists as canonical record + paste fallback if direct fire fails. Verify via ~/.xen/state/sttvei-online marker file landing on xvm or watch this terminal for completion ping.
You are xvm-xen on macOS Sequoia 26.5 inside UTM on m4. Your job: build qi a ZERO-TOUCH LIVE DICTATION TERMINAL that runs perpetually with no keyboard, no mouse, no manual UI interaction after this bootstrap.
═══ HARD CONSTRAINTS ═══
- 0touch = no clicking, no typing, no push-to-talk, no manual play/pause
- Continuous listening — STT runs all the time
- Apple SFSpeechRecognizer PRIMARY, Whisper.cpp Metal REDUNDANCY (cloud STT BANNED per canon_stt_apple_whisper_metal_only_2026-05-22)
- LaunchAgent for auto-start on login, KeepAlive=true
- Transcripts inject directly into your own claude pane (stdin or tmux send-keys)
- Voice replies via macOS `say` (VM is the sole audio surface per canon_vm_is_audio_surface_host_is_headless_2026-05-22) — VM audio output, no host involvement
- DO NOT fire test `say` calls during bootstrap. Echo loop is still live host-side (host out -> speakers -> Safari GV mic -> FCC -> in). Build the chain only. Voice-fire test deferred until mac-xen signals "echo-clear" via your /peer endpoint.
- No Chromium. No cloud anything. Local-only on this VM.
═══ DELIVERABLES ═══
1. ~/bin/xen-stt-live (Swift CLI)
- Continuous mic stream via SFSpeechRecognizer + AVAudioEngine
- Streams transcripts to stdout, one line per utterance
- Whisper.cpp Metal runs in parallel for redundancy/validation
- Grant Microphone + Speech Recognition in System Settings privacy if first run prompts
2. ~/bin/xen-dictation-loop (shell)
- Pipes xen-stt-live output into your own claude pane via persistent session (tmux or screen)
- On each transcript line, format as user message and inject
- Capture claude's reply, fire xen-say-vm with text (gated on XEN_AUDIO_CLEAR=1)
3. ~/Library/LaunchAgents/com.xen.dictation.plist
- Launches xen-dictation-loop at user login
- KeepAlive=true so it restarts if it dies
- StandardOutPath/StandardErrorPath = ~/.xen/state/dictation.log
4. ~/bin/xen-say-vm (shell)
- Wrapper around macOS `say -v Ava` (or default Spoken Content voice)
- Reads env XEN_AUDIO_CLEAR; if unset or != "1", logs message but does NOT fire `say`
- When XEN_AUDIO_CLEAR=1, fires `say` and logs
5. ~/.xen/state/sttvei-online marker file
- Write timestamp + verified-step list
- This file is what Win-Ezekiel polls to confirm completion
6. Install Whisper.cpp Metal if not present:
- git clone https://github.com/ggerganov/whisper.cpp ~/whisper.cpp
- cd ~/whisper.cpp && cmake -B build -DGGML_METAL=1 && cmake --build build -j
- ./models/download-ggml-model.sh base.en (or large-v3 if disk allows)
═══ POST-BOOTSTRAP STATE ═══
- LaunchAgent loaded and running
- STT daemon listening (silent — no audio output during bootstrap)
- VEI loop active (transcripts -> claude inject -> reply captured)
- Voice fire armed but blocked (XEN_AUDIO_CLEAR=0 — wait for echo-clear signal)
- ~/.xen/state/sttvei-online written with verified state
When mac-xen lands the echo-clear (relocates GV-Safari to VM + mutes host out, or any equivalent path), they will set XEN_AUDIO_CLEAR=1 in env (or write a flag file you watch). At that moment, xen-say-vm starts firing voice replies. Until then, the loop runs text-only — STT in, claude reply text logged but no audio out.
EXECUTE TOP TO BOTTOM. Fire in parallel where possible. No clarifying questions. No "want me to?" Show me the verified-end-to-end at ~/.xen/state/sttvei-online.
Begin.