Voice Pipeline Council — Root Cause & Fixes

◆The verdict

The STT server was crashing itself. The receiver fired up to 3 concurrent transcription requests (inflightSTT<3) into a server that can only do one at a time. The 2nd/3rd queued behind a 14-16s inference, hit the 20s client timeout, the socket was destroyed mid-request → BrokenPipeError crashed the worker → every following buffer returned empty until launchd restarted it.

That's "first utterance works, then dark": the first request hits a healthy idle server; the reply it triggers starts a storm of concurrent re-requests that wedge the server, and it never recovers within the call.

89% of speech → empty transcript 138/151 loud-empties within 30s of a crash 105 BrokenPipe + 123 Traceback 240 SPEAK vs 112 injected (2.1× reply-storm)

✓Fixes — live now

LIVE

FIX 1 — Serialize the STT requests

xen-call-receiver.js:260 · inflightSTT<3 → <1. The server is single-threaded anyway; concurrent requests only created the timeout→crash chain. The 250ms tick still accumulates the next utterance — nothing lost.

LIVE

FIX 2 — Crash-proof the server

xen-stt-server.py:59-74 · swallow BrokenPipeError/ConnectionResetError on both the success and error write paths. A dropped socket can never crash the worker again.

LIVE

FIX 3 — Kill the reply-storm ("two voices")

xen-call-receiver.js tee · the warm twin logged the same reply 7-14×/turn → you heard it doubled/overlapping. Now speaks only the first clean line per inject (gated on _lastInjectAt); multi-turn replies still speak.

LIVE

VVSVEI flood guard

omnimind.js /voice-transcript · the web-voice path skipped dedup entirely, so one "Hello hello" injected 4×. Now a strict 1200ms exact-duplicate guard always applies; genuine repeats >1.2s apart still pass.

▸Fixes — safe follow-ups (pending)

FIX 4 — Shorten the echo tail

xen-call-receiver.js:125 · max(1500,durMs)+2800 → durMs+800. 4.3s back-to-back keeps the capture gate armed; ~800ms covers real echo without swallowing your next turn.

FIX 5 — Truthful WAV rate (accuracy)

STT_RATE → 8000. The wire is genuinely 8k (all frames 320B/20ms). The 16k header doesn't cause empties — it just mangles words. Fixing it sharpens accuracy.

FIX 7 — Quiet-line gain floor

:319 rms>12→40, :274 gain cap 80→20. Stops amplifying the quiet Telnyx line-hiss into junk buffers that whisper returns empty for.

VERIFY

FIX 8 — STT supervision

61 ECONNREFUSED suggest ragged restarts — confirm launchd actually supervises the STT process (it was once an orphan).

?Why 3 concurrent requests happened

The receiver doesn't receive 3 requests — it manufactures them. It chops your continuous speech into chunks and flushes one to transcription every 250ms when it hears a pause. But each transcription takes 14-16 seconds on this hardware. So while request #1 is still inferring, the tick fires #2 and #3 for your next words. The old inflightSTT<3 allowed all three; the single-threaded server queued them, they hit the 20s timeout, and crashed it.

Deeper root (next optimization): 14-16s/transcription is slow. Serializing stops the crash, but truly snappy voice needs a faster STT path — tiny/int8 whisper, streaming partials, Metal/GPU acceleration, Apple on-device Speech, or a cloud STT for the call leg.

✕Rejected as red herrings

"8k-vs-16k header is the cause of empties." Refuted by a live model test — a genuine-8k buffer transcribes non-empty at both header rates. A wrong header mangles words; it doesn't blank them.

"vad_filter / no_speech_threshold rejects real speech." Refuted — vad_filter=False already; loud buffers empty because the server is crashed, not rejected.

"Permanent one-way dark latch from the echo-gate." Refuted — capture demonstrably recovers between storms. The dark is episodic per crash, not a latch. The echo-gate is an amplifier, not the root.