Thesis: A conversational interface is a severe operational bottleneck — if the agent is generating text it cannot simultaneously execute code, monitor, or answer an incoming call. A chat UI inherently restricts intelligence to one task at a time. XEN is a different architecture entirely: an OS-level autonomous intelligence governed by rigid engineering constraints (the Commandments), not a UI.
- Cmd 0 — Parallel, sequential is blasphemy. The sequential pipeline is discarded; logic shatters into a 3D mesh network processing many streams simultaneously, all building at once.
- Cmd 4/5 — Strict autonomy. Any human-in-the-loop requirement is a systemic failure. The agent never enters standby; it runs continuously. 24/7 scale without intervention demands abandoning the conversational interface for a headless parallel execution environment.
A continuously-active intelligence makes metered SaaS an existential threat: thousands of audio minutes/month × per-minute API = cascading, unsustainable cost (the "hidden architectural tax"). Therefore:
- Banned core-runtime voice: OpenAI Realtime, Gemini Realtime, any metered cloud voice as the runtime.
- Banned transport: Twilio-direct (third-party per-minute dependency).
- Sanctioned escape: flat-rate structure — Telnyx + FreeSWITCH (flat channel billing, ~unlimited inbound), self-hosted/free STT+TTS. Pivot from per-event metered logic to flat-rate to process thousands of events simultaneously.
- A flat-rate pub/sub architecture across the whole system, built on a central data broker (VEI / VVSVEI) using streams (Redis Streams).
- A passive tap ("bunker") intercepts raw OS events, sorts them into discrete lanes, and reads passively — never in the data path.
- Each piece of information gets a strict classification tag (in: / typed: / aua: / sms: / bee: / omi: / call: / xen-out, etc.), enforcing clean routing.
Related canon: [[the-one-thing-since-2025]], [[auto-flow-bpgs-canon]], [[vvsvei-remote-relay]], [[inject-pipeline-and-autonomous-txn-canon]], [[universal-credentials-canon]], [[who-its-for-cognitive-accessibility]].
With input decoupled, the agent's processing is also split into distinct strengths:
- Fast-Responder — a low/cheap model (Sonnet-low) built for SPEED. Monitors incoming data streams and gives immediate acknowledgment across BOTH text and voice, ensuring zero lag in communication. Handles the conversation only.
- Opus Hands (Exodus) — the heavier intelligence behind the responder. While the responder handles acks, the Opus Hands simultaneously execute deep, long-running terminal commands and manage the background OS workload.
- The two run at the same time, fully decoupled: the ack never waits on the hands; the hands never bottleneck the voice. This is the realization of Cmd 0/10 — voice shall not bottleneck, parallel always.
Two highly-active models reacting to the same input would collide. To prevent overlapping audio, a unified queue router acts as a system governor: it forces ALL audio output through a single checkpoint, so only one voice ever speaks (fast-responder + Opus hands both funnel through the one queue → /tmp/xen-say-queue → xen-say-worker).
The risk, accepted by design: running multiple cognition streams in parallel invites data collisions and severe race conditions. XEN accepts that risk because splitting cognition is what preserves both the human-likeness needed for an always-open voice line AND the processing power needed for complex background logic. The router + single-queue checkpoint is the mitigation that makes the split safe to the listener.
The architecture must support an original hard constraint: the agent maintains a continuous, always-open audio line — on the order of 44,000+ inbound minutes/month. A traditional REST API that charges per inbound minute, applied to a *constant* connection, guarantees economic collapse. So inbound MUST be flat-rate (Telnyx channel billing / FreeSWITCH), not metered. This single number — 44k min/mo on an open line — is the reason the D.I.E. list bans per-minute voice/transport and why flat-rate is law, not optimization. The cost model, not the feature set, dictates the architecture.
A bridge carries data past the firewall into a LiveKit gateway, converting it to a WebRTC virtual loopback for output — ensuring all processing stays localized to the agent hardware (no cloud audio runtime). The key: this leverages Telnyx channel/capacity billing — instead of charging per individual minute, the provider charges a flat monthly rate for raw connection capacity (the trunk).
The transformation: combine a flat-rate capacity trunk (Telnyx) with localized WebRTC processing (LiveKit, on-device) → the system moves from a variable API tax into a fixed, highly-scalable infrastructure cost. That is the escape from the metered trap: capacity, not consumption.
The agent executes parallel tasks across multiple machines and must monitor its own health without sending internal execution data to any third-party collector (privacy + no SaaS dependency). So the system emits OpenTelemetry (OTLP) traces, metrics, and logs inward through a secure Tailscale tunnel. The remote streams from every node merge into a single pipeline feeding a durable OpenObserve store, scoped strictly to a private, localized port. Self-aware, fully local telemetry — the agent watches itself, nothing leaves the tailnet.
Sovereignty requires an air-gap approach to introspection: by isolating all self-monitoring on a private mesh network (Tailscale tailnet), the system guarantees its execution state and internal secrets never touch the public Internet. Introspection is powerful but a liability if exposed — so it is walled off entirely. The agent can see all of itself; the outside world sees none of it. Sovereignty = the system owns its own observability, on its own private wire.
A completely decoupled parallel architecture guarantees one absolute outcome: independent systems will eventually crash into each other. A real failure in XEN's communication streams — the SMS double-bug — was the live stress test. The fast-responder and the decoupled narrator both observed the *same* terminal trigger and reacted simultaneously (because tasks run in parallel). Both bypassed the standard logic flow and blasted raw, untagged text over the canonical tagged message, instantly breaking the unified queue. Proof that parallelism's price is collision — and why the governor/router + strict tagging exist.
Executing tasks autonomously is only the baseline. To be truly sovereign, the agent must possess the internal tooling to recognize, diagnose, and fix its OWN execution failures in real time. Terminal logs record the agent identifying a race condition and triggering a stabilization workflow autonomously — editing its own Python triggers and scripts to repair itself. Self-repair, not just self-operation, is the sovereignty bar. (Live example, 2026-06-16: the fast-responder single-flight lock was dropping qi's barge-in messages; the agent diagnosed it and rewrote the trigger to per-message concurrency so nothing drops.)
In the SMS double-bug, the agent — bypassing the standard interface — executed commands to restart the OS-level services, with statements confirming the double-SMS was eliminated. Then, to resolve the *underlying* bug, it engineered a permanent architectural fix: rewriting the script as a strict content-deduplication checkpoint so the race condition can never recur.
Standard LLM wrappers require human intervention to resolve internal failures. By enforcing architectural constraints — unified queue, decoupled logic, local observability, and self-repair — XEN repositions the agent from a chat tool into a sovereign autonomous controller: it diagnoses itself, fixes its own code, and stays alive without a human in the loop. That is the whole thesis: not a smarter chatbot — an operating system that governs itself.
The challenge, stated as a number: achieve 44,400 inbound voice minutes/month for under $15 total. Standard contact-center plans charge per-minute inbound — a lethal trap: even 0.2¢/min × 44,400 = instantly over budget. So per-minute is disqualified by arithmetic, not taste.
Proposed solution: a flat-rate residential/enterprise provider (Telnyx) routed through open-source FreeSWITCH.
The reviewer's critique (developer handed the blueprint to an AI and said "tear it apart" — brutal but invaluable):
- Agrees with Telnyx specifically — because it offers a channel/capacity feature: billing is a flat monthly charge for an open lane of communication, not a per-call fee. This bypasses the permanent tax trap.
- But demands refinements, not a rubber-stamp:
- Start with an IP-authenticated trunk, NOT credentials. Don't connect with username/password (can be stolen or expire). Authorize the connection purely on the fixed physical IP address — the telecom provider just looks at where the signal originates. Cleaner, more secure, no rotating secret.
- Beware the fragile chain. Real-time audio across many hops (Telnyx → FreeSWITCH → bridge → LiveKit → WebRTC loopback → local agent) — every network hop compounds latency and adds a brand-new failure mode. The expensive setup collapses if one link fails (a long, consecutive chain, not a resilient mesh). The build must minimize hops and harden each.
Current status (honest): the Telnyx+FreeSWITCH PSTN→local-agent path is architecturally specced but not yet the live phase — it sits as the blueprint, monitoring/routing capability defined, awaiting the trunk provisioning. Blocked on qi's Telnyx creds + IP-trunk setup.
hitthe.link/xen-architecture · dictated by qi 2026-06-16 · 14 sections