Voice Is the New Interface

For decades, the interface between humans and software has been visual — screens, clicks, forms, dashboards. Voice existed, but it was bad. IVR systems that made you press 1 for English. Siri that misheard half your words. Alexa that could set a timer but not hold a conversation.

That era is over. Voice AI crossed a quality threshold in 2024-2025 that changes the equation. The latency is low enough for natural conversation. The comprehension is good enough to handle nuance, objections, and context. And the output isn't just speech — it's structured data, decisions, and actions.

Voice isn't a feature anymore. It's becoming the interface.

The Three Voice Architectures

Not all voice AI is built the same. There are three fundamentally different approaches, each with distinct tradeoffs. Understanding these is essential if you're building anything with voice.

Architecture 1: The Cascaded Pipeline

Audio In → STT → Text → LLM → Text → TTS → Audio Out

The traditional approach. Speech-to-text converts audio into text, an LLM processes it, and text-to-speech converts the response back to audio. Three stages, three services, three latency budgets.

The stack (late 2025):

STT: OpenAI Whisper, Deepgram Nova-2, AssemblyAI
LLM: Claude, GPT-4o, Gemini, Llama — any text model
TTS: ElevenLabs, PlayHT, Cartesia, Amazon Polly

Latency budget:

STT: 200-800ms depending on utterance length
LLM: 300-1500ms for first token (varies by model and prompt size)
TTS: 200-500ms to start streaming audio
Total: 700ms-2.8s before the user hears the first word of the response

Pros:

Mix-and-match best-in-class at each stage
Use any LLM — not locked into one provider
Cheapest per-call cost
Full control over each stage (swap STT provider without changing anything else)

Cons:

Cumulative latency creates noticeable pauses
Audio nuance is lost in transcription — tone, emotion, hesitation disappear
The LLM only sees text, so it can't react to how something was said
Feels like talking to three systems, not one mind

Best for: Contact center automation, IVR replacement, structured qualification calls where latency tolerance is higher.

Architecture 2: Voice-Native LLM

Audio In → Multimodal LLM → Audio Out

The model processes audio directly — no transcription, no separate TTS. It hears the voice, reasons about it, and generates speech as output. One model, one hop.

The stack (late 2025):

OpenAI Realtime API (GPT-4o) — the leader here. Launched October 2024, now maturing. Sub-500ms latency, native audio-to-audio processing.
Google Gemini 2.0 — multimodal with audio support, competitive latency.
Hume AI EVI — specifically built for emotionally intelligent voice, detects and responds to vocal emotion.

Latency budget:

Total: 300-600ms — a single inference pass

Pros:

Dramatically lower latency — conversations feel natural
Preserves audio nuance (the model hears tone, pace, emotion)
Can interrupt, overlap, and handle natural conversational dynamics
One integration point, simpler architecture

Cons:

Expensive — 10-50x the cost of cascaded pipelines for equivalent throughput
Fewer model choices (essentially OpenAI and Google)
Less control over the voice itself (though custom voices are improving)
Can't easily swap components — it's all-or-nothing
Harder to extract structured data (the model outputs audio, not JSON)

Best for: Consumer-facing real-time conversations, customer support where naturalness matters, any context where latency is the primary constraint.

Architecture 3: Hybrid (Streaming Pipeline)

Audio In → Real-time STT (streaming) → LLM (streaming) → TTS (streaming) → Audio Out
                                         ↓
                              Structured JSON extraction

The pragmatic middle ground. Each stage streams its output to the next as it's generated, rather than waiting for completion. The STT starts sending partial transcripts while the user is still speaking. The LLM starts generating while the transcript is still arriving. The TTS starts speaking while the LLM is still thinking.

The stack (late 2025):

STT: Deepgram (real-time WebSocket), Whisper (with streaming wrapper)
LLM: Any model with streaming support — Claude, GPT-4, Gemini
TTS: ElevenLabs (streaming API), Cartesia (ultra-low-latency)
Orchestration: VAPI, Retell AI, LiveKit — platforms that manage the pipeline
Transport: WebSocket (preferred over SIP for lower latency)

Latency budget:

Overlapped streaming: 500ms-1.2s effective perceived latency
Significantly better than cascaded, approaching voice-native

Pros:

Near-voice-native latency through streaming overlap
Any LLM, any TTS voice — full component flexibility
Structured output extraction happens alongside speech generation
Can extract JSON (sentiment, intent, next steps) while the agent is talking
More cost-effective than voice-native at scale

Cons:

Complex orchestration — timing, interruption handling, silence detection
Still loses some audio nuance in the STT step
Requires a coordination layer (VAPI, Retell, or custom)

Best for: Voice SDRs, appointment booking, qualification calls — anywhere you need both natural conversation AND structured data extraction.

The Orchestration Layer

Regardless of which architecture you pick, you need an orchestration layer. This is where platforms like VAPI, Retell AI, and Bland AI live. They handle:

Turn-taking: When to speak, when to listen, when to interrupt
Silence detection: Knowing the user is done vs. just pausing to think
Context injection: Loading CRM data, deal history, and persona instructions before the call
Action execution: Booking meetings, updating records, sending follow-ups during or after the call
Telephony: Connecting to phone networks via Twilio, Vonage, or direct carriers

The orchestration layer is arguably more important than the model choice. A mediocre model with great orchestration will outperform a great model with bad turn-taking. Nothing kills a voice experience faster than the agent talking over the user or pausing for two seconds mid-sentence.

Why Voice Changes the Agent Equation

Here's the thing that most people building AI agents miss: voice isn't just another channel. It changes what agents can do.

A text-based agent can only reach people who are at a screen, logged in, and willing to type. A voice agent can reach anyone with a phone number — which is everyone. It can call at the right time, have a conversation, handle objections in real time, and capture structured outcomes.

For B2B sales alone, this means:

No-show recovery: Someone missed a demo? An agent calls them within hours.
Pipeline reactivation: Thousands of stale leads in your CRM that nobody's calling? An agent can work through them all.
After-hours coverage: Prospects don't stop having questions at 6pm.
Qualification at scale: Every inbound lead gets a conversational qualification call, not just a form.

Voice is the interface that turns agents from assistants (they help you do things) into operators (they do things). That's a fundamental shift.

Where This Is Going

By mid-2025, the trajectory is clear:

Short term: Hybrid streaming pipelines dominate enterprise voice AI. The orchestration platforms (VAPI, Retell) become the middleware layer, similar to what Twilio did for SMS.

Medium term: Voice-native models get cheaper and more flexible. The cascaded pipeline becomes legacy for real-time use cases, though it persists for batch processing and cost-sensitive scenarios.

Long term: Voice becomes the default interface for most agent interactions. Not all — complex data tasks will stay visual. But for anything conversational, instructional, or relational, voice wins. It's faster, more natural, and more accessible than any screen-based interface.

The question isn't whether voice agents will become mainstream. It's whether you're building the infrastructure to support them when they do.

I'm building the infrastructure for AI agents — including the protocols and tools that make voice agents interoperable. Follow what I'm working on at wizardofagents.com or book a call if you're building in this space.

Voice Is the New Interface

The Three Voice Architectures

Architecture 1: The Cascaded Pipeline

Architecture 2: Voice-Native LLM

Architecture 3: Hybrid (Streaming Pipeline)

The Orchestration Layer

Why Voice Changes the Agent Equation

Where This Is Going

Join the Discussion