Why Voice Agents Need a Different Inference Stack
By Sev Geraskin

Chances are, your intelligent voice AI agent gives relevant responses. But every exchange has this half-second pause that makes the whole thing feel broken. The voice agent is shelved. That pause killed it.
At PolarGrid, we’ve spent the past year optimizing milliseconds because voice agents that don’t match human conversational timing don’t get used.
The 300 Millisecond Threshold
Here’s what psycholinguistic research tells us about human conversation. In a landmark 2009 PNAS study, Stivers et al. analyzed turn-taking across ten languages — from Japanese to Tzeltal to Dutch — and found something remarkable: median response gaps ranged from 0 to 300 milliseconds, with most falling between 0 and 200ms. This pattern held regardless of culture, language structure, or geographic location.
More recent work by Meyer (2023) in the Journal of Cognition confirms this finding: median latencies in conversational speech are consistently reported under 300ms. Furthermore, Levinson and Torreira’s research places the pause between speakers at around 200ms, noting that human conversation involves less than 5% simultaneous speech.
Our brains evolved to interpret pauses as meaningful signals. At over 300ms, listeners begin anticipating a hedged response. By 600–800ms, we’ve already concluded something is wrong. Thus, intelligent voice agents that respond outside 300ms communicate incompetence through silence.
The One Second Cliff
The contact center industry has learned this lesson the hard way. The industry experience with voice AI deployments shows that response latency beyond a second causes users to talk over the agent, breaking intent recognition and forcing conversation loops. At the one-second mark, customers start hanging up. Abandonment rates increase by 40% compared to sub-second responses.
One second creates a hard ceiling that most voice AI platforms can’t reliably hit. Even the faster models struggle to achieve this target because the infrastructure wasn’t designed for it.
Why Centralized Inference Fails
Standard LLM inference is optimized for throughput and batching. Send a request, queue it with others, maximize GPU utilization, return results. This works brilliantly for email summarization and code generation. For voice, batching is the enemy.
Consider a voice pipeline with three centralized models:
- Audio travels from the user’s device to a centralized data center: 100ms
- Transcribed by a speech-to-text model (STT): 100–300ms
- Processed by an LLM: 200–800ms
- Converted back to speech (TTS): 100–300ms
- Transmitted back to the user: 100ms
- Plus inter-model network hops if models aren’t colocated: 10–100ms each
You’re looking at 800ms to over 1.6 seconds before the first word reaches the user’s ear. Under ideal conditions. Add network jitter, queue delays, and p99 spikes, and you get an engagement-breaking agent.
The fundamental problem: centralized hyperscalers were built to process language, not to converse in real-time.
What We Built Instead
At PolarGrid, our pre-production benchmarks show a different picture. Our voice pipeline achieves 364ms p50 end-to-end latency, measured from audio input to audio output, including real network round-trip time. The p95 holds at 403ms, demonstrating consistent performance with minimal tail latency.
We run Whisper for speech-to-text, Llama for language processing, and Kokoro for voice synthesis — all on the same physical GPU node. No inter-service network hops between pipeline stages.
The architecture has three properties that make this possible:
Colocation. All three models — STT, LLM, TTS — run on the same node. The inter-stage latency is a memory copy, not a network call.
Edge placement. Our nodes sit in carrier-neutral colocation facilities in Toronto, Vancouver, and Montreal. The network round-trip from most North American users to the nearest node is under 30ms.
Latency-first serving. We run Triton with a configuration optimized for single-request latency, not aggregate throughput. No batching in the hot path. Dedicated GPU allocation per session when demand allows.
The result is a pipeline that fits inside human conversational timing — not in a lab, but in production, with real users, real network conditions, and real tail latency.
What This Means for Your Product
If you’re building a voice agent today on a centralized inference stack, you’re starting with a 400–800ms handicap before your first token generates. That handicap compounds through the pipeline. By the time audio reaches the user, you’re at 1–2 seconds — well past the threshold where users start hanging up.
The solution isn’t a faster model. It’s a different infrastructure architecture. Models closer to users. Pipeline stages colocated. Serving configuration optimized for latency over throughput.
That’s what PolarGrid is built for.
Try PolarGrid today
$500 in free credits. No card required. Sub-400ms voice pipeline live now.
Start Free →