Why Voice Is the Hardest Problem in AI

This article appears in Voxworks' 2026 State of AI Voice report, now available here. We wrote it for executives who need to understand the engineering constraints of generating real-time speech, and why you cannot just assemble an AI voice application from off-the-shelf components and safely deploy it in front of customers.

If you have used ChatGPT or any text-based AI, you might assume AI voice agents are just a chatbot with a microphone and speaker bolted on. That will work, but you will be waiting a few seconds for every response, which does not exactly qualify as a natural conversation.

Voice is a fundamentally harder problem than chat because you are not only optimising the content of an LLM's output, you also care about the timing of that output. Suddenly you have crossed into the technical domain of signal processing, systems theory, and telecommunications networking.

Quite often we speak with AI developers who walk into a voice project thinking it will be simple, then reach out asking why their voice agent is slow or buggy. The unfortunate reality is that there is no silver bullet. AI voice agents are finely tuned complex systems riddled with feedback loops, and there is no point tweaking one component in isolation. You need a whole-of-system approach.

So if you are about to embark on a voice project yourself, or evaluate AI voice vendors with any rigour, you need to understand the fundamental engineering constraints.

The Basic Cascading Pipeline

At its most basic level, data flows across three separate interconnected AI models.

When you speak to an AI voice agent, there are three AI models generating outputs in sequence:

Speech recognition (ASR or STT). Your voice is converted to text. The system has to figure out what you said despite background noise, accents, crosstalk, poor phone-line quality, and inconsistent dictation.
Language model reasoning (LLM). The text goes to a large language model, which figures out what you meant, decides what to do about it, and generates a response. This is the "thinking" step.
Speech synthesis (TTS). The response text gets converted back into spoken audio. The system has to produce something that sounds human, with appropriate pacing, intonation, and emphasis.

This is clearly an oversimplification. Any time the pipeline is interrupted, we need to reset the flow, pick up midway through a flow, or reroute to a completely separate flow. But it is still the right place to start if you want to build a picture of system latency before diving into the surrounding complexity.

The Latency Budget

Humans are remarkably sensitive to conversational timing. Research shows the typical gap between one person finishing a sentence and another starting their response is about 200ms. Most people cannot consciously perceive a gap that short, but their brain tracks it.

Why Voice Is the Hardest Problem in AI

The Basic Cascading Pipeline

The Latency Budget

Turn Endpointing

Interruptions

The Mobile Network Problem

Australian Telco Networks

Latency Theft

The Speed vs Accuracy Tradeoff in Speech Recognition

The LLM Reliability Problem

The Australian Accent Problem

The Data Sovereignty Question

Why Quality Matters