This article appears in Voxworks' 2026 State of AI Voice report, now available here. We wrote it for executives who need to understand the engineering constraints of generating real-time speech, and why you cannot just assemble an AI voice application from off-the-shelf components and safely deploy it in front of customers.
If you have used ChatGPT or any text-based AI, you might assume AI voice agents are just a chatbot with a microphone and speaker bolted on. That will work, but you will be waiting a few seconds for every response, which does not exactly qualify as a natural conversation.
Voice is a fundamentally harder problem than chat because you are not only optimising the content of an LLM's output, you also care about the timing of that output. Suddenly you have crossed into the technical domain of signal processing, systems theory, and telecommunications networking.
Quite often we speak with AI developers who walk into a voice project thinking it will be simple, then reach out asking why their voice agent is slow or buggy. The unfortunate reality is that there is no silver bullet. AI voice agents are finely tuned complex systems riddled with feedback loops, and there is no point tweaking one component in isolation. You need a whole-of-system approach.
So if you are about to embark on a voice project yourself, or evaluate AI voice vendors with any rigour, you need to understand the fundamental engineering constraints.
The Basic Cascading Pipeline
At its most basic level, data flows across three separate interconnected AI models.
When you speak to an AI voice agent, there are three AI models generating outputs in sequence:
Speech recognition (ASR or STT). Your voice is converted to text. The system has to figure out what you said despite background noise, accents, crosstalk, poor phone-line quality, and inconsistent dictation.
Language model reasoning (LLM). The text goes to a large language model, which figures out what you meant, decides what to do about it, and generates a response. This is the "thinking" step.
Speech synthesis (TTS). The response text gets converted back into spoken audio. The system has to produce something that sounds human, with appropriate pacing, intonation, and emphasis.
This is clearly an oversimplification. Any time the pipeline is interrupted, we need to reset the flow, pick up midway through a flow, or reroute to a completely separate flow. But it is still the right place to start if you want to build a picture of system latency before diving into the surrounding complexity.
The Latency Budget
Humans are remarkably sensitive to conversational timing. Research shows the typical gap between one person finishing a sentence and another starting their response is about 200ms. Most people cannot consciously perceive a gap that short, but their brain tracks it.
In our anecdotal experience, conversations can still feel natural between 500-1,000ms. At around 1,000ms or above, your subconscious starts telling you the conversation feels slow. Above 1,500ms, the caller is consciously aware and conversational dynamics start to break down.
So human biology dictates that your AI voice system has a latency budget of roughly 1,000-1,300ms from the moment the caller stops speaking to the moment they hear the first syllable of the response.
Traditional cascaded systems typically land at 1,000-2,000ms end-to-end latency based on the sum of the component parts. The engineering effort is overwhelmingly focused on compressing this pipeline, either by overlapping sequential components or reducing individual components, or ideally both.
The main technique is streaming: rather than running STT, then LLM, then TTS as three sequential steps, a well-built system pipelines them as real-time data streams. LLM text chunks flow into TTS as they arrive, and audio frames are sent to the caller's phone immediately. The goal is for the caller to hear the first syllable of the response while the LLM is still generating the rest of the sentence. In practice, this is difficult to achieve while maintaining the illusion of natural speech.
In the processing flow, the LLM step consumes the most time and is also the most volatile. Choosing an LLM with fast inference matters more than almost any other architectural decision. Time-to-first-token, the moment the model produces its first word of response, accounts for more than half of total latency in most voice pipelines and is the typical benchmarking metric used by model providers. But what you actually care about is time-to-first-sentence, since that is the minimum input most TTS models require to voice a sentence naturally rather than speaking in disjointed words.
Engineers who have built these systems understand that the difference between a naive sequential pipeline and a properly streamed one is the difference between 1,600ms and 800ms. This tiny sliver of time also happens to be the gap between feeling like you are talking to a machine versus feeling like a natural conversation. In effect, signal-processing engineering becomes the core differentiator of a workable voice product.
Turn Endpointing
Voice Activity Detection (VAD) is the system's ability to distinguish between when the caller has finished their sentence and when they have merely paused for half a second to think about what they want to say next. Get it wrong in one direction and the AI jumps in too early, cutting the caller off. Get it wrong in the other and there is an awkward silence while the system waits for more input that is not coming.
Humans handle this subconsciously through a combination of prosodic cues, syntactic completion, and pragmatic context. AI voice systems have to approximate all of that from an audio signal, in real time, over a phone line that may contain compression artefacts and background noise.
Interruptions
People interrupt each other constantly in normal conversation. They also backchannel: "uh huh", "right", "yeah", spoken while the other person is still talking. These are signals that the listener is engaged in the social sense.
An AI voice system has to constantly differentiate between interruptions and backchannels in real time. When a caller genuinely interrupts, the system needs to stop talking immediately, process what the caller said, discard whatever it was about to say, and generate a new response.
This is called barge-in handling, and it is harder than just "stop and listen". The system has to cancel the in-flight LLM generation, tear down the TTS synthesis, and flush any buffered audio queued for playback without splicing words, all simultaneously. Miss any one of those and the barge-in feels broken: the caller hears a stray syllable from the old response, or there is a long pause while the system cleans up after itself.
The Mobile Network Problem
Speech-recognition benchmarks are measured on clean, curated audio with clear speakers in controlled conditions. The word-error rates you see on transcription vendor websites, typically 3-7%, reflect a best case. In production telephony, with speakerphone echo, background noise, low-bitrate codec compression, and unstable mobile signal, those numbers blow out to 15-25%.
In a three-minute call with roughly 450 words, a 15-20% error rate means the system misinterprets 70-90 words. More importantly, these errors cluster around the less predictable and therefore more important markers within the conversation.
When it comes to generating good output content, the biggest challenge by far is accurate transcription. So it pays to understand why running AI voice over telephony is so constrained.
Australian Telco Networks
In Australia, most residential and business NBN "landline" services default to the G.711 codec (A-law). This codec was designed in 1972 to fit human speech into the smallest possible digital pipe while remaining intelligible. It is why you hear that characteristic tinny sound on Australian phone lines.
Australian mobile networks primarily use the AMR-WB (Adaptive Multi-Rate Wideband) codec and the more modern EVS (Enhanced Voice Services) codec. While these are upgrades over the narrowband 2G and 3G era, they are still problematic for AI voice agents and transcription services because of how aggressively they compress audio to fit within mobile bandwidth constraints.
On a phone call, your audio signal quality defaults to the lowest common quality standard. It does not matter if your business uses HD voice over VoIP if the other end of the call is on a landline.
AI transcription models are trained on high-fidelity, uncompressed data. When you feed them landline or mobile audio, multiple issues arise:
Frequency clipping. Human speech contains subtle fricatives, such as "s", "f", "th", and "sh", that often exist above 7,000 Hz. AMR-WB cuts these off at 7kHz and G.711 cuts them at 3.4kHz. An AI might easily confuse "sixty" with "fifty" because the high-frequency distinction between "s" and "f" has been digitally flattened.
Artifact aliasing. To save space, these codecs use lossy compression. They remove data the human ear will not miss, but AI models look for mathematical patterns. The metallic or bubbly artefacts introduced by compression can be interpreted by AI as background noise or actual speech, leading to hallucinations in the transcript.
Double compression. First the AI generates high-quality audio, often using Opus. Then the telephony gateway transcodes it to G.711. Then the mobile network transcodes it again to AMR-WB or EVS. Each hop degrades the quality further. Studies show that transcribing G.711 audio can result in a 10-20% drop in accuracy compared with high-definition wideband audio.
Packet loss. Australian mobile networks use frame error concealment to hide dropped packets. While this makes audio sound smooth to a human, it creates tiny timing shifts that can cause an AI agent to interrupt the user or fail to detect when the user has finished speaking.
Latency Theft
Australian telco networks were built for live human conversation where, as we learned earlier, the average person can respond within 200ms but tolerate delays of up to 1,000ms. That means somewhere deep within the network stack, providers have appropriated this excess latency budget for audio signal processing and buffering, helping maximise throughput over fixed infrastructure without most people noticing.
That does not work for AI voice, where you are already operating at the limit of the latency budget.
Comparing raw waveforms between sent and received data, we have observed firsthand the adulteration in the audio packets by telco networks. The result is that Australian mobile networks absorb 300-500ms minimum of your latency budget out of the gate. On our estimates, the absolute minimum speed AI voice can achieve over Australian telephony with a single LLM call is roughly 1,000ms.
The Speed vs Accuracy Tradeoff in Speech Recognition
There is a tension in speech recognition: the most accurate models cannot run in real time, and the ones that can run in real time are less accurate.
The models with the lowest published error rates achieve their accuracy by processing complete audio files with full context. They can look backwards and forwards across the entire recording. That is great for transcribing a podcast. It is useless for a live phone call where the system needs to produce a transcript within milliseconds of each word being spoken, with no ability to look ahead.
Streaming speech-recognition models sacrifice some accuracy for the speed that live conversation demands. They commit to a transcript word by word, in real time, with no opportunity to reconsider once they have moved on. If the model mishears "Melbourne" as "Melbin" in the first ten seconds of a call, that error enters the conversation history. The language model downstream receives the garbled transcript and has to work with it.
The standard AI chat interface is based on a user-assistant turn-by-turn transcript, where the LLM responds based on that transcript to predict the next response. If the transcript is wrong, everything downstream is wrong: understanding, response, incorrect tool call, broken call. And the error compounds. By turn three or four, the language model is responding based on a conversation history that contains multiple uncorrected transcription mistakes, and the whole interaction starts to drift.
This is one of the least visible but most consequential quality differences between STT vendors. Ask your transcription provider what its real-world word-error rate is on Australian telephony audio, not the published benchmark on clean American English.
The LLM Reliability Problem
The language model sits at the centre of the pipeline and most AI voice systems treat it as a single point of trust. If the model is slow, the caller waits. If the model fails, the system retries with the same model. If the model produces an incorrect response, the caller hears it spoken with full confidence.
LLM APIs are not as reliable as most users assume. They experience queue delays during peak load, cold starts on first requests, occasional timeouts, and significant latency variance. The same prompt can return in 150ms one time and 800ms the next. Most voice AI platforms run on shared GPU pools serving thousands of customers simultaneously. Your latency at 2pm on a Tuesday is determined not by your own usage, but by total demand across the provider's customer base. For a voice conversation where any perceptible pause erodes trust, this variance is a serious operational risk.
The error rates of LLMs are well understood and outside the scope of this article. Suffice to say, enterprises typically deal with hallucinated outputs by layering multiple LLM calls: first with the generated response, then with a second call that checks the first response against a policy rubric. Obviously this introduces latency that is unacceptable in most real-time voice applications unless the architecture is already perfectly optimised.
The intuitive solution is to use a slower, more powerful model with more reasoning capability. This does not reliably help. You cannot solve a context-awareness failure by spending more on inference. This is a structural limitation of single-model architectures and again underscores the almost impossible tradeoff between latency and accuracy.
The Australian Accent Problem
Most voice AI models are trained predominantly on American English data. The largest speech datasets are American because that is where the money and research labs have historically been. When these models encounter Australian English, their accuracy drops noticeably for the range of voices that actually call Australian contact centres.
The Australian accent is highly nuanced. Seemingly only Australians can detect fake Australian accents. Offshore model providers like ElevenLabs purport to provide Australian accents, but we still hear the artefacts, presumably because the engineers training those voices cannot.
The way Australians speak is also different from the way LLMs generate text. The vowel system is different. The way Australians shorten and modify words, such as "arvo" for afternoon, often does not appear in American training data. Place names are a constant source of errors. An ASR model that has never seen "Woolloomooloo" in its training data will either mishear it or ask the caller to repeat themselves, breaking conversational flow.
A TTS model that constantly mispronounces words will trigger the subconscious uncanny-valley effect, eroding customers' trust. This shows up in call outcomes. Every "sorry, could you repeat that?" adds seconds to handle time, erodes caller confidence, and increases the probability that the caller will ask for a human agent. Over thousands of calls, a few percentage points of accuracy difference between an American-tuned model and an Australian-tuned one compounds into a measurable gap in resolution rates and customer satisfaction.
The Data Sovereignty Question
When a caller speaks to a voice AI agent, their voice data travels through multiple systems: the telephony provider, the ASR model, the LLM, the TTS engine, and whatever logging and analytics infrastructure sits behind it all. Where those systems are physically located matters.
If any part of that pipeline routes through US servers, you have created a cross-border data transfer. Under the Privacy Act, that transfer is permissible only if the overseas recipient is subject to equivalent privacy protections, or you have obtained the individual's consent, or you have taken reasonable steps to ensure compliance. In practice, this means most Australian enterprises handling sensitive customer data in health, finance, or government either need their AI infrastructure hosted in Australia or need a robust legal framework around offshore processing.
There is also a latency cost. A voice call routed from Sydney to a US-based LLM inference server adds 150-200ms of round-trip network latency before the model even starts processing. That is a third of your total latency budget gone to geography alone. For a voice application where every millisecond counts, that is unacceptable.
Australian-hosted infrastructure eliminates both problems. Data stays onshore, which simplifies compliance. And the network latency between the caller and the AI system drops to single-digit milliseconds, leaving the full budget available for actual processing.
Why Quality Matters
The logical outcome of the uncanny-valley phenomenon is that small differences in the quality of your AI voice system can produce dramatically different results in the real world.
The minimum standard of low latency is only achievable today with the right systematic approach across a broad spectrum of disciplines: signal processing, latency under load, transcription accuracy across Australian accents and on Australian mobile networks, barge-in handling, and VAD tuning. This is before we even start talking about the actual content outputs of the LLM.
For AI voice systems deployed in Australian enterprises, serving Australian callers under Australian regulation, there is a measurable performance and compliance advantage to infrastructure, accent models, and regulatory frameworks that are purpose-built for this market. That advantage shows up in STT accuracy, call-resolution rates, compliance posture, and latency.