Voice Activity Detection (VAD) is the technology that tells AI when you've stopped talking.
In an AI voice call, VAD decides the moment to hand the conversation back to the AI. Get it wrong, and the AI either cuts you off mid-sentence or sits in awkward silence. Get it right, and users don't notice it at all, which is exactly the point.
Why VAD Makes or Breaks Conversation Flow
VAD does three things that matter:
1. Turn-Taking
VAD determines when you've finished speaking. It's the invisible traffic cop of conversation.
Too aggressive? The AI cuts you off mid-thought. Too conservative? The AI waits forever to respond.
Neither feels natural. Both frustrate callers.
2. Processing Efficiency
Without VAD, systems would process all audio—including dead air and background noise. VAD ensures only actual speech gets sent for transcription and AI processing.
That saves compute costs and reduces unnecessary work.
3. Latency Reduction
Good VAD detects speech endings quickly. The faster it knows you've stopped talking, the faster the AI can start responding.
Every millisecond counts in an AI voice call. What Is Latency in AI Voice Calls?
How VAD Actually Works
Energy-Based Detection
The simplest approach: speech is louder than silence.
How it works: Compare audio energy to a threshold. Upside: Fast and computationally light. Downside: Fooled by loud background noise.
Zero-Crossing Rate
Speech has characteristic patterns where the audio signal crosses zero amplitude.
How it works: Count zero-crossings per time window. Upside: Helps distinguish speech from noise. Downside: Not reliable on its own.
Spectral Analysis
Speech has specific frequency patterns that noise doesn't.
How it works: Analyse frequency content of audio. Upside: More accurate than energy alone. Downside: More computationally expensive.
Machine Learning VAD
Modern systems use neural networks trained on millions of examples.
How it works: Neural network classifies audio frames as speech or non-speech. Upside: Most accurate, handles complex scenarios. Downside: Requires more computation.
Hybrid Approaches
The best systems combine multiple methods:
- Fast energy check first
