Definition
Time to First Byte (TTFB) measures the interval between sending a text request to a TTS API and receiving the first chunk of audio data back. It is the most important latency metric for real-time voice applications because streaming playback can begin as soon as the first bytes arrive.
Why TTFB matters
Human conversation operates on 200-300ms turn-taking gaps. A TTS system serving a voice agent must contribute no more than 100-200ms to the total response latency. If TTFB exceeds 500ms, users consciously notice the delay and the interaction feels unnatural.
TTFB benchmarks (March 2026)
Leading TTS providers achieve sub-200ms TTFB under optimized conditions. Vendor-reported benchmarks reflect ideal server conditions, not production load. Always measure TTFB under realistic concurrency and network conditions.
Frequently Asked Questions
What is TTFB in TTS?
TTFB (Time to First Byte) measures how long after sending text to a TTS API the first audio bytes arrive. For conversational applications, sub-500ms TTFB is considered conversational-grade. Over 2 seconds makes real-time interaction unusable.
What is a good TTFB for TTS?
For voice agents and conversational AI, TTFB under 200ms is excellent, under 500ms is acceptable, and over 2 seconds is unusable. Leading providers achieve 40-90ms under optimal conditions as of 2026.
How is TTFB different from total latency?
TTFB measures only the time until the first audio chunk arrives. Total latency includes the entire generation time plus network transmission and client-side audio pipeline setup. With streaming, perceived latency can be much lower than total generation time.