Streaming TTS is a delivery method where synthesized audio is transmitted to the client incrementally, chunk by chunk, as the synthesis engine generates it. Instead of waiting for the complete audio file to be produced before sending any data, the server begins transmitting the first audio frames within milliseconds of receiving the text input.

The concept borrows from media streaming (audio and video), but applies specifically to the synthesis pipeline. Early TTS systems had no need for streaming because synthesis was fast relative to network speeds. Modern neural TTS models are computationally heavier, making the synthesis-to-playback gap noticeable. Streaming closes that gap by overlapping synthesis with playback.

Why it matters

For conversational AI and voice agents, perceived latency is the primary user experience metric. A voice agent that takes two seconds to begin speaking after a user finishes talking feels broken. Streaming TTS reduces the time-to-first-audio to under 100 milliseconds for most providers, making the interaction feel natural.

TTFB (Time to First Byte) is the standard metric for measuring streaming TTS performance. A TTFB of 40-80ms is competitive as of March 2026. Without streaming, total latency equals full synthesis time plus download time, which can exceed one second for sentences of moderate length.

Streaming protocols

ProtocolHow it worksBest for
Chunked HTTPAudio sent as chunks in a single HTTP responseSimple integrations, browser playback
WebSocketBidirectional connection, lowest overheadConversational AI, voice agents
Server-Sent Events (SSE)Unidirectional server pushEvent-driven architectures
gRPC streamingBidirectional, protocol buffer encodingHigh-performance backend services

Streaming vs batch synthesis

AspectStreamingBatch (non-streaming)
Latency40-100ms to first audioFull synthesis time (500ms-3s+)
Use caseReal-time interactionPre-rendered content, caching
ComplexityRequires chunked playback handlingSimpler: request, receive file, play
CachingDifficult (audio arrives in pieces)Easy (store complete file on CDN)
CostSame per-character pricingSame, but cached audio avoids repeat calls

Implementation considerations

Streaming introduces complexity that batch synthesis avoids. The client must handle partial audio buffers, manage playback of incomplete data, and gracefully handle connection interruptions mid-utterance. For web applications, the Web Audio API or MediaSource Extensions are typically required to play chunked audio in the browser.

For EAA-compliant implementations, streaming audio players must still provide pause, stop, and volume controls that work correctly even while audio is still being received. The WCAG 1.4.2 audio control requirement applies regardless of delivery method.

Frequently Asked Questions

What is streaming TTS?

Streaming TTS delivers synthesized audio in chunks as it is generated, rather than waiting for the entire utterance to finish before sending any audio. This allows playback to start within tens of milliseconds, reducing the perceived delay between request and audible output.

Why does streaming matter for TTS latency?

Without streaming, the user waits for the full audio file to be synthesized and downloaded before hearing anything. With streaming, the first audio chunk arrives in as little as 40-100 milliseconds, making the interaction feel responsive even for long utterances.

How is streaming TTS delivered technically?

Most providers use one of two methods: chunked HTTP transfer encoding (the audio is sent as a series of chunks over a single HTTP response) or WebSocket connections (bidirectional, lower overhead, preferred for real-time conversational applications). Some providers also support Server-Sent Events (SSE).

Which TTS providers support streaming?

Most major commercial providers support streaming as of March 2026, including Amazon Polly, Google Cloud TTS, Azure AI Speech, ElevenLabs, Deepgram Aura, and Cartesia. Open-source models like Piper can also stream when served behind a compatible API wrapper.

When should I use streaming vs non-streaming TTS?

Use streaming for any interactive or real-time application: voice agents, chatbots, live narration, accessibility widgets. Use non-streaming (batch synthesis) for pre-rendering content like article audio, podcast generation, or audiobook production where latency is irrelevant and caching is preferred.