Text-to-Speech (TTS) is a technology that converts written text into spoken audio. The term covers a broad range of approaches, from early concatenative systems that spliced pre-recorded phonemes to modern neural models that generate raw audio waveforms from learned patterns.
TTS is one of the oldest areas of computing research. Bell Labs demonstrated the first electronic speech synthesizer in 1939. The technology remained a niche accessibility tool for decades, until neural generation transformed it into core AI infrastructure starting in 2016.
Why it matters
TTS is the output layer of every voice AI system. Without it, conversational agents, accessibility tools, and content automation pipelines cannot produce audio. The quality of TTS directly determines whether a voice interaction feels natural or robotic, which in turn affects user satisfaction, call completion rates, and accessibility compliance.
The European Accessibility Act, enforced from June 2025, requires digital services to provide audio alternatives, making TTS a compliance requirement for businesses operating in the EU.
How it works
Modern TTS passes text through a neural network that predicts audio waveforms sample by sample. The process typically involves four stages:
- Text normalization: expanding abbreviations, numbers, and special characters into pronounceable words.
- Phoneme conversion: mapping normalized text to phonetic representations.
- Spectrogram generation: predicting a mel-spectrogram (a visual representation of the audio frequency content) from the phonemes.
- Vocoding: converting the spectrogram into a raw audio waveform.
In 2016, Google DeepMind’s WaveNet demonstrated this approach could reduce the quality gap with human speech by over 50%, achieving a Mean Opinion Score (MOS) of 4.21 out of 5.0 (van den Oord et al., 2016). By 2017, Tacotron 2 achieved MOS of 4.53 out of 5.0, compared to 4.58 for professional recordings (Shen et al., 2018).
Concatenative vs neural TTS
| Aspect | Concatenative TTS | Neural TTS |
|---|---|---|
| Method | Splices pre-recorded phoneme fragments | Generates waveforms from neural networks |
| Quality | Robotic seams between segments | Natural prosody and intonation |
| Data required | 10-50 hours of studio recordings | 1-30 hours (some models: 3 seconds) |
| Compute cost | Minimal (lookup + concatenation) | GPU inference required |
| Era | 1990s through mid-2010s | 2016 to present |
| Used by | Legacy IVR systems | Most commercial APIs |
Commercial TTS APIs
Major providers offer TTS as a cloud API with per-character or subscription pricing. Costs vary by a factor of 75x across the market as of March 2026:
- Amazon Polly: $4 per million characters (Standard), $16 per million characters (Neural).
- Google Cloud TTS: $4 per million characters (Standard), $16 per million characters (WaveNet).
- Azure AI Speech: $16 per million characters (Neural).
- ElevenLabs: approximately $120 to $200 per million characters depending on plan tier as of March 2026.
Common issues
- Latency: Time to first byte (TTFB) ranges from 40ms to 500ms depending on provider and model. For conversational use, sub-200ms is required.
- Pricing at scale: the 75x price range means provider selection has significant cost implications for high-volume applications.
- Voice consistency: long-form content can exhibit prosody drift or pronunciation inconsistencies across segments.
- Edge cases: abbreviations, foreign names, mixed-language content, and domain-specific terminology remain challenging. SSML can help control pronunciation.
- Ethical concerns: voice cloning capabilities raise consent and identity protection issues. Several jurisdictions now regulate synthetic voice creation.
Frequently Asked Questions
What is TTS?
TTS stands for Text-to-Speech. It is a technology that converts written text into spoken audio. Modern TTS systems use neural networks to generate human-like speech from text input.
How does TTS work?
Modern TTS works by passing text through a neural network that predicts audio waveforms sample by sample. The process typically involves text normalization, phoneme conversion, spectrogram generation, and vocoding to produce the final audio output.
What is a TTS API?
A TTS API is a cloud service that accepts text input and returns synthesized audio. Providers like Amazon Polly, Google Cloud TTS, and ElevenLabs offer TTS APIs with per-character or subscription pricing.
How much does TTS cost?
TTS API pricing ranges from $4 per million characters for basic voices (Amazon Polly Standard) to approximately $120 to $200 per million characters for premium voices (ElevenLabs, depending on plan tier) as of March 2026.
What is the difference between neural TTS and concatenative TTS?
Concatenative TTS splices pre-recorded phoneme fragments to form speech. Neural TTS uses deep learning to generate audio waveforms directly from text. Neural TTS produces more natural-sounding speech but requires more compute. Most commercial APIs now use neural models.