Voice AI & TTS Glossary

Definitions of key terms in text-to-speech, voice AI, and speech synthesis. A reference for developers, CTOs, and accessibility professionals.

A

a W3C specification that defines HTML attributes for making dynamic web content accessible to assistive technologies. ARIA provides roles, states, and properties that communicate the purpose and current state of interactive elements to screen readers and other tools. ARIA is essential for custom audio players, dropdown menus, modals, and any interactive widget that goes beyond native HTML semantics.

Read full definition

C

a TTS method that assembles speech by selecting and splicing pre-recorded phoneme fragments from a large audio database. Also called unit selection synthesis, this approach dominated commercial TTS from the 1990s through the mid-2010s. While it could produce natural-sounding speech for well-covered phoneme combinations, it suffered from audible seam artifacts and required extensive recording sessions. Neural TTS models have largely replaced concatenative systems in production, though some legacy IVR deployments still use them.

Read full definition

D

synthetic media generated by AI that convincingly replicates a real person's appearance, voice, or both. Audio deepfakes use voice cloning technology to reproduce a specific person's speech patterns from as little as a few seconds of sample audio. The EU AI Act (Regulation 2024/1689) requires disclosure when synthetic media could be mistaken for authentic content. Deepfake detection remains an active area of research, with no universally reliable detection method available as of March 2026.

Read full definition

a legally binding contract between a data controller and a data processor, required under GDPR Article 28. A DPA defines what personal data is processed, how it is handled, where it is stored, and when it is deleted. For TTS providers, a signed DPA is a prerequisite for processing text that may contain personal data such as names, addresses, medical information, or financial details.

Read full definition

E

EU Directive 2019/882, enforceable since 28 June 2025, that requires digital products and services sold to EU consumers to meet accessibility standards. The EAA covers e-commerce, banking, telecom, media, transport, and consumer hardware. It references EN 301 549, which maps to WCAG 2.1 Level AA. Non-EU companies targeting EU consumers are in scope. Microenterprises providing services (fewer than 10 employees, turnover at or below EUR 2 million) are exempt from service requirements but not product requirements.

Read full definition

the European standard for ICT accessibility, published by ETSI, CEN, and CENELEC. EN 301 549 defines accessibility requirements for information and communication technology products and services. For websites and digital interfaces, it maps to WCAG 2.1 Level AA. The current version is V3.2.1 (March 2021). The European Accessibility Act references EN 301 549 as its technical compliance baseline.

Read full definition

G

the EU regulation governing the processing of personal data of individuals within the European Union. GDPR (Regulation 2016/679) has been enforceable since May 25, 2018, and applies to any organization processing EU residents' data regardless of where the organization is located. For TTS providers, GDPR requires a signed Data Processing Agreement (DPA), explicit data retention policies, and EU data residency options when text submitted for synthesis contains personal data.

Read full definition

I

an automated phone system that interacts with callers using pre-recorded or synthesized speech and accepts input via keypad or voice commands. IVR systems were the original commercial market for TTS technology, routing millions of calls daily for banks, airlines, and healthcare providers. Modern IVR platforms increasingly use neural TTS and conversational AI to replace rigid menu trees with natural language interactions, reducing call handling times and improving customer satisfaction.

Read full definition

N

a speech synthesis approach that uses deep neural networks to generate audio waveforms directly from text. Unlike concatenative synthesis, which splices pre-recorded phoneme fragments, neural TTS learns the full mapping from text to audio during training. This produces natural-sounding speech with realistic intonation and prosody. Google DeepMind's WaveNet demonstrated the approach in 2016, reducing the quality gap with human speech by over 50%. Neural TTS now powers most commercial voice APIs.

Read full definition

S

an XML-based markup language defined by the W3C for controlling how TTS engines pronounce and deliver text. SSML tags let developers specify pauses, emphasis, pronunciation, speaking rate, pitch, and language switching within a single utterance. Most commercial TTS APIs support SSML, including Amazon Polly, Google Cloud TTS, and Azure AI Speech. SSML is essential for IVR systems, multilingual content, and any application where default pronunciation is insufficient.

Read full definition

a method of delivering synthesized speech where audio is sent to the client in chunks as it is generated, rather than waiting for the full utterance to complete. Streaming reduces perceived latency by allowing playback to begin within milliseconds of the request. Most commercial TTS APIs support streaming via chunked HTTP responses or WebSocket connections. Streaming is essential for conversational AI, voice agents, and any application where response time directly affects user experience.

Read full definition

T

the delay between sending text to a TTS API and receiving the first audio bytes back. In voice applications, TTFB directly determines how long users wait before hearing speech. Real-time conversational AI requires TTFB under 300 milliseconds. Streaming TTS APIs reduce perceived latency by sending audio chunks before the full utterance is generated. TTFB varies by provider, model complexity, and network conditions, making it a critical metric when evaluating TTS services.

Read full definition

technology that converts written text into spoken audio. Modern TTS systems use neural networks to generate human-like speech from text input, replacing older concatenative methods that spliced pre-recorded phonemes. TTS powers accessibility tools, voice assistants, IVR phone systems, and content automation pipelines. Commercial APIs from providers like Amazon Polly, Google Cloud TTS, and ElevenLabs offer per-character pricing, with costs ranging from $4 to $200 per million characters as of March 2026.

Read full definition

V

the process of creating a synthetic replica of a specific person's voice from audio samples. Modern voice cloning systems can produce a usable voice model from as little as three seconds of reference audio, with professional-quality results from 15 to 30 seconds of clean audio as of March 2026. The technology enables personalized TTS, dubbing, and accessibility applications, but raises ethical concerns around consent and deepfake misuse. Several jurisdictions now regulate voice cloning under identity protection laws.

Read full definition

W

a deep neural network developed by Google DeepMind in 2016 that generates raw audio waveforms sample by sample. WaveNet reduced the quality gap between synthetic and human speech by over 50% in Mean Opinion Score tests, marking the turning point from concatenative to neural TTS. Google Cloud TTS offers WaveNet voices commercially. The architecture influenced every major TTS system that followed, including Tacotron, VITS, and the models behind ElevenLabs and OpenAI's voice products.

Read full definition

an international standard published by the W3C that defines how to make web content accessible to people with disabilities. WCAG 2.1 Level AA is the baseline referenced by the European Accessibility Act (EAA) via EN 301 549. The guidelines are organized around four principles: perceivable, operable, understandable, and robust (POUR). WCAG 2.2 was published in October 2023, and WCAG 3.0 is in working draft as of March 2026.

Read full definition

16 terms