Definition

Speech Synthesis Markup Language (SSML) is a W3C standard that provides an XML-based way to control text-to-speech output. It allows developers to annotate text with instructions for pronunciation, pauses, emphasis, speaking rate, pitch, and language switching.

SSML support levels

TTS providers offer varying levels of SSML support:

  • Full SSML: pauses, pronunciation (IPA/phonemes), language switching, prosody control
  • Basic SSML: pauses and emphasis only
  • None: plain text input only, the engine handles all prosody decisions

When SSML matters

SSML is most valuable for handling edge cases that neural models still struggle with: abbreviations (is “Dr.” a doctor or a drive?), foreign proper nouns, precise pause timing in IVR menus, and mixed-language content.

Frequently Asked Questions

What is SSML?

SSML (Speech Synthesis Markup Language) is a W3C standard XML-based markup language that controls how TTS engines process and speak text. It allows developers to specify pronunciation, pauses, emphasis, speed, pitch, and language switching.

Do all TTS APIs support SSML?

Support varies. Some providers offer full SSML support (pauses, pronunciation, language switching), others support only basic features (pauses and emphasis), and some accept plain text only with no SSML.

Is SSML still relevant with neural TTS?

Neural TTS models handle prosody and pronunciation better than traditional systems, reducing the need for explicit SSML control. However, SSML remains useful for edge cases like abbreviations, foreign words, and precise timing control.