Mistral AI launches Voxtral TTS, a 4B-parameter multilingual voice model

Mistral AI released Voxtral TTS on March 26, 2026, a 4B-parameter text-to-speech model supporting 9 languages. The model is available via API at $0.016 per 1,000 characters and as open weights on Hugging Face under a CC BY NC 4.0 license.
What Voxtral TTS delivers
Voxtral TTS is a transformer-based, autoregressive, flow-matching model built on Ministral 3B. It consists of a 3.4B-parameter transformer decoder backbone, a 390M flow-matching acoustic transformer, and a 300M in-house neural audio codec operating at 12.5Hz frame rate.
The model supports English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. It adapts to a custom voice from as little as 3 seconds of reference audio, capturing accent, intonation, rhythm, and emotional tone. It also supports zero-shot cross-lingual voice adaptation: a French voice prompt with English text produces natural French-accented English.
Preset voices are available in American English, British English, and French dialects. Custom voice libraries can be connected through the API.
How Mistral AI frames performance
Mistral AI reports that human evaluations by native speakers show Voxtral TTS achieves superior naturalness compared to ElevenLabs Flash v2.5 while maintaining similar time-to-first-audio. The company also claims parity with ElevenLabs v3 quality, with support for emotion steering.
The evaluation used two recognizable voices per language in a zero-shot custom voice context. Three annotators per pair performed side-by-side preference tests on naturalness, accent adherence, and acoustic similarity to the original reference.
VoiceRadar has not independently verified these claims. The benchmark methodology is Mistral AI's own, and independent third-party evaluations have not yet been published as of March 26, 2026. A detailed analysis of Voxtral TTS, including independent testing, will follow in a dedicated blog post.
Pricing and availability
Voxtral TTS is available now via the Mistral API at $0.016 per 1,000 characters. At that price point, one million characters costs $16, positioning the model significantly below ElevenLabs and most commercial TTS providers.
The model can be tested in Mistral Studio and Le Chat. Open weights with several reference voices are published on Hugging Face under a CC BY NC 4.0 license, which permits non-commercial use.
Why it matters
This is Mistral AI’s first text-to-speech model and the first TTS release from a major European AI lab. The combination of open weights, aggressive pricing, and a compact 4B-parameter architecture targets enterprises that want to self-host or reduce their dependency on American TTS providers.
The model also completes Mistral AI’s audio pipeline: paired with Voxtral Transcribe (their speech-to-text model), enterprises can now build full speech-to-speech systems within a single vendor stack.
For the broader TTS market, Voxtral TTS adds a credible open-weights competitor at a price point that could pressure margins across the industry.
Source: Mistral AI announcement, March 26, 2026.
VoiceRadar’s take on Voxtral TTS
Watch our video analysis of Voxtral TTS, covering what the model delivers, how the pricing compares, and what it means for the TTS market.

