Definition

WaveNet is a deep generative model for raw audio, introduced by Google DeepMind in September 2016. It generates audio waveforms one sample at a time (typically at 16,000 or 24,000 samples per second), using a deep stack of causal dilated convolutions to model the probability distribution of each audio sample conditioned on all previous samples.

The breakthrough

WaveNet’s significance was not just technical but conceptual. Instead of assembling pre-recorded fragments or using mathematical speech models, it learned the statistical patterns of human speech from hundreds of hours of recordings. The result was speech with natural prosody, intonation, and breathing patterns that previous approaches could not achieve.

From research to production

Early WaveNet required hours to generate one second of audio, making real-time use impossible. Google’s engineering team solved this through Parallel WaveNet, a distillation technique that achieved a 1,000x speedup by 2017. The optimized system could generate one second of speech in 50 milliseconds.

Frequently Asked Questions

What is WaveNet?

WaveNet is a deep neural network developed by Google DeepMind in 2016 that generates raw audio waveforms sample by sample. It was the first AI model to produce speech that sounded significantly more natural than existing concatenative or parametric systems.

How much better is WaveNet than traditional TTS?

WaveNet reduced the gap between synthetic and human speech quality by over 50% according to subjective listener evaluations. This was not incremental improvement but a fundamental paradigm shift in speech synthesis.

Is WaveNet still used?

WaveNet itself has been succeeded by newer architectures, but it established the neural approach that all modern TTS systems use. Google Cloud TTS still offers WaveNet voices at $16 per million characters.