Tuesday, August 29, 2017

Deep Learning for Siri’s Voice

Siri Team:

Recently, deep learning has gained momentum the field of speech technology, largely surpassing conventional techniques, such as hidden Markov models (HMMs). Parametric synthesis has benefited greatly from deep learning technology. Deep learning has also enabled a completely new approach for speech synthesis called direct waveform modeling (for example using WaveNet [4]), which has the potential to provide both the high quality of unit selection synthesis and flexibility of parametric synthesis. However, given its extremely high computational cost, it is not yet feasible for a production system.

In order to provide the best possible quality for Siri’s voices across all platforms, Apple is now taking a step forward to utilize deep learning in an on-device hybrid unit selection system.


For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality, and expressivity of Siri’s voice. We evaluated hundreds of candidates before choosing the best one. Then, we recorded over 20 hours of speech and built a new TTS voice using the new deep learning based TTS technology. As a result, the new US English Siri voice sounds better than ever. Table 1 contains a few examples of the Siri deep learning -based voices in iOS 11 and 10 compared to a traditional unit selection voice in iOS 9.

Update (2017-09-11): John Gruber:

It’s the voice assistant equivalent to getting a better UI font or retina graphics for a visual UI. But: if given a choice between a Siri that sounds better but works the same, or a Siri that sounds the same but works better, I don’t know anyone who wouldn’t choose the latter.

Comments RSS · Twitter

Leave a Comment