May 30, 2021
Blog
Best text to speech services
Text to speech originally began as a great way to enable people with either sight or reading difficulties to consume content orally. This is still a key role for these applications, however, there are now quite a number of new ways that TTS is now being applied. For example, education services use these applications as a great way to consume their educational content and courses. Businesses, are now using these applications as a way to communicate their services and product both online via bots or podcasts but, also in real-world spaces. Within the travel and tourism industry, many businesses and tourist bodies now use TTS to create easy to consume audio guides for cities, museums, etc. Within the marketing industry, they are a great way for clients to extend the reach of their product and marketing information as businesses can now transform their online content into podcasts that customers can then access whenever and wherever is convenient for them. The truth is that we are just at the beginning of the TTS revolution and already many significant industries and social enterprises are finding important and valuable roles for custom text to speech applications that help engage audiences in ways more convenient to the end-users.
All of the technology major players in the market now provide custom voice text to speech services. From Amazon to Google, Microsoft, and IBM. All of these providers have been active in this market for a number of years and have built robust and scalable services. They all have different strengths and weaknesses. However, it is fair to say that overall Amazon and Google are the stronger services. As their footprint in the audio smart speaker market has enabled them to create accurate and sophisticated text to audio services.
Amazon Polly
Polly’s Text To Speech service uses advanced deep learning technologies to synthesize natural sounding human speech. With dozens of lifelike voices across a broad set of languages, you can build speech-enabled applications, services and audio content platforms that work in many different countries.
In addition to Standard Text to Speech voices, Amazon Polly offers Neural Text to Speech voices that deliver advanced improvements in speech quality through a new machine learning approach. Polly’s Neural Text to Speech technology also supports two speaking styles that allow you to better match the delivery style of the speaker to the application. For example, a Newscaster reading style that is tailored to news narration use cases. And a Conversational speaking style that is ideal for two-way communication like telephony applications.
Amazon Polly includes dozens of lifelike voices and support for a variety of languages. So you can select a suitable voice and distribute your speech-enabled applications, services or audio content in many countries.
For a full list of voices go to:
https://aws.amazon.com/polly/features/?nc=sn&loc=3
Pros
- Quality of the voices
- Accuracy of transcoding
Cons
- Limited number of voices in neural text to speech voices
- Limited range of voices outside the five major world languages
Google Text to Speech
Google Cloud Text-to-Speech enables developers to synthesise natural-sounding speech with 220+ voices and is available in multiple languages. It applies DeepMind’s ground-breaking research in WaveNet and Google’s powerful neural networks to deliver the highest fidelity possible. As an easy-to-use API, a user can create lifelike interactions with their users, across many applications and devices
The 220+ voices go across 40+ languages and variants, including Mandarin, Hindi, Spanish, Arabic, Russian, and more.
For a full list of voices here go here:
https://cloud.google.com/text-to-speech/docs/voices
Platform provides over 90+ WaveNet voices based on DeepMind’s research to generate voices that significantly close the gap with human performance. It can also personalise the pitch of a selected voice, up to 20 semitones more or less from the default. Ability to customise speech with speech synthesis markup language tags that allow a customer to add pauses, numbers, date and time formatting, and other pronunciation instructions.
Pros
• Number of voices to choose
• Number of WaveNet voices deployed
Cons
• Voice quality can vary across their full range of voice
IBM Watson
IBM Watson’s is an API cloud service that enables a customer to convert written text into natural-sounding audio in a variety of languages and voices within an existing application or within Watson Assistant.
It has controllable speech attributes which enable a user to easily adjust pronunciation, volume, pitch, speed using Speech Synthesis Markup Language. Has the ability to personalise voice quality by specifying attributes such as strength, pitch, breathiness, rate, timbre, and more. Provides strong data governance practices based on IBM’s long-term corporate governance practises
IBM Watson provides 32 to different voices. To find out more go here:
https://cloud.ibm.com/docs/text-to-speech?topic=text-to-speech-voices
Pros
- Data security
- Ability to personalise the voice
Cons
- Accuracy when transcoding Text to Speech
- Limited number of voices to choose from
- Processing of text to speech can be slow
Microsoft
Microsoft’s platform enables clients to build apps and services that speak naturally via its , choosing from more than 215 voices and 60 languages and variants. A customer can differentiate its brand with a customized voice, and access voices with different speaking styles and emotional tones to fit your use case—from text readers to customer support chatbots.
It is flexible to deploy as it can Run Text to Speech anywhere in the cloud or at the edge in containers providing fine-grained audio controls. It also tunes voice outputs for different customer scenarios by easily adjusting rate, pitch, pronunciation, pauses, and more. Microsoft’s neural Text to Speech supports several speaking styles, including chat, newscast, and customer service, and emotions like cheerfulness and empathy.
There are over 215 voices to choose from in Azure Neural TTS. To find out more go here:
Pros
• Range of voices
• Flexibility of deployment
Cons
• Accuracy of transcoding
• Voice quality, especially for regional voicesp
Summary
All of the providers have robust and scalable services. However, each of the providers has different strengths and weaknesses so it is best to look to match the service that most suits your use case.
The most important first step is therefore to establish the role TTS plays in your business or social enterprise as that will both ensure you get appropriate value from the application you select but also that you select the application that best meets those needs.
Overall Amazon and Google, currently, provide the most robust applications, however, all of the providers have different strengths, and all of these providers are well-established players in this market.
More from our blog
Blog
Amplifying Engagement:
Blog
What Is Voice Cloning?
Blog