This session will introduce you to Amazon Polly, a new deep learning service that turns text into lifelike speech. Polly enables existing applications to speak as a first class feature and creates the opportunity for entirely new categories of speech-enabled products – from mobile apps and cars, to devices and appliances. Polly includes 47 lifelike voices and support for 24 languages, so you can select the ideal voice and distribute your speech-enabled applications in many geographies. Polly is easy to use – you just send the text you want converted into speech to the Polly API, and Polly immediately returns the audio stream to your application so you can play it directly or store it in a standard audio file format, such as MP3. Polly supports Speech Synthesis Markup Language (SSML) tags like prosody so you can adjust the speech rate, pitch, or volume. Polly is a secure service that delivers all of these benefits at high scale and at low latency. You can cache and replay Polly’s generated speech at no additional cost. Polly lets you convert 5M characters per month for free during the first year. Polly’s pay-as-you-go pricing, low cost per request, and lack of restrictions on storage and reuse of voice output make it a cost-effective way to enable speech synthesis everywhere. Join this session to learn more and find out how you get can started with Amazon Polly, today!
2. What to expect from the session
• Introduction to Amazon Polly
• Features and functionality
• Text-to-speech: Under the hood
• Getting started
• Pricing
• Case studies
• Q&A
4. Why we built Amazon Polly
• Apps using voice to communicate with end-users are
becoming more common every day
• Naturalness of generated speech is a key element of
user experience
• Integration of speech varies across use cases
5. What is Amazon Polly?
• A service that converts text into lifelike speech
• Offers 47 lifelike voices across 24 languages
• Low latency responses enable developers to build
real-time systems
• Developers can store, replay, and distribute
generated speech
6. Amazon Polly: Quality
Natural-sounding speech
A subjective measure of how close TTS output is to human speech.
Accurate text processing
Ability of the system to interpret common text formats such as abbreviations, numerical
sequences, homographs etc.
Today in Las Vegas, NV it's 54°F.
"We live for the music", live from the Madison Square Garden.
Highly intelligibile
A measure of how comprehensible speech is.
”Peter Piper picked a peck of pickled peppers.”
7. Amazon Polly: Language Portfolio
Americas:
• Brazilian Portuguese
• Canadian French
• English (US)
• Spanish (US)
A-PAC:
• Australian English
• Indian English
• Japanese
EMEA:
• British English
• Danish
• Dutch
• French
• German
• Icelandic
• Italian
• Norwegian
• Polish
• Portuguese
• Romanian
• Russian
• Spanish
• Swedish
• Turkish
• Welsh
• Welsh English
9. Amazon Polly features: SSML
Speech Synthesis Markup Language
is a W3C recommendation, an XML-based markup language for speech
synthesis applications
<speak>
My name is Kuklinski. It is spelled
<prosody rate='x-slow'>
<say-as interpret-as="characters">Kuklinski</say-as>
</prosody>
</speak>
10. Amazon Polly features: Lexicons
Enables developers to customize the pronunciation of
words or phrases
My daughter’s name is Kaja.
<lexeme>
<grapheme>Kaja</grapheme>
<grapheme>kaja</grapheme>
<grapheme>KAJA</grapheme>
<phoneme>"kaI.@</phoneme>
</lexeme>
12. Goal: Convert text into intelligible, accurate, and natural speech
Challenges:
• Homographs: words written identically that have different
pronunciation
I live in Las Vegas vs This presentation broadcasts live from Las Vegas
• Text normalization: disambiguation of abbreviations, acronyms, units
‘St.’ expanded as ‘street’ or ‘saint’
• Conversion of text to phonemes (Grapheme-to-Phoneme) in
languages with complex mapping such as English e.g. tough,
through, though
• Foreign words (déjà vu), proper names (François Hollande), slang
(ASAP, LOL) etc.
Main Challenges of Text-to-Speech
13. TEXT
Market grew by > 20%.
WORDSPHONEMES
{
{
{
{
{
ˈtwɛn.ti
pɚ.ˈsɛnt
ˈmɑɹ.kət ˈgɹu baɪ ˈmoʊɹ
ˈðæn
PROSODY CONTOURUNIT SELECTION AND ADAPTATION
TEXT PROCESSING
PROSODY MODIFICATIONSTREAMING
Market grew by more
than
twenty
percent
Speech units
inventory
14. Unit Selection
Conversion of phoneme sequence to waveform
Database of recorded audio
Unit – diphone
Coverage of diphones and various features
e.g. Allophonic variation
• Pin vs Spin vs limping
15. Recording Data for TTS
Tons of text
Recording script:
Few weeks of
recordings
Automatic
selection of
texts
Recording script:
• Covers all combinations of diphones
and significant features in a
language
16. an error occurred while searching for your route
because snaps weren't all so obedient anymore,
now we say apple again. and we say apple,
general electric soars today. information on general electric
quick breads, zucchini, holiday, crock pot, cake,
so are you still keeping tabs on your old team,
that weighs more than four tons, disrupts the herring's swim
…
An apple a day, keeps …
19. First app
from boto3 import Session
from contextlib import closing
polly = Session().client("polly")
response = polly.synthesize_speech(
Text="Hello world!",
OutputFormat="mp3",
VoiceId="Joanna")
with closing(response["AudioStream"]) as stream:
with open("speech.mp3", "wb") as file:
file.write(stream.read())
20. Amazon Polly is cost-effective
• Pay-as-you-go
• $4 for 1M characters
• Free Tier of 5M characters/month - first year
• You can store and reuse generated speech
25. Why voice matters
• Spoken language crucial for
language learning
• Accurate pronunciation matters
• Faster iteration thanks to TTS
• As good as natural human speech
29. Polly (Vitoria)
Portuguese
Old voice
Winner!
“Just today, I started getting a new voice for my Portuguese lessons! It's SO
much better than the previous one (...) in terms of comprehension it's miles
better.”
34. About GoAnimate
• Do-it-yourself animated video creation platform
• Less resource-intensive than professional video creation
• Companies use GoAnimate for:
• Training and eLearning
• HR
• Marketing
• GoAnimate for Schools supports K–12 educators and
their students
35. Use cases for text-to-speech
• Multi-language communication
• Training or HR professionals who have to create content in
many languages
• Video preproduction
• Video makers who need to iterate and fine-tune before the
text-to-speech is eventually replaced by a professional
voiceover
• K–12 education
• Students who make videos and don’t have access to
professional voices or time for or knowledge of voiceover
41. Duolingo voices its language learning service Using Polly
Duolingo is a free language learning service where
users help translate the web and rate translations.
With Amazon Polly our users
benefit from the most lifelike
Text-to-Speech voices
available on the market.
Severin Hacker
CTO, Duolingo
”
“ • Spoken language crucial for
language learning
• Accurate pronunciation matters
• Faster iteration thanks to TTS
• As good as natural human speech
42. GoAnimate is a cloud-based, animated video creation
plarform.
Amazon Polly gives
GoAnimate users the ability
to immediately give voice to
the characters they animate
using our platform.
Alvin Hung
CEO, GoAnimate
”
“ • Multi-language communication
• Training or HR professionals who
have to create content in many
languages
• Video preproduction
• Video makers who need to iterate
and fine-tune before the text-to-
speech is eventually replaced by a
professional voiceover
• K–12 education
• Students who make videos and
don’t have access to professional
voices or time for or knowledge of
voiceover
With Polly, GoAnimate gives voice to the characters in their animations
43. Royal National Institute of Blind People creates and
distributes accessible information in the form of
synthesized content
Amazon Polly delivers
incredibly lifelike voices
which captivate and engage
our readers.
John Worsfold
Solutions Implementation Manager, RNIB
”
“ • RNIB delivers largest library of
audiobooks in the UK for nearly 2
million people with sight loss
• Naturalness of generated speech is
critical to captivate and engage readers
• No restrictions on speech
redistributions enables RNIB to create
and distribute accessible information in
a form of synthesized content
RNIB provides the largest library in the UK for people with sight loss