AWS re:Invent 2016: NEW LAUNCH! Introducing Amazon Polly (MAC204)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rafal Kuklinski – Amazon Text-to-Speech
November 30, 2016
MAC204
NEW LAUNCH! Introducing
Amazon Polly
A Service that Turns Text into Lifelike
Speech

What to expect from the session
• Introduction to Amazon Polly
• Features and functionality
• Text-to-speech: Under the hood
• Getting started
• Pricing
• Case studies
• Q&A

Why we built Amazon Polly
• Apps using voice to communicate with end-users are
becoming more common every day
• Naturalness of generated speech is a key element of
user experience
• Integration of speech varies across use cases

What is Amazon Polly?
• A service that converts text into lifelike speech
• Offers 47 lifelike voices across 24 languages
• Low latency responses enable developers to build
real-time systems
• Developers can store, replay, and distribute
generated speech

Amazon Polly: Quality
Natural-sounding speech
A subjective measure of how close TTS output is to human speech.
Accurate text processing
Ability of the system to interpret common text formats such as abbreviations, numerical
sequences, homographs etc.
Today in Las Vegas, NV it's 54°F.
"We live for the music", live from the Madison Square Garden.
Highly intelligibile
A measure of how comprehensible speech is.
”Peter Piper picked a peck of pickled peppers.”

Amazon Polly: Language Portfolio
Americas:
• Brazilian Portuguese
• Canadian French
• English (US)
• Spanish (US)
A-PAC:
• Australian English
• Indian English
• Japanese
EMEA:
• British English
• Danish
• Dutch
• French
• German
• Icelandic
• Italian
• Norwegian
• Polish
• Portuguese
• Romanian
• Russian
• Spanish
• Swedish
• Turkish
• Welsh
• Welsh English

Amazon Polly features: SSML
Speech Synthesis Markup Language
is a W3C recommendation, an XML-based markup language for speech
synthesis applications
<speak>
My name is Kuklinski. It is spelled
<prosody rate='x-slow'>
<say-as interpret-as="characters">Kuklinski</say-as>
</prosody>
</speak>

Amazon Polly features: Lexicons
Enables developers to customize the pronunciation of
words or phrases
My daughter’s name is Kaja.
<lexeme>
<grapheme>Kaja</grapheme>
<grapheme>kaja</grapheme>
<grapheme>KAJA</grapheme>
<phoneme>"kaI.@</phoneme>
</lexeme>

Text-to-Speech: Under the Hood

Goal: Convert text into intelligible, accurate, and natural speech
Challenges:
• Homographs: words written identically that have different
pronunciation
I live in Las Vegas vs This presentation broadcasts live from Las Vegas
• Text normalization: disambiguation of abbreviations, acronyms, units
‘St.’ expanded as ‘street’ or ‘saint’
• Conversion of text to phonemes (Grapheme-to-Phoneme) in
languages with complex mapping such as English e.g. tough,
through, though
• Foreign words (déjà vu), proper names (François Hollande), slang
(ASAP, LOL) etc.
Main Challenges of Text-to-Speech

TEXT
Market grew by > 20%.
WORDSPHONEMES
{
{
{
{
{
ˈtwɛn.ti
pɚ.ˈsɛnt
ˈmɑɹ.kət ˈgɹu baɪ ˈmoʊɹ
ˈðæn
PROSODY CONTOURUNIT SELECTION AND ADAPTATION
TEXT PROCESSING
PROSODY MODIFICATIONSTREAMING
Market grew by more
than
twenty
percent
Speech units
inventory

Unit Selection
Conversion of phoneme sequence to waveform
Database of recorded audio
Unit – diphone
Coverage of diphones and various features
e.g. Allophonic variation
• Pin vs Spin vs limping

Recording Data for TTS
Tons of text
Recording script:
Few weeks of
recordings
Automatic
selection of
texts
Recording script:
• Covers all combinations of diphones
and significant features in a
language

an error occurred while searching for your route
because snaps weren't all so obedient anymore,
now we say apple again. and we say apple,
general electric soars today. information on general electric
quick breads, zucchini, holiday, crock pot, cake,
so are you still keeping tabs on your old team,
that weighs more than four tons, disrupts the herring's swim
…
An apple a day, keeps …

First app
from boto3 import Session
from contextlib import closing
polly = Session().client("polly")
response = polly.synthesize_speech(
Text="Hello world!",
OutputFormat="mp3",
VoiceId="Joanna")
with closing(response["AudioStream"]) as stream:
with open("speech.mp3", "wb") as file:
file.write(stream.read())

Amazon Polly is cost-effective
• Pay-as-you-go
• $4 for 1M characters
• Free Tier of 5M characters/month - first year
• You can store and reuse generated speech

Learning a language with TTS
11/30/2016
Amazon Polly in Duolingo
Severin Hacker, CTO

Efficacy
34h of Duolingo
=
1 college semester
[Vesselinov et al, 2012]

Why voice matters
• Spoken language crucial for
language learning
• Accurate pronunciation matters
• Faster iteration thanks to TTS
• As good as natural human speech

Voice A
A/B Testing Voices
Voice B
Learning = 5, Engagement = 5 Learning = 10, Engagement = 10
all results statistically significant (p=0.05)

Polly (Salli)
English
Old voice
Winner!
”The new voice is a huge improvement ! I really like it, the old one was
terrible at times.”

Polly (Vitoria)
Portuguese
Old voice
Winner!
“Just today, I started getting a new voice for my Portuguese lessons! It's SO
much better than the previous one (...) in terms of comprehension it's miles
better.”

Polly (Hans)
German
Old voice
Winner!
“The German male TTS is music to my ears”

We use …
Danish: Naja (female)
Dutch: Ruben (male)
English: Salli (female), Joey (male)
German: Hans (male)
Spanish: Miguel (male)
French: Mathieu (male)
Italian: Carla (female)
Norwegian: Liv (female)
Polish: Maja (female)
Portuguese: Vitoria (female), Ricardo (male)
Swedish: Astrid (female)
Turkish: Filiz (female)
Welsh: Gwyneth (female)

worker
Infrastructure
Amazon
S3
Amazon
CloudFront
webserver
Amazon
SQS
Amazon
DynamoDB
Amazon
Beanstalk
TTS request
TTS meta-data
Amazon
Polly
Other
TTS
Other
TTS
TTS files
Global distribution
download

GoAnimate Case Study
Stacy Adams, Head of Marketing, GoAnimate
@atl2oz
Using Amazon Polly in
Animated Video

About GoAnimate
• Do-it-yourself animated video creation platform
• Less resource-intensive than professional video creation
• Companies use GoAnimate for:
• Training and eLearning
• HR
• Marketing
• GoAnimate for Schools supports K–12 educators and
their students

Use cases for text-to-speech
• Multi-language communication
• Training or HR professionals who have to create content in
many languages
• Video preproduction
• Video makers who need to iterate and fine-tune before the
text-to-speech is eventually replaced by a professional
voiceover
• K–12 education
• Students who make videos and don’t have access to
professional voices or time for or knowledge of voiceover

Remember to complete
your evaluations!

Duolingo voices its language learning service Using Polly
Duolingo is a free language learning service where
users help translate the web and rate translations.
With Amazon Polly our users
benefit from the most lifelike
Text-to-Speech voices
available on the market.
Severin Hacker
CTO, Duolingo
”
“ • Spoken language crucial for
language learning
• Accurate pronunciation matters
• Faster iteration thanks to TTS
• As good as natural human speech

GoAnimate is a cloud-based, animated video creation
plarform.
Amazon Polly gives
GoAnimate users the ability
to immediately give voice to
the characters they animate
using our platform.
Alvin Hung
CEO, GoAnimate
”
“ • Multi-language communication
• Training or HR professionals who
have to create content in many
languages
• Video preproduction
• Video makers who need to iterate
and fine-tune before the text-to-
speech is eventually replaced by a
professional voiceover
• K–12 education
• Students who make videos and
don’t have access to professional
voices or time for or knowledge of
voiceover
With Polly, GoAnimate gives voice to the characters in their animations

Royal National Institute of Blind People creates and
distributes accessible information in the form of
synthesized content
Amazon Polly delivers
incredibly lifelike voices
which captivate and engage
our readers.
John Worsfold
Solutions Implementation Manager, RNIB
”
“ • RNIB delivers largest library of
audiobooks in the UK for nearly 2
million people with sight loss
• Naturalness of generated speech is
critical to captivate and engage readers
• No restrictions on speech
redistributions enables RNIB to create
and distribute accessible information in
a form of synthesized content
RNIB provides the largest library in the UK for people with sight loss

AWS re:Invent 2016: NEW LAUNCH! Introducing Amazon Polly (MAC204)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (6)

Destaque

Destaque (20)

Semelhante a AWS re:Invent 2016: NEW LAUNCH! Introducing Amazon Polly (MAC204)

Semelhante a AWS re:Invent 2016: NEW LAUNCH! Introducing Amazon Polly (MAC204) (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Último

Último (20)

AWS re:Invent 2016: NEW LAUNCH! Introducing Amazon Polly (MAC204)