SlideShare a Scribd company logo
1 of 70
Download to read offline
Fundamentals of Text To
Speech in UC	

Patrick Dexter
Thank you - my name is Patrick Dexter with a company Cepstral and today I’ll be talking about Text To Speech voices. We’ll discuss how a TTS voice is
made, the component parts of text to speech software, and how that fits into Unified Communications software like Elastix
Cepstral
Text To Speech innovator	

Founded in 2001 	

Focus on North and South America	

Elastix Partner since 2011	

To give you some background information Cepstral is a commercial company spun out of Carnegie Mellon University in Pittsburgh Pennsylvania. We have
customers all around the world from doing announcements at train stations in New Zealand and Australia to delivering 1000s of concurrent ports in large
call centers in Canada. Our main customer base is in North and South America. And we’ve been a proud partner of Elastix since 2011.
!
!
@Cepstral_LLC	

Our marketing department wouldn’t let me do this presentation without giving you our twitter address. But this is also useful if you have any questions
about this presentation or TTS in general tweet them to me and I’ll respond.
What is Text To Speech?
So what is Text To Speech? Text to Speech is the ability to create audio that was never recorded before. There’s far too many words to record them all and
new ones are being created every day. We see this all the time in Telephone systems. You need to tell a caller the amount of money they have in an
account. Or that their package will be delivered to a specific address. Information is constantly changing so you need a way to get it to your callers.
Fun History of TTS
Before we dive into more details about Text To Speech I want to show you one of the earliest Speech Synthesis devices
The machine pictured on this slide is a replica of the first speech synthesizer originally developed by Wolfgang Von Kempelen in the late 1700s.
Interestingly this machine from the 1840s was viewed and studied by Alexander Graham Bell who created his own version and used many of the ideas
when he invented the telephone! So Speech Synthesis and the Telephone have been used together since the very beginning.
Text To Speech	

Technologies
So there are several different competing technologies that are used to create Text To Speech voices.
• Formant	

• Diphone	

• Statistical Parametric
If you’re familiar with Text To Speech you’ve heard of some of these.
!
Formant synthesis creates mathematical models of the tissue in the mouth and lungs. It has a very small footprint but requires a great deal of computing
power to operate. To me it sounds like an opera singer. doing scales of aaaaahhhhhhs Formant synthesizers are good at doing vowel sounds in a range of
pitches
!
Diphone voices are quite robotic - this is the Stephen Hawking voice it’s easily understood but doesn’t sound like a person.
!
Statistical parametric voices are sometimes called HMM for their use of Hidden Markov Models to create a model of speech based on a corpus and then use
that model to generate the new audio.
!
Unit Selection Synthesis
But the primary technology being used in commercial Text To Speech systems today is Unit Selection synthesis. You’re already familiar with this if your
mobile phone talks back to you. It’s what SIRI and Google Now uses.
!
Unit Selection voices provide the most human like experience today And that’s because they are made with the recordings of actual people. these
recordings are identified and labeled to create a database of sounds.
!
In this sense Unit Selection is similar to a ransom note
We’ve all seen these in movies and TV shows. You cut up letters and rearrange them to form new words. At the most base level this is what Unit Selection
Synthesis is all about. In English there are 26 letters in Spanish 29 so all we have to do is record about 30 things and we have a Unit Selection Text to
Speech voice, right? Well no - unfortunately it’s not this easy.
Unit Selection Synthesis
You can’t just record the alphabet because human speech is not made up of letters. We use letters to write down speech. but when spoken, speech is
made up of sounds which we call phonemes. And how these phonemes are pronounced can vary quite a lot depending on what you’re saying and
importantly where in the sentence that sound occurs. So much so that there are specialized alphabets specifically for phonemes.
!
!
agua
So let’s take a look at phonemes. I tried to modify this presentation with Spanish examples when possible. 
!
Agua is a word that even with my limited Spanish skills I can pronounce.
a1 g xu0 a0
and here is the phonetic spelling of that word. If you’re curious this is based on the Carnegie Mellon University phonetic set. There are several other
phoneme dictionaries IPA, SAMPA are popular but we use a variation of the CMU alphabet for our voices.
a1g xu0 a0
OK so looking at this first phoneme we are depicting the ah sound with a 1 which denotes that the vowel is stressed.
a1 gxu0 a0
The guh sound is represented by the g
a1 g xu0a0
and here’s one that should be new - xu with a 0. this is where phonetic alphabets start to make differ from a regular alphabet. If we just used the u to
describe this sound it wouldn’t work.
aquí
Here’s a very similar word - has a u in it but it’s pronounced completely different aqui versus agua do you hear that whuu sound?
a1 g xu0a0
That whuuu sound is identified by this X U phone. and the 0 marks it as being unstressed.
a1 g xu0 a0
the a 0 now completes the word to give us the whhhuuaaa sound.
Create a new unit
selection TTS
voice
I think the really cool thing about Unit selection voices is that they we need to select a specific person to record a new voice and that their voice will live on
theoretically forever.
So how do we grab all of those phonemes? We lock that lucky voice actor in a sound booth and force them to record hours and hours worth of carefully
worded scripts. These scripts are designed to capture as many of the phoneme interactions as possible.
Nowadays I wish I
used cheese to
coax them out,
because bacon can
be awkward.
Here’s an actual sentence from our English script. This sentence is grammatically correct which helps the voice actor to read it in a natural tone. Once this
has been recorded the audio file is segmented into individual phonemes, and the location of the phonemes in the syllables, words, common phrases, and
then sentences is noted. This allows us to better match sounds when the software runs. We’ll see an example of this in a minute
Labeling
Pitch

Duration	

Position	

Diphones
The phonemes are then labelled with acoustic parameters or context factors like the fundamental frequency or pitch, time duration, position in the
syllable, and neighboring phonemes. Because of all of these different possible interactions. When we’re done we’ll have hundreds of thousands of
examples or units of each phoneme in our database. 

!
agua salida sola jamón hora
Now that we have a database filled with units we can select them to create new audio that was never said by the voice talent. Here’s a group of words -
most likely recorded at different times of the day or even days or years apart. We have to use the same voice talent for a single voice. So based on our
own testing and customer feedback we’ll record new material and build that into the voice in order to provide more natural synthesis.
!
Let’s go through this and create a new word by selecting units from these recordings.
a gxuasalida sola jamón hora
g xu a
Starting off with our old friend agua, we’ll take the last bit of that. You can see that we’re grabbing several phonemes at the same time. The TTS engine is
looking for units that will match up best. So if it can find phonemes that were near each other already the audio will probably sound more natural. 
What we want is for the phonemes from different recordings to join together smoothly. If they don’t there’s a jump that the two sounds will have to make
and that’s when you hear the glitches in TTS audio that make it sound robotic. So getting smooth joins is of paramount importance. That’s one of the
reasons why we need hundreds of thousands of these phonemes.
agua sali dasola jamón hora
g xu a d a
Continuing along with our example. Now we’ll add in a group of phonemes from the next word.
agua salida so lajamón hora
g xu a d a l a
We’ll continue to do this to build out the new word.
agua salida sola xamón hora
g xu a d a l a x a
In an actual TTS engine. This selection will only take milliseconds. This is how the software can be used in a telephone of unified communications system.
It operates faster than realtime. The engine will also be performing a number of other calculations as well that we’ll look at in a minute. To me it’s still
amazing that Text To Speech even works at all.
agua salida sola jamón ho ra
g xu a d a l a x a r a
And finishing up gives us
g xu a0 d a0 l a1 x a0 r a
Guadalajara
The Mexican city of Guadalajara
!
Now this is just a simple example of creating a new word. In real life a TTS engine is looking at features like phrase boundaries - does the phoneme occur
in the beginning, middle, or end of a word. Going even further where in the original recording was the word? All of these attributes influence how that
phoneme is said. 
!
Hay agua en Marte.
!
!
Beber el agua.
The whuuuuuu phonemes from Agua in these two sentences are labeled differently. Typically at the end of a sentence the pitch descends. Beber el Agua.
and I know my spanish is very very bad. But el Agua the l bleeds into the a. It would be difficult to take that A phoneme and use it in Marte for example. 
!
The perfect unit required at synthesis time may not be available in the database, so a selection must be performed to choose, from amongst the many
slightly mis-matched units, the best available sequence of units to concatenate. The more units we have the greater the chance that we’ll find that perfect
unit.
User Lexicons
!
!
We’ve been talking about the research end of speech synthesis but there are production applications that knowing all of this will help with.
!
Being familiar with phonemes and phonetic alphabets provides both you and the end users of Text To Speech software with the ability to customize the
voice through a user lexicon.
!
User Lexicons
word = phonemes
word = phonemes
word = phonemes
word = phonemes
word = phonemes
User lexicons are lookup tables replacing words in the text with user defined pronunciations. These can be specialized acronyms that are specific to a
company or peoples names - often very useful when using an English voice to pronounce Spanish or other language names. Lexicons fine tune the audio
to make sure that it’s as understandable as possible.
Text
Normalization
I mentioned earlier that in milliseconds the engine performs other calculations as well. One of the more important calculations is called Text
Normalization. This actually happens first in the process. So let’s take a look at what that means and why it is a challenge for all Text To Speech engines.
7/10
Let’s say you had a piece of text that said this. What could it mean?
7/10
7 de octubre
Here in Colombia it could be todays date October 7th.
7/10
MM/DD/YYYY
In the United States the date format is different.
7/10
July 7th
so this exact same text would be July 10th to me. It’s a bit absurd I think we’re one of the only countries to use this format it probably has something to
do with our hatred of the metric system as well. I guess we just like to be difficult. Moving on
7/10
7 dividido por 10
Or we could look at it another way and it would be a math problem. These are the types of issues that Text Normalization has to solve. Unless the
software knows what the sentence means it can’t properly pronounce the words to convey that information clearly. One of the keys to figuring these
things out is identifying and analyzing them in the context of the sentence as a whole.
This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
About a week before my dentist appointment I receive a phone call reminding me to floss so I don’t get yelled at by the guy with sharp stabby things in my
mouth.
!
If you have a service that provides outbound phone calls this message is exactly the type of automation that Text To Speech is perfect for.
This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
In a high call volume environment there’s no way you could record every name. And on the call you really do want to identify a specific person - in this
case Patrick Dexter. You don’t want the person to show up to the appointment with their son or daughter when it’s actually their appointment.
This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
But even with saying the person’s name there’s so much here that a computer can easily make a mistake on. The text to speech software needs to identify
this text D R period. Not as durrrr and the end of the sentence. But as the abbreviation for Doctor. Doctor Steel is what a person reading this would think
so that’s what the engine has to say.
This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
So getting back to our first example. The text to speech engine should be able to correctly interpret this text as a date. It will look at the sentence and
know that at least in English on is a preposition that typically comes before a date so that’s a very good clue and it’s English so the format is month date
year. This gives you an idea of the type of rules that are built into TTS software. Now some of the fun research that’s being done in the field of
computational linguistics is how to apply more artificial intelligence to this process rather than strict rule based decision trees.
This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
And we’re not done yet. At the end of our sentence again there’s something ambiguous. we have the time of the appointment. if the TTS engine reads
this as 10 colon 30 ammmm it will cause confusion. 
!
!
!
Heteronym
Another issue is this lovely thing. Does anyone know what a Heteronym is? 
It’s an evil part of the English language where a single word can mean two different things and is pronounced differently! I don’t believe Spanish has
heteronyms but if it does I’d love to find out more.
!
!
@Cepstral_LLC	

This is where that twitter handle becomes relevant.
Bass
The word bass can be a fish
Bass
Or it can be pronounced bass and in music mean the deep low end. As in the bass clef versus the treble clef.
Object
This is a fun word that can be either a verb or a noun
Me opongo a
ese objeto.
In spanish this sentence may make sense. I’m really hoping. I used Google Translate. But you have two different words for the verb and noun
I object to that
object.
But in English the sentence is I object to that object. Do you hear the two different pronunciations of the exact same word? object versus OB ject. If you
say it I object to that object to an English speaker it doesn’t make sense.
I object verb to
that object.
The engine can figure out the part of speech to help determine the pronunciation. Here object is a verb
I object to that
object noun.
and a noun. This functionality is called a part of speech tagger and it’s very helpful in a text to speech engine.
!
$10 per day
Currency interpretation is also important and is something we see all of the time in outbound phone call campaigns. Maybe this is a phone call to remind
someone of a fine they will have to pay. Or a utility bill. We’d read this as 10 dollars or 10 pesos per day
$10.5 million per
day
We’d read this as $10 point five million dollars per day. Not 10 period 5 dollars million per day. So the engine has to look at the entire sentence both
backwards and forwards in advance in order to understand not only what the text is but how all of the words interoperate.
!
Text Normalization occurs first in order to determine what the text as a whole means. periods and commas are interpreted to add pauses, numbers and
dates are converted into formats that are more friendly to the ear. Abbreviations and so much more all figured out so that the human computer interaction
can occur.
1
To put it all those pieces together. we have our text to speech software here.
!
1. The text is sent to it
2
!
2. The Text Normalization occurs trying to figure out all of those dates and currency, the parts of speech and heteronym information. And User lexicons
are checked to see if there are custom pronunciations.
3
!
3. The best possible units are selected from hundreds of thousands of examples based on all of those acoustic parameters like pitch, duration and position
relative to other phonemes
4
!
4. The units are all joined together to generate a wave form
5
!
5. And the audio is outputted to the user. Magic!
!
In the world of Telephones and Unified Communication Text To Speech isn’t just magic though. it’s an incredibly powerful tool giving you the ability to
deliver information to your callers. Let’s take a look at an example of this.
Since Elastix is built on top of Asterisk we can use the existing tools like ODBC database connections for grabbing variable information.
!
And the open source module app_swift for linking Cepstral into Asterisk
app_swift
we'll be looking at app_swift which is specific to Cepstral text to speech.
MRCP
but there's also a protocol called MRCP if you want to use other TTS engines and MRCP is also used to add in speech recognition. It’s great for larger
installs as well where you may have multiple Elastix servers or that all need to share TTS resources. Again twitter or see me after the talk with any
questions on this.
exten=>n,Swift("Hello! Thank you for
calling Cepstral.”|4000|3)	

!
exten=>n,Set(CALL_TRANSFER=$
{FILTER(0-9,${SWIFT_DTMF})})
Here’s an app_swift example - swift is the name of Cepstral’s TTS engine - so that adds the swift command into your dialplan. It says a simple greeting
and then uses Asterisk functionality to listen for a DTMF tone.
!
We’ve had customers with 100s of menu options in their IVR that have the entire thing read in realtime by TTS voices. Allowing them to make menu
updates and changes on the fly. Want to add a new IVR option? Simply reload extensions.conf and the menu changes. No recording of prompts, no
uploading wav files. Very easy to maintain.
exten => 123,8,Set(BALANCE=$
{ODBC_BALANCE(${ACCOUNT})})	

!
exten => 123,9,Swift(${BALANCE})	

Like I said before with ODBC you can also create very complex database driven systems right in the dialplan. Or you can use AGI to do this. 
!
Here’s a very simple example of querying a database for an account balance.
exten => 123,8,Set(BALANCE=$
{ODBC_BALANCE(${ACCOUNT})})	

!
exten => 123,9,Swift(${BALANCE})	

and then using the swift command right in the dialplan to read that back to the caller. And because you have the $ symbol in there Cepstral will perform
that text normalization we discussed to read this off as a person would say it.
!
You can really expand what’s possible to automate inbound or outbound calls. Do you have a technical support line? Have the caller identify themselves
with a Ticket number and read back the last note that a customer service rep left for them. Or a status update on a known system outage.
!
https://vimeo.com/84233208
!
There's a fantastic video available from Elastix training that goes into detail on how to install and configure TTS and Elastix and how to set up an AGI script
that makes use of the TTS software like this. The video shows you a demo app for employees to find out more information about outstanding loans. 
!
I'll tweet a link to the video. And I do recommend that you watch it. It’s in Spanish and you can pause and rewatch it over and over. And now that you
know how the text to speech software works it will make more sense if you’re adding TTS to your Elastix systems 
!
• Text To Speech automates
the delivery of
information. 	

• Grow IVR usage without
adding call center
employees	

To end - Text To Speech is a powerful tool that can read any information that you have in your systems. Not only is it useful for traditional IVR systems.
But if you have a call center agent or customer service rep reading information to a caller then Text To Speech can automate that. Allowing you to grow
call volumes without adding agents. It also frees agents up to handle more difficult tasks that can’t be automated.
¡Gracias!
Thank you very much for the opportunity to speak to all of you today.

More Related Content

Similar to Dynamic calls with Text To Speech

8 the american english long vowel sounds _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
8  the american english long vowel sounds  _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...8  the american english long vowel sounds  _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
8 the american english long vowel sounds _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
carlosdrosario
 
.Scriptwriting for radio
.Scriptwriting for radio.Scriptwriting for radio
.Scriptwriting for radio
riannalloyd21
 
10LanguageThe Organization of LanguageLanguage use inv.docx
10LanguageThe Organization of LanguageLanguage use inv.docx10LanguageThe Organization of LanguageLanguage use inv.docx
10LanguageThe Organization of LanguageLanguage use inv.docx
aulasnilda
 

Similar to Dynamic calls with Text To Speech (20)

8 the american english long vowel sounds _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
8  the american english long vowel sounds  _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...8  the american english long vowel sounds  _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
8 the american english long vowel sounds _eɪ, i, ɑɪ, oʊ, yu_ — pronuncian_ ...
 
Os Group5
Os Group5Os Group5
Os Group5
 
GROUP5-SYLLABLES
GROUP5-SYLLABLESGROUP5-SYLLABLES
GROUP5-SYLLABLES
 
Os group5
Os group5Os group5
Os group5
 
GROUP5-SYLLABLES
GROUP5-SYLLABLESGROUP5-SYLLABLES
GROUP5-SYLLABLES
 
GROUP5-SYLLABLES
GROUP5-SYLLABLESGROUP5-SYLLABLES
GROUP5-SYLLABLES
 
2016 Florida Literacy Coalition Truespel Presentation
2016 Florida Literacy Coalition Truespel Presentation2016 Florida Literacy Coalition Truespel Presentation
2016 Florida Literacy Coalition Truespel Presentation
 
9 the american english short vowel sounds æ, ɛ, ɪ, ɑ, ʌ_ — pronuncian_ amer...
9  the american english short vowel sounds  æ, ɛ, ɪ, ɑ, ʌ_ — pronuncian_ amer...9  the american english short vowel sounds  æ, ɛ, ɪ, ɑ, ʌ_ — pronuncian_ amer...
9 the american english short vowel sounds æ, ɛ, ɪ, ɑ, ʌ_ — pronuncian_ amer...
 
First of all i
First of all iFirst of all i
First of all i
 
Radio writing
Radio writingRadio writing
Radio writing
 
voice and Accent Neutralization
voice and Accent Neutralization voice and Accent Neutralization
voice and Accent Neutralization
 
Sound Structure
Sound StructureSound Structure
Sound Structure
 
.Scriptwriting for radio
.Scriptwriting for radio.Scriptwriting for radio
.Scriptwriting for radio
 
Speech and Language Processing
Speech and Language ProcessingSpeech and Language Processing
Speech and Language Processing
 
Phonetics i
Phonetics iPhonetics i
Phonetics i
 
Icfl - slides ver 2 - copy
Icfl - slides ver 2 - copyIcfl - slides ver 2 - copy
Icfl - slides ver 2 - copy
 
Multi lingual text-processing
Multi lingual text-processingMulti lingual text-processing
Multi lingual text-processing
 
Speech Synthesis.pptx
Speech Synthesis.pptxSpeech Synthesis.pptx
Speech Synthesis.pptx
 
Phonology
PhonologyPhonology
Phonology
 
10LanguageThe Organization of LanguageLanguage use inv.docx
10LanguageThe Organization of LanguageLanguage use inv.docx10LanguageThe Organization of LanguageLanguage use inv.docx
10LanguageThe Organization of LanguageLanguage use inv.docx
 

More from PaloSanto Solutions

More from PaloSanto Solutions (20)

Tres componentes fundamentales de un buen PBX IP: seguridad, alta disponibili...
Tres componentes fundamentales de un buen PBX IP: seguridad, alta disponibili...Tres componentes fundamentales de un buen PBX IP: seguridad, alta disponibili...
Tres componentes fundamentales de un buen PBX IP: seguridad, alta disponibili...
 
Voip y Big Data, ¿Cómo aplicar analytics a la VoIP?
Voip y Big Data, ¿Cómo aplicar analytics a la VoIP?Voip y Big Data, ¿Cómo aplicar analytics a la VoIP?
Voip y Big Data, ¿Cómo aplicar analytics a la VoIP?
 
Innovative technology for universal communication designed to involve the (he...
Innovative technology for universal communication designed to involve the (he...Innovative technology for universal communication designed to involve the (he...
Innovative technology for universal communication designed to involve the (he...
 
Queuemetrics esencial, de la implementación a reportes avanzadas
Queuemetrics esencial, de la implementación a reportes avanzadasQueuemetrics esencial, de la implementación a reportes avanzadas
Queuemetrics esencial, de la implementación a reportes avanzadas
 
La evolución de la telefonía IP a comunicaciones unificadas
La evolución de la telefonía IP a comunicaciones unificadasLa evolución de la telefonía IP a comunicaciones unificadas
La evolución de la telefonía IP a comunicaciones unificadas
 
WebRTC … ¡vamos a discar!
WebRTC … ¡vamos a discar!WebRTC … ¡vamos a discar!
WebRTC … ¡vamos a discar!
 
Integrando encuestas automáticas con iSurveyX
Integrando encuestas automáticas con iSurveyXIntegrando encuestas automáticas con iSurveyX
Integrando encuestas automáticas con iSurveyX
 
Usando el módulo PIKE en Elastix MT
Usando el módulo PIKE en Elastix MTUsando el módulo PIKE en Elastix MT
Usando el módulo PIKE en Elastix MT
 
Todo lo lo que necesita saber para implementar FreePBX
Todo lo lo que necesita saber para implementar FreePBXTodo lo lo que necesita saber para implementar FreePBX
Todo lo lo que necesita saber para implementar FreePBX
 
Gestión de la Información de Desempeño con OpenNMS
Gestión de la Información de Desempeño con OpenNMSGestión de la Información de Desempeño con OpenNMS
Gestión de la Información de Desempeño con OpenNMS
 
Escalado y balanceo de carga de sistemas SIP
Escalado y balanceo de carga de sistemas SIPEscalado y balanceo de carga de sistemas SIP
Escalado y balanceo de carga de sistemas SIP
 
Elastix unified communications server cookbook
Elastix unified communications server cookbookElastix unified communications server cookbook
Elastix unified communications server cookbook
 
Seguridad en Asterisk: Un acercamiento detallado
Seguridad en Asterisk: Un acercamiento detalladoSeguridad en Asterisk: Un acercamiento detallado
Seguridad en Asterisk: Un acercamiento detallado
 
Proceso de migración de telefonía tradicional a Elastix (Caso)
Proceso de migración de telefonía tradicional a Elastix (Caso)Proceso de migración de telefonía tradicional a Elastix (Caso)
Proceso de migración de telefonía tradicional a Elastix (Caso)
 
Building a new ecosystem for interoperable communications
Building a new ecosystem for interoperable communicationsBuilding a new ecosystem for interoperable communications
Building a new ecosystem for interoperable communications
 
Asterisk: the future is at REST
Asterisk: the future is at RESTAsterisk: the future is at REST
Asterisk: the future is at REST
 
Presentacion Hardware Elastix 2015 - Colombia
Presentacion Hardware Elastix 2015 - Colombia Presentacion Hardware Elastix 2015 - Colombia
Presentacion Hardware Elastix 2015 - Colombia
 
Voicemail Avanzado
Voicemail AvanzadoVoicemail Avanzado
Voicemail Avanzado
 
Módulo de Alta Disponibilidad de Elastix
Módulo de Alta Disponibilidad de ElastixMódulo de Alta Disponibilidad de Elastix
Módulo de Alta Disponibilidad de Elastix
 
Porteros IP SURiX con sin Video - Aplicaciones - Casos de éxito - Configuración
Porteros IP SURiX con sin Video - Aplicaciones - Casos de éxito - ConfiguraciónPorteros IP SURiX con sin Video - Aplicaciones - Casos de éxito - Configuración
Porteros IP SURiX con sin Video - Aplicaciones - Casos de éxito - Configuración
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Dynamic calls with Text To Speech

  • 1. Fundamentals of Text To Speech in UC Patrick Dexter Thank you - my name is Patrick Dexter with a company Cepstral and today I’ll be talking about Text To Speech voices. We’ll discuss how a TTS voice is made, the component parts of text to speech software, and how that fits into Unified Communications software like Elastix
  • 2. Cepstral Text To Speech innovator Founded in 2001 Focus on North and South America Elastix Partner since 2011 To give you some background information Cepstral is a commercial company spun out of Carnegie Mellon University in Pittsburgh Pennsylvania. We have customers all around the world from doing announcements at train stations in New Zealand and Australia to delivering 1000s of concurrent ports in large call centers in Canada. Our main customer base is in North and South America. And we’ve been a proud partner of Elastix since 2011.
  • 3. ! ! @Cepstral_LLC Our marketing department wouldn’t let me do this presentation without giving you our twitter address. But this is also useful if you have any questions about this presentation or TTS in general tweet them to me and I’ll respond.
  • 4. What is Text To Speech? So what is Text To Speech? Text to Speech is the ability to create audio that was never recorded before. There’s far too many words to record them all and new ones are being created every day. We see this all the time in Telephone systems. You need to tell a caller the amount of money they have in an account. Or that their package will be delivered to a specific address. Information is constantly changing so you need a way to get it to your callers.
  • 5. Fun History of TTS Before we dive into more details about Text To Speech I want to show you one of the earliest Speech Synthesis devices
  • 6. The machine pictured on this slide is a replica of the first speech synthesizer originally developed by Wolfgang Von Kempelen in the late 1700s. Interestingly this machine from the 1840s was viewed and studied by Alexander Graham Bell who created his own version and used many of the ideas when he invented the telephone! So Speech Synthesis and the Telephone have been used together since the very beginning.
  • 7. Text To Speech Technologies So there are several different competing technologies that are used to create Text To Speech voices.
  • 8. • Formant • Diphone • Statistical Parametric If you’re familiar with Text To Speech you’ve heard of some of these. ! Formant synthesis creates mathematical models of the tissue in the mouth and lungs. It has a very small footprint but requires a great deal of computing power to operate. To me it sounds like an opera singer. doing scales of aaaaahhhhhhs Formant synthesizers are good at doing vowel sounds in a range of pitches ! Diphone voices are quite robotic - this is the Stephen Hawking voice it’s easily understood but doesn’t sound like a person. ! Statistical parametric voices are sometimes called HMM for their use of Hidden Markov Models to create a model of speech based on a corpus and then use that model to generate the new audio. !
  • 9. Unit Selection Synthesis But the primary technology being used in commercial Text To Speech systems today is Unit Selection synthesis. You’re already familiar with this if your mobile phone talks back to you. It’s what SIRI and Google Now uses. ! Unit Selection voices provide the most human like experience today And that’s because they are made with the recordings of actual people. these recordings are identified and labeled to create a database of sounds. ! In this sense Unit Selection is similar to a ransom note
  • 10. We’ve all seen these in movies and TV shows. You cut up letters and rearrange them to form new words. At the most base level this is what Unit Selection Synthesis is all about. In English there are 26 letters in Spanish 29 so all we have to do is record about 30 things and we have a Unit Selection Text to Speech voice, right? Well no - unfortunately it’s not this easy.
  • 11. Unit Selection Synthesis You can’t just record the alphabet because human speech is not made up of letters. We use letters to write down speech. but when spoken, speech is made up of sounds which we call phonemes. And how these phonemes are pronounced can vary quite a lot depending on what you’re saying and importantly where in the sentence that sound occurs. So much so that there are specialized alphabets specifically for phonemes. ! !
  • 12. agua So let’s take a look at phonemes. I tried to modify this presentation with Spanish examples when possible. ! Agua is a word that even with my limited Spanish skills I can pronounce.
  • 13. a1 g xu0 a0 and here is the phonetic spelling of that word. If you’re curious this is based on the Carnegie Mellon University phonetic set. There are several other phoneme dictionaries IPA, SAMPA are popular but we use a variation of the CMU alphabet for our voices.
  • 14. a1g xu0 a0 OK so looking at this first phoneme we are depicting the ah sound with a 1 which denotes that the vowel is stressed.
  • 15. a1 gxu0 a0 The guh sound is represented by the g
  • 16. a1 g xu0a0 and here’s one that should be new - xu with a 0. this is where phonetic alphabets start to make differ from a regular alphabet. If we just used the u to describe this sound it wouldn’t work.
  • 17. aquí Here’s a very similar word - has a u in it but it’s pronounced completely different aqui versus agua do you hear that whuu sound?
  • 18. a1 g xu0a0 That whuuu sound is identified by this X U phone. and the 0 marks it as being unstressed.
  • 19. a1 g xu0 a0 the a 0 now completes the word to give us the whhhuuaaa sound.
  • 20. Create a new unit selection TTS voice I think the really cool thing about Unit selection voices is that they we need to select a specific person to record a new voice and that their voice will live on theoretically forever.
  • 21. So how do we grab all of those phonemes? We lock that lucky voice actor in a sound booth and force them to record hours and hours worth of carefully worded scripts. These scripts are designed to capture as many of the phoneme interactions as possible.
  • 22. Nowadays I wish I used cheese to coax them out, because bacon can be awkward. Here’s an actual sentence from our English script. This sentence is grammatically correct which helps the voice actor to read it in a natural tone. Once this has been recorded the audio file is segmented into individual phonemes, and the location of the phonemes in the syllables, words, common phrases, and then sentences is noted. This allows us to better match sounds when the software runs. We’ll see an example of this in a minute
  • 23. Labeling Pitch
 Duration Position Diphones The phonemes are then labelled with acoustic parameters or context factors like the fundamental frequency or pitch, time duration, position in the syllable, and neighboring phonemes. Because of all of these different possible interactions. When we’re done we’ll have hundreds of thousands of examples or units of each phoneme in our database. 
 !
  • 24. agua salida sola jamón hora Now that we have a database filled with units we can select them to create new audio that was never said by the voice talent. Here’s a group of words - most likely recorded at different times of the day or even days or years apart. We have to use the same voice talent for a single voice. So based on our own testing and customer feedback we’ll record new material and build that into the voice in order to provide more natural synthesis. ! Let’s go through this and create a new word by selecting units from these recordings.
  • 25. a gxuasalida sola jamón hora g xu a Starting off with our old friend agua, we’ll take the last bit of that. You can see that we’re grabbing several phonemes at the same time. The TTS engine is looking for units that will match up best. So if it can find phonemes that were near each other already the audio will probably sound more natural. What we want is for the phonemes from different recordings to join together smoothly. If they don’t there’s a jump that the two sounds will have to make and that’s when you hear the glitches in TTS audio that make it sound robotic. So getting smooth joins is of paramount importance. That’s one of the reasons why we need hundreds of thousands of these phonemes.
  • 26. agua sali dasola jamón hora g xu a d a Continuing along with our example. Now we’ll add in a group of phonemes from the next word.
  • 27. agua salida so lajamón hora g xu a d a l a We’ll continue to do this to build out the new word.
  • 28. agua salida sola xamón hora g xu a d a l a x a In an actual TTS engine. This selection will only take milliseconds. This is how the software can be used in a telephone of unified communications system. It operates faster than realtime. The engine will also be performing a number of other calculations as well that we’ll look at in a minute. To me it’s still amazing that Text To Speech even works at all.
  • 29. agua salida sola jamón ho ra g xu a d a l a x a r a And finishing up gives us
  • 30. g xu a0 d a0 l a1 x a0 r a Guadalajara The Mexican city of Guadalajara ! Now this is just a simple example of creating a new word. In real life a TTS engine is looking at features like phrase boundaries - does the phoneme occur in the beginning, middle, or end of a word. Going even further where in the original recording was the word? All of these attributes influence how that phoneme is said. !
  • 31. Hay agua en Marte. ! ! Beber el agua. The whuuuuuu phonemes from Agua in these two sentences are labeled differently. Typically at the end of a sentence the pitch descends. Beber el Agua. and I know my spanish is very very bad. But el Agua the l bleeds into the a. It would be difficult to take that A phoneme and use it in Marte for example. ! The perfect unit required at synthesis time may not be available in the database, so a selection must be performed to choose, from amongst the many slightly mis-matched units, the best available sequence of units to concatenate. The more units we have the greater the chance that we’ll find that perfect unit.
  • 32. User Lexicons ! ! We’ve been talking about the research end of speech synthesis but there are production applications that knowing all of this will help with. ! Being familiar with phonemes and phonetic alphabets provides both you and the end users of Text To Speech software with the ability to customize the voice through a user lexicon.
  • 33. ! User Lexicons word = phonemes word = phonemes word = phonemes word = phonemes word = phonemes User lexicons are lookup tables replacing words in the text with user defined pronunciations. These can be specialized acronyms that are specific to a company or peoples names - often very useful when using an English voice to pronounce Spanish or other language names. Lexicons fine tune the audio to make sure that it’s as understandable as possible.
  • 34. Text Normalization I mentioned earlier that in milliseconds the engine performs other calculations as well. One of the more important calculations is called Text Normalization. This actually happens first in the process. So let’s take a look at what that means and why it is a challenge for all Text To Speech engines.
  • 35. 7/10 Let’s say you had a piece of text that said this. What could it mean?
  • 36. 7/10 7 de octubre Here in Colombia it could be todays date October 7th.
  • 37. 7/10 MM/DD/YYYY In the United States the date format is different.
  • 38. 7/10 July 7th so this exact same text would be July 10th to me. It’s a bit absurd I think we’re one of the only countries to use this format it probably has something to do with our hatred of the metric system as well. I guess we just like to be difficult. Moving on
  • 39. 7/10 7 dividido por 10 Or we could look at it another way and it would be a math problem. These are the types of issues that Text Normalization has to solve. Unless the software knows what the sentence means it can’t properly pronounce the words to convey that information clearly. One of the keys to figuring these things out is identifying and analyzing them in the context of the sentence as a whole.
  • 40. This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am. About a week before my dentist appointment I receive a phone call reminding me to floss so I don’t get yelled at by the guy with sharp stabby things in my mouth. ! If you have a service that provides outbound phone calls this message is exactly the type of automation that Text To Speech is perfect for.
  • 41. This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am. In a high call volume environment there’s no way you could record every name. And on the call you really do want to identify a specific person - in this case Patrick Dexter. You don’t want the person to show up to the appointment with their son or daughter when it’s actually their appointment.
  • 42. This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am. But even with saying the person’s name there’s so much here that a computer can easily make a mistake on. The text to speech software needs to identify this text D R period. Not as durrrr and the end of the sentence. But as the abbreviation for Doctor. Doctor Steel is what a person reading this would think so that’s what the engine has to say.
  • 43. This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am. So getting back to our first example. The text to speech engine should be able to correctly interpret this text as a date. It will look at the sentence and know that at least in English on is a preposition that typically comes before a date so that’s a very good clue and it’s English so the format is month date year. This gives you an idea of the type of rules that are built into TTS software. Now some of the fun research that’s being done in the field of computational linguistics is how to apply more artificial intelligence to this process rather than strict rule based decision trees.
  • 44. This is a courtesy call to remind Patrick Dexter of an appointment with Dr. Steel on 10/07/2015 at 10:30 am. And we’re not done yet. At the end of our sentence again there’s something ambiguous. we have the time of the appointment. if the TTS engine reads this as 10 colon 30 ammmm it will cause confusion. ! ! !
  • 45. Heteronym Another issue is this lovely thing. Does anyone know what a Heteronym is? It’s an evil part of the English language where a single word can mean two different things and is pronounced differently! I don’t believe Spanish has heteronyms but if it does I’d love to find out more.
  • 46. ! ! @Cepstral_LLC This is where that twitter handle becomes relevant.
  • 47. Bass The word bass can be a fish
  • 48. Bass Or it can be pronounced bass and in music mean the deep low end. As in the bass clef versus the treble clef.
  • 49. Object This is a fun word that can be either a verb or a noun
  • 50. Me opongo a ese objeto. In spanish this sentence may make sense. I’m really hoping. I used Google Translate. But you have two different words for the verb and noun
  • 51. I object to that object. But in English the sentence is I object to that object. Do you hear the two different pronunciations of the exact same word? object versus OB ject. If you say it I object to that object to an English speaker it doesn’t make sense.
  • 52. I object verb to that object. The engine can figure out the part of speech to help determine the pronunciation. Here object is a verb
  • 53. I object to that object noun. and a noun. This functionality is called a part of speech tagger and it’s very helpful in a text to speech engine. !
  • 54. $10 per day Currency interpretation is also important and is something we see all of the time in outbound phone call campaigns. Maybe this is a phone call to remind someone of a fine they will have to pay. Or a utility bill. We’d read this as 10 dollars or 10 pesos per day
  • 55. $10.5 million per day We’d read this as $10 point five million dollars per day. Not 10 period 5 dollars million per day. So the engine has to look at the entire sentence both backwards and forwards in advance in order to understand not only what the text is but how all of the words interoperate. ! Text Normalization occurs first in order to determine what the text as a whole means. periods and commas are interpreted to add pauses, numbers and dates are converted into formats that are more friendly to the ear. Abbreviations and so much more all figured out so that the human computer interaction can occur.
  • 56. 1 To put it all those pieces together. we have our text to speech software here. ! 1. The text is sent to it
  • 57. 2 ! 2. The Text Normalization occurs trying to figure out all of those dates and currency, the parts of speech and heteronym information. And User lexicons are checked to see if there are custom pronunciations.
  • 58. 3 ! 3. The best possible units are selected from hundreds of thousands of examples based on all of those acoustic parameters like pitch, duration and position relative to other phonemes
  • 59. 4 ! 4. The units are all joined together to generate a wave form
  • 60. 5 ! 5. And the audio is outputted to the user. Magic! ! In the world of Telephones and Unified Communication Text To Speech isn’t just magic though. it’s an incredibly powerful tool giving you the ability to deliver information to your callers. Let’s take a look at an example of this.
  • 61. Since Elastix is built on top of Asterisk we can use the existing tools like ODBC database connections for grabbing variable information. !
  • 62. And the open source module app_swift for linking Cepstral into Asterisk
  • 63. app_swift we'll be looking at app_swift which is specific to Cepstral text to speech.
  • 64. MRCP but there's also a protocol called MRCP if you want to use other TTS engines and MRCP is also used to add in speech recognition. It’s great for larger installs as well where you may have multiple Elastix servers or that all need to share TTS resources. Again twitter or see me after the talk with any questions on this.
  • 65. exten=>n,Swift("Hello! Thank you for calling Cepstral.”|4000|3) ! exten=>n,Set(CALL_TRANSFER=$ {FILTER(0-9,${SWIFT_DTMF})}) Here’s an app_swift example - swift is the name of Cepstral’s TTS engine - so that adds the swift command into your dialplan. It says a simple greeting and then uses Asterisk functionality to listen for a DTMF tone. ! We’ve had customers with 100s of menu options in their IVR that have the entire thing read in realtime by TTS voices. Allowing them to make menu updates and changes on the fly. Want to add a new IVR option? Simply reload extensions.conf and the menu changes. No recording of prompts, no uploading wav files. Very easy to maintain.
  • 66. exten => 123,8,Set(BALANCE=$ {ODBC_BALANCE(${ACCOUNT})}) ! exten => 123,9,Swift(${BALANCE}) Like I said before with ODBC you can also create very complex database driven systems right in the dialplan. Or you can use AGI to do this. ! Here’s a very simple example of querying a database for an account balance.
  • 67. exten => 123,8,Set(BALANCE=$ {ODBC_BALANCE(${ACCOUNT})}) ! exten => 123,9,Swift(${BALANCE}) and then using the swift command right in the dialplan to read that back to the caller. And because you have the $ symbol in there Cepstral will perform that text normalization we discussed to read this off as a person would say it. ! You can really expand what’s possible to automate inbound or outbound calls. Do you have a technical support line? Have the caller identify themselves with a Ticket number and read back the last note that a customer service rep left for them. Or a status update on a known system outage. !
  • 68. https://vimeo.com/84233208 ! There's a fantastic video available from Elastix training that goes into detail on how to install and configure TTS and Elastix and how to set up an AGI script that makes use of the TTS software like this. The video shows you a demo app for employees to find out more information about outstanding loans. ! I'll tweet a link to the video. And I do recommend that you watch it. It’s in Spanish and you can pause and rewatch it over and over. And now that you know how the text to speech software works it will make more sense if you’re adding TTS to your Elastix systems !
  • 69. • Text To Speech automates the delivery of information. • Grow IVR usage without adding call center employees To end - Text To Speech is a powerful tool that can read any information that you have in your systems. Not only is it useful for traditional IVR systems. But if you have a call center agent or customer service rep reading information to a caller then Text To Speech can automate that. Allowing you to grow call volumes without adding agents. It also frees agents up to handle more difficult tasks that can’t be automated.
  • 70. ¡Gracias! Thank you very much for the opportunity to speak to all of you today.