Dynamic calls with Text To Speech

Fundamentals of Text To
Speech in UC

Patrick Dexter
Thank you - my name is Patrick Dexter with a company Cepstral and today I’ll be talking about Text To Speech voices. We’ll discuss how a TTS voice is
made, the component parts of text to speech software, and how that ﬁts into Uniﬁed Communications software like Elastix

Cepstral
Text To Speech innovator

Founded in 2001

Focus on North and South America

Elastix Partner since 2011

To give you some background information Cepstral is a commercial company spun out of Carnegie Mellon University in Pittsburgh Pennsylvania. We have
customers all around the world from doing announcements at train stations in New Zealand and Australia to delivering 1000s of concurrent ports in large
call centers in Canada. Our main customer base is in North and South America. And we’ve been a proud partner of Elastix since 2011.

!
!
@Cepstral_LLC

Our marketing department wouldn’t let me do this presentation without giving you our twitter address. But this is also useful if you have any questions
about this presentation or TTS in general tweet them to me and I’ll respond.

What is Text To Speech?
So what is Text To Speech? Text to Speech is the ability to create audio that was never recorded before. There’s far too many words to record them all and
new ones are being created every day. We see this all the time in Telephone systems. You need to tell a caller the amount of money they have in an
account. Or that their package will be delivered to a speciﬁc address. Information is constantly changing so you need a way to get it to your callers.

Fun History of TTS
Before we dive into more details about Text To Speech I want to show you one of the earliest Speech Synthesis devices

The machine pictured on this slide is a replica of the ﬁrst speech synthesizer originally developed by Wolfgang Von Kempelen in the late 1700s.
Interestingly this machine from the 1840s was viewed and studied by Alexander Graham Bell who created his own version and used many of the ideas
when he invented the telephone! So Speech Synthesis and the Telephone have been used together since the very beginning.

Text To Speech

Technologies
So there are several different competing technologies that are used to create Text To Speech voices.

• Formant

• Diphone

• Statistical Parametric
If you’re familiar with Text To Speech you’ve heard of some of these.
!
Formant synthesis creates mathematical models of the tissue in the mouth and lungs. It has a very small footprint but requires a great deal of computing
power to operate. To me it sounds like an opera singer. doing scales of aaaaahhhhhhs Formant synthesizers are good at doing vowel sounds in a range of
pitches
!
Diphone voices are quite robotic - this is the Stephen Hawking voice it’s easily understood but doesn’t sound like a person.
!
Statistical parametric voices are sometimes called HMM for their use of Hidden Markov Models to create a model of speech based on a corpus and then use
that model to generate the new audio.
!

Unit Selection Synthesis
But the primary technology being used in commercial Text To Speech systems today is Unit Selection synthesis. You’re already familiar with this if your
mobile phone talks back to you. It’s what SIRI and Google Now uses.
!
Unit Selection voices provide the most human like experience today And that’s because they are made with the recordings of actual people. these
recordings are identiﬁed and labeled to create a database of sounds.
!
In this sense Unit Selection is similar to a ransom note

We’ve all seen these in movies and TV shows. You cut up letters and rearrange them to form new words. At the most base level this is what Unit Selection
Synthesis is all about. In English there are 26 letters in Spanish 29 so all we have to do is record about 30 things and we have a Unit Selection Text to
Speech voice, right? Well no - unfortunately it’s not this easy.

Unit Selection Synthesis
You can’t just record the alphabet because human speech is not made up of letters. We use letters to write down speech. but when spoken, speech is
made up of sounds which we call phonemes. And how these phonemes are pronounced can vary quite a lot depending on what you’re saying and
importantly where in the sentence that sound occurs. So much so that there are specialized alphabets speciﬁcally for phonemes.
!
!

agua
So let’s take a look at phonemes. I tried to modify this presentation with Spanish examples when possible.
!
Agua is a word that even with my limited Spanish skills I can pronounce.

a1 g xu0 a0
and here is the phonetic spelling of that word. If you’re curious this is based on the Carnegie Mellon University phonetic set. There are several other
phoneme dictionaries IPA, SAMPA are popular but we use a variation of the CMU alphabet for our voices.

a1g xu0 a0
OK so looking at this ﬁrst phoneme we are depicting the ah sound with a 1 which denotes that the vowel is stressed.

a1 gxu0 a0
The guh sound is represented by the g

a1 g xu0a0
and here’s one that should be new - xu with a 0. this is where phonetic alphabets start to make differ from a regular alphabet. If we just used the u to
describe this sound it wouldn’t work.

aquí
Here’s a very similar word - has a u in it but it’s pronounced completely different aqui versus agua do you hear that whuu sound?

a1 g xu0a0
That whuuu sound is identiﬁed by this X U phone. and the 0 marks it as being unstressed.

a1 g xu0 a0
the a 0 now completes the word to give us the whhhuuaaa sound.

Create a new unit
selection TTS
voice
I think the really cool thing about Unit selection voices is that they we need to select a speciﬁc person to record a new voice and that their voice will live on
theoretically forever.

So how do we grab all of those phonemes? We lock that lucky voice actor in a sound booth and force them to record hours and hours worth of carefully
worded scripts. These scripts are designed to capture as many of the phoneme interactions as possible.

Nowadays I wish I
used cheese to
coax them out,
because bacon can
be awkward.
Here’s an actual sentence from our English script. This sentence is grammatically correct which helps the voice actor to read it in a natural tone. Once this
has been recorded the audio ﬁle is segmented into individual phonemes, and the location of the phonemes in the syllables, words, common phrases, and
then sentences is noted. This allows us to better match sounds when the software runs. We’ll see an example of this in a minute

Labeling
Pitch 
Duration

Position

Diphones
The phonemes are then labelled with acoustic parameters or context factors like the fundamental frequency or pitch, time duration, position in the
syllable, and neighboring phonemes. Because of all of these different possible interactions. When we’re done we’ll have hundreds of thousands of
examples or units of each phoneme in our database.  
!

agua salida sola jamón hora
Now that we have a database ﬁlled with units we can select them to create new audio that was never said by the voice talent. Here’s a group of words -
most likely recorded at different times of the day or even days or years apart. We have to use the same voice talent for a single voice. So based on our
own testing and customer feedback we’ll record new material and build that into the voice in order to provide more natural synthesis.
!
Let’s go through this and create a new word by selecting units from these recordings.

a gxuasalida sola jamón hora
g xu a
Starting off with our old friend agua, we’ll take the last bit of that. You can see that we’re grabbing several phonemes at the same time. The TTS engine is
looking for units that will match up best. So if it can ﬁnd phonemes that were near each other already the audio will probably sound more natural.
What we want is for the phonemes from different recordings to join together smoothly. If they don’t there’s a jump that the two sounds will have to make
and that’s when you hear the glitches in TTS audio that make it sound robotic. So getting smooth joins is of paramount importance. That’s one of the
reasons why we need hundreds of thousands of these phonemes.

agua sali dasola jamón hora
g xu a d a
Continuing along with our example. Now we’ll add in a group of phonemes from the next word.

agua salida so lajamón hora
g xu a d a l a
We’ll continue to do this to build out the new word.

agua salida sola xamón hora
g xu a d a l a x a
In an actual TTS engine. This selection will only take milliseconds. This is how the software can be used in a telephone of uniﬁed communications system.
It operates faster than realtime. The engine will also be performing a number of other calculations as well that we’ll look at in a minute. To me it’s still
amazing that Text To Speech even works at all.

agua salida sola jamón ho ra
g xu a d a l a x a r a
And ﬁnishing up gives us

g xu a0 d a0 l a1 x a0 r a
Guadalajara
The Mexican city of Guadalajara
!
Now this is just a simple example of creating a new word. In real life a TTS engine is looking at features like phrase boundaries - does the phoneme occur
in the beginning, middle, or end of a word. Going even further where in the original recording was the word? All of these attributes inﬂuence how that
phoneme is said.
!

Hay agua en Marte.
!
!
Beber el agua.
The whuuuuuu phonemes from Agua in these two sentences are labeled differently. Typically at the end of a sentence the pitch descends. Beber el Agua.
and I know my spanish is very very bad. But el Agua the l bleeds into the a. It would be difficult to take that A phoneme and use it in Marte for example.
!
The perfect unit required at synthesis time may not be available in the database, so a selection must be performed to choose, from amongst the many
slightly mis-matched units, the best available sequence of units to concatenate. The more units we have the greater the chance that we’ll ﬁnd that perfect
unit.

User Lexicons
!
!
We’ve been talking about the research end of speech synthesis but there are production applications that knowing all of this will help with.
!
Being familiar with phonemes and phonetic alphabets provides both you and the end users of Text To Speech software with the ability to customize the
voice through a user lexicon.

!
User Lexicons
word = phonemes
word = phonemes
word = phonemes
word = phonemes
word = phonemes
User lexicons are lookup tables replacing words in the text with user defined pronunciations. These can be specialized acronyms that are specific to a
company or peoples names - often very useful when using an English voice to pronounce Spanish or other language names. Lexicons fine tune the audio
to make sure that it’s as understandable as possible.

Text
Normalization
I mentioned earlier that in milliseconds the engine performs other calculations as well. One of the more important calculations is called Text
Normalization. This actually happens ﬁrst in the process. So let’s take a look at what that means and why it is a challenge for all Text To Speech engines.

7/10
Let’s say you had a piece of text that said this. What could it mean?

7/10
7 de octubre
Here in Colombia it could be todays date October 7th.

7/10
MM/DD/YYYY
In the United States the date format is different.

7/10
July 7th
so this exact same text would be July 10th to me. It’s a bit absurd I think we’re one of the only countries to use this format it probably has something to
do with our hatred of the metric system as well. I guess we just like to be difficult. Moving on

7/10
7 dividido por 10
Or we could look at it another way and it would be a math problem. These are the types of issues that Text Normalization has to solve. Unless the
software knows what the sentence means it can’t properly pronounce the words to convey that information clearly. One of the keys to ﬁguring these
things out is identifying and analyzing them in the context of the sentence as a whole.

This is a courtesy call to
remind Patrick Dexter
of an appointment with
Dr. Steel on 10/07/2015
at 10:30 am.
About a week before my dentist appointment I receive a phone call reminding me to ﬂoss so I don’t get yelled at by the guy with sharp stabby things in my
mouth.
!
If you have a service that provides outbound phone calls this message is exactly the type of automation that Text To Speech is perfect for.

Dr. Steel on 10/07/2015
at 10:30 am.
In a high call volume environment there’s no way you could record every name. And on the call you really do want to identify a speciﬁc person - in this
case Patrick Dexter. You don’t want the person to show up to the appointment with their son or daughter when it’s actually their appointment.

Dr. Steel on 10/07/2015
at 10:30 am.
But even with saying the person’s name there’s so much here that a computer can easily make a mistake on. The text to speech software needs to identify
this text D R period. Not as durrrr and the end of the sentence. But as the abbreviation for Doctor. Doctor Steel is what a person reading this would think
so that’s what the engine has to say.

Dr. Steel on 10/07/2015
at 10:30 am.
So getting back to our first example. The text to speech engine should be able to correctly interpret this text as a date. It will look at the sentence and
know that at least in English on is a preposition that typically comes before a date so that’s a very good clue and it’s English so the format is month date
year. This gives you an idea of the type of rules that are built into TTS software. Now some of the fun research that’s being done in the field of
computational linguistics is how to apply more artificial intelligence to this process rather than strict rule based decision trees.

Dr. Steel on 10/07/2015
at 10:30 am.
And we’re not done yet. At the end of our sentence again there’s something ambiguous. we have the time of the appointment. if the TTS engine reads
this as 10 colon 30 ammmm it will cause confusion.
!
!
!

Heteronym
Another issue is this lovely thing. Does anyone know what a Heteronym is?
It’s an evil part of the English language where a single word can mean two different things and is pronounced differently! I don’t believe Spanish has
heteronyms but if it does I’d love to ﬁnd out more.

!
!
@Cepstral_LLC

This is where that twitter handle becomes relevant.

Bass
The word bass can be a ﬁsh

Bass
Or it can be pronounced bass and in music mean the deep low end. As in the bass clef versus the treble clef.

Object
This is a fun word that can be either a verb or a noun

Me opongo a
ese objeto.
In spanish this sentence may make sense. I’m really hoping. I used Google Translate. But you have two different words for the verb and noun

I object to that
object.
But in English the sentence is I object to that object. Do you hear the two different pronunciations of the exact same word? object versus OB ject. If you
say it I object to that object to an English speaker it doesn’t make sense.

I object verb to
that object.
The engine can ﬁgure out the part of speech to help determine the pronunciation. Here object is a verb

I object to that
object noun.
and a noun. This functionality is called a part of speech tagger and it’s very helpful in a text to speech engine.
!

$10 per day
Currency interpretation is also important and is something we see all of the time in outbound phone call campaigns. Maybe this is a phone call to remind
someone of a ﬁne they will have to pay. Or a utility bill. We’d read this as 10 dollars or 10 pesos per day

$10.5 million per
day
We’d read this as $10 point five million dollars per day. Not 10 period 5 dollars million per day. So the engine has to look at the entire sentence both
backwards and forwards in advance in order to understand not only what the text is but how all of the words interoperate.
!
Text Normalization occurs first in order to determine what the text as a whole means. periods and commas are interpreted to add pauses, numbers and
dates are converted into formats that are more friendly to the ear. Abbreviations and so much more all figured out so that the human computer interaction
can occur.

1
To put it all those pieces together. we have our text to speech software here.
!
1. The text is sent to it

2
!
2. The Text Normalization occurs trying to ﬁgure out all of those dates and currency, the parts of speech and heteronym information. And User lexicons
are checked to see if there are custom pronunciations.

3
!
3. The best possible units are selected from hundreds of thousands of examples based on all of those acoustic parameters like pitch, duration and position
relative to other phonemes

4
!
4. The units are all joined together to generate a wave form

5
!
5. And the audio is outputted to the user. Magic!
!
In the world of Telephones and Uniﬁed Communication Text To Speech isn’t just magic though. it’s an incredibly powerful tool giving you the ability to
deliver information to your callers. Let’s take a look at an example of this.

Since Elastix is built on top of Asterisk we can use the existing tools like ODBC database connections for grabbing variable information.
!

And the open source module app_swift for linking Cepstral into Asterisk

app_swift
we'll be looking at app_swift which is speciﬁc to Cepstral text to speech.

MRCP
but there's also a protocol called MRCP if you want to use other TTS engines and MRCP is also used to add in speech recognition. It’s great for larger
installs as well where you may have multiple Elastix servers or that all need to share TTS resources. Again twitter or see me after the talk with any
questions on this.

exten=>n,Swift("Hello! Thank you for
calling Cepstral.”|4000|3)

!
exten=>n,Set(CALL_TRANSFER=$
{FILTER(0-9,${SWIFT_DTMF})})
Here’s an app_swift example - swift is the name of Cepstral’s TTS engine - so that adds the swift command into your dialplan. It says a simple greeting
and then uses Asterisk functionality to listen for a DTMF tone.
!
We’ve had customers with 100s of menu options in their IVR that have the entire thing read in realtime by TTS voices. Allowing them to make menu
updates and changes on the ﬂy. Want to add a new IVR option? Simply reload extensions.conf and the menu changes. No recording of prompts, no
uploading wav ﬁles. Very easy to maintain.

exten => 123,8,Set(BALANCE=$
{ODBC_BALANCE(${ACCOUNT})})

!
exten => 123,9,Swift(${BALANCE})

Like I said before with ODBC you can also create very complex database driven systems right in the dialplan. Or you can use AGI to do this.
!
Here’s a very simple example of querying a database for an account balance.

exten => 123,8,Set(BALANCE=$
{ODBC_BALANCE(${ACCOUNT})})

!
exten => 123,9,Swift(${BALANCE})

and then using the swift command right in the dialplan to read that back to the caller. And because you have the $ symbol in there Cepstral will perform
that text normalization we discussed to read this off as a person would say it.
!
You can really expand what’s possible to automate inbound or outbound calls. Do you have a technical support line? Have the caller identify themselves
with a Ticket number and read back the last note that a customer service rep left for them. Or a status update on a known system outage.
!

https://vimeo.com/84233208
!
There's a fantastic video available from Elastix training that goes into detail on how to install and conﬁgure TTS and Elastix and how to set up an AGI script
that makes use of the TTS software like this. The video shows you a demo app for employees to ﬁnd out more information about outstanding loans.
!
I'll tweet a link to the video. And I do recommend that you watch it. It’s in Spanish and you can pause and rewatch it over and over. And now that you
know how the text to speech software works it will make more sense if you’re adding TTS to your Elastix systems
!

• Text To Speech automates
the delivery of
information.

• Grow IVR usage without
adding call center
employees

To end - Text To Speech is a powerful tool that can read any information that you have in your systems. Not only is it useful for traditional IVR systems.
But if you have a call center agent or customer service rep reading information to a caller then Text To Speech can automate that. Allowing you to grow
call volumes without adding agents. It also frees agents up to handle more difficult tasks that can’t be automated.

¡Gracias!
Thank you very much for the opportunity to speak to all of you today.

Dynamic calls with Text To Speech

Recommended

Recommended

More Related Content

Similar to Dynamic calls with Text To Speech

Similar to Dynamic calls with Text To Speech (20)

More from PaloSanto Solutions

More from PaloSanto Solutions (20)

Recently uploaded

Recently uploaded (20)

Dynamic calls with Text To Speech