The document discusses Amazon Polly, a text-to-speech service that converts text into natural sounding speech in various languages and voices. It describes how Polly offers features like Speech Synthesis Markup Language to control various speech properties and outputs high quality speech. The document also discusses Amazon Rekognition, a deep learning image recognition service that can perform tasks like facial analysis, facial recognition, and object detection on images. Finally, it provides an example use case of using Rekognition and other AWS services to build a smart assistant application that can detect faces, understand speech commands and respond using text-to-speech.
4. What is Amazon Polly?
• A service that converts text into lifelike speech
• Offers 48 lifelike voices across 24 languages
• Low latency responses enable developers to build
real-time systems
• Developers can store, replay, and distribute
generated speech
5. Amazon Polly: Quality
Natural-sounding speech
A subjective measure of how close TTS output is to human speech.
Accurate text processing
Ability of the system to interpret common text formats such
as abbreviations, numerical sequences, homographs etc.
Today in Sydney Australia, it's 26°C
It’s nice to know, we’re going to Nice
Highly intelligible
A measure of how comprehensible speech is.
Peter Piper picked a peck of pickled peppers
6. Amazon Polly: SSML
Speech Synthesis Markup Language
Is a W3C recommendation, an XML-based markup language for speech
synthesis applications
<speak>
My name is Alastair. It is spelled
<prosody rate='x-slow'>
<say-as interpret-as="characters">Alastair</say-as>
</prosody>
</speak>
7. Amazon Polly: SSML
Speech Synthesis Markup Language
Is a W3C recommendation, an XML-based markup language for speech
synthesis applications
<speak>
This is normal voice,
<amazon:effect name="whispered">
and this is me whispering!
</amazon:effect>
</speak>
8. Polly Voice Synthesis Architecture
Amazon Polly
Amazon API
Gateway
Lambda
function
Amazon
S3
Mobile App
IoT Device
Calling through API Gateway
allows us to implement caching
and use throttling and API
Keys via Usage Plans
10. Amazon Rekognition
Deep learning-based image recognition service
Search, verify, and organise millions of images
Object and Scene
Detection
Facial
Analysis
Face
Comparison
Facial
Recognition
11. Amazon Rekognition
Deep learning-based image recognition service
Search, verify, and organise millions of images
Object and Scene
Detection
Facial
Analysis
Face
Comparison
Facial
Recognition
19. Scaling to Many Faces
Amazon
Rekognition
Lambda
function Amazon
ElasticSearch
Amazon
SNS
Lambda
function
Amazon
S3
User’s Face
Image
Fan Out of Lambda Functions via SNS.
1 Notification per Face detected
Metadata from DetectFaces() +
S3 Object Ref to Face Image
Metadata +
Location +
Timestamp
User’s Face
Image
searchFacesByImage()
indexFaces()
21. Smart Assistant - Key Features
Face detection with OpenCV http://opencv.org
Hot word detection to get device’s attention via Snowboy
https://snowboy.kitt.ai/
Silence detection during live speech capture for start/stop using SoX
http://sox.sourceforge.net/
Streaming of audio capture in live-time to reduce latency AWS IoT
NLU provided by Amazon Lex
22. Streaming audio via AWS IoT
AWS IoT
Audio streamed in
segments in live-time using
SoX and stdout pipe
Amazon
DynamoDB
Segments keyed and written
to DynamoDB as base64
chunks of audio
Amazon API
Gateway
On silence detected
ProcessLexDialog
Amazon
DynamoDB
Amazon
Lex
Lex Intent result payload
23. Wait for Hot Word
(Snowboy)
Wait for Face to
appear in camera view
Listen for audio
command
START
Smart Assistant
24. Wait for Face to
appear in camera view
Capture image from
webcam
(fswebcam)
Recognise Face
(Amazon Rekognition)
Resize to improve
process efficiency
(Imagemagick)
Detect face on device
(OpenCV)
Known User State
Replay Audio
Is the face
in the
collection?
YES
NO
Run User Speech
Dialogue Interaction
and NLU
Smart Assistant
25. Smart Assistant
Process intent
(API Gateway/Lambda)
Listen for speech input
with silence detection
(SoX)
Play audio response &
loop back to listen for
speech input
On Audio Segment
Recorded –MQTT
(AWS IoT)
On Silence – submit to
APIGW base on key
YES
Run User Speech
Dialogue Interaction
and NLU
Is the
interaction
Ready for
Fulfillment
?
NO
Listen for speech input
with silence detection
(SoX)
26. Wrap Up
• Amazon AI Services are simple
• Developers can add AI to real world applications quickly
• AI opens up new mediums and interfaces
• Deploy at scale and at low cost
• For more information: https://aws.amazon.com/amazon-ai/