1. BUIDING A DESKTOP ASSISTANT THAT USES VOICE COMAAND
AND COMPLETES SPECIFIED TASKS
SUBMITTED TO
KIIT Deemed to be University
In Partial Fulfillment of the Requirement for the Award of
BACHELOR’S DEGREE IN
ELECTRONICS AND COMPUTER SCIENECE ENGINEERING
BY
PRASUN CHAKRABORTY ROLL-1730041
UNDER THE GUIDANCE OF
PROF.CHANDANI KUMARI
SCHOOL OF ELECTRONICS ENGINEERING
KALINGA INSTITUTE OF INDUSTRIAL TECHNOLOGY
BHUBANESWAR, ODISHA - 751024
2. KIIT Deemed to be University
School of Electronics Engineering
BHUBANESWAR, ODISHA - 751024
CERTIFICATE
This is certify that the report entitled
“BUIDING A DESKTOP ASSISTANT THAT USES VOICE COMAAND AND COMPLETES SPECIFIED TASKS “
BY
PRASUN CHAKRABORTY ROLL-1730041
is a record of bonafied work carried out by them, in the partial fulfillment of the
requirement for the award of degree of Bachelor of Engineering in Electronics and
Computer Science Engineering at KIIT Deemed to be university, Bhubaneswar. This
work is done during year 2020, under your guidance.
Date: / /
(Prof. Guide Name)
CHANDANI KUMARI
3. ACKNOWLEDGEMENTS
The success and final outcome of this project required a lot of guidance and assistance
from Prof. CHANDANI KUMARI and I am extremely privileged to have got this all
along the completion of my project. All that I have done is only due to such supervision
and assistance and I would not forget to to thank her.
I respect and thank Prof. CHANDANI KUMARI for providing me an opportunity to do
the project work from home in this pandemic and giving me all support and guidance
which made me complete the project . I am extremely thankful to her for providing such
a nice support and guidance although she had busy schedule managing the academic
affairs.
PRASUN CHAKRABORTY
4. ABSTARCT
A virtual assistant for desktop also called digital assistant, is an application program
that understands natural language voice commands and complete tasks for the users.
Most of the digital assistants are interacted with by using human voice. They may also
be reffered as voice assistant. To interact with a digital assistant one must use a wake
word, that is used to activate the device . Once one said a wake word, the system is now
ready to be asked a question. One could then ask “whats about the weather” and the
system will forecast the weather in local area aloud .
As digital digital assistant become more popular , so do their capabilities and the task
they are able to perform . Below are few of popular activities this desktop assistant can
perform .
Answer basic questions
Searching Google/Wikipedia
Set alarm, timer
Get information about temperature
Playing a Song
Reading and writing text files & many more..
In this project python programming language is used to develop the application .
5. CONTENT
1. Literature Review……………………………………………1
1.1 An overview of speech recognition……………………………1
1.2 History……………………………………………………..……………….1
2. Types of speech recognition……………….…………..2
2.1 Isolated speech…………………………………………..…….……….2
2.2 Connected speech……………………………………………………..2
2.3 Continuous speech…………………………………………………….2
2.4 Spontaneous speech………………………………………………….2
3. Basic speech recognition process……………………3
4. Introduction to Python…………………………………..4
5. Python libraries used in this Project……………….4
6. Tools required………………………………………………10
7. Use case Diagram………………………………………….10
8. List of tasks this application perform…………….11
9. Uses of speech recognition……………………………16
10. Applications………………………………………………….16
10.1 From medical perspective……………………………………….16
10.2 From military perspective………………………………………..16
10.3 From education perspective…………………………………….16
11. Some factors that may disturb functionalities
of the application………………………………………….17
12. The future of Speech Recognition…………………..17
13. References……………………………………………………..17
6. 1
1. Literature Review
1.1 An overview of Speech Recognition
Speech Recognition is a technology that enables a computer to capture the words
spoken by a human with a help of microphone . These words are later on recognized by
Speech recognizer, and at the end system works according to the voice input .
The process of Speech Recognition consists of different steps that will be discussed in the
following section one by one.
1.2 History
The concept of speech recognition stated somewhere in 1940s, practically the first
speech recognition program was appeared in 1952 at the bell labs, that was about
recognition of digit in a noise free environment.
1940s and 1950s are considered as the foundation period of the speech recognition
technology , in this period work was done on the foundational paradigms of the speech
recognition that is automation and information theoretic models. The key technologies
that were developed in this decade were filter banks and time normalization methods.
In 1990s the key technologies developed during this period were the methods for
stochastic language understanding , statistical learning of acoustic and language models
and the method for implementation of large vocabulary speech understanding systems.
After the five decades of research , the speech recognition technology has finally entered
marketplace , benefiting the users in variety of ways . The challenge of designing a
machine that truly functions like an intelligent human is still a major one going forward.
7. 2
2. Types of Speech Recognition : Speech Recognition systems can be divided
Into the number of classes based on their ability to recognize those words and list of
words they have. A few classes of speech recognition are classified as under :
2.1. Isolated Speech
Isolated word usually involve a pause between two utterance ; it does’nt mean that it
only accepts a single word but instead it requires one utterance at a time.
2.2 Connected Speech
Connected words or connected speech are similar to isolated speech but allow separate
utterance with minimal pause between them .
2.3 Continuous Speech
Continuous speech allow the user to speak almost naturally. It is also called the
computer dictation.
2.4 Spontaneous Speech
At a basic level, it can be the thought of as speech that is natural sounding and not
rehearsed. An ASR system with spontaneous speech ability should be able to handle a
variety of natural speech features such as words being run together ,“ums” and
“ahs”and even slight stutters.
8. 3
3. Basic Speech Recognition Process
Audio Input : With the help of the microphone the audio (human voice )is input to
the system .
Analog to Digital : The process of converting to analog signal into digital form is
known as digitization . it involves both sampling and quantization process.
Acoustic Model : An acoustic model is created by taking audio inputs and their text
transcripts , and using software to create statistical representation of the sounds
that make up each word .
Language Model : Language modeling is used in many natural language processing
applications such as speech recognition tries to capture the properties of language
and to predict the next word in the speech sequence .
Speech Engine : The job of speech engine is to convert the input audio file into text
to accomplish this it uses all sorts of data, software algorithms and statistics .
Output : After all the above steps finally the output comes that is performing
operations according to the voice commands .
Audio Input Analog to Digital Acoustic Model
Output Language ModelSpeech Engine
9. 4
4. Introduction to Python
4.1 Python : Python is an interpreted, high-level, general-purpose programming
language. Created by Guido van Rossum and first released in 1991, Python's design
philosophy emphasizes code readability with its notable use of significant whitespace.
It is a high level programming language which is,
Interpreted: Python is processed as run time by the interpreter .
Interactive: A python prompt can be used and can interact with the interpreter
directly to write the programs .
Object-oriented: Python suppoprts Object oriented technique of programming
Beginner’s language: Python is a great language for the beginner-level
programmers and supports the development of a wide range of applications.
In this project different techniques have been used for different functionalities .
Those will be discussed one by one .
5. Python Libraries used in this project
OS : The OS module in python provides functions for interacting with the
operating system. OS, comes under Python’s standard utility modules. This
module provides a portable way of using operating system dependent functionality.
Date-time : Python has a module named datetime to work with dates and times.
Random : Sometimes we want the computer to pick a random number in a given
range, pick a random element from a list, pick a random card from a deck, flip a
coin, etc. The random module provides access to functions that support these types
of operations. The random module is another library of functions that can extend
the basic features of python.
PyOWM : PyOWM is a client Python wrapper library for Open Weather Map web
APIs. It allows quick and easy consumption of OWM data from Python
applications via a simple object model and in a human-friendly fashion.
10. 5
PyAutoGUI : PyAutoGUI is a cross-platform GUI automation Python module for
human beings. Used to programmatically control the mouse & keyboard.
PyAutoGUI supports Python 2 and 3.
Requests : Requests is an Apache2 Licensed HTTP library, written in Python. It is
designed to be used by humans to interact with the language. This means one don’t
have to manually add query strings to URLs, or form-encode POST data.
Requests will allow one to send HTTP/1.1 requests using Python. With it, one can
add content like headers, form data, multipart files, and parameters via simple
Python libraries. It also allows you to access the response data of Python in the
same way.
Twilio.rest : The Twilio Python Helper Library makes it easy to interact with the
Twilio API from Python application.The Twilio Python Helper Library supports
Python applications written in Python 2.7 and above. Using Twilio API one can
automate sending whatsapp messages , calls , sending verification codes. In our
project Twilio API has been used to send help message to anyone in emergency
situation.
Webbrowser : The webbrowser module provides a high-level interface to allow
displaying Web-based documents to users. Under most circumstances, simply
calling the open() function from this module will open url using the default browser .
One have to import the module and use open() function.
Webbrowser.open(“URL”,new=2)
If new is 0, the url is opened in the same browser window if possible. If new is 1, a
new browser window is opened if possible. If new is 2, a new browser page ("tab") is
opened if possible.
Pyttsx : pyttsx is a cross-platform text to speech library which is platform
independent. The major advantage of using this library for text-to-speech
conversion is that it works offline. However, pyttsx supports only Python 2.x.
Hence, we will see pyttsx3 which is modified to work on both Python 2.x and
Python 3.x with the same code.
11. 6
.
Speech-Recognition : Speech recognition has its roots in research done at Bell Labs
in the early 1950s. Early systems were limited to a single speaker and had limited
vocabularies of about a dozen words. Modern speech recognition systems have
come a long way since their ancient counterparts. They can recognize speech from
multiple speakers and have enormous vocabularies in numerous languages.
The first component of speech recognition is, of course, speech. Speech must be
converted from physical sound to an electrical signal with a microphone, and
then to digital data with an analog-to-digital converter. Once digitized, several
models can be used to transcribe the audio to text.
Most modern speech recognition systems rely on what is known as a Hidden
Markov Model (HMM). This approach works on the assumption that a speech
signal, when viewed on a short enough timescale (say, ten milliseconds), can be
reasonably approximated as a stationary process—that is, a process in which
statistical properties do not change over time.
In a typical HMM, the speech signal is divided into 10-millisecond fragments. The
power spectrum of each fragment, which is essentially a plot of the signal’s
power as a function of frequency, is mapped to a vector of real numbers known
as cepstral coefficients. The dimension of this vector is usually small—sometimes as
low as 10, although more accurate systems may have dimension 32 or more.
The final output of the HMM is a sequence of these vectors.
To decode the speech into text, groups of vectors are matched to one or
more phonemes. A fundamental unit of speech. This calculation requires training,
since the sound of a phoneme varies from speaker to speaker, and even varies from
one utterance to another by the same speaker. A special algorithm is then applied
to determine the most likely word (or words) that produce the given sequence of
phonemes.
One can imagine that this whole process may be computationally expensive. In
many modern speech recognition systems, neural networks are used to simplify the
speech signal using techniques for feature transformation and dimensionality
reduction before HMM recognition. Voice activity detectors (VADs) are also used
to reduce an audio signal to only the portions that are likely to contain speech.
This prevents the recognizer from wasting time analyzing unnecessary parts of the
signal.
12. 7
Wikipedia : The Internet is the single largest source of information, and therefore it
is important to know how to fetch data from various sources. And with Wikipedia
being one of the largest and most popular sources for information on the Internet.
Wikipedia is a multilingual online encyclopedia created and maintained as an open
collaboration project by a community of volunteer editors using a wiki-based
editing system.
Smtplib : Simple Mail Transfer Protocol (SMTP) is a protocol, which handles
sending e-mail and routing e-mail between mail servers.
Python provides smtplib module, which defines an SMTP client session object that
can be used to send mail to any Internet machine with an SMTP or ESMTP
listener daemon.
Here is the detail of the parameters:
Host - This is the host running SMTP server. One can specify IP address of the
host or domain name like facebook.com This is optional argument.
Port - If one are providing host argument , then they need to specify a port where
SMTP server is listening. Usually this port would be 25.
Local-hostname - If one’s SMTP server is running on local machine,then they
can specify just localhost as of this option.
An SMTP object has an instance method called sendmail,which is typically used to
do the work of mailing a message. It takes the parameters-
The sender - A string with the address of the sender.
The receivers - A list of strings , one or each recipient.
The message - A message as a string formatted as a specified in the various
RFCs.
Playsound : The playsound module is the simplest module to use for playing sound.
This module works on both Python 2 and Python 3, and is tested to play wav and
mp3 files only. It contains only one method, named playsound(), with one
argument to take the audio filename for playing.
13. 8
Plyer : Plyer is a Python library for accessing features of hardware / platforms.
Ctypes : Ctypes is a foreign function library for Python. It provides C compatible
data types, and allows calling functions in DLLs or shared libraries. It can be used
to wrap these libraries in pure Python.
Psutil : Psutil is a Python cross-platform library used to access system details and
process utilities. It is used to keep track of various resources utilization in the
system. Usage of resources like CPU, memory, disks, network, sensors can be
monitored. Hence, this library is used for system monitoring, profiling, limiting
process resources and the management of running processes. It is supported in
Python versions 2.6, 2.7 and 3.4+.
Urllib : urllib is a package that collects several modules for working with URLs:
urllib.request for opening and reading URLs
urllib.error containing the exceptions raised by urllib.request
urllib.parse for parsing URLs
urllib.robotparser for parsing robots.txt files
Pyspeedtest : Python library to test network bandwidth using Speedtest.net servers.
One can check ping speed, downloading speed and ping speed using this library .
Pandas : pandas is a fast, powerful, flexible and easy to use open source data
analysis and manipulation tool, built on top of the Python programming language.
Matplotlib : Matplotlib is an amazing visualization library in Python for 2D plots
of arrays. Matplotlib is a multi-platform data visualization library built on NumPy
arrays and designed to work with the broader SciPy stack. It was introduced by
John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like
line, bar, scatter, histogram etc.
14. 9
Beautifulsoup : Beautiful Soup is a Python package for parsing HTML and XML
documents (including having malformed markup, i.e. non-closed tags, so named
after tag soup). It creates a parse tree for parsed pages that can be used to extract
data from HTML, which is useful for web scraping.
Tabulate : Pretty-print tabular data in Python, a library and a command-line
utility.
The main use cases of the library are:
printing small tables without hassle: just one function call, formatting is
guided by the data itself
authoring tabular data for lightweight plain-text markup: multiple output
formats suitable for further editing or transformation
readable presentation of mixed textual and numeric data: smart column
alignment, configurable number formatting, alignment by a decimal point
Numpy : NumPy is a python library used for working with arrays. It also has
functions for working in domain of linear algebra, fourier transform, and
matrices. NumPy was created in 2005 by Travis Oliphant. It is an open source
project and can be used freely. NumPy stands for Numerical Python.
Opencv : OpenCV-Python is a library of Python bindings designed to solve
computer vision problems. ... OpenCV-Python makes use of Numpy, which is a
highly optimized library for numerical operations with a MATLAB-style syntax.
All the OpenCV array structures are converted to and from Numpy arrays.
It is also a free open source library used in real-time image processing. It's used
to process images, videos, and even live streams too.
Wave : The wave module in Python's standard library is an easy interface to the
audio WAV format. The functions in this module can write audio data in raw
format to a file like object and read the attributes of a WAV file.
15. 10
6. Tools Required
Hardware : Monitor/Display
Software : Windows 10
Visual Studio Code(IDE)
Google-Chrome browser
Python version 3.7
7. Diagrams
16. 11
8. List of tasks This application perform
When this application is executed user has to set some input fields ,
(i) That is target whatsapp number and the body of the message . Now here question
arises why these has to be set at the initial point , so an emergency alarming
feature has been added to this desktop assistant. To ellaborate this feature let’s take an
example below.
Eg. Suppose one person goes out of his home for any purpose and in the road their
he faces something wrong happening with him, suppose some people are trying to
attack him or trying to force him to do something or to take him with them ,means
any kind of disturbing situation. He finds that there is nobody to help him in that
situation , he shouts “help me, help me please! ” but there is nobody to help but using
this emergency alarming features he can beg help direct from police station or
hospitals or any kind of emergency services , may be there are no people to listen his
cry but there is his voice assistant running on his laptop or mobile phone and when he
shouts “help me !” his digital assistant listen to him, recognizes his voice and sends a
preset message for help to any emergency service within few seconds. It looks good
now right !! Now he can inform anyone about his problem without calling or texting
anyone touching his mobile phone , he can beg help only using his voice command .
So what he needs to do to activate this feature in emergency is first to set all the
inputs those are the target whatsapp number (the number he wants to send message to
inform aboout his problem )and the message of the body(here suppose the message he
wants to send to police station ), so when all these are set now his digital assistant is
ready to help him out of his home too. The best format of writing the message body is
<Myself XYZ, My address is UXV , My contact number is XXXXXXXXX, I came to
market , now I am in trouble please help me !!>
All these things he has to set before going outside ,so that in critical situation using
only “help me !” command message can be sent because in a disturbing situation he
would’nt get much time to deliver his all details to be traced out.
[**NOTE : To successfully implement this feature the target whatsapp number should
be registered with twiliio account because in this project twilio API has been used for
this purpose.**]
18. 13
(ii) When the application starts it shows birthday reminder if any of user’s friends or
known person has their birthday on that day or not . User just has to set the date of
birth of anybody in the application, if the current date matches with the preset date
then it will show birthday wish reminder otherwis it will push a notification as shown
in the image below.
More Tasks and their Commands
Voice Command Task Description
Ok bro Activation poitn for speech recognition
Jarvis Check if jarvis is listening user or not
Who are you ? Tells the sytem name that is ‘Jarvis’
tell me something about you Gives basic description about itself
temperature Forecast the current temperature of air
check my battery status Gives the battery details
check the connection status Says if user is connected to internet or not
and pushes notification
check internet speed Checks and tells upload , download and
ping speed of user’s network
<user’s query>wikipedia Surfing to wikipedia, shows and speaks
the result aloud
google search <query> Opens google chrome browser and
displays all the possible results for the
query
google Opens google for user
19. 14
Voice Command Task Description
google maps Opens google map
google drive Opens google drive for user
google translate Opens google translate for user
find location<place_name> Searches the place on google map and
shows result in browser
open youtube Opens youtube homepage
search youtube<query> Searches and shows all the possible
contents for user’s query
open udemy Opens udemy homepage
search udemy<course_name> Shows all the possible course
find geeksforgeeks<subject> Opens geeksforgeeks and shows possible
results
open mail Open user’s personal gmail id account
send mail
<message_body><target_mail_id>
Sends mail to anyone using voice
command
open whatsapp Opens whatsapp for user
open facebook Opens facebook for user
find facebook<query> Find people in facebook
search live train status<train_no> Shows live status for user provided train
number
open zoom Opens zoom application
open sublime text Opens sublime text editor
open calculator Opens calculator app
open notepad Opens notepad in desktop
handle file<mode_to_handle> Read write and save text files based on
user selected mode
play music/I am sad Plays music from folders
movie<movie_name> Opens media player and starts the specific
name
take screenshot<file_name> Takes screen shot and saves as user
provided file name in specified folder
20. 15
Voice Command Task Description
change walpaper Changes the desktop background
take me to my chilhood Shows any chilhood photo of user if there
is any specified folder conatining all such
photos
set alarm
<hours><minutes><am/pm>
Sets alarm and rings alarm tone when the
set time reaches
set timer <seconds> Set a timer for user provided seconds and
rings a warning tone when the deadline
reacches
read breaking news Read top 10 global news headlines aloud
in a day
india corona cases Shows results of top 5 states in india
corona update Shows total global records of corona, like
number of infected , death , recovered
people
take photo Captures picture and saves to specified
folder
record video Records video using dektop camera and
saves to specified folder
record audio Records audio using microphone and
saves the recorded file in specified folder
help me <problem_statement> Sends whatsapp messages to emergency
contact through voice command
police<problem_statement> Sends mail to local police station mail-id
mentioning the problem through voice
command
restart my pc Takes confirmation from user about
restarting and works accordingly
shutdown my pc Takes confirmation from user about
switching the system off and works
accordingly
exit Quits the application
21. 16
9. Uses of Speech Recognition program
Basically speech recognition is used for two main purposes. First and foremost
dictation that is in the context of speech recognition is translation of spoken words
into text and second controlling the computer and its various application by voice .
Writing by voice let a person to write 150 words per minute or more if indeed he/she
can spoke quickly. This perspective of speech recoginition programs help to do much
bigger things in a short time and this way they can save their effort too.
10. Applications
10.1 From medical Perspective :
People with disabilities can benefit from speech recognition programs. Speech
recognition is especially useful for people who have difficulties using their hands,
in such cases speech recognition is much beneficial and they can use for operating
computers. Speech recognition is used in deaf telephony such as voicemail to
text.
10.2 From military perspective :
Speech recognition is important from military perspective ; in air force speech
recognition has definite potential for reducing the pilot workload. Beside the air
force such program can also be used to train helicopters , battle management and
other applications.
10.3 From education perspective :
Individual with learning disabilities who have problems with thought-to-paper
communication can benefit from the software . some other application areas of speech
recognition technology are described above.
22. 17
11. Some factors that may disturb functionalities of the application :
Homonyms : Are the words that are differently spelled and have the different
meaning but acuqires the same meaning, for example ‘to’ and ‘two’, ‘be’ and
‘bee’. This is a challenge for computer machine to distinguish between such
types of phrases that sound alike.
Overlapping Speeches : A second challenge in this process is to understand the
speech uttered by user, often the machine takes wrong command on the basis of
the style of uttering a word .
12. The future of Speech Recognition :
Accuracy will become better and better.
Dictation speeech recognition will gradually become accepted
Using speech recognition in collaboration with AI a system can be developed
exactly as intelligent as human
In future probably corporate tasks can be automated using speech recognition
and selenium.
13. References :
1. https://pypi.org/
2. https://www.geeksforgeeks.org/
3. https://github.com/github
4. https://www.kdnuggets.com/2020/06/easy-speech-text-python.html
5.
https://www.analyticsvidhya.com/blog/2019/07/learn-build-first-speech-to-text-model-
python/