Google Glass Project

GOOGLE GLASS PROJECT
A Final Report
Presented to
The Advisory Committee and Project Sponsors
Senior Design Course of Fall-2013 & Spring 2014
Florida Institute of Technology
In Partial Fulfillment
of the Requirements for the Course
MAE 4190/4193/4194: Mechanical Engineering Design
by
Google Glass Group
Cheng Lu, Kyle Coleman, Luke Glesener, Constantin Jessen, Brandon Mills, Shaopeng
Han, Yanglong Lu
April 14, 2014
Accepted by:
Dr. Beshoy Morkos, Advisory Committee Chair
Mr. Varun Menon

i
1. ABSTRACT
The google glass project was created to enhance the Google Glasses with an interface to allow it to
interpret audio input and output text to the google glass screen. The project motivation was to improve the
overall quality of life for those who are hearing impaired or completely deaf.
In the first semester, the project began by looking into which microphones could be used for the audio
input as well as general research into what made microphones work. Different specifications were taken
into account to include; sensitivity, polar pattern, range, and recording mechanic. Once the microphone
criteria were confirmed, primarily that the group would be utilizing two cardioid condenser microphones,
the microphone team produced a list of viable microphones available for purchase. This list was pared
down until specific microphones were selected and ordered. The POM-3535L-3-R was the best choice out
of four microphones that were sent through the final decision matrix. These findings were submitted to
David Beavers to conduct the ordering process. Once microphone research was completed, the team
focused on the platform for the device as well as the dictation software. The team decided to focus their
efforts on the Raspberry Pi and an Android phone based app. The dictation software to be used would be
PocketSphinx for the Raspberry Pi and iSpeech for the Android phone. The group did additional research
into Sound Source Separation methods (sss) as well as filtering software. Testing software used in this
project for microphone analysis was a software called Praat. The group took the time to look into
additional products similar to the Google Glass in the expectation that they may not be able to acquire the
Google Glass for testing. The front runners for this expansion were; Vuzix M100 Smart Glasses, Meta 3D
Space Glasses, and the GlassUp.
In the second semester, the group was able to acquire a working Google Glass. With the purchase of the
Google Glass, much of the research from the previous semester had to be re-assessed. The google glass
had no support for external microphones, and it was discovered that the programing required to integrate
an external microphone would endanger the health of the Glass and was beyond the groups programming
ability. The Raspberry Pi was also removed from the project because it was too slow and did not interface
with the Google Glass. The group initially focused on programming for both the Android smart phone
and the Google Glass itself. Once a successful app, The Live Card App, was created, most of the group’s
efforts was focused on improving the code, and updating deficiencies. Eventually, the group was able to
create an app that continuously converted speech to text and printed the text to the Google Glass screen.
This application was named The Immersion App, later named Live Subtitles. Live Subtitles is effective
within 3 to 5 feet when used in environments with varying dB. In order to improve range and accuracy,
the group began researching filtering technology, however the expected programming requirements
would push the project beyond its slated timeline. The group has formulated multiple suggestions for
future project advancement. Overall, the group was successful in achieving its prescribed goals; the
following document denotes the specifics of all research and results for the project.

ii
2. ACKNOWLEDGMENTS
The google glass group would like to personally thank their senior advisor Dr. Beshoy Morkos for his
support and guidance on this project. The team would also like to thank David Beavers for his support in
ordering of the microphones and Raspberry Pi and for lending his technical expertise in the field of
electronics. Finally the group would like to thank Dr. Crawford, Varun Menon and Shaun Davenport for
their support and advice on the project.

iii
3. TABLE OF CONTENTS
1. Abstract......................................................................................................................... i
2. Acknowledgments........................................................................................................ii
3. Table of Contents........................................................................................................iii
4. List of Tables .............................................................................................................. vi
5. List of Figures............................................................................................................vii
6. Introduction.................................................................................................................. 1
6.1. Problem Statement................................................................................................ 1
6.2. Motivation............................................................................................................. 1
6.3. Goal....................................................................................................................... 1
6.4. Terminology, Constraints and Requirements ....................................................... 1
6.5. Global, Social, and Contemporary Impact............................................................ 3
6.6. Group Logistics..................................................................................................... 4
7. Background.................................................................................................................. 5
7.1. General Research.................................................................................................. 5
7.2. Google Glass Alternatives .................................................................................. 16
7.3. Patent Search....................................................................................................... 19
7.4. Current State of the Art....................................................................................... 20
7.5. Reverse Engineering ........................................................................................... 21
8. Recommended Solution............................................................................................. 22
8.1. Solution Principles.............................................................................................. 22
8.2. Analysis .............................................................................................................. 23
8.3. FMEA ................................................................................................................. 25
8.4. Decisions............................................................................................................. 27
8.5. QFD .................................................................................................................... 32
8.6. Bill of Material.................................................................................................... 34
9. Programming.............................................................................................................. 36
9.1. Programming for the Google Glass .................................................................... 36
9.2. Programming Process ......................................................................................... 38

iv
9.3. Generated Applications....................................................................................... 39
9.4. Application Operation: ....................................................................................... 39
9.5. Android Smartphone Programming.................................................................... 42
9.6. Programming the RPi.......................................................................................... 43
10. Testing...................................................................................................................... 44
10.1. Introduction....................................................................................................... 44
10.2. Praat Software................................................................................................... 45
10.3. Read Time Research ......................................................................................... 45
10.4. Audio-to-Displayed Text .................................................................................. 46
10.5. L/R Comparison................................................................................................ 47
10.6. Dictation Software Comparison........................................................................ 50
10.7. Audio Test for Google Glass ............................................................................ 51
10.8. Launch Time of “Speech to Text” .................................................................... 51
10.9. Working Range of “Speech to Text” ................................................................ 52
10.10. Incremental Test for “Speech to Text” ........................................................... 52
10.11. Ambient Noise Threshold Test for “Speech to Test” ..................................... 54
10.12. Ambient Noise Threshold Test for “Live Subs”............................................. 56
10.13. Accuracy of “Live Subs” ................................................................................ 57
10.14. Polling............................................................................................................. 57
11. Conclusions and Future Work ................................................................................. 59
11.1. Final State of the Project................................................................................... 59
11.2. Major Accomplishments................................................................................... 59
11.3. System Validation............................................................................................. 60
11.4. Recommendations for Future Work ................................................................. 63
Appendices........................................................................................................................ 65
Appendix A Meeting Minutes ................................................................................... 66
Appendix B Emails.................................................................................................. 102
Appendix C Executive Summaries .......................................................................... 129
Appendix D Raspberry Pi Programming Log.......................................................... 153
Appendix E Dictation Research Initial Research Log ............................................. 156
Appendix F Microphone Search Log....................................................................... 157
Appendix G Experiments......................................................................................... 159

v
Appendix H Smartphone Audio-to-Text ................................................................. 166
Appendix I Filter Tester App................................................................................... 173
Appendix J AGC MATLAB CODE........................................................................ 189
Appendix K Google Glass Application ................................................................... 192
12. References.............................................................................................................. 205

vi
4. LIST OF TABLES
Table 1: Smart Watch Comparison............................................................................................... 18
Table 2: Microphone Requirements.............................................................................................. 24
Table 3: FMEA ............................................................................................................................. 26
Table 4: Platform Decision Matrix ............................................................................................... 28
Table 5: Microphone Decision Matrix.......................................................................................... 31
Table 6: Budget Allocation........................................................................................................... 34
Table 7: Expenses ......................................................................................................................... 34
Table 8: Audio-to-Display Text Results....................................................................................... 46
Table 9: L/R Comparison at 3ft .................................................................................................... 48
Table 10: L/R Comparison Results Chart at 2 ft........................................................................... 49
Table 11: Dictation Software ........................................................................................................ 50
Table 12: Microphone Range........................................................................................................ 51
Table 13: Launch Time of “Speech to Text” ................................................................................ 51
Table 14: Accuracy of “Speech to Text” in Various Distance and Words................................... 53
Table 15: Accuracy of “Speech to Text” in Various Environment & Distance ........................... 54
Table 16: Working Range of “Speech to Text” in Different Environment .................................. 54
Table 17: Accuracy of “Live Subs” in Various Environment & Distance ................................... 56
Table 18: Working Range of “Live Subs” in Different Environment .......................................... 56
Table 19: Accuracy of "Live Subs" in Different Environment..................................................... 57
Table 20: Result of Polling ........................................................................................................... 57
Table 21: System Validation Checklist......................................................................................... 61

vii
5. LIST OF FIGURES
Figure 1: Fiber optic microphone ................................................................................................... 5
Figure 2: Dynamic Microphone...................................................................................................... 5
Figure 3: Ribbon Microphone......................................................................................................... 6
Figure 4: Condenser electrets microphone ..................................................................................... 6
Figure 5 - Polar Pattern................................................................................................................... 7
Figure 6: Raspberry Pi [11]............................................................................................................. 8
Figure 7: Butterworth Filter [20] .................................................................................................. 15
Figure 8: Bessel Filter [20] ........................................................................................................... 15
Figure 9: Chebyshev Filter [20].................................................................................................... 16
Figure 10: Vuzix M100 Smart Glasses [24] ................................................................................. 17
Figure 11: Meta 3D Space Glasses [25] ....................................................................................... 17
Figure 12: GlassUp [26]................................................................................................................ 18
Figure 13: SmartWatch, Pepple, Galaxy Gear (left to right) ........................................................ 18
Figure 14: Functional Flow Block Diagram ................................................................................. 22
Figure 15: Final Concept .............................................................................................................. 23
Figure 16: Approximate frequency ranges [37]............................................................................ 24
Figure 17: QFD............................................................................................................................. 33
Figure 18: Implicit intent [47]....................................................................................................... 36
Figure 19: Live Card Operation [50] ............................................................................................ 37
Figure 20: Low Frequency (Left) and High Frequency (Right) Live Cards [X8]........................ 38
Figure 21: Immersion [51]............................................................................................................ 38
Figure 22: Exporting Code to the Glass [52]................................................................................ 39
Figure 23: Launching Application................................................................................................ 39
Figure 24: App Listening.............................................................................................................. 40
Figure 25: Review of Conversation .............................................................................................. 40
Figure 26: Text History................................................................................................................. 40
Figure 27: Return to Speech Recognition..................................................................................... 41
Figure 28: Tap to Enter Speech Recognition Mode...................................................................... 41
Figure 29: Deleting Files .............................................................................................................. 41
Figure 30 Subtitle Research.......................................................................................................... 45

viii
Figure 31: Audio-to-Display Results............................................................................................ 46
Figure 32: L/R Comparison Results Chart at 3 ft. ........................................................................ 47
Figure 33: L/R Comparison Results Chart at 2 ft. ........................................................................ 48
Figure 34: Speech to Text Accuracy............................................................................................. 52
Figure 35: Accuracy vs. Distance & Words ................................................................................. 53
Figure 36: Final State and Design................................................................................................. 59
Figure 37: Experiment Setup ...................................................................................................... 162

1
6. INTRODUCTION
6.1. Problem Statement
The objective of this project is to modify the Google Glass with a system which will allow people who
are hearing-impaired to participate in spoken conversations. This will be done by transforming that audio
into text through noise filters and voice to text software, and then displaying it on the Google Glass.
6.2. Motivation
The primary motivation for this project was to improve the overall standard of living for the hearing
impaired or completely deaf. Relying on others to interpret what someone is saying for them can be
difficult. So can limiting oneself to only being able to talk to a small community of people who
understand sign language. By modifying the google glass to be used as a voice to text device, we are
allowing the hearing impaired and deaf the freedom to interact with more people in more places without
being hindered by their disabilities. Ideally, science will eventually advance enough to allow the hearing
impaired or deaf to completely recover their faculties. However, until that time comes, it makes sense to
use the technology of today to make that wait just a little easier.
6.3. Goal
The major goal for the team was to determine the applicability potential for the Google Glass and any
secondary software/hardware needed to support the functionality. By the end of the first semester the
group was required to have a working prototype of a voice to text to a remote display and by the end of
the second semester the team needed to have the voice to text working and the text being displayed on the
Google Glass.
6.4. Terminology, Constraints and Requirements
6.4.1. Terminology
 System - Refers to all the combined components of the product from the audio capturing
device to the text being displayed.
 Dictation Software - The software that converts audio signals to textual form.
 User - The person directly operating the system.
 Audio Capture Device - The device used to capture the audio, such as a microphone.
 Filtering - The process of removing undesired noise from a sound signal.
For the following constraints and requirements, “shall” designates requirements that must be fulfilled in
order for the experiment to be deemed a success. Items that contain “should” are items that are desired,
but are not needed in order for the project to be deemed a success.
6.4.2. Constraints
 The system shall use the Google Glass to display text.
 The system shall be housed in a portable package.*

2
 The complete system cost shall remain within the budget assigned by the sponsor.
6.4.3. Requirements
The team was given four initial stakeholder requirements: The system shall do real time speech to text, be
mobile, safe to use, and user friendly. From this the team generated functional and nonfunctional
requirements to meet the key stakeholder requirements. The functional requirements from the list were
later placed in a QFD to determine their relative importance.
6.4.3.1. Real Time Speech to Text (Functional)
1. The audio capturing device shall be able to capture all audio within the human sound
threshold at the point of capture. (Up to 100 decibels sound pressure level).
a. Based on human hearing range from microphone research.
2. The audio capturing device shall have a minimum frequency response range of 300 Hz - 3.4
kHz.
a. Based on human speech range from microphone research.
3. The filtering function of the system should be able to filter out all ambient noises.
a. Ambient noise is defined as any sound other than the desired sound source.
4. The system shall be able to display the text under 3 seconds from the audio being detected.*
5. The system shall capture audio from a nominal distance of 3 ft.**
6.4.3.2. Mobile (Functional)
6. The system shall be under 8” x 3” x 1”.
a. Based on Phablet (Samsung Galaxy Note 3) and Portable gaming devices (PS Vita
and Nintendo 3DS)
7. The system should have a battery life greater than 4 hours.
a. Initial estimate.
8. The system should be weather resistant.
9. The system shall be durable enough to withstand daily use.
a. “Daily Use” shall be defined in the future.
10. The system should weight under 8 oz.
a. Based on average weight of mobile phone.
6.4.3.3. Safety
11. No component of the system shall exceed a temperature of 140F.
a. Based on ASTM C10552 for metallic surfaces.
12. No components of the system shall have sharp edges.
13. All electrical components shall be electrically grounded.
14. The system shall not reuse the user’s personal information without the user’s permission.
6.4.3.4. User Friendly (Feel)
15. The system shall be user friendly.
a. User friendly shall be defined as having an average user friendliness rating of 7 or
greater on a 0-10 scale (10 being the greatest) by a polled group.

3
16. The system should be comfortable to wear.
a. Comfort shall be defined as having an average user comfort rating of 7 or greater on
a 0-10 scale (10 being the greatest) by a polled group.
17. The user should have the ability to select the formatting of the text displayed. (i.e. font, font
size)
18. The system should be aesthetically pleasing.
a. Aesthetically pleasing shall be defined as having an average aesthetics rating of 7 or
greater on a 0-10 scale (10 being best) by a polled group.
19. The dictation software shall be controlled easily.
20. The user should be able to indicate which sound source to be displayed.
a. “Indicate” to be defined in the future.
21. The system should be able to indicate the sound level of the source.
a. This is to give the user some context. Example textual output: “I was driving down
the road and “BAM” there was something in the road.” where “BAM” was shouted.
22. The system should notify the user when it detects audio and begins converting the audio after
the system leaves a dormant state.
a. This feature would notify the user that it is going to be displaying text if the system
had not detected audio for a length of time.
b. The purpose is to get the users attention so they do not miss information coming on
the display.
23. The system should be able to display to non-Google Glass sources.
a. This is to widen the systems customer base.
24. The system should have different mode settings.
a. Example: Have a lecture setting to increase the range and sensitivity of the audio
capturing settings.
25. The system should be able to tell which source the sound is from and store it, so that
information can be relayed to the user.
26. The system shall have a cold start-up time of under 3 mins.
a. This is an initial estimate.
27. The dictation software should have multi language capability.
28. The system should have a data storage feature to allow the user to recall past data
a. This will also aid in debugging the software.
29. The system should indicate to the speaker when it is not able to keep up.
* Based on the read-time research
** Based on the audio-to-text experiment
6.5. Global, Social, and Contemporary Impact
Overall the impact this project will have on the social, global, or contemporary scale is expected to be
very low. The project will have a primarily focus on the social scale. By focusing on the hearing
impaired, our group hopes to allow this sector of the population to more easily interact with the rest of
society. Historically, those who are deaf or hearing impaired have been limited in their ability to interact
with people who do not understand sign language or do not have the time to write down everything they
need to impart to the hearing impaired/deaf. This barrier in communication results in: confusion, social
issues, trouble obtaining or holding certain modes of employment, and various other issues. With the
implementation of our modified Google Glass, or devices similar to what we have proposed, we are
offering the hearing impaired more freedom to interact with their environment and ultimately improving
their standard of living.

4
The following are some instances where the Google Glass would not only help those wearing them, but
also people around them: lectures, presentations, award ceremonies, security checkpoints, or interacting
with police personnel, emergencies, etc. Any instance where information must be transmitted to someone
quickly when there may not be a mediator who knows sign language available, will be positively affected
by this device.
Some issues that may be presented by our project would primarily be financial, as well as the creation of a
social stigma. As far as financial issues are concerned, it is not entirely clear how expensive the finalized
Google Glass will be nor who will be able to access it, however, the app itself will not carry any apparent
additional cost. Those who are unable to acquire this device or similar devices, may be left behind if
performances, lectures, and award ceremonies no longer feel the need to provide interpreters because
everyone is expected to own one of these devices. Though this is a far reaching assumption, which may
not come to fruition anytime soon and hopefully become a “non-issue” with advanced medical
technology. Eventually, the results of this project might be available through health care or insurance; if
this is the case, some insurance or health care providers may have to change their policies and rates. With
regards to social stigma issues, similar to how people may be viewed differently when they use a cane or
animal assistants, or even hearing aids, these people wearing obvious augmented reality (AR) glasses may
be automatically singled out as different or in some cases “weak” or “easy targets” for violent crimes.
Anything that draws attention to someone’s handicap is going to somehow affect how they interact with
the public. It is possible though, since the App provided is attached to a social media device, others who
are not impaired will wear the Glass in abundance and dilute this stigma cue. Although any device that
impairs active or passive situational awareness is going to make someone an easier target. In addition to
this, people who are older or less accepting of technology may have issues with using or wearing these
devices.
6.6. Group Logistics
The group had one formal meeting a week to discuss the direction the project was going, tasks that
needed to be done, who would do them, and to address any questions or concerns. Meeting minutes were
taken for these meetings and are included in Appendix A. There was an Advisory meeting on a near
weekly bases throughout the year, in which the team would present their current works and findings to the
advisory committee and get feedback from said committee. Minutes for the advisory meetings were also
recorded and can be found in Appendix A. Part of keeping the advisory committee informed was writing
an executive summary of everything accomplished during the prevailing week. Any issues and upcoming
projects were also recorded in the executive summary. These summaries are located in Appendix C. The
group communicated via the group email when contacting people and companies external to the group, a
copy of these emails can found in Appendix B.

5
7. BACKGROUND
7.1. General Research
7.1.1. Microphones
A microphone is a sensor or transmitter which converts sound waves into an electric signal. In the world
of microphones, there is a wide variety of different types that are designed for different purposes. The
type of microphone has a major impact on its performance depending on which situation it is used in.
Certain microphones are very unreliable when put under conditions the microphone was not designed for,
this lead to the focus on researching the different types of microphones and what they are designed to do.
7.1.1.1. Types
7.1.1.1.1. Fiber Optic Microphone
A fiber optic microphone replaces the metal wire, which in traditional
microphones is used to pick up sound waves, with an extremely thin glass
strand. A laser travels inside the glass strand. This allows the microphone to
use an optical system to convert acoustic signals into a modulated light
intensity signal [1]. Those signals are very accurate and while not being
electrical they do not cause any magnetic fields [2]. The microphone itself
does not require any metal components and therefore does not interfere with
many other sensitive electronic systems.
The fiber optic microphone is extremely accurate, as well as very light and
small in size, potentially making it very effective in this project. The
problem is that the fiber optic microphones are very expensive with starting
prices around $2000 and are not within the affordable range for this project.
7.1.1.1.2. Dynamic Microphone
A dynamic microphone uses a magnet and a coil of wire to induce a current
through the wire for the electrical output of the microphone. The magnet or
coil is connected to a diaphragm which vibrates as a result of incoming sound
waves. A dynamic microphone induced current follows the electromagnetic
principles of Faraday’s law of induction [3].
The dynamic microphone’s sensitivity is generally very low (around -90 dB)
[4] as well it’s capability to record human speech. The only issue with
dynamic microphones is that dynamic microphones do not come in a small
scale and are generally large.
7.1.1.1.3. Ribbon Microphone
The ribbon microphone uses a metal ribbon suspended in a magnetic field to
absorb the sound vibrations. Inside the magnetic field, the ribbon microphone is connected to the output
of the microphone through an electrical connection. The alteration caused by the vibration of the ribbon
as the result of the sound waves, is transmitted electrically over to the microphone’s output [5].
Figure 1: Fiber optic
microphone
Figure 2: Dynamic
Microphone

6
Though the sound quality of a ribbon microphone is very clear and does not
need any external power supply. The ribbon microphone is sensitive to
movement. The metal ribbon suspended in the magnetic field moves when the
microphone itself starts moving, affecting the output [6]. This ineffectiveness in
a mobile environment makes the ribbon microphone unsuitable for this project.
7.1.1.1.4. Condenser Microphone
A condenser microphone or capacitor microphone consist of two sides of a
capacitor. One side is a stretched diaphragm so that it can achieve a high
resonating frequency. The other side of the condenser is fixed. The condenser
microphone needs a constant current flowing through both plates of the
capacitor to function. Incoming sound waves cause the diaphragm to vibrate
which causes a change in current inside the capacitor. This change in current
then gets transmitted as the output from the microphone [7].
As for Electret Condenser Microphones, a sub-category for condenser
microphones, they use a permanently charged diaphragm, such as a magnet, and
therefore do not need a power input. This allows the electret microphone to be
built in small compressed sizes. They are commonly used in cellphones and
other electronic devices and in high demand. This reduces their price to a very
small amount, starting at $0.50 per microphone.
7.1.1.2. Sound Pressure Level
Sounds exert a pressure on the surroundings which is typically measured in decibels (dB). The dB scale is
based on the range of human hearing. A pressure at zero dB is at the extreme minimum that a human can
hear, whereas the human threshold for pain is around 130 dB.
7.1.1.3. Sensitivity
Sensitivity of microphones is a rather vague topic. It is typically measured with a 1 kHz sine wave at a 94
dB sound pressure level or 1 Pa of pressure. The accepted definition of sensitivity for a microphone,
typically given in negative dB, tells how much voltage will be output from a microphone at a given sound
pressure level. In combination with this, all microphones need a preamp gain which changes based on the
sensitivity of the microphone. The challenge is balancing the gain with the sensitivity of the microphone
[8].
7.1.1.4. Polar Patterns
A polar pattern is the sensitivity of a microphone relating to the direction the sound is coming from,
typically represented as a graph. For example in the graphs below, each band, going from the center
towards the outside, represents increments of 5 dB of sensitivity [9]. What polar pattern a microphone is
using, influences the microphones capability to capture sound from different directions. There are four
main types of polar patterns are presented in Error! Reference source not found. (from left to right):
cardioid, omnidirectional, bidirectional, and shotgun.
Though the choice of polar pattern is not essential, it has to be regarded to optimize the performance of
the microphone. This means, if the condenser microphone has an omnidirectional polar pattern, the
housing of the microphone has to be built accordingly, so that there exist damping towards the not
focused directions, for example the 180 degree mark.
Figure 3: Ribbon
Microphone
Figure 4:
Condenser electrets
microphone

7
7.1.1.5.
7.1.2. Platforms
7.1.2.1. The Raspberry Pi
The Raspberry Pi (RPi) is a small single-board computer developed by the Raspberry Pi Foundation in
the United Kingdom [10]. The RPi is approximately the size of a credit card and most often runs a
modified version of Linux called Raspbian, a cross between Raspberry and Debian. Although Raspbian
is the most common operating system (OS) other lesser known OSs can also be used. Most computers
have a permanent hard drive, which contains the operating system and data storage. RPi, on the other
hand, uses a removable SD card. The main features of the RPi are (Figure 6):
 ARM11 700 MHz processor
 512 MB of RAM
 HDMI and RCA video out
 Ethernet port
 2 USB 2.0 slots
 SD card slot
Figure 5 - Polar Pattern

8
Figure 6: Raspberry Pi [11]
7.1.2.2. Android Devices
The two main smartphone categories currently on the market are Android devices, made by many
different developers and iPhones, made by the Apple Corporation. The actual android language is based
on Java and was created by Google. The Android language is simple and free to develop applications for.
It also has a massive online resource library filled with tutorials and example code. For these reasons, it
was chosen as a strong alternative to programming with the Raspberry Pi.
7.1.3. Speech Enhancement
Speech enhancement is the field that deals with the signal processing associated with speech recognition
and dictatio [12]. Its two main goals are to increase the intelligibility and to increase quality. Intelligibility
is how understandable the words are to a listener. Quality is how good the sound is. These are typically
mutually exclusive meaning when one is increased the other is often decreased. This can be minimized to
some degree by using multiple microphones. The field of speech enhancement consists of three main
fields; noise reduction, signal separation and dereverberation.
7.1.3.1. Noise Reduction
Noise reduction is the process of removing ambient noise from a signal. This can be broken into two
main fields; parametric and nonparametric algorithms.

9
7.1.3.1.1. Parametric
Parametric methods tend to be classified by the fact that they model the speech signal as an autoregressive
(AR) process embedded in Gaussian noise [12]. Parametric techniques use the following steps; they
estimate the coefficients of the AR and noise variances and then apply the Kalman filter to estimate the
clean speech.
7.1.3.1.2. Nonparametric
The leading types of nonparametric denoising techniques are spectral subtraction and signal subspace
decomposition.
Spectral Subtraction
Spectral subtraction is the most commonly used method in real-world applications. It does have the
drawback of introducing artifacts known as musical noise [12].
There are different appoaches of spectral subtraction method for Enhancing the Speech Signal in Noisy
Environments. The introduction of a5oaches is as follows:
1. Basic spectral subtractions algorithm
Assumptions- Noise is additive and its spectrum doesn’t change with time
Principle- Estimate and update the noise spectrum when speech signal is not clear and highly
influenced by noise.
Advantage- Simple and easy to implement.
2. Shortcomings of S.S.
Principle- This Algorithm Lead to negative values, resulting from differences between the
estimated noise frame and actual noise frame.
Advantage- Accuracy.
Disadvantage-Too sensitive. Hard to implement, if subtracted too little, the much noise will
remain.
3. Spectral subtraction with over subtraction
Principle-This method are subtracting an overestimate of the noise power spectrum and prevent
it going below to the minimum spectrum.
Advantage-this mainly focused on lowering the musical noise effect.
4. Non-linear spectral subtraction (NSS)
Principle- NSS is a method by modifying the over subtraction factor dependent on subtraction
process non-linear.

10
Advantage- Since the subtraction factor is various, it can be used in different types of noise.
Disadvantage- This method effect low frequency region more than high frequency.
5. Multi band spectral subtraction (MBSS)
Principle- In MBSS approach, the speech spectrum is divided into different bands and each band
has different subtraction coefficient.
Advantage- Accuracy and easy to operate.
6. MMSE Spectral Subtraction Algorithm
Principle- A method for selecting the best subtractive parameters then apply it.
Advantage- Fast
Disadvantage- May subtract the useful sound.
7. Selective spectral subtraction algorithm
Principles- Due to the spectral differences between vowels and consonant. This method proposed
to treat the voiced and unvoiced segment differently than separates the sound into two band
spectral subtraction.
Advantage- Treating voiced and unvoiced segments differently can make a substantial
improvements in performance
Disadvantage- The accurate and reliable voiced and unvoiced decisions particular at low SNR
(Signal-to-noise ratio) conditions cannot be guaranteed.
8. Spectral subtraction based on perceptual properties
Principles- This method is doing expectation based on the perception of human beings.
Consider the perceptual properties of the auditory system.
Disadvantage: More ideal.
These methods would be a good reference for our filter software’s selection. The main goal of the
approaches are add a subtraction factor which related to Laplace transform and Fourier transform.

11
7.1.3.1.3. Signal Subspace Decomposition
Signal subspace decomposition is the process of decomposing the vector space of the noisy signal into
two orthogonal subspaces; the signal plus the noise and the noise [12]. After which, the clean speech can
be modeled. By applying the Wiener filter the musical noise can be mitigated [12].
7.1.3.1.4. Dereverberation
Dereverberation is the process of removing reverberations from a signal source. Reverberation is the
noise generated by signal source bouncing off of objects on the way to the listening device and reaching
the listening device at different times, resulting in echoes and spectral distortions [12].
7.1.3.1.5. Blind Source Separation (BSS)
Blind source separation is the process of separating the signals in a recording without prior knowledge of
the signals attributes.
7.1.3.1.6. Acoustic Beamforming
Acoustic beamforming is a method for locating the source of noise using a microphone array in a far field
[13]. A far field is defined as having a distance from the signal source greater than that of the microphone
array. Is measured in spatial resolution and dynamic range. Spatial resolution is the minimum distance
two signal sources can be placed and still be distinguishable from one another. Dynamic Range is
difference in sound level between the sound source and its environment.
Advantages
 Size of the microphone array can be smaller than the signal source
 Fast
Disadvantages
 Only typically good for frequencies over 1kHz
 Cannot be used to find the intensity of sources, therefore cannot do ranking
 Tends to use a lot of microphones
Recommendation
Although it only works above a certain frequency it could still come into use. It also does not require a
large microphone array, which is an additional benefit. It may be part of a future upgrade of the system,
which would allow the wearer to be shown who is speaking.
7.1.3.1.7. Near-Field Acoustic Holography (NAH)
Similar to acoustic beam forming, but is used in the near field. Near field is defined as being closer than
one or two wavelengths of the highest frequency [13]. It also calculates the sound pressure of the area.
The microphone array needs to be the same size as the object generating the sound. The microphone
arrangement determines the maximum frequency of interest and the granularity of the signal.

12
Advantages
 Calculates sound pressure.
 Is good over a variety of frequency ranges.
Disadvantages
Must be very close to the sound source.
Requires that the array be the dimensions as the sound source.
Recommendation
Due to the microphones need to be extremely close to the sound source and for the microphone array
shape to match the source, it would be impractical to use it for the google glass project.
7.1.3.1.8. Spherical Beamforming
Spherical beamforming is a technique used for locating sound sources within a cavity, such as a car
interior [14]. It requires a specialized microphone array, typically a 3D acoustic camera.
Recommendation
Due to needing a specialized microphone and that it can only be used in cavities, it has been ruled out for
this project.
7.1.4. Filtering
7.1.4.1. Definition of a Filter
A Filter is any medium through which a sound signal is passed, regardless of whether the signal is in a
digital, electrical or physical form. [15] A physical filter uses the physical shape to alter the signal; an
example of this is the human mouth forming words. An electronic filter uses electrical hardware, such as
resistors and circuits. A digital filter uses an algorithm in code to alter a digital signal. Here the focus is
put on digital filters due to size limitations of the project.
7.1.4.2. Major Categories of Digital Filters
7.1.4.2.1. Time Domain vs. Frequency Domain
These Filters operate in the domain that their name implies, time domain filters operate in the time
domain and frequency domain filters operate in the frequency domain. The time domain filters utilize the
difference equation [16].
𝑦(𝑚) = ∑ 𝑎 𝑘 𝑦(𝑚 − 𝑘)𝑁
𝑘=1 + ∑ 𝑏 𝑘 𝑥(𝑚 − 𝑘)𝑀
𝑘=0 (1)

13
In the above equation ‘y’ is the output, ‘x’ is the input, and ‘m’ is the position or index of the signal being
fed into the equation. The coefficients ‘ak’ and ‘bk’ are dependent on the type of filter and are based on
analog filters [17]. The filter order is the larger of the N or M from the difference equation [16].
Since the audio signal is in the time domain when captured, it requires performing a Fourier transform
(FT) and inverse Fourier transform to convert the signal from time to frequency domain for filtering and
then back to the time domain for the speech to text. The most common method found to do this is a Fast
Fourier Transform (FFT). FFT is a method of performing a FT that reduces the number of calculation
from 2N^2 to 2N Log N [18].
7.1.4.2.2. Infinite vs. Finite Impulse Response
When operating in the time domain and using the difference equation, another choice arises: Infinite
Impulse Response (IIR) or Finite Impulse Response (FIR). The main difference, mathematically, is that
an IIR takes into account the previously calculated values as it processes the data. FIR, on the other hand,
does not. Conceptually, this means that once there is an impulse being passed through an IIR filter, the
filter generates an infinite response with an exponential decay that never truly goes to zero [16]. From a
programming perspective, an IIR requires fewer coefficients to achieve the same user defined frequency
response as the FIR filter. The lower the number of coefficients means lower storage requirements, faster
processing and a higher throughput, but an IIR is more likely to become unstable [16].
7.1.4.2.3. Time-Invariant vs. Time-varying
Time-Invariant filter coefficients do not change with time [16]. Where the Time varying does the opposite
and requires more complex calculation for determining the coefficients over time.
7.1.4.2.4. Linear vs. Non-Linear
Linear filters response is the weighted sum of the filter's individual responses. Non-Linear is as its name
implies, and is not a linear combination of responses and like time-varying filters require more complex
mathematics to derive.
7.1.4.3. Common Types of Digital Filters
For all filters the bandwidth or frequency being targeted cannot be greater than half of the sample rate,
this limiting frequency is called the Nysquist frequency [15]. A frequency of between 5 Hz to 3.7 Hz has
been proven to be ample for speech recognition [19], but some researchers/companies go up to 8 kHz.
This means that the minimum recording sampling frequency must be 8 kHz for the 3.7 kHz and 16 kHz
for the 8 kHz, with recording at a sampling frequency of 8 kHz being most common.

14
7.1.4.3.1. Band Filters
Band filters either allow or prevent frequencies (depending on how they are designed) within a given
bandwidth from passing through the filter [15]. The most common types of these are: Low-Pass, High-
Pass, and Band-Stop. The Low-Pass Filter (LPF) eliminates frequencies greater than the prescribed
frequency. Based on the needed frequencies for speech recognition the prescribed frequency would either
be 4 kHz to 8 kHz. High-Pass Filters (HPF) eliminate frequency less than a prescribed frequency, approx.
5 Hz. Band-Pass filters (BPF) are a combination of the LPF and HPF allowing frequencies within given
ranges or bands to pass. Although an ideal filter eliminates all the undesired frequencies, real filters can
only attenuate the frequencies. A notch filter is a BPF that has a narrow stop band, or band that it is
rejecting.
7.1.4.3.2. Methods of Implementing Band Filters
Band filters come in several flavors, the most common of which are the Butterworth, Bessel and
Chebyshev [20]. These all accomplish the goal of filtering out a band/s, but use different algorithms to
accomplish it, each with their own pro’s and con’s. The Butterworth filter (Figure 7) does not introduce
any major ripples (areas in the band-pass that are attenuated due to the algorithm), but is slow to attenuate
the signal in the band-stop region, in digital signal processing it is said to have a slow roll-off. The Bessel
filters (Figure 8) also do not introduce any major ripples, but begin the roll-off well before the cut-off
frequency. This would cause the desired frequencies in that region to be attenuated, which may result in
missing some parts of the speech. The Chebyshev filter (Figure 9) has a sharp roll-off, but introduces
fairly significant ripples in the band-pass region. Similar to the Bessel filter, this could cause some of the
speech to be distorted or too weak for the speech recognition to work. The Type II Chebyshev eliminates
this, but does not efficiently attenuate the signal after the roll-off region.

15
Figure 7: Butterworth Filter [20]
Figure 8: Bessel Filter [20]

16
Figure 9: Chebyshev Filter [20]
7.1.4.3.3. Other Types of Filters
Bell filters, also called peaking filters, allow all bands to pass, but boast or attenuate frequencies around a
user chosen center frequency [21]. The result of applying the filter when seen on a dB vs. Frequency
graph is reminiscent of the classic “Bell Curve” seen in statistics. A Low-Shelf filter passes all frequency,
similar to the bell filter, but also boasts or attenuates the frequencies below the shelf frequency by a user
specified amount [22]. A High-Shelf filter does the same thing, but above the shelf.
7.1.5. Automatic Gain Control
Another method of manipulating a signal is by simply adjusting its gain. Adjusting the gain can increase
the signal strength of a weak signal or decrease the signal of a strong signal. A very simple way of doing
this digitally is by multiplying the signal by a scalar. A positive scalar will increase the strength and a
negative will decrease it. Making this process automated makes an automatic gain filter (AGC) [23]. An
AGC will automatically adjust the level of the signal to a user defined level. To do this the signal is
sampled finding the power of the signal, and then a scalar is calculated to change the signals strength to
the desired level. This comes in very handy in speech recognition since different speakers may speak at
different sound levels or from different distances. Incorporation of an AGC could potentially increase the
effective range of a speech recognition application.
7.2. Google Glass Alternatives
7.2.1. Vuzix M100 Smart Glasses
Designed to model a bluetooth headset, this display is meant to integrate with a smartphone and show
visual updates on Facebook, email, text messages, and much more. It cost $1,000, but is available right
now. This device is very similar to the Google Glass except in the shape of a headset instead of glasses
[24].

17
Figure 10: Vuzix M100 Smart Glasses [24]
7.2.2. Meta 3D Space Glasses
These glasses are designed to make a 3D virtual overlay on the user’s environment. It uses a 3D graphics
software called Unity 3D to program all of the applications it uses which right now includes chess,
Minecraft, and a 3D graphic design program [25]. It costs $667 which makes it much cheaper than the
Google Glass, and it ships in January; however, it has several drawbacks. First off, being goggles, it is not
very discrete in that it covers the whole front of the face. Also, for the primary application of this project
all of the 3D graphics are unnecessary. It only needs to be able to display text.
Figure 11: Meta 3D Space Glasses [25]
7.2.3. GlassUp
Much simpler than the Meta Space Glasses, GlassUp displays only text and basic diagrams on the screen
through the uses of a small projector. It is also much cheaper and only costs $400; however, it is supposed
to ship in February 2014, but this project might not be able to get one until July 2014 due to the
constraints of the GlassUp indiegogo campaign, which is far too late [26].

18
Figure 12: GlassUp [26]
7.2.4. SmartWatch/ Pebble/ Galaxy Gear
The Smartwatch works as the accessory of the smartphone. The main functions of Smartwatch are
notifications (calls, messages, and reminders), music control, fitting application (distance, speed, and
calorie) and safety application (track the missing phone). In the market, we considered three Smart
Watches as alternatives: Pebble, Sony SmartWatch and Samsung Galaxy Gear.
Figure 13: SmartWatch, Pepple, Galaxy Gear (left to right)
Table 1: Smart Watch Comparison
Product
Name
Price
(USD)
Platform
Compatibility
Input
Types
Dimensions
(mm)
Display (inch) Weight Battery
Life
Pebble [27] 150 iOS/Android buttons 52×36×11.5 1.26×144×168 38g 5~7
days
Smart
Watch [28]
200 Android Multi-
touch
36×36×8 1.6×220×176
OLED display
41.5g 1~7
days
Galaxy
Gear [29]
300 Samsung
Galaxy Note
III
Multi-
touch
55.9×38.1×10.2 1.63×320×320
Super
AMOLED
display
73.5g 1 day

19
As alternative system for Google Glass, Smart Watches are not a preferable choice: their displays are so
small that few words can be shown; they don’t have operating systems to run the software we need; users
don’t like staring at the watch while conversing.
7.3. Patent Search
7.3.1. Patent 1
Title: Captioning glasses [30]
Patent Number: US6005536 A
Publication date: Dec 21, 1999
Key Features:
1. Large Over the frame housing for projection device. Utilizing A projector bounced off a 45
degree mirror onto a partially reflective beam splitter.
2. Can only be mounted over one eye
3. Compatible with prescription lenses if integrated into the mount initially (made to order)
4. Display of subtitles for media or from dictation software (unspecified)
5. No mention of microphones.
7.3.2. Patent 2
Title: Translating eyeglasses [31]
Patent Number: US20020158816 A1
Publication date: Oct 31, 2002
Key Features:
1. Uses a “plurality of microphones” as little as two, specifying that they are all
omnidirectional.
2. Microphones are integrated into the glasses frame itself or can be mounted on the outside.
3. Utilizes a means of directionally filtering received sound using a “sound localization
algorithm” (sources noted)
4. Filtering software: SPHINX, or ASR.
5. Translator: SYSTRAN
6. Signal generator: Using either a prism system or screen
7.3.3. Patent 3
Title: Assistive device for converting an audio signal into a visual representation [32]
Patent Number: WO2013050749 A1

20
Publication date: Apr 11, 2013
Key Features:
1. Specifies at least one receiver, more specifically two microphones; one omnidirectional
and one single direction, used for source localization/ambient noise filtering.
2. Signal processing unit: Mobile phone (unspecified)
3. Converter: Speech recognition (unspecified)
4. Projection: Embedded Grating Structure, with an optical coating adjusted to specific band
width
5. Small projector in the arm or mounted next to the temple
6. Can modify existing prescription glasses
7.3.4. Effects on the project
These patents describe devices similar to the one this project describes. The key points that make this
project different from the patents stated above are that; this project describes modifications made to
existing devices to allow for the ability to turn speech to text. The device itself would be working under
its own patents and any “new” or “revolutionary” technique that is come up within the project would have
to be patented. As it stands, the processes that this project is using falls under common application of the
art and will not have to specifically patented. The result, if sufficiently different from the above patents,
specifically the facial recognition interaction, will have to be patented. The project has utilized these
documents as references for some of the design schemes mainly US20020158816 A1 Translating Glasses,
and similarities to the microphone setup from WO2013050749 A1. In the future, the project will also be
looking into translating capabilities much like Patent 2. However the final design will not be specifically
attributed to any one individual patent that was previously filed.
7.4. Current State of the Art
Currently there is only one known project that is attempting to complete the same project. That is the
SpraakZien (SpeechView in English) being developed by Leiden University Center for Linguistics [33].
Similar to the Google Glass project it is still in its initial phases and does not currently offer much
information besides saying that they are using an augmented reality glasses to display text that a computer
running dictation software is translating. In their initial news release on their website they state that they
are currently wireless and are looking to use a mobile phone for the computing in their next phase. They
were scheduled to release another paper December 2013 [33] but as of yet no paper has been found.
Another project that is similar to the Google Glass project is “Project Glass: Real Time
Translation...Inspired” which was done by Will Powell in the United Kingdom [34]. He created a near
real-time translation system using augmented reality glasses, mobile phones and Raspberry Pies. His
system is extremely accurate but takes around 10 seconds to process. It also requires both of the
individuals in the conversation to have microphones. The system also has limited portability due to the
requirements of the augmented reality glasses that he used (Vuzix 1200 Star glasses) which use a wired
connection to the computing platform. As it currently stands, there does not seem to be any indication that
he will be attempting to increase the portable of the system, or use it aid the hearing impaired. He was
contacted in November through his website email and did not respond.
The last system that could be seen as similar to this project is Sony’s Entertainment Access Glasses [35].
These glasses are currently being sold to movie theaters equipped with Sony’s projectors and show
subtitles for the movie the wearer is watching. It is completely wireless and uses a special holographic
method for displaying the text. The main disadvantage of the glasses is that it does not use real time
dictation to generate the subtitles, but instead the subtitles are pre-programmed for the movies.

21
7.5. Reverse Engineering
Currently there is only one system, Will Powell’s system, which has released enough information to be
useful for reverse engineering. From what he released on his website it gave the group the idea to look
into using the Raspberry Pi as a computing platform and gave some ideas as to what dictation software to
use. It is recommended that when Leiden University releases more information on their project that it
should be evaluated to see if any additional information can be gleaned off of it.

22
8. RECOMMENDED SOLUTION
8.1. Solution Principles
In order to get a better understanding of the project and what needs to be done a system level functional
flow block diagram was implemented (Figure 14).
From this the team held a brainstorming sessions on how each block would be addressed. Most of the
blocks, like capture audio, had limited options when it came to which device to use. Other blocks, like
filter audio, had many options that can work in series with one another. The final concept that was
selected is presented in Figure 15.
Figure 14: Functional Flow Block Diagram

23
This lead to the group breaking the group into subgroups to address the different blocks.
8.2. Analysis
8.2.1. Requirements
Before choosing a microphone for this application, a set of numerical requirements needed to be
developed. The main criteria needed are; maximum sound pressure level, frequency range, sensitivity,
size, and weight.
For the maximum sound pressure level, since it is not as critical of a specification for this project, a
general range was estimated based on some comparisons to real world applications. A normal
conversation at three feet is approximately 60 dB of sound intensity, whereas a jackhammer or propeller
aircraft is around 120 dB [36]. An example of something in between would be a heavy machine shop or a
subway train at about 100 dB [36]. Using these numbers for comparison, the maximum sound pressure
level for this application was chosen to be a5oximately 100 dB to avoid audio clipping and still capture all
the necessary sound while disregarding the extremely loud sounds that can be assumed as not part of the
conversation.
The frequency response range was chosen based on the minimum and maximum frequency range for
human speech. Humans can hear sounds in the range of 20Hz to 20 kHz [36]; however, human speech is a
little more limited in range. The most fundamental range for human speech intelligibility is from 300Hz
to 3.4 kHz [36]. Although this range is ideal for capturing human speech, the ranges for singing and their
harmonics are from 100Hz to 16 kHz [37]. By using this larger frequency band, the system is not limited
Figure 15: Final Concept

24
to just human speech which greatly increases its future expandability. The firm requirement is a minimum
frequency response of 300Hz to 3.4 kHz, but anything larger than that would be preferable.
The only real requirement of the microphone based on sensitivity is that the microphone is sensitive
enough to at least pick up the sound from the required distance. To get some reasonable numbers for the
sensitivity required by the microphone, the sound level of a conversation at 15 ft. needed to be found.
Using the relationship that the sound level of a source decreases by 6 dB for a doubling of distance [36]
and that the sound level of normal conversation at 3 ft. is 60 dB [36], the sound pressure level at 15 ft.
should be about 45 dB.
60 𝑑𝐵 − 6 𝑑𝐵 ∗ 15 𝑓𝑡 / 3 𝑓𝑡 ∗ ½ = 45 𝑑𝐵
This number gives ballpark estimate of the sensitivity value required by the microphone as -45 dB. Since
the sensitivity values are fairly arbitrary, a range around -40 dB was chosen to be conservative. The
weight and dimensional requirements were estimates based on the dimensions of the Google Glass itself.
It was decided, that the microphone should not weigh more than the weight of the glasses. The weight of
the Google Glass is 50g and therefore the weight of the microphone should not exceed 50g. As for the
dimensions of the microphone, the group looked at the dimensions of the Google Glass and approximated
a nominal size (approximately 3in x ¾ in x ¾ in) to be attached to the outside of the glasses without
interfering with the total design.
Table 2: Microphone Requirements
Figure 16: Approximate frequency ranges [37]

25
Requirement Value
Minimum Frequency Response Range 300 Hz - 3.4 kHz
Max Sound Pressure Level ~100 dB
Max Dimensions 3in x ¾ in x ¾ in
Max Weight 50g
Sensitivity -38 to -42 dB
8.3. FMEA
A key aspect of refining ideas is trying to predict future problems that may arise that will need to be
addressed. To accomplish this, the team deployed a failure modes effects analysis (FMEA) to determine
future problem areas. A list of components that are currently known and ways they can fail were
generated. Complete failure tags were given to items that are key to the system working which if that item
were to fail the complete system would also fail. Component failures are given to items that are optional
or extra to the system that do not affect the over system performance. The chance of occurrence (10 being
extremely likely to occur), the severity (10 being extremely severe) and chance of detection (10 being
unable to be detected) were evaluated for each of the failure modes. Since there is no real threat of bodily
harm in this project high severity means that the item is tied to the critical items in the system and could
not be quickly fixed by the user. If when the product of the occurrence, severity and detection was 200 or
greater, then there would need to be an action to remedy the problem. Although these problems were not
addressed, if the project were to further developed for the consumer they would be addressed.

26
Table 3: FMEA
FMEA
Google Glass Project Group: Google Glass Group (G^3)
O: Chance for Occurrence; S: Seriousness, D: Chance of Detection
Function/Item Potential Failure
Potential Reason
for Failure
Effects of
Potential Failure
O S D Score Action
Microphones
Short Out
Exposure to
Water Complete Failure 5 9 5 225
Water Proof
Microphones
Power Surge Complete Failure 1 9 10 90
Break Being Dropped Complete Failure 1 5 10 50
Sound Filter
Data Corruption
Virus
Component
Failure 1 5 10 50
Program Bug
Faulty
Programming
Component
Failure 6 5 5 150
Dictation
Software
Data Corruption Virus Complete Failure 1 9 10 90
Program Bug
Faulty
Programming Complete Failure 6 6 5 180
Display
Short out
Exposure to
Build
waterproofing
enclosure
Break
object being
dropped Complete Failure 2 9 10 180
Data Storage
Data Corruption
Exposure to
magnets
Component
Failure 1 5 10 50
Unmounted
during writing
Component
Failure 4 5 5 100
Break
Being Dropped
Component
Failure 1 5 10 50
Computing
Platform
Break Overheating Complete Failure 4 8 7 224
Incorporate
Heat Sink
Being Dropped Complete Failure 1 8 10 80
Short Out
Exposure to
Water
Proof/resistant
enclosure

27
8.4. Decisions
8.4.1. Platform Decision
When it came to selecting a computing platform to run the dictation and filtering software on, a
preliminary search of available and reasonable platforms was generated: laptop, tablet, cell phone,
Raspberry Pi. These platforms were then evaluated using a decision matrix (on a 1-5 scale with 5 being
the best) based on criteria that were derived by our system requirements and what the system needed to
do. The criteria are as follows:
8.4.1.1. Portability
Portability is defined here as the overall dimensions, weight and battery life of the platform. The
Raspberry Pi and cell phone received 5’s in this category due to their small size and being light
weight. Both also had all day battery life, the Raspberry Pi requires the purchase of a battery pack
though. The tablet got a 3, because depending on the type they are typically have the same size display as
a small laptop, but are fairly light and have good battery life. The laptop received a 2, because compared
to the other devices they are heavy and have poor battery life.
8.4.1.2. Cost
Cost is defined here as the complete cost of the platform. Due to the cost of purchasing a pair of
augmented reality glasses or similar display device, the budget for the computing platform is tighter. The
Raspberry Pi received a 5 because it has a low cost (under $100). Cell phones also received a 5 due to the
fact that most potential users already owning a smartphone and due to having a cost ranging from free to
a5oximately $600 depending on whether the phone was bought through a carrier. Tablets received a 3 due
to its purchase price ($200-$1000 depending on brand and specifications). Laptops received a 1 do to
their high cost ($300+) for a decent laptop.
8.4.1.3. Programmable
Programmable is defined as the platforms ability to either run the required dictation and filtering software
natively or to be able to program applications for the platform capable of doing it. Laptops got a 5 in this
category due to their native ability to support programming and software and can use a wide array of
operating systems. The RPi got a 4 because of its native ability to support programming and software, but
had fewer options in operating systems. Tablets and Cell phones both received 4’s for similar reasons to
the RPi.
8.4.1.4. Display
Display here refers to the ability of the platform to display to an external screen. The Laptop and RPi both
received 5’s because they natively support wired video out options and can use software to wireless
output video. Cell phones and tablets received 4’s because they do support wired video out options, but
typically require special adapters. They also support wireless video output, but have few options of what
they can display to.

28
8.4.1.5. Microphone
Microphone refers to the ability of the platform to have a microphone connected to it. Laptops received a
5 since it has the ability to accept both USB and 3.5 mm microphones without the need for converters.
The Raspberry Pi received a 4 because it can accept USB microphones without converters and can accept
3.5 mm through the use of an external sound card. Cell phones and tablets received 3’s because they can
accept microphones, but require special microphones or require knowledge of programming to use non-
specialty microphones.
8.4.1.6. Evaluation
Table 4: Platform Decision Matrix
Platform Decision Matrix
Platform
Tablet Cell Phone Laptop
Raspberry
Pi
Criteria
Portability 3 5 2 5
Cost 3 5 1 5
Programmable 4 4 5 4
Display 4 4 5 5
Microphone 3 3 5 4
Total 17 21 18 23
From the decision matrix the top two devices (RPi and Cell Phone) were selected for further tests. After
developing a working speech to text app for the RPi, it proved to have insufficient processing power for
the project and was abandoned; this is discussed in greater depth in Problems with Using RPi.
8.4.2. Display Platform Decision
8.4.2.1. Criteria
When it came to selecting a display device the programming team got together to brainstorm criteria. The
main ones were:
 Available Now
o The AR glasses are fairly new to the consumer market there are few of them that are
either currently on the market or will soon be on the market. Since this is a time
sensitive project the team decided that the AR glasses selected would need to be “in-
hand” by the beginning of the spring semester. This criteria greatly reduced the
amount of options
 Developer Support/Base
o To aid in the development of the program the display, it had to have a strong
developer support and developer base to ensure that there would be enough
documentation available to learn how to program for the selected device.

29
8.4.2.2. Devices Reviewed
For this section the following format will be used:
 Name of Device, Manufacturer
 Cost (if known)
 Main Attribute
 Why it was ruled out
 Manufacturer’s Website
At the time that the survey of devices was done, the following devices were found.
 Jet, Recon
o $599
o Built for sports enthusiast (theoretically very durable)
o No concrete release date, the manufacturer only said “spring 2014”
o http://www.reconinstruments.com
 Epiphany Eyewear, Epiphany Eyewear
o $399
o Only used for recording video
o No textual display
o http://www.epiphanyeyewear.com
 GlassUp, GlassUp
o $299
o Simple design
o No official release date
o http://www.glassup.net
 Telepathy One, Telepathy
o Unknown
o Smallest form factor of all options
o No official release date
o http://tele-pathy.us
 Meta 0.1, Meta
o $667
o Holographic overlay
o No official release data
o https://www.spaceglasses.com
 ION Glasses, ION Glasses
o $100
o Indicates when messages are received
o No textual display
o http://www.ionglasses.com
 ORA-S, Optinvent

30
o $949
o Bright see-through display
o No concrete release date (Sometime in March)
o http://optinvent.com
 Oculus Rift, Oculus VR
o $300
o Complete VR system for computer gaming
o Pure virtual reality, no ability to see outside world
o http://www.oculusvr.com
 SmartGoggles, Sensics
o Unknown
o Complete VR system for entertainment
o Currently only available to governments and manufacturers
o http://smartgoggles.net
 M100, Vuzix
o $1000
o Compact size
o Does not have the developer documentation/base that the Google Glass has
o http://www.vuzix.com/consumer/products_m100.html
8.4.2.3. Ideal Selection
If the time constraint was not a factor the Ideal choice was the Meta for this project, because it would
have increased the user friendliness by having the words appear on or near the individual speaking them
limiting eye fatigue of the user.
8.4.2.4. Device Selected
The team decided to still go with the Google Glass because it is one of only two options that are currently
available with an adequate display. Also, due to its popularity and its manufacturer, it has strong
developer support. If the team does not get the Google Glass, then the team will purchase a Vuzix M100,
since it is currently the only other device capable of displaying.
8.4.2.5. Computing Platform Refinement
The use of the Google Glass or Vuzix M100 for the display limited the computing device to any bluetooth
enabled phones [38]. Currently there are four main operating systems for smart phones: Android, iOS,
Windows, and Blackberry OS. The team limited the selection to the most common phone OSs: Android
and iOS [39] [40]. Both OSs have plenty of documentation and support, but Google offers a Glass SDK
add-on for their Android SDK [41]. Similar support for the iOS was not found. This lead the team to
select the Android platform.

31
8.4.3. Microphone Type Decision
The decision on which microphone type the group is going to implement in the project was a major
milestone for the group’s progress. Though some of the decisions of sorting out microphones were easier
than others. For example, while the fiber optic microphones would fit all the criteria for the project,
however it was too expensive. The ribbon microphone was not suitable for mobility and would not make
a good fit if the microphone would have been mounted on the glasses.
The hardest decision was to decide between condenser microphones and the dynamic microphones. Both
of those microphones made it to the final consideration. The speech recognition of a dynamic microphone
is better than the speech recognition of a condenser microphone and for purposes such as focusing on one
person and eliminating surroundings sounds, the dynamic microphones is also better than the condenser
microphone. As it turns out, while the microphone research group looked for suitable microphones, the
majority of microphones found were electret condenser microphones. This lead to the decision that not
only the electret microphones are much cheaper than other microphones, and produced to be included in
mobile electronic devices, such as cell phones, the electret microphones would be the optimal choice of
microphone for this project.
8.4.4. Microphone Final Decision
8.4.4.1. Evaluation
After developing microphone requirements and researching microphones, a list of microphones was
compiled and then compared with these requirements. This list was then narrowed down based on the
requirements to a top four. The results of a decision matrix on the top four are shown below.
Table 5: Microphone Decision Matrix
Criteria Weight SP-MMM-1 MT830R CME-1538-100LB POM-3535L-3-R
Cost 7 4 1 9 10
Sensitivity 5 10 8 7 9
Max pressure 1 7 10 8 9
Size 2 7 8 10 9
Frequency 3 10 10 10 10
Total (unweighted) 38 37 44 47
Total (weighted) 129 103 156 172
According to this matrix the POM-3535L-3-R is the best choice out of these four microphones for this
application. These top four microphones, the original list of microphones, and the requirements developed
in the previous section were all brought to the Electronic Support Lab where David Beavers found similar
microphones and ordered them.

32
8.4.5. Filter Decision
The decision of what category of filter to use was based on processing time constraints, the group’s
programming ability and the difficulty of implementation. When it came to the decision of frequency or
time domain, it came down to the ease of implementation and the programming skill of the group. For the
frequency domain, doing FFT would be required and android does not currently natively support FFT.
This meant the group would either need to program it themselves or find a source [42] which would be a
hurdle in the programming aspect; it would also increase the code length and processing time due to the
extra steps. Time domain filtering, on the other hand, only required the implementation of a difference
equation and did not require any FFTs. Based off of the research, an IIR filter was found to be quicker
and less expensive computationally than the FIR and was therefore chosen. Based off research, Linear,
time-invariant filters did not introduce new spectral components (artifacts from the filtering process) [15]
and were therefore chosen. For the particular types of filters to implement, the Butterworth Low-Pass
Filter and Low Shelf Filter were selected. The Butterworth LPF was chosen because it did not introduce
any major ripples and had a cleaner (defined as closer to the cut-off frequency) roll-off than the Bessel
filter. A low-pass filter was selected since the lower end of the speech range is very narrow and the bulk
of the unwanted frequency would be between the upper speech frequency and the Nysquist limit. The low
shelf filter was selected based off the need to increase the gain of the audio in the speech frequency range
to increase the overall range of the speech recognition.
8.5. QFD
In order to determine the functional requirement, their importance and any interdependencies in those
requirements a quality function deployment was used. The four key stakeholder requirements that were
analyzed were: Real time speech to text, mobile, safety and user friendly. The most important of these
was safety, given a 10, since this will be a consumer product and should not harm the user. The next most
important was the real time speech to text since and the product being mobile, both given 9’s, these are
the essence of the project and must be done in order for the project to be deemed a success. The least
important of the requirements was the user friendliness, given a 4. Although being user friendly is
important for any consumer product it did not have the same import as the other requirements. Functional
requirements were sought as to how, from an engineering standpoint, the stakeholder requirements could
be solved and their dependency and correlations were evaluated both with regard to the stakeholder
requirements and to each other. The template that was used did the mathematics to determine which
requirements were more important than others based on their dependency on the stakeholder requirements
and the weight of the stakeholder requirements. The correlations that are in the QFD shows that the
weight and size are negatively impacted by the weather resistance and battery life. This is due to the fact
that in order to increase battery life, the battery must typically also increase in size and weight. The same
is believed for the weather resistance attribute.
The QFD was also used to benchmark the current prototype against the currently known
competitors. Fields that are blank do not have the required information, due to not being release by the
competitor, to make reasonable estimates.

34
It can be noted from the QFD that the importance percentage of the requirements are fairly evenly split.
The only exceptions to this are the settings requirements to address being user friendly. Due to the latency
time of Mr. Powell’s system it received low marks in “Real-Time Speech to Text” whereas the team’s app
received high marks for being near instantaneous when connected to a strong data connection. For the
usability Mr. Powell’s and SpeechView both received low marks 2 and 3 respectively. For SpeechView
this is due to its dependency on a laptop for its computing and for Mr. Powell’s system it for the amount
of devices and wires required to work. The Glass received high marks on this due to being completely
housed on the Glass and can use a smartphone for a data connection should Wi-Fi not be present. For
safety the competition did not give enough information to make an informed estimate of their ratings,
however the teams project received high marks again since they only added code which does not interfere
with the Glasses hardware and therefore should be no more of a safety concern than any other Glass app.
Finally for user friendly, both the team’s product and Mr. Powell’s received a median mark since both
present the text in a similar and user friendly way, but do not incorporate more in-depth features. From
this benchmarking it can be seen that the current Glass app is ahead of the current known competition.
8.6. Bill of Material
The team divided the budget into the following categories (Table 6) based on the FFBD and Concept.
Table 6: Budget Allocation
Budget Allocation
Category Allotted
Google Glass/Display $1,500
Microphones $500
Software $250
Testing Equipment $100
Printing $100
Computing Platform $250
Cushion Fund @ 10% $300
Total $3,000
From the allotted budget the team purchased a RPi and related items to try as a computing platform, a
microphone to try with an Android phone, and the Google Glass and its associated cables and accessories.
The items and their costs can be found in Table 7 below. The only item that was over budget was the
Google Glass due to its accessories, but this extra cost was absorbed by not needing some of the other
items.
Table 7: Expenses

35
Expenses
Item Price Category
Microphone $93 Microphone
Raspberry Pi (RPi) $35 Computing Platform
SD Card for RPi $7 Computing Platform
RPi Cables $16 Computing Platform
Google Glass w/Accessories $1,918 Google Glass/Display
Total $2,069

36
9. PROGRAMMING
9.1. Programming for the Google Glass
The Google Glass uses the android operating system which employs Java as its main programming
language. Google is the creator of the android operating system and provides a complete open-source
database of programming functions and development tools on the android developer website [43]. The
android developer website also has instructions on how to begin programming in android and free access
to the Android Development Tools (ADT) which is a plug-in for the Eclipse programming IDE
(integrated development environment) [44]. Google also has a complete online guide to creating
applications for the Google Glass on the Glass Developer website [45].
Three of the main structural components when programming an android application are activities, intents,
and processes. Just like their names suggest, activities are where the user does something, intents get
interpreted by the system and then perform an action, and processes are the execution structure of an
application. Activities have a graphical component or window in which to create an interface that the user
interacts with [46], whereas, intents are entirely behind the scenes calling on the system or other activities
(Figure 18) [47]. Whenever an application begins, it creates a Linux process with one thread of operation
that performs all of the application tasks in series [48]. To perform multiple tasks at the same time,
multiple threads need to be created.
Figure 18: Implicit intent [47]
When using these tools to program for the Google Glass, a programmer follows the same principles as
programming for a mobile device but with a few distinct differences. For instance, the main method for
input on mobile devices is the touch screen. However, the Google Glass does not have a touch screen;
instead it uses a combination of voice commands and touch gestures on a directional pad (D-pad), such as
swiping forward, backward, down, or tapping.
Another way the glass differs from normal mobile development is in its use of the timeline. When starting
the Glass, the first screen visible is the “Ok Glass” menu. This is where the user can start applications
with the voice commands by saying “Ok Glass, <voice trigger>” or touch gestures by tapping and

37
swiping. When on the “Ok Glass” menu, swiping to the right will allow the user to see past conversations
or other usage history, and swiping to the left brings up current items or settings.
When coding for this type of structure the programmer has several decisions that need to be made. First is
whether to use the Glass Development Kit (GDK) or the Mirror API. Applications created with the GDK
are implemented on the Glass itself and use the Glass hardware. This allows applications to access the
hardware components on the Glass such as the built in microphone or the camera. GDK applications are
also stand alone and do not require access to a network in order to function. The Mirror API functions
quite different from the GDK. None of the code for applications made with the Mirror API runs on the
Glass. Instead, it runs on an external system and then communicates with the Glass to perform its
functions. The benefit of this is that the Mirror API has platform independence and more built in
functionality; however, a network connection is needed [49].
The GDK has several different options for application structure including static cards, live cards, and
immersions. Static cards show up to the right of the “Ok Glass” menu in the past section of the timeline.
This type of interface is not updated and remains unchanged over time. It is mainly used for things like
message updates, search results, or anything else that does not change after its initial function is
completed. The purpose of this project is to add live subtitles to conversation in order to help the hearing
impaired. This requires that the interface is updated continuously, which means that static cards are not a
viable option. Live cards and immersions, however, are updatable but function differently from each
other. Live cards are in the timeline to the left of the “Ok Glass” menu and can be navigated to and from
without closing them or impacting the use of the Glass as a whole. This is accomplished through the use
of a live card service that runs in the background, the results of which get printed to the card on the screen
(Figure 19). A service is similar to an activity except it does not have a visual component and runs
entirely behind the scenes.
Figure 19: Live Card Operation [50]
Live cards come in two different varieties: low frequency and high frequency (Figure 20). The low
frequency cards update on the order of seconds (e.g. 1 Hz) by using a remote view created by the service
and then printing the remote view to the live card. High frequency live cards use drawing logic that
refreshes many times a second (e.g. 60 Hz) instead of a remote view and then prints to a buffer, referred
to as a backing surface, before pushing the results to the card [50].

38
Figure 20: Low Frequency (Left) and High Frequency (Right) Live Cards [X8]
Live cards do have some limitations, so a different format is needed for some of the more menu intensive
and interactive applications. An immersion does not use the timeline; instead, it takes control of the Glass
and must be exited to access the other Glass functions (Figure 21) [51]. This allows for much more in
depth applications and complex functions.
Figure 21: Immersion [51]
9.2. Programming Process
The following process was used when coding for the Google Glass.
1. Identify application needs
2. Research coding method
a. Find what functions are needed and where they are implemented
3. Write the code in Eclipse using the ADT
4. Revise and check for errors
5. Debug the code
a. Research the problem if necessary
6. Export the code to the Google Glass
a. Generates an Android Package (APK) that is installed on the Glass (Figure 22)
b. Repeat steps 4-5 as needed

39
Figure 22: Exporting Code to the Glass [52]
9.3. Generated Applications
Initially we tried using a low frequency live card as the structure of the Live Subtitles application, but it
had some limitations. It achieved the basic structure and function of the app, but it could only do single
iterations of the speech recognizer. Also, the WakeLock function that maintains control of the screen
could not be used in the live card. To resolve these issues a second application was created using an
immersion instead of the live card. The immersion solved the issues with the WakeLock and the non-
continuous problem.
9.4. Application Operation:
From the main menu say “Ok Glass, Live Subs” to launch the application (Figure 23).
Figure 23: Launching Application

40
The app launches and begins to listen to speech continuously printing the results to the screen (Figure 24).
Figure 24: App Listening
The user can swipe down on the touchpad to close the speech recognizer and review the conversation
after saying “history mode” in the speech recognizer (Figure 25).
Figure 25: Review of Conversation
The user can swipe down to enter the history mode (Figure 26)
Figure 26: Text History

41
Tap to return the speech recognition mode
Figure 27: Return to Speech Recognition
Tap to pause or resume the speech recognition mode.
Figure 28: Tap to Enter Speech Recognition Mode
Say “delete file” in Speech recognition mode in order to the clear the text history.
Figure 29: Deleting Files

42
9.5. Android Smartphone Programming
9.5.1. Speech to Text
The speech to text app was developed to allow testing of some of the speech-to-text and file generation
techniques, methods that are common between the cellphone and Glass. This allowed the programmers to
work in parallel with one another and test preliminary code without needing to be in possession of the
Glass. The code itself was taken from a tutorial on YouTube provided by AndroidDev101 [53] The
original code utilized the “Voice Typing” feature that Android introduced in 2011 [54] to continuously
listen to the speaker for voice commands, translate the voice commands to text, compare the text to
existing commands, and then perform the command’s action. The code was altered to remove the voice
commands and to simply listen for speech, convert that speech to text, and print the text to the screen.
Next a speech history feature was added to the program which continuously wrote the text from the
speech to a file which is stored on the device and can be read by the user on the device or taken off the
device via a USB cable. This was accomplished by using the FileWriter [55] and BufferedWriter [56]
methods that are built into Android. The android phone Speech-to-Text code can be found in Appendix H.
9.5.2. Filter Tester
At the time the filter research and coding began the Glass app had not been completed, so the need arose
to have an app for android that could test the implementation of the filters. From this need the Filter
Tester app was conceived. The app’s purpose was to allow for implementation and testing of filters that
could, in theory, be ported directly to the Google Glass to be implemented into the final Glass app. The
first goal was be able to capture and store audio in a .WAV form. This was accomplished by starting with
the open source code provided by Krishnaraj Varma through their blog [57] and Google Code [58]. Next
the code was modified to include a function for applying filters, the filters were based on the following
sources: Bristown-Johnson [17], Smith [15], and Fisher [20], which were discussed in the filtering
section. In addition to the filters, a simple gain algorithm, a scalar being multiplied by the signal, was
applied. The Filter Tester code can be found in Appendix I. An automatic gain control was coded in
MATLAB and provided in the Appendix J, but was not implemented into the code due to time
constraints. Although the purpose of this app was to be able to port the code to the Glass to be
implemented into the Glass app, the need to record the audio, however briefly, has made it incompatible
with the current speech to text.
9.5.3. Microphone Implementation
To implement an external microphone on Google glass requires 3 conditions to be met:
1. A USB microphone
This is easy to achieve, since USB powered microphones are readily accessible for computers and the
only item needed to connect it would be a USB to micro-USB converter.
2. Android 3.1 above with kernel compiled with USB-host support (Google Glass works on Android
4.0.3 and kernel comes with USB-host support) [59]
To make the microphone work with the Google Glass requires the Glass to act as a USB host, which
needs the Linux kernel to support it when the kernel is compiled. Android OS will also need to support
mounting USB device to the glass. According to the document, android 3.1 and above will do the job.
Google glass comes with Android 4.0.3, thus it also satisfy this condition.

43
3. Driver of the microphone for android
This is the trickiest part of this topic. To make a hardware work, software is needed that supports it; in
this case, it is the driver. It seems that Google has done some modification on the Android OS on Google
glass [60], which makes driving an unknown USB device very difficult. It is very hard to tell if a specific
model of USB microphone will be supported unless you tried it. If it does not work out of box, heavily
modification to the Linux kernel (which is also the kernel of android) will be needed. This process needs
deep expertise in driver development and it can brick the whole device very easily.
9.6. Programming the RPi
Linux is a very common open source OS and Raspbian, a modified form used on the RPi, is gaining
popularity. For this reason documentation on programming is relatively easy to come by. The first hurdle
for using the RPi was getting an USB microphone to work with it. This just required the changing of
some core setting. The next task was to install Carnegie Mellon University's (CMU) PocketSphinx
dictation software and luckily a basic guide was found which provided a step by step procedure. The last
task was to get the RPi to output its results to a remote display. This was accomplished through using
Secure Shell (SSH), a secure data communications program that uses the internet. The RPi natively
supports SSH and this was used to communicate with any other display/device that also can use SSH. A
RPi programming logbook was used to keep track of what was done and what sources were used and is
found in the appendix. Ultimately the RPi was eliminated as a computing platform due to the problem,
which are stated below.
9.6.1. Problems with Using RPi
9.6.1.1. Accuracy
Currently the dictation software uses CMU’s SphinxBase, a speech dictionary used by the dictation
software to know what sound relates to which word. SphinxBase is fairly limited and when the dictation
software hears words that are not in that dictionary it finds the next closest thing. This results is an
inaccurate output. This problem should be fixed by either using a more robust speech dictionary or
creating one through CMU’s SphinxTrain software. Another source of inaccuracy lies in the
settings. These setting would be optimized through trial and error.
9.6.1.2. Latency
The delay between the software hearing the words and outputting the text is also quite significant (often
15-20 seconds). This is a far cry from the project goal of “real-time.” To fix this some core setting of the
program would need to be optimized through trial and error, but due to the processor limitations the
latency will more than likely still be unacceptable.
9.6.1.3. Internet Dependency
Since the current system uses SSH to communicate with a remote display it is dependent on having an
internet connection. Although this does not violate any current requirements, it does limit the mobility of
the system and therefore will be addressed. The easiest remedy for this problem is to use a wired
connection. Other solutions that were under consideration are using Bluetooth and direct wifi.

Google Glass Project

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (20)

Semelhante a Google Glass Project

Semelhante a Google Glass Project (20)

Google Glass Project