The document discusses how to conduct a comparative evaluation of automatic speech recognition (ASR) systems. It outlines selecting a representative evaluation data set, determining human transcriptions, running each ASR engine at optimal settings, and comparing results to measure accuracy. Key steps include choosing a data set with at least 100 recordings per grammar path, transcribing the data, building a test harness to run each ASR engine, analyzing possible error types (missilences, misrejections, misrecognitions), and evaluating errors by running engines at different rejection thresholds to find optimal values. The document demonstrates running two ASR engines (A and B) on the same data set and comparing their accuracy curves.
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Comparative ASR Evaluation - Voxeo - SpeechTEK NY 2010
1. Comparative ASR
evaluation
Dan Burnett
Director of Speech Technologies, Voxeo
SpeechTek New York
August 2010
2. Goals for today
• Learn about data selection
• Learn all the steps of doing an eval by
actually doing them
• Leave with code that runs
3. Outline
• Overview of comparative ASR evaluation
• How to select an evaluation data set
• Why transcription is important and how to
do it properly
• What and how to test
• Analyzing the results
4. Comparative ASR
Evaluation
• How could you compare ASR accuracy?
• Can you test against any dataset?
• What settings should you use?
The optimal ones, right?
5. Today’s approach
• Choose representative evaluation data set
• Determine human classification of each recording
• For each ASR engine
• Determine machine classification of each
recording at “optimal” setting
• Compare to human classification to determine
accuracy
• Intelligently compare results for the two engines
6. Evaluation data set
• Ideally at least 100 recordings per grammar
path for good confidence in results (up to
10000 minimum for large grammars)
• Must be representative
• Best to take from actual calls (why?)
• Do you need all the calls? Consider
• Time of day, day of week, holidays
• Regional differences
• Simplest is to use every nth call
7. Lab data set
• Stored in all-data
• In “original” format as recorded
• Only post-endpointed data for today
• 1607 recordings of answers to yes/no
question
• Likely to contain yes/no, but not guaranteed
8. Transcription
• Why is it needed? Why not automatic?
• Stages
• Classification
• Transcription
9. Audio classification
• Motivation:
• Applications may distinguish (i.e. possibly behave
differently) among the following cases:
Case Possible behavior
No speech in audio sample Mention that you didn’t hear
(nospeech) anything and ask for repeat
Speech, but not intelligible
Ask for repeat
(unintelligible)
Intelligible speech, but not in app
grammar Encourage in-grammar speech
(out-of-grammar speech)
Intelligible speech, and within app
grammar (in-grammar speech) Respond to what person said
11. Lab 1
• Copy yn_files.csv to yn_finaltrans.csv and edit
• For each file, append category of nospeech,
unintelligible, or speech
• Example: all-data/.../utt01.wav,unintelligible
• Append transcription if speech
• Example: all-data/.../utt01.wav,speech,yes
• Transcription instructions in transcription.html
• How might you validate transcriptions?
12. What and how to test
• Understanding what to test/measure
• Preparing the data
• Building a test harness
• Running the test
13. What to test/measure
• To measure accuracy, we need
• For each data file
• the human categorization and transcription,
and
• the recognizer’s categorization, recognized
string, and confidence score
14. Preparing the data
• Recognizer needs a grammar (typically from
your application)
• This grammar can be used to classify transcribed
speech as In-grammar/Out-of-grammar
15. Lab 2
• Fix GRXML yes/no grammar in “a” directory
called yesno.grxml
• Copy yn_finaltrans.csv to yn_igog.csv
• Edit yn_igog.csv and change every “yes” or
“no” line to have a category of “in_grammar”
(should be 756 yes, 159 no, for total of 915)
16. Building a test harness
• Why build a test harness? What about
vendor batch reco tools?
• End-to-end vs. recognizer-only testing
• Harness should be
• generic
• customizable to different ASR engines
17. Lab 3
• Complete the test harness harness.php
(see harness_outline.txt)
• The harness must use the “a/scripts” scripts
• A list of “missing commands” is in
harness_components.txt
• Please review (examine) these scripts
• FYI, ASR engine is a/a.php -- treat as black box
18. Lab 4
• Now run the test harness:
• php harness.php a/scripts <data file> <rundir>
• Output will be in <rundir>/results.csv
• Compare your output to “def_results.csv”
20. Possible ASR Engine
Classifications
• Silence/nospeech (nospeech)
• Reject (rejected)
• Recognize (recognized)
• What about DTMF?
21. Possible outcomes
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
22. Possible outcomes
Misrecognitions
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
23. Possible outcomes
“Misrejections”
ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
24. Possible outcomes
“Missilences” ASR
nospeech rejected recognized
Correct Improperly
nospeech Incorrect
classification rejected
Improperly Correct Assume
unintelligible
treated as silence behavior incorrect
True
out-of- Improperly Correct
Incorrect
grammar treated as silence behavior
Improperly Improperly Either correct
in-grammar
treated as silence rejected or incorrect
25. Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
inappropriately or incorrectly
26. Three types of errors
• Missilences -- called silence, but wasn’t
• Misrejections -- rejected inappropriately
• Misrecognitions -- recognized
inappropriately or incorrectly
So how do we evaluate these?
27. Evaluating errors
• Run ASR Engine on data set
• Try every rejection threshold value
• Plot errors as function of threshold
• Find optimal value
28. Try every rejection
threshold value
• Ran data files through test harness with
rejection threshold of 0 (i.e., no rejection),
but recorded confidence score
• Now, for each possible rejection threshold
from 0 to 100
• Calculate number of misrecognitions,
misrejections, and missilences
29. Semantic equivalence
• We call “yes” in-grammar, but what about
“yes yes yes”?
• Application only cares about whether it does
the right thing, so
• Our final results need to be semantic results
30. Lab 5
• Look at synonyms.txt file
• Analyze at single threshold and look at the result
• php analyze_csv.php <csv file> 50 synonyms.txt
• Note the difference between raw and semantic results
• Now evaluate at all thresholds and look at the (semantic)
results
• php analyze_all_thresholds.php <csv file> <synonyms file>
31. 1000
ASR Engine A errors
“Misrejections”
Misrecognitions
500
“Missilences”
0 Rejection Threshold 100
32. 1000
ASR Engine A errors
500
Minimum
Total Error
Sum
0 Rejection Threshold 100
33. Lab 6
• You now have engine B in “b” directory
• Change harness and component scripts as necessary to
run the same test
• You need to know that
• The API for engine B is different. Run “php b/b.php” to
find out what it is. It takes ABNF grammars instead of
XML.
• Engine B stores its output in a different file.
• Possible outputs from engine B are
• <audiofilename>: [NOSPEECH, REJECTION,
SPOKETOOSOON, MAXSPEECHTIMEOUT]
• <audiofilename>: ERROR processing file
34. 1000
ASR Engine B errors
“Misrejections”
Misrecognitions
500
“Missilences”
0 Rejection Threshold 100
35. 1000
ASR Engine B errors
500
Minimum
Total Error
Sum
0 Rejection Threshold 100
36. Comparing ASR
accuracy
• Plot and compare
• Remember to compare optimal error rates
of each (representing tuned accuracy)
37. 1000
Total errors: A vs B
500 ASR Engine A
ASR Engine B
0 Rejection Threshold 100
38. Comparison conclusions
• Optimal error rates are very similar on this
data set
• Engine A is much more sensitive to
rejection threshold changes
39. 1000
Natural Numbers
Note that optimal
thresholds are
different!
500
ASR Engine A ASR Engine B
0 Rejection Threshold 100
40. Today we . . .
• Learned all the steps of doing an eval by
actually doing them
• How to collect data
• Transcribing data
• Running a test
• Analyzing results
• Finished with code that runs
(and some homework . . .)