SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Natural
Language
Processing
Politecnico di Milano
Polo di Como
Prof. Licia Sbattella
--Student: Lorenzo Monni Sau
Matr.: 771378
AA 2012/2013

Assignment: Text & Speech Analysis
Indice generale
1. Introduction: Goals of the Assignment and used tools................................................................2
2. Choice of the dialogue and text to speech alignment with SPPAS..............................................3
3. Editing the dialogue tiers in Praat and writig a Script for Processing.........................................4
4. POS Tagging................................................................................................................................5
5. Semantic Analysis with JWNL....................................................................................................5
6. Results and main statistics...........................................................................................................5
7. Conclusions..................................................................................................................................7
8. Appendix: Lines of Code. ...........................................................................................................8

1. Introduction: Goals of the Assignment and used tools
The objective of this work is to provide a complete analysis of a piece of conversation,
carrying out the following features:
•

phonologic features of dialogue and a brief statistical analysis;

•

A subdivision in dialogue acts using the DAMSL model;

•

the POS tagging of the dialogue;

•

a brief Semantic Analysis;

•

a Graphical Representation of the results.

Given these goals, the first step has been the choice of the right dialogue for the purpose of
analysis. The audio file of the dialogue together with the written transcription was taken as
input to SPPAS (Automatic Phonetic Annotation of Speech), which is a tool for operations
of alignment between audio and text, with tokenization and phonetization features.
The result of SPPAS analysis got the text aligned with the audio file and it was used as
input to PRAAT, which is a tool to capture audio features of speech such as Pitch,
Intensity and Formants. The alignment was manually edited in Praat to provide the best
match between transcription and audio, and then a Praat script was created to append
some audio features and further annotations to the words in the .txt file.
The POS Tagging part of the project was carried out by using the POS Tagger of the
Stanford University. After this phase the txt with the data looked like a table with audio,
dialogue and syntactic features associated with each word of the conversation.
The last part of the project involved the semantic analysis of dialogue, leveraging the
JWNL java library to query the WordNet lexical database.
Graphical results has been made importing the final .txt file in Microsoft Excel.
2. Choice of the dialogue and text to speech alignment with SPPAS
The choice of the suitable dialogue for the analysis was probably the hardest step in the
assignment, due to the constraints given by the SPPAS limited capabilities of processing.
My first idea was to get an artistically relevant dialogue, so I started with an excerpt from
the film Eyes Wide Shut by Stanley Kubrick, and I tried to get the best results in terms
of alignments.
SPPAS (version 1.4.8) doesn't perform so well with
•

audio files longer than 2 minute;

•

excerpts of films, which usually show a relevant background noise;

•

realistic and natural dialogues, due to superpositions of more voice, non-words

phonemes and other imperfections.
The Bill and Victor Dialogue had both these three characteristics, so it was almost
impossible to obtain a sufficient result in the alignment, even for a following editing
provided in Praat. I tried to remove some noise and underline only the speech parts of the
audio file using a simple matlab script (See appendix for code), but it didn't work.
The second attempt was the dialogue from the italian film Il Divo by Paolo Sorrentino,
in which the speech seemed more clear and fluid than the previous. SPPAS also allows
processing of italian language dialogues. Unfortunately this audio file showed the same
drawbacks of the previous, though I also tried to divide processing in shorter fragments of
the audio file, as you can see in the folder.
The last attempt was for a linear english educational dialogue between two girls, which
worked really good for SPPAS processing. Despite his simpleness and linear dialogue
interaction, it had a good level of emotive speaking and it was enough expressive for the
purpose of the assignment.
To enable a correct alignment with SPPAS I put in the .txt file also the the hashes to signal
the moments of pause in the dialogue. This is another limit of SPPAS, since without the
silence tracing in the .txt it couldn't provide a precise alignment. The resulting files are
shown in the folder of project “SPPAS Processing”.
3. Editing the dialogue tiers in Praat and writig a Script for
Processing
Since the process of alignment in SPPAS was not precise, a further editing in Praat was
needed, moving boundaries and tokens in the right positions when needed. The results of
this editing were saved in the TextGrid file “dialogue-flat-phon_palign”, in the folder
“Editing in Praat”.
Two more tiers have been added in the TextGrid file, indicating the class of dialogue act
(using the theory of dialogue acts classifcation proposed in DAMSL model) and the
speaker.
The final TextGrid file featured the following tiers:
•

PhonAlign Tier;

•

PhnTokAlign Tier;

•

TokensAlign Tier;

•

DialogueAct Tier;

•

Speaker.

In the consequent phase I passed from the Praat Editor View to the Praat scripting
language, to extract required audio features associated to each word token in the dialogue.
The Praat Script “features.praat” takes the Wave file and the TextGrid file as input and
produces a txt file which shows:
•

Word token;

•

Mean Pitch of token;

•

Mean Intensity of token;

•

DialogueAct;

•

Speaker.

The results were saved in the .txt file “conversation-audio” in the folder “Editing in Praat”.
4. POS Tagging
To come up with the part-of-speech tagging of each word in the dialogue the tool
Stanford POSTAGGER was used (version 3.2.0). The result of the tagging operation has
been stored in the file “conversation-tagged.txt”. A pretrained model has been used to
assign part of speech tags to unlabeled text, the adopted model was “wsj-0-18-left3wordsdistsim”, included in the package of the Stanford-postagger.
After the POS-tagging processing I noticed some mistakes of the tagger, i.e. some noun
terms were recognized as verbs and viceversa, but the majority of words had the right tag.

5. Semantic Analysis with JWNL
JWNL is a Java API (Application Programming Interface) to access and query WordNet
database. In this context JWNL was used to find the domains of each word token. I used
version 2.0 of WordNet, version 1.4 of JWNL and Eclipse as IDE with Java 1.7 SDK and
JRE 7 (Java Runtime Environment).
To find the domains of each token I leveraged the CATEGORY pointer type, and when no
related domains were found I wrote a function which recorsively search the root
hypernym. The Java Project reads as .txt input file “conversation-tagged” in the folder
“POS tagging”, and writes the .txt file “dialogue-audio-pos-domains” as output file.
One issue in this operation was due to the fact that the CATEGORY pointer didn't work for
so many tokens, and recursive search for hypernyms returned base classes like “entity” or
“abstraction”, too general for the purpose of a semantic domain search.
The final results of all processing are stored in the excel file “Dialogue Data” and in the flat
.txt file “dialogue-audio-pos-domains-def”.

6. Results and main statistics
Data of dialogue analysis were all imported in the excel file “Dialogue Data”, which include
four different sheets:
–

General Data: table with all fields and values;

–

Speaker Pitch-Intensity: Pitch & Intensity Data and graphics;

–

Dialogue Acts: Analysis of Dialogue Acts;

–

Domains: Analysis of Domains.
In the analysis non-word utterances were not taken into account since there is only a notword token in the conversation.

Pitch Trend By Speaker
600,00
500,00
Pitch (Hz)

400,00
300,00
200,00
100,00
0,00
1

5

9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
Token Number
Amanda

Karen

Intensity Trend By Speaker

90,00
80,00
Intensity (dB)

70,00
60,00
50,00
40,00
30,00
20,00
10,00
0,00
1

5

9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81
Token Number
Amanda

Karen
7. Conclusions
Due to the difficulties in SPPAS processing, the chosen dialogue is a very simple type of
conversation, so the DAMSL analysis and the domain analysis did not show sensitive
results. The topic of conversation is general, so there is not a particular trend in semantic
domains of word tokens. The conversation is equally distributed such that the two speakers
have almost the same number of tokens. The conversation shows slight variations in pitch
and the fundamental frequency of Amanda's voice is quite different than Karen's, showing
the different timber of the two speakers, though always maintaining a pitch in the range of
common female values. In average pitch results there is a significant pitch outlier
associated to the Amanda's expression “on friday”: the values of 97 and 107 Hz sound a
little bit irrealistics if associated to female voice. The average intensity of tokens underlines
that the volume of dialogue remains constant during the conversation, there's not softly
speaking and the two speakers talk at the same volume (only 2 dB of difference).
The PRAAT analysis is probably the most reliable analysis together with POS tagging,
whereas the analysis carried out with JWNL shows evident limits in recognizing the correct
domains of speech. Most of the domains found are clearly wrong if associated to the kind
of dialogue, and the reason relies upon the fact that a knowledge of the context in which
word token resides should be mandatory to reach the right semantic domain.
The kind of conversation between Amanda and Karen is a Q & A conversation, so it's not a
surprise that a high percentage of dialogue acts falls in the Answer and Info-Request types.
More pleasant expressions seems to have higher level of pitch and intensity, whereas
action-directive, open-options and offers show a lower pitch and sometimes lower
intensity, meaning that when the speaker launches a proposal wants probably to give a
feeling of modesty, to avoid the feeling of an imposition.
8. Appendix: Lines of Code.
MATLAB CODE
function [y_n] = remove_noise(y,win_len,mean_val, atten)
% This functions performs a background noise attenuation, provided that the
% loudness difference between noise and original signal is high enough.
%
y = signal with noise
%
win_len = frame length to calculare noise impact
%
mean_val = threshold which discriminates between noise and signal
%
atten = attenuation value to cut noise
for n = 1:(length(y)-win_len)
if (sum(abs( y(n:(n+win_len-1) ) )) < mean_val*win_len &
max(abs(y(n:n+win_len-1)))< mean_val)
for m = n:n+win_len-1
y(m) = y(m)*atten;
end
end
end
y_n = y;
end

PRAAT CODE
##### Script to extract features for each token #####
##print columns of the table##
echo Token

MeanPitch Intens.

DialogueAct

select all
#sound file & TextGrid file to be analyzed#
s = selected("Sound")
tg = selected("TextGrid")
select tg
numIntervals = Get number of intervals... 3
### calculate Pitch and Intensity of Speech ###
select s
To Pitch... 0.0 75 600
select s
To Intensity... 75 0.0
plus Pitch dialogue-flat

Speaker
pitch = selected ("Pitch")
intensity = selected("Intensity")
space$ = " "
for cont from 1 to numIntervals
select TextGrid dialogue-flat-phon_palign

token$ = Get label of interval... 3 cont
tstart = Get starting point... 3 cont
tend = Get end point... 3 cont
dialogueActNum = Get interval at time... 4 tstart+0.01
dialogueAct$ = Get label of interval... 4 dialogueActNum
speakerNum = Get interval at time... 5 tstart+0.01
speaker$ = Get label of interval... 5 speakerNum
# for each not-silence token extract mean pitch & mean intensity #
if !startsWith (token$, "#")
select pitch
pitchMean = Get mean... tstart tend Hertz

select intensity
intensityMean= Get mean... tstart tend dB
### configure layout ###
lenStr = length(token$)
spaceNum = 15 - lenStr
print 'token$'
for lung from 1 to spaceNum
print 'space$'
endfor
print 'pitchMean:2'

'intensityMean:2'

lenStr2 = length(dialogueAct$)
spaceNum2 = 20 - lenStr2
### configure layout ###
print 'dialogueAct$'

for lung from 1 to spaceNum2
print 'space$'
endfor
print 'speaker$'

printline
endif
endfor
### Save data in txt file ###
appendFile ("conversation-audio.txt", info$ ())

JWNL CODE
package wordnet;
import java.io.*;
public class WordSem {
public static void main(String[] args) throws JWNLException, IOException,
JWNLRuntimeException {
// Initialize JWNL with the properties file to point to dictionary files
JWNL.initialize(new FileInputStream("file_properties.xml"));
// Dictionary object
Dictionary wordnet;
//After initialization create a Dictionary object that can be queried
wordnet = Dictionary.getInstance();
// read text file and extract words to be searched on WordNet
String read_path = "D:Ultimo semestreNatural Language
ProcessingASSIGNMENTconversationPOS taggingconversation-tagged.txt";
//Open file reader stream (will read file with POS Tagging)
FileReader fr = new FileReader(read_path);
BufferedReader br = new BufferedReader(fr);
//Open file writer stream (will write txt file with "Token POS Domain"
// lines for each token
String write_path = "D:Ultimo semestreNatural Language
ProcessingASSIGNMENTconversationdialogue-audio-pos-domains.txt";
File file = new File(write_path);
FileWriter file_write = new FileWriter(file);
String read_linea = ""; //line string variable, read line from sourcefile
String wordn = "";
//takes token words from source file
String word_POS = "";
// takes POS tags from source file
POS wnPOS;
// POS tag in WordNet format
String strdomain = "";
//takes domain string related to word token
// While there are lines in source file take word token and POS tag
while(true)
{
read_linea = br.readLine();
if(read_linea==null)
break;
String [] splits = read_linea.split("_"); //this is separator
between word and tag in source file
wordn = splits[0];
System.out.println(wordn);
word_POS = splits[1];
System.out.println(word_POS);
//begin write line in output txt file
StringBuilder write_appnd = new StringBuilder();
write_appnd.append(wordn)
.append(" ")
.append(word_POS)
.append(" ");
// translate from POS tag to WordNet word type
wnPOS = getWordNetPOS(word_POS);
//WordNet analysis: will check for word domain, and for hypernyms
if (wnPOS != null && wordn != null)
{
//An IndexWord is a single word and part of speech. Lookup a
SynSet object.
IndexWord w = wordnet.lookupIndexWord(wnPOS, wordn);
if (w != null)
{
Synset[] senses = w.getSenses();
int domainlen = senses.length;
Pointer[] domain = new Pointer[domainlen];
for (int i=0; i<senses.length; i++)
{
// CATEGORY is the pointer type for the domains
domain =
senses[i].getPointers(PointerType.CATEGORY);
Synset[] syndomain = new Synset[domain.length];
for (int l=0; l<domain.length; l++)
{
//obtain synset from domain and then an
associated word string
syndomain[l] =
domain[l].getTargetSynset();
Word rootWord = syndomain[l].getWord(0);
strdomain = rootWord.getLemma();
// add to outputtxt file
write_appnd.append(strdomain);
}
}
//get to root hypernym
if (wnPOS == POS.NOUN)
{
strdomain = getRootHypernym(w);
write_appnd.append(strdomain);
}
}
}
//finish to write line, and then skip to another
write_appnd.append("rn");
String write_linea = write_appnd.toString();
file_write.write(write_linea);
}
file_write.close();
br.close();
}
//translate from POS tag to WordNet word type
public static POS getWordNetPOS(String wPOS)
{
POS wordNetPos;
switch (wPOS)
{
case "NN": case "NNS": case "NNP": wordNetPos = POS.NOUN; break;
case "VB": case "VBD": case "VBG": case "VBN": case "VBP": case
"VBZ": wordNetPos = POS.VERB; break;
case "JJ": case "JJR": case "JJS": wordNetPos = POS.ADJECTIVE;
break;
case "RB": case "RBR": case "RBS": wordNetPos = POS.ADVERB; break;
default: wordNetPos = null;
}
return wordNetPos;
}
// search for root hypernym
public static String getRootHypernym(IndexWord synsetw) throws JWNLException
{
String stringdomain ="";
Synset syndomain = null;
Synset[] senses = synsetw.getSenses();
int domainlen = senses.length;
Pointer[] domain = new Pointer[domainlen];
for (int i=0; i<senses.length; i++)
{
domain = senses[0].getPointers(PointerType.HYPERNYM);
if (domain.length > 0)
{
syndomain = domain[0].getTargetSynset();
while(syndomain.toString() != null)
{
domain =
syndomain.getPointers(PointerType.HYPERNYM);
if (domain.length > 0) syndomain =
domain[0].getTargetSynset();
else break;
}
}
}
Word rootWord = syndomain.getWord(0);
stringdomain = rootWord.getLemma();
System.out.println(stringdomain);
return stringdomain;
}

}

Mais conteúdo relacionado

Semelhante a Text and Speech Analysis

Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?Markus Voelter
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
 
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORPSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORijistjournal
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGEIJCI JOURNAL
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGEIJCI JOURNAL
 
49532873-Voice-Recognition.ppt
49532873-Voice-Recognition.ppt49532873-Voice-Recognition.ppt
49532873-Voice-Recognition.pptssuserf6349e
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemIJERA Editor
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.Vladimir Ulogov
 
Solidity Parsing Using SmaCC: Challenges and Irregularities
Solidity Parsing Using SmaCC: Challenges and IrregularitiesSolidity Parsing Using SmaCC: Challenges and Irregularities
Solidity Parsing Using SmaCC: Challenges and IrregularitiesESUG
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheetluisfvazquez1
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 
Mozilla Intern Summer 2014 Presentation
Mozilla Intern Summer 2014 PresentationMozilla Intern Summer 2014 Presentation
Mozilla Intern Summer 2014 PresentationCorey Richardson
 
automata theroy and compiler designc.pptx
automata theroy and compiler designc.pptxautomata theroy and compiler designc.pptx
automata theroy and compiler designc.pptxYashaswiniYashu9555
 
Envisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesEnvisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesMarkus Voelter
 
Compiler_Lecture1.pdf
Compiler_Lecture1.pdfCompiler_Lecture1.pdf
Compiler_Lecture1.pdfAkarTaher
 

Semelhante a Text and Speech Analysis (20)

Build your own Language - Why and How?
Build your own Language - Why and How?Build your own Language - Why and How?
Build your own Language - Why and How?
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 
SS UII Lecture 1
SS UII Lecture 1SS UII Lecture 1
SS UII Lecture 1
 
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATORPSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
PSEUDOCODE TO SOURCE PROGRAMMING LANGUAGE TRANSLATOR
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
 
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGESOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
SOFTWARE TOOL FOR TRANSLATING PSEUDOCODE TO A PROGRAMMING LANGUAGE
 
49532873-Voice-Recognition.ppt
49532873-Voice-Recognition.ppt49532873-Voice-Recognition.ppt
49532873-Voice-Recognition.ppt
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis System
 
Turtlebot Poster_Summer 2016
Turtlebot Poster_Summer 2016Turtlebot Poster_Summer 2016
Turtlebot Poster_Summer 2016
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 
Antlr Conexaojava
Antlr ConexaojavaAntlr Conexaojava
Antlr Conexaojava
 
Solidity Parsing Using SmaCC: Challenges and Irregularities
Solidity Parsing Using SmaCC: Challenges and IrregularitiesSolidity Parsing Using SmaCC: Challenges and Irregularities
Solidity Parsing Using SmaCC: Challenges and Irregularities
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Mozilla Intern Summer 2014 Presentation
Mozilla Intern Summer 2014 PresentationMozilla Intern Summer 2014 Presentation
Mozilla Intern Summer 2014 Presentation
 
automata theroy and compiler designc.pptx
automata theroy and compiler designc.pptxautomata theroy and compiler designc.pptx
automata theroy and compiler designc.pptx
 
Envisioning the Future of Language Workbenches
Envisioning the Future of Language WorkbenchesEnvisioning the Future of Language Workbenches
Envisioning the Future of Language Workbenches
 
Compiler_Lecture1.pdf
Compiler_Lecture1.pdfCompiler_Lecture1.pdf
Compiler_Lecture1.pdf
 
Ig2 task 1 work sheet
Ig2 task 1 work sheetIg2 task 1 work sheet
Ig2 task 1 work sheet
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Text and Speech Analysis

  • 1. Natural Language Processing Politecnico di Milano Polo di Como Prof. Licia Sbattella --Student: Lorenzo Monni Sau Matr.: 771378 AA 2012/2013 Assignment: Text & Speech Analysis
  • 2. Indice generale 1. Introduction: Goals of the Assignment and used tools................................................................2 2. Choice of the dialogue and text to speech alignment with SPPAS..............................................3 3. Editing the dialogue tiers in Praat and writig a Script for Processing.........................................4 4. POS Tagging................................................................................................................................5 5. Semantic Analysis with JWNL....................................................................................................5 6. Results and main statistics...........................................................................................................5 7. Conclusions..................................................................................................................................7 8. Appendix: Lines of Code. ...........................................................................................................8 1. Introduction: Goals of the Assignment and used tools The objective of this work is to provide a complete analysis of a piece of conversation, carrying out the following features: • phonologic features of dialogue and a brief statistical analysis; • A subdivision in dialogue acts using the DAMSL model; • the POS tagging of the dialogue; • a brief Semantic Analysis; • a Graphical Representation of the results. Given these goals, the first step has been the choice of the right dialogue for the purpose of analysis. The audio file of the dialogue together with the written transcription was taken as input to SPPAS (Automatic Phonetic Annotation of Speech), which is a tool for operations of alignment between audio and text, with tokenization and phonetization features. The result of SPPAS analysis got the text aligned with the audio file and it was used as input to PRAAT, which is a tool to capture audio features of speech such as Pitch, Intensity and Formants. The alignment was manually edited in Praat to provide the best match between transcription and audio, and then a Praat script was created to append some audio features and further annotations to the words in the .txt file. The POS Tagging part of the project was carried out by using the POS Tagger of the Stanford University. After this phase the txt with the data looked like a table with audio, dialogue and syntactic features associated with each word of the conversation. The last part of the project involved the semantic analysis of dialogue, leveraging the JWNL java library to query the WordNet lexical database. Graphical results has been made importing the final .txt file in Microsoft Excel.
  • 3. 2. Choice of the dialogue and text to speech alignment with SPPAS The choice of the suitable dialogue for the analysis was probably the hardest step in the assignment, due to the constraints given by the SPPAS limited capabilities of processing. My first idea was to get an artistically relevant dialogue, so I started with an excerpt from the film Eyes Wide Shut by Stanley Kubrick, and I tried to get the best results in terms of alignments. SPPAS (version 1.4.8) doesn't perform so well with • audio files longer than 2 minute; • excerpts of films, which usually show a relevant background noise; • realistic and natural dialogues, due to superpositions of more voice, non-words phonemes and other imperfections. The Bill and Victor Dialogue had both these three characteristics, so it was almost impossible to obtain a sufficient result in the alignment, even for a following editing provided in Praat. I tried to remove some noise and underline only the speech parts of the audio file using a simple matlab script (See appendix for code), but it didn't work. The second attempt was the dialogue from the italian film Il Divo by Paolo Sorrentino, in which the speech seemed more clear and fluid than the previous. SPPAS also allows processing of italian language dialogues. Unfortunately this audio file showed the same drawbacks of the previous, though I also tried to divide processing in shorter fragments of the audio file, as you can see in the folder. The last attempt was for a linear english educational dialogue between two girls, which worked really good for SPPAS processing. Despite his simpleness and linear dialogue interaction, it had a good level of emotive speaking and it was enough expressive for the purpose of the assignment. To enable a correct alignment with SPPAS I put in the .txt file also the the hashes to signal the moments of pause in the dialogue. This is another limit of SPPAS, since without the silence tracing in the .txt it couldn't provide a precise alignment. The resulting files are shown in the folder of project “SPPAS Processing”.
  • 4. 3. Editing the dialogue tiers in Praat and writig a Script for Processing Since the process of alignment in SPPAS was not precise, a further editing in Praat was needed, moving boundaries and tokens in the right positions when needed. The results of this editing were saved in the TextGrid file “dialogue-flat-phon_palign”, in the folder “Editing in Praat”. Two more tiers have been added in the TextGrid file, indicating the class of dialogue act (using the theory of dialogue acts classifcation proposed in DAMSL model) and the speaker. The final TextGrid file featured the following tiers: • PhonAlign Tier; • PhnTokAlign Tier; • TokensAlign Tier; • DialogueAct Tier; • Speaker. In the consequent phase I passed from the Praat Editor View to the Praat scripting language, to extract required audio features associated to each word token in the dialogue. The Praat Script “features.praat” takes the Wave file and the TextGrid file as input and produces a txt file which shows: • Word token; • Mean Pitch of token; • Mean Intensity of token; • DialogueAct; • Speaker. The results were saved in the .txt file “conversation-audio” in the folder “Editing in Praat”.
  • 5. 4. POS Tagging To come up with the part-of-speech tagging of each word in the dialogue the tool Stanford POSTAGGER was used (version 3.2.0). The result of the tagging operation has been stored in the file “conversation-tagged.txt”. A pretrained model has been used to assign part of speech tags to unlabeled text, the adopted model was “wsj-0-18-left3wordsdistsim”, included in the package of the Stanford-postagger. After the POS-tagging processing I noticed some mistakes of the tagger, i.e. some noun terms were recognized as verbs and viceversa, but the majority of words had the right tag. 5. Semantic Analysis with JWNL JWNL is a Java API (Application Programming Interface) to access and query WordNet database. In this context JWNL was used to find the domains of each word token. I used version 2.0 of WordNet, version 1.4 of JWNL and Eclipse as IDE with Java 1.7 SDK and JRE 7 (Java Runtime Environment). To find the domains of each token I leveraged the CATEGORY pointer type, and when no related domains were found I wrote a function which recorsively search the root hypernym. The Java Project reads as .txt input file “conversation-tagged” in the folder “POS tagging”, and writes the .txt file “dialogue-audio-pos-domains” as output file. One issue in this operation was due to the fact that the CATEGORY pointer didn't work for so many tokens, and recursive search for hypernyms returned base classes like “entity” or “abstraction”, too general for the purpose of a semantic domain search. The final results of all processing are stored in the excel file “Dialogue Data” and in the flat .txt file “dialogue-audio-pos-domains-def”. 6. Results and main statistics Data of dialogue analysis were all imported in the excel file “Dialogue Data”, which include four different sheets: – General Data: table with all fields and values; – Speaker Pitch-Intensity: Pitch & Intensity Data and graphics; – Dialogue Acts: Analysis of Dialogue Acts; – Domains: Analysis of Domains.
  • 6. In the analysis non-word utterances were not taken into account since there is only a notword token in the conversation. Pitch Trend By Speaker 600,00 500,00 Pitch (Hz) 400,00 300,00 200,00 100,00 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Token Number Amanda Karen Intensity Trend By Speaker 90,00 80,00 Intensity (dB) 70,00 60,00 50,00 40,00 30,00 20,00 10,00 0,00 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 Token Number Amanda Karen
  • 7. 7. Conclusions Due to the difficulties in SPPAS processing, the chosen dialogue is a very simple type of conversation, so the DAMSL analysis and the domain analysis did not show sensitive results. The topic of conversation is general, so there is not a particular trend in semantic domains of word tokens. The conversation is equally distributed such that the two speakers have almost the same number of tokens. The conversation shows slight variations in pitch and the fundamental frequency of Amanda's voice is quite different than Karen's, showing the different timber of the two speakers, though always maintaining a pitch in the range of common female values. In average pitch results there is a significant pitch outlier associated to the Amanda's expression “on friday”: the values of 97 and 107 Hz sound a little bit irrealistics if associated to female voice. The average intensity of tokens underlines that the volume of dialogue remains constant during the conversation, there's not softly speaking and the two speakers talk at the same volume (only 2 dB of difference). The PRAAT analysis is probably the most reliable analysis together with POS tagging, whereas the analysis carried out with JWNL shows evident limits in recognizing the correct domains of speech. Most of the domains found are clearly wrong if associated to the kind of dialogue, and the reason relies upon the fact that a knowledge of the context in which word token resides should be mandatory to reach the right semantic domain. The kind of conversation between Amanda and Karen is a Q & A conversation, so it's not a surprise that a high percentage of dialogue acts falls in the Answer and Info-Request types. More pleasant expressions seems to have higher level of pitch and intensity, whereas action-directive, open-options and offers show a lower pitch and sometimes lower intensity, meaning that when the speaker launches a proposal wants probably to give a feeling of modesty, to avoid the feeling of an imposition.
  • 8. 8. Appendix: Lines of Code. MATLAB CODE function [y_n] = remove_noise(y,win_len,mean_val, atten) % This functions performs a background noise attenuation, provided that the % loudness difference between noise and original signal is high enough. % y = signal with noise % win_len = frame length to calculare noise impact % mean_val = threshold which discriminates between noise and signal % atten = attenuation value to cut noise for n = 1:(length(y)-win_len) if (sum(abs( y(n:(n+win_len-1) ) )) < mean_val*win_len & max(abs(y(n:n+win_len-1)))< mean_val) for m = n:n+win_len-1 y(m) = y(m)*atten; end end end y_n = y; end PRAAT CODE ##### Script to extract features for each token ##### ##print columns of the table## echo Token MeanPitch Intens. DialogueAct select all #sound file & TextGrid file to be analyzed# s = selected("Sound") tg = selected("TextGrid") select tg numIntervals = Get number of intervals... 3 ### calculate Pitch and Intensity of Speech ### select s To Pitch... 0.0 75 600 select s To Intensity... 75 0.0 plus Pitch dialogue-flat Speaker
  • 9. pitch = selected ("Pitch") intensity = selected("Intensity") space$ = " " for cont from 1 to numIntervals select TextGrid dialogue-flat-phon_palign token$ = Get label of interval... 3 cont tstart = Get starting point... 3 cont tend = Get end point... 3 cont dialogueActNum = Get interval at time... 4 tstart+0.01 dialogueAct$ = Get label of interval... 4 dialogueActNum speakerNum = Get interval at time... 5 tstart+0.01 speaker$ = Get label of interval... 5 speakerNum # for each not-silence token extract mean pitch & mean intensity # if !startsWith (token$, "#") select pitch pitchMean = Get mean... tstart tend Hertz select intensity intensityMean= Get mean... tstart tend dB ### configure layout ### lenStr = length(token$) spaceNum = 15 - lenStr print 'token$' for lung from 1 to spaceNum print 'space$' endfor
  • 10. print 'pitchMean:2' 'intensityMean:2' lenStr2 = length(dialogueAct$) spaceNum2 = 20 - lenStr2 ### configure layout ### print 'dialogueAct$' for lung from 1 to spaceNum2 print 'space$' endfor print 'speaker$' printline endif endfor ### Save data in txt file ### appendFile ("conversation-audio.txt", info$ ()) JWNL CODE package wordnet; import java.io.*; public class WordSem { public static void main(String[] args) throws JWNLException, IOException, JWNLRuntimeException { // Initialize JWNL with the properties file to point to dictionary files JWNL.initialize(new FileInputStream("file_properties.xml")); // Dictionary object Dictionary wordnet; //After initialization create a Dictionary object that can be queried wordnet = Dictionary.getInstance(); // read text file and extract words to be searched on WordNet String read_path = "D:Ultimo semestreNatural Language ProcessingASSIGNMENTconversationPOS taggingconversation-tagged.txt"; //Open file reader stream (will read file with POS Tagging) FileReader fr = new FileReader(read_path); BufferedReader br = new BufferedReader(fr); //Open file writer stream (will write txt file with "Token POS Domain"
  • 11. // lines for each token String write_path = "D:Ultimo semestreNatural Language ProcessingASSIGNMENTconversationdialogue-audio-pos-domains.txt"; File file = new File(write_path); FileWriter file_write = new FileWriter(file); String read_linea = ""; //line string variable, read line from sourcefile String wordn = ""; //takes token words from source file String word_POS = ""; // takes POS tags from source file POS wnPOS; // POS tag in WordNet format String strdomain = ""; //takes domain string related to word token // While there are lines in source file take word token and POS tag while(true) { read_linea = br.readLine(); if(read_linea==null) break; String [] splits = read_linea.split("_"); //this is separator between word and tag in source file wordn = splits[0]; System.out.println(wordn); word_POS = splits[1]; System.out.println(word_POS); //begin write line in output txt file StringBuilder write_appnd = new StringBuilder(); write_appnd.append(wordn) .append(" ") .append(word_POS) .append(" "); // translate from POS tag to WordNet word type wnPOS = getWordNetPOS(word_POS); //WordNet analysis: will check for word domain, and for hypernyms if (wnPOS != null && wordn != null) { //An IndexWord is a single word and part of speech. Lookup a SynSet object. IndexWord w = wordnet.lookupIndexWord(wnPOS, wordn); if (w != null) { Synset[] senses = w.getSenses(); int domainlen = senses.length; Pointer[] domain = new Pointer[domainlen]; for (int i=0; i<senses.length; i++) { // CATEGORY is the pointer type for the domains domain = senses[i].getPointers(PointerType.CATEGORY); Synset[] syndomain = new Synset[domain.length]; for (int l=0; l<domain.length; l++) { //obtain synset from domain and then an associated word string syndomain[l] = domain[l].getTargetSynset(); Word rootWord = syndomain[l].getWord(0); strdomain = rootWord.getLemma(); // add to outputtxt file write_appnd.append(strdomain);
  • 12. } } //get to root hypernym if (wnPOS == POS.NOUN) { strdomain = getRootHypernym(w); write_appnd.append(strdomain); } } } //finish to write line, and then skip to another write_appnd.append("rn"); String write_linea = write_appnd.toString(); file_write.write(write_linea); } file_write.close(); br.close(); } //translate from POS tag to WordNet word type public static POS getWordNetPOS(String wPOS) { POS wordNetPos; switch (wPOS) { case "NN": case "NNS": case "NNP": wordNetPos = POS.NOUN; break; case "VB": case "VBD": case "VBG": case "VBN": case "VBP": case "VBZ": wordNetPos = POS.VERB; break; case "JJ": case "JJR": case "JJS": wordNetPos = POS.ADJECTIVE; break; case "RB": case "RBR": case "RBS": wordNetPos = POS.ADVERB; break; default: wordNetPos = null; } return wordNetPos; } // search for root hypernym public static String getRootHypernym(IndexWord synsetw) throws JWNLException { String stringdomain =""; Synset syndomain = null; Synset[] senses = synsetw.getSenses(); int domainlen = senses.length; Pointer[] domain = new Pointer[domainlen]; for (int i=0; i<senses.length; i++) { domain = senses[0].getPointers(PointerType.HYPERNYM); if (domain.length > 0) { syndomain = domain[0].getTargetSynset(); while(syndomain.toString() != null) { domain = syndomain.getPointers(PointerType.HYPERNYM); if (domain.length > 0) syndomain =
  • 13. domain[0].getTargetSynset(); else break; } } } Word rootWord = syndomain.getWord(0); stringdomain = rootWord.getLemma(); System.out.println(stringdomain); return stringdomain; } }