1. The researcher analyzed quantitative characteristics such as entropy, readability, lexical diversity, frequencies of words, and parts of speech for different text genres including scientific texts, news articles, and student writings.
2. The analysis found that student writings had higher entropy and readability than news articles or scientific texts. News articles had higher lexical diversity and frequencies of common words.
3. To evaluate the accuracy of a developed Old Irish lemmatizer, the researcher applied it to a test corpus of 840 tokens, of which 186 were unknown words. The lemmatizer correctly predicted lemmas for 84 of the unknown words, achieving an accuracy of around 60% for unknown words.
3. Dimension reduction
— dimension reduction is the process of reducing the
number of random variables in machine learning
tasks:
— Lemmatization –grouping together the inflected
forms of a word. LemmaGen; morpha; pymorphy2,
mystem...
— Stemming –reducing inflected words to their word
stem. The stem need not be identical to
the morphological root of the word. Snowball;
Lovins; Porter; nltk.stem.* ...
— Root Extraction – reducing derivates to their root.,
i.e. meaning.
7. Realization
— Neural Networks algorithm
— Train data – 749 cases
— Cross validation – 84 cases (10%)
— Test data – 93 cases
— Accuracy ~0.7
8. Tasks
— plagiarism;
— paraphrase detection;
— textual similarity;
— semantic disambiguation;
— topic model;
— text classification;
— text clusterization;
— question answering systems;
— building semantic graphs (entities, links and
relationship between them);
9. References
— РацибурскаяЛ.В. Словарь уникальных морфем
современногорусского языка М.: Флинта: Наука, 2009. — 160
с.
— Аванесов Р.И., Ожегов С.И. Морфемно-орфографический
словарь Около 100 000 слов / А. Н. Тихонов. — М.: АСТ:
Астрель, 2002. — 704 с.
— Тихонов А.Н. Морфемно-орфографический словарь русского
языка, 2002.
— Кузнецова А. И., Ефремова Т. Ф. Словарь морфем русского
языка Ок. 52000 слов. — М.: Рус. яз., 1986. — 1132 с.
— http://old.kpfu.ru/infres/slovar1/begall.htm
— http://snowball.tartarus.org/algorithms/russian/stemmer.html,
http://snowballstem.org/demo.html
10. Effective Paraphrase Expansion in Addressing
Lexical Variability
Vasily Konovalov, Meni Adler, Ido Dagan
Department of Computer Science
Bar-Ilan University, Israel
The 5th conference on Artificial Intelligence and Natural
Language
11. Problem
Lexical Variability
From Negochat negotiation dialogue corpus:
‘Reject’: “I disagree”, “I reject your proposal”, “it’s not
accepted”.
‘Accept’: “I accept your offer”, “I agree to the salary”, “It’s OK”.
‘Offer’: “I offer you a salary of 60,000 USD”, “How about the
programmer position”, “I propose you a pension of 10%”.
13. Our research questions
◮ What is the ‘best’ performing language? Why is it actually
the ‘best’ one?
◮ What is the ‘best’ performing combination of MT engines?
14. Our research settings
Languages: Portuguese, French, German, Hebrew, Russian,
Arabic, Finish, Chinese, Hungarian.
MT engines: Google Translate API, Microsoft Translator Text
API, Yandex Translate API.
15. Our findings
◮ Among tested languages Hungarian is the ‘best’ performing
one.
◮ The performance of a language correlates well with the
averaged smoothed BLEU.
◮ A language that generates the most lexically dissimilar
paraphrases is the ‘best’ performing language.
◮ The differences between MT engines are insignificant
according to the averaged smoothed BLEU and are not
reflected in evaluation.
◮ The language family relations are reflected in averaged
smoothed BLEU.
18. ■ For data analysis, we used several texts
collection.
■ For scientific texts: Collection from the conference
Dialogue (to 2003-2006), and Corpus Linguistics.
■ For news: Collection is made up of mass media
short articles such as: Lenta.ru, the Russian
newspaper, RBC, Independent Newspaper, and
Kompyulenta.
■ To research writings from Unified State
Examination we created several collections,
”reference”, which contains writings written by
experts, and the second written by students.
19. ■ For research we selected the most representative
characteristics: entropy, readability, lexical
diversity, verbal, autosem(all words, except for the
service parts of speech), and frequencies (the
ratio of the first hundred of the most frequent
words of the Russian language, to all words in the
text).
27. Old Irish: Grammar
• Changes can occur to any part of the word
o beginning: mutations
o middle: infixed pronouns
o end: flections
caraid ‘he / she / it loves’
rob-car-si ‘she has loved you’
• Very differently looking forms in a paradigm (esp. verbal)
do-beir ‘gives, brings’
ní t(h)abair ‘does not give, bring’
28. Old Irish: Orthography
• Inconsistent use of length marks
• Mutations are not always shown in writing
• Complex verb forms can be spelled either with or without a hyphen or a whitespace
• In later texts there are mute vowels to indicate the quality (broad / slender) of consonants
next to them
⇨ a great number of possible spellings for every form
Consonant b c d f g l m n p r s t
Mutated
consonant
bh ch dh fh gh ll mh nn ph rr sh th
mb gc nd ḟ ng l-l mm bp ṡ dt
cc ḟh m-m ss
bhf ts
s-s
29. Data
• Dictionary of the Irish Language (DIL)
43,345 entries ⇨ 79,140 unique forms
• Corpus
125 texts, 831,280 tokens
• Gold standard
50 random sentences from the test corpus, 840 tokens
• Not only classical Old Irish
The corpus covers VII-XVI centuries
30. Problems
• DIL covers only ~ 41% of
unique forms in the corpus
• Many contracted forms, but
no unified system of
contractions
• Inconsistent use of markup
and punctuation
caraid
Cite this: eDIL s.v. caraid
or dil.ie/8212
Forms: -carim, -cairim,
caraim, -caraim, -caru, -
cari, carid, caraid, -cara,
carthai, caras, charas,
caris, carthar, -charam,
carait, charaíd, -carat,
cartae, cardda, carda,
carde, cartar, carad,
caram, carid, -carid, -
carad, carad, carthae, -
chartais, carddais, cardáis,
care, -charae, -carae, cara,
-rochra, -chara, cara, -
carat, -carad, -charad,
cechar, -cechra, -cechra,
cechras, -chechrat, -
cechrainn, carais, carois, -
cair, carsait, carsat,
charus, rob-car-si, ro-car,
arro-car, char, rondob-
carsam-ni, charsat,
charsad, ros-carsat, serc,
carthain, carthi
weak vb. with reduplicated fut. on
analogy of canaid ( Thurn. Gramm.
402 ). Ind. pres. 1 s. -carim, Wb. 5c7
. -cairim, 23c12 . caraim, Thes. ii
293.16 . -caraim, Ml. 79d1 . -caru,
Fél. Ep. 311 . 2 s. -cari, Wb. 6c8 . 3 s.
carid, Wb. 25d5 . caraid , Ml. 75c4 . -
cara, Wb. 27d9 . With suff. pron. 3 s.
m. carthai, Fráech 10 . Rel. caras,
Wb. 25c19 . Ml. 91b17 . charas, 30c3
. caris, Thes. ii 247.4 . Pass. rel.
carthar, Ml. 75c4 . Sg. 193b3 . 196b4
.. <…>
(a) loves (persons): nád carad som
Iudeiu, Wb. 4d17 . carad uir
mulierem, 22c19 . carsus fiadhu,
Snedg. u. Mac R. 11.5 . rot charus ar
th'airscélaib I have fallen in love
with thee, LU 6084 (TBC). nít
charadar nít tágedar, TBC 2032 = -
chara, LU 5797 . car do chomnesam
amal no-t-cara fén = dilige
proximum, PH 5837 . gé no
charfuinn fiche fear, KMMisc. 362.7
. a fhir Chola charuid mná `beloved
of women', Sc.G. St. iv 62 § 10 . ní
charabh bean tsean ná óg, Dánta Gr.
78.11 . <…>
31. Lemmatizer
• Two methods for OOV-words
o Baseline: return a demutated form
o Predict a lemma using modified Damerau-Levenshtein
distance
• Disambiguation
o For homonymous forms, the lemma with the highest lexical
probability is chosen
o Lemma probability equals the sum of probabilities of its forms,
and form probability is its frequency count in the corpus
32. Predicting lemmas for OOV-words
• Generate all possible strings on edit distance 1 and 2
• Check them up in the dictionary
• Add real words to candidate list
• Filter candidates by the first character
“If the unknown word starts with a vowel, the candidate should also start
with a vowel, and if the unknown word starts with a consonant, the
candidate should start with the same consonant”
• The lemma of the candidate with the highest lexical probability (i.e.
frequency count in the corpus) is taken as a lemma for the unknown word
33. Evaluation
Lexicon Forms ‘Recall’
DIL forms only 79,140 74.7 %
DIL + 1000 most frequent OOV-words 80,206 80.0 %
! 4,889 homonymous forms
Baseline Predicted lemmas
Lemmatized correctly 483 / 840 552 / 840
Accuracy 57,50 % 65,71 %
34. Evaluation
Tokens 840
Known words 654
Unknown words 186
Lemmatized correctly 552
Lemmas predicted for unknown words 157
Predicted correctly 84
Predicted incorrectly 68
Several lemmas predicted including the
correct one, but the wrong one is chosen
5
~ 60 % of lemmas are predicted correctly
35. Token Best candidate
from closest
dictionary forms
Best candidate’s
lemma
Chosen lemma
+ eólais eólas eólas eólas
+ fiarfaigid fíarfaigid fíarfaigid, íarmi-foich íarmi-foich
+ cheast ceist ceist ceist
* déa dia dá, de, do, día de
+ bréithir bréthir bríathar bríathar
– n-uaill aill aile, aill, all, aille aile
– chuain cain cain, canaid, cani,
caingen
canaid
– christ ceist ceist ceist
– caeme caíme caíme caíme
– chniss cliss cles cles
Predicted lemmas
37. Extraction of Social
Networks from Literary Text
Tsygankova Viktoria,
National Research University
Higher School of Economics, Moscow
38. NovelGraphs
a tool for automatic annotation
of texts and for extracting social
networks of characters from text,
where nodes represent
characters and edges are
relations between them.
It can also analyze structural
balance of the resulting graphs.
39. prince paradox
duke de valentinois
henry wotton
narborough
borgia
filippo
hallward
louis xii
lady henry
erskine
adrian
gian maria visconti
romeo
gray
mercutio
ruxton
Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
40. Example graph of the “Study
in Scarlet”
by A. Conan Doyle
lestrade
gregson
murcher
rance
holmes
narrator
eph stangerson
41. Example graph of the “Study
in Scarlet” by A. Conan Doyle
with sentiment
42. Example graph of the “Picture
of Dorian Gray” by Oscar Wilde
with sentient
43. Conclusions
A tool NovelGraphs was created for
English-language literary fiction, which
uses a new approach of extracting characters
and connections between them.
Nodes represent characters found in the text,
and edges connect them to other characters
with whom they interact.
At the moment, combinations of extractors and
aggregators detect characters better than
interactions between them.
Analysis of structural balance identifies key
passages of the text that correspond to the
minima and maxima on the balance plot.
45. Are the results of your corpus
research really reliable?
Getting automatic result analysis on
GICR.
Tatiana Shavrina, Daniil Selegey
AINL FRUCT, SPb, 12.11.2016
46. Big Corpora Problem:
1. Billions of words, mostly coming from
social media
2. Getting just the IPM and search
results in KWIC format doesn’t tell
you if the results are biased
3. A lot of metatext attributes – URLs,
doc IDs, author IDs, region, gender,
genre etc. – all are potential source
of bias
Users need corpus tools to see all statistics of the
search area to check for homogeneity with the
whole corpus.