SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Robust Word Vectors for Russian Language
V. Malykh1
1Laboratory of Neural Systems and Deep learning,
Moscow Institute of Physics and Technology (State University)
http://www.mipt.ru/
Artificial Intelligence
and Natural Language Conference, 2016
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 1 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 2 / 27
Word Vectors
Vector representations of words is a basic idea to enable computers to
work with words in more convenient manner than with simple
categorical features.
Figure is adopted from deeplearning4j.com
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 3 / 27
Existing approaches
Two main word-level approaches:
Local context (e.g. Word2Vec, [Mikolov2013])
Co-occurence matrix decomposition (e.g. GloVe, [Pennington2014])
Drawbacks:
A model need to recognize the word exactly.
Out of vocabulary words.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 4 / 27
Existing approaches (2)
Char-level approach:
Read the letters and try to predict the word, which it represents. E.g.
[Pennington2015]
Drawback:
Again out of vocabulary words.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 5 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 6 / 27
Our Approach Description
In our architecture we do not use any co-occurrence matrices.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
Our Approach Description
In our architecture we do not use any co-occurrence matrices.
There is no entity of vocabulary in the model.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
Our Approach Description
In our architecture we do not use any co-occurrence matrices.
There is no entity of vocabulary in the model.
Context is handled by memorising in weights of a neural net.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
Our Approach Description
In our architecture we do not use any co-occurrence matrices.
There is no entity of vocabulary in the model.
Context is handled by memorising in weights of a neural net.
We use recurrent layers to achieve this property.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 8 / 27
Long Short-Term Memory
Figure is adopted from http://deeplearning.net/tutorial/lstm.html
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 9 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 10 / 27
BME Representation
B - begin, first 3 letters in one-hot form.
M - middle, all letters in alphabet counters form.
E - end, last 3 letters in one-hot form.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 11 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 12 / 27
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 13 / 27
Math description
Negative Contrast Estimation loss:
NCE = e−s(v,c)
+ es(v,c )
(1)
where v - a word vector, c - word vector from the word context, c - word
vector outside of the word context,
and s(x, y) is some scoring function. We are using cosine similarity as
scoring function:
cos(x, y) =
x · y
|x||y|
(2)
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 14 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 15 / 27
Corpus Description
News headings corpus in Russian
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
Corpus Description
News headings corpus in Russian
3 classes: strong paraphrase, weak paraphrase, and non-paraphrase
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
Corpus Description
News headings corpus in Russian
3 classes: strong paraphrase, weak paraphrase, and non-paraphrase
Firstly introduced in 2015 in [Pronoza2015], in 2016 extended
version.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
Corpus Description
Statistics
Pairs1 7227
Strong Paraphrase Pairs 1668
Weak Paraphrase Pairs 2957
Non Paraphrase Pairs 2582
1
Remark: in our experiments we use only 1 & -1 classes.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 17 / 27
Experiment setup
The metric is ROC AUC on cosine similarity interpreted as
probability of the positive class (strong paraphrase).
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
Experiment setup
The metric is ROC AUC on cosine similarity interpreted as
probability of the positive class (strong paraphrase).
We’re adding artificial noise - additional & vanishing letters,
replacement of letters.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
Experiment setup
The metric is ROC AUC on cosine similarity interpreted as
probability of the positive class (strong paraphrase).
We’re adding artificial noise - additional & vanishing letters,
replacement of letters.
10 runs with each noise level.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
Results
Figure: Results on Paraphraser corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 19 / 27
Outline
1 Word Vectors
2 Our Approach
Our Approach Description
LSTM
BME Representation
Architecture
3 Experiments
1st Corpus
2nd Corpus
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 20 / 27
Corpus Description
Plagiarism detection in scientific papers.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
Corpus Description
Plagiarism detection in scientific papers.
150 pairs of articles’ titles & descriptions in Russian.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
Corpus Description
Plagiarism detection in scientific papers.
150 pairs of articles’ titles & descriptions in Russian.
3 human experts should produce their evaluation in [0, 1].
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
Corpus Description
Plagiarism detection in scientific papers.
150 pairs of articles’ titles & descriptions in Russian.
3 human experts should produce their evaluation in [0, 1].
Was introduced in 2014 in work [Derbenev2014].
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
Results (2)
Table: Results of testing on scientific plagiarism corpus
System Quality
Random Baseline 0.213 ± 0.025
Word2Vec Baseline 0.189
Robust Word2Vec 0.232
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 22 / 27
Summary
We have introduced an architecture to produce word vectors, basing
on characters.
It does not store explicitly word vectors, so it has only fixed weights
number, does not depending on the vocabulary size.
The architecture does not rely on any type of pre-processing (i.e.
stemming).
The architecture outperforming the existing word vectors models in
noisy environment.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 23 / 27
Future Work
We should find more corpora for paraphrase, maybe naturally noisy
(e.g. Twitter Paraphrase Corpus for English).
Try the architecture on other languages.
Try to improve the quality on low noise regions, by the means of more
deep architecture, attention, etc.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 24 / 27
Thank you for your attention!
I would be happy to answer your questions.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 25 / 27
References I
N. V. Derbenev, D. A. Kozliuk, V. V. Nikitin, V. O. Tolcheev
Experimental Research of Near-Duplicate Detection Methods for
Scientific Papers.
Machine Learning and Data Analysis. Vol. 1 (7), 2014 (in Russian).
E. Pronoza, E. Yagunova, A. Pronoza
Construction of a Russian Paraphrase Corpus: Unsupervised
Paraphrase Extraction.
Proceedings of the 9th Russian Summer School in Information
Retrieval, August 24?28, 2015, Saint-Petersburg, Russia, (RuSSIR
2015, Young Scientist Conference), Springer CCIS
T. Mikolov et al.
Distributed representations of words and phrases and their
compositionality.
Advances in neural information processing systems. 2013.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 26 / 27
References II
W. Ling et al.
Finding function in form: Compositional character models for open
vocabulary word representation.
In Proc. of EMNLP2015
J. Pennington, R. Socher, and C.D. Manning.
Glove: Global Vectors for Word Representation.
EMNLP. Vol. 14. 2014.
V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 27 / 27

Mais conteúdo relacionado

Mais procurados

Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
Andre Freitas
 

Mais procurados (20)

Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Latent Relational Model for Relation Extraction
Latent Relational Model for Relation ExtractionLatent Relational Model for Relation Extraction
Latent Relational Model for Relation Extraction
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Cognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithmsCognitive plausibility in learning algorithms
Cognitive plausibility in learning algorithms
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
 

Destaque

Destaque (20)

AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, LedovayaAINL 2016: Panicheva, Ledovaya
AINL 2016: Panicheva, Ledovaya
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
AINL 2016: Romanova, Nefedov
AINL 2016: Romanova, NefedovAINL 2016: Romanova, Nefedov
AINL 2016: Romanova, Nefedov
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
AINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, CoutoAINL 2016: Castro, Lopez, Cavalcante, Couto
AINL 2016: Castro, Lopez, Cavalcante, Couto
 
AINL 2016: Ustalov
AINL 2016: Ustalov AINL 2016: Ustalov
AINL 2016: Ustalov
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Skornyakov
AINL 2016: SkornyakovAINL 2016: Skornyakov
AINL 2016: Skornyakov
 
AINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, KazorinAINL 2016: Fenogenova, Karpov, Kazorin
AINL 2016: Fenogenova, Karpov, Kazorin
 
AINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, NikolenkoAINL 2016: Alekseev, Nikolenko
AINL 2016: Alekseev, Nikolenko
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Strijov
AINL 2016: StrijovAINL 2016: Strijov
AINL 2016: Strijov
 

Semelhante a AINL 2016: Malykh

Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
devangmittal4
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
Andi Wu
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
csandit
 

Semelhante a AINL 2016: Malykh (20)

Embedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglioEmbedding for fun fumarola Meetup Milano DLI luglio
Embedding for fun fumarola Meetup Milano DLI luglio
 
Multilingual qa
Multilingual qaMultilingual qa
Multilingual qa
 
Continuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdfContinuous bag of words cbow word2vec word embedding work .pdf
Continuous bag of words cbow word2vec word embedding work .pdf
 
Designing, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural NetworksDesigning, Visualizing and Understanding Deep Neural Networks
Designing, Visualizing and Understanding Deep Neural Networks
 
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONAN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATION
 
New word analogy corpus
New word analogy corpusNew word analogy corpus
New word analogy corpus
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Word2vec on the italian language: first experiments
Word2vec on the italian language: first experimentsWord2vec on the italian language: first experiments
Word2vec on the italian language: first experiments
 
Latent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text SummarizationLatent Topic-semantic Indexing based Automatic Text Summarization
Latent Topic-semantic Indexing based Automatic Text Summarization
 
Cwn aat talk
Cwn aat talkCwn aat talk
Cwn aat talk
 
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceBeyond Word2Vec: Embedding Words and Phrases in Same Vector Space
Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
Explicit Semantic Analysis
Explicit Semantic AnalysisExplicit Semantic Analysis
Explicit Semantic Analysis
 
Statistically-Enhanced New Word Identification
Statistically-Enhanced New Word IdentificationStatistically-Enhanced New Word Identification
Statistically-Enhanced New Word Identification
 
AINL 2016: Maraev
AINL 2016: MaraevAINL 2016: Maraev
AINL 2016: Maraev
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 

Mais de Lidia Pivovarova

Mais de Lidia Pivovarova (10)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 

Último

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 

Último (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 

AINL 2016: Malykh

  • 1. Robust Word Vectors for Russian Language V. Malykh1 1Laboratory of Neural Systems and Deep learning, Moscow Institute of Physics and Technology (State University) http://www.mipt.ru/ Artificial Intelligence and Natural Language Conference, 2016 V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 1 / 27
  • 2. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 2 / 27
  • 3. Word Vectors Vector representations of words is a basic idea to enable computers to work with words in more convenient manner than with simple categorical features. Figure is adopted from deeplearning4j.com V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 3 / 27
  • 4. Existing approaches Two main word-level approaches: Local context (e.g. Word2Vec, [Mikolov2013]) Co-occurence matrix decomposition (e.g. GloVe, [Pennington2014]) Drawbacks: A model need to recognize the word exactly. Out of vocabulary words. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 4 / 27
  • 5. Existing approaches (2) Char-level approach: Read the letters and try to predict the word, which it represents. E.g. [Pennington2015] Drawback: Again out of vocabulary words. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 5 / 27
  • 6. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 6 / 27
  • 7. Our Approach Description In our architecture we do not use any co-occurrence matrices. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  • 8. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  • 9. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. Context is handled by memorising in weights of a neural net. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  • 10. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. Context is handled by memorising in weights of a neural net. We use recurrent layers to achieve this property. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  • 11. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 8 / 27
  • 12. Long Short-Term Memory Figure is adopted from http://deeplearning.net/tutorial/lstm.html V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 9 / 27
  • 13. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 10 / 27
  • 14. BME Representation B - begin, first 3 letters in one-hot form. M - middle, all letters in alphabet counters form. E - end, last 3 letters in one-hot form. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 11 / 27
  • 15. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 12 / 27
  • 16. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 13 / 27
  • 17. Math description Negative Contrast Estimation loss: NCE = e−s(v,c) + es(v,c ) (1) where v - a word vector, c - word vector from the word context, c - word vector outside of the word context, and s(x, y) is some scoring function. We are using cosine similarity as scoring function: cos(x, y) = x · y |x||y| (2) V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 14 / 27
  • 18. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 15 / 27
  • 19. Corpus Description News headings corpus in Russian V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  • 20. Corpus Description News headings corpus in Russian 3 classes: strong paraphrase, weak paraphrase, and non-paraphrase V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  • 21. Corpus Description News headings corpus in Russian 3 classes: strong paraphrase, weak paraphrase, and non-paraphrase Firstly introduced in 2015 in [Pronoza2015], in 2016 extended version. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  • 22. Corpus Description Statistics Pairs1 7227 Strong Paraphrase Pairs 1668 Weak Paraphrase Pairs 2957 Non Paraphrase Pairs 2582 1 Remark: in our experiments we use only 1 & -1 classes. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 17 / 27
  • 23. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  • 24. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). We’re adding artificial noise - additional & vanishing letters, replacement of letters. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  • 25. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). We’re adding artificial noise - additional & vanishing letters, replacement of letters. 10 runs with each noise level. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  • 26. Results Figure: Results on Paraphraser corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 19 / 27
  • 27. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 20 / 27
  • 28. Corpus Description Plagiarism detection in scientific papers. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  • 29. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  • 30. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. 3 human experts should produce their evaluation in [0, 1]. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  • 31. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. 3 human experts should produce their evaluation in [0, 1]. Was introduced in 2014 in work [Derbenev2014]. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  • 32. Results (2) Table: Results of testing on scientific plagiarism corpus System Quality Random Baseline 0.213 ± 0.025 Word2Vec Baseline 0.189 Robust Word2Vec 0.232 V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 22 / 27
  • 33. Summary We have introduced an architecture to produce word vectors, basing on characters. It does not store explicitly word vectors, so it has only fixed weights number, does not depending on the vocabulary size. The architecture does not rely on any type of pre-processing (i.e. stemming). The architecture outperforming the existing word vectors models in noisy environment. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 23 / 27
  • 34. Future Work We should find more corpora for paraphrase, maybe naturally noisy (e.g. Twitter Paraphrase Corpus for English). Try the architecture on other languages. Try to improve the quality on low noise regions, by the means of more deep architecture, attention, etc. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 24 / 27
  • 35. Thank you for your attention! I would be happy to answer your questions. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 25 / 27
  • 36. References I N. V. Derbenev, D. A. Kozliuk, V. V. Nikitin, V. O. Tolcheev Experimental Research of Near-Duplicate Detection Methods for Scientific Papers. Machine Learning and Data Analysis. Vol. 1 (7), 2014 (in Russian). E. Pronoza, E. Yagunova, A. Pronoza Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. Proceedings of the 9th Russian Summer School in Information Retrieval, August 24?28, 2015, Saint-Petersburg, Russia, (RuSSIR 2015, Young Scientist Conference), Springer CCIS T. Mikolov et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 26 / 27
  • 37. References II W. Ling et al. Finding function in form: Compositional character models for open vocabulary word representation. In Proc. of EMNLP2015 J. Pennington, R. Socher, and C.D. Manning. Glove: Global Vectors for Word Representation. EMNLP. Vol. 14. 2014. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 27 / 27