SlideShare uma empresa Scribd logo
1 de 38
Baixar para ler offline
Using Word Embedding for Cross-Language
Plagiarism Detection
Authors
Jérémy Ferrero Frédéric Agnès Laurent Besacier Didier Schwab
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 1
What is Cross-Language Plagiarism Detection?
Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been
plagiarized while being translated (manually or automatically).
From a text in a language L, we must find similar passage(s) in other text(s) from
among a set of candidate texts in language L’ (cross-language textual similarity).
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 2
Why is it so important?
Sources:
- McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School.
- Josephson Institute. (2011). What would honest Abe Lincoln say?
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 3
Research Questions
• Are Word Embeddings useful for cross-language
plagiarism detection?
• Is syntax weighting in distributed representations of
sentences useful for the text entailment?
• Are cross-language plagiarism detection methods
complementary?
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 4
State-of-the-Art Methods
MT-Based Models
Translation + Monolingual Analysis [Muhr et al., 2010, Barrón-Cedeño, 2012]
Comparable Corpora-Based Models
CL-KGA, CL-ESA [Gabrilovich and Markovitch, 2007, Potthast et al., 2008]
Parallel Corpora-Based Models
CL-ASA [Barrón-Cedeño et al., 2008, Pinto et al., 2009], CL-LSI, CL-KCCA
Dictionary-Based Models
CL-VSM, CL-CTS [Gupta et al., 2012, Pataki, 2012]
Syntax-Based Models
Length Model, CL-CnG [Mcnamee and Mayfield, 2004, Potthast et al., 2011], Cognateness
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 5
Augmented CL-CTS
We use DBNary [Sérasset, 2015] as linked lexical resource.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
Augmented CL-CTS
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
Augmented CL-CTS
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
Evaluation Dataset
[Ferrero et al., 2016]1
• French, English and Spanish;
• Parallel and comparable (mix of Wikipedia, conference papers, product reviews,
Europarl and JRC);
• Different granularities: document level, sentence level and chunk level;
• Human and machine translated texts;
• Obfuscated (to make the similarity detection more complicated) and without
added noise;
• Written and translated by multiple types of authors;
• Cover various fields.
1
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity
Detection. In Proceedings of LREC 2016.
https://github.com/FerreroJeremy/Cross-Language-Dataset
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 9
Evaluation Protocol
• We compared each English textual unit to its corresponding
French unit and to 999 other units randomly selected;
• We threshold the obtained distance matrix to find the threshold
giving the best F1 score;
• We repeat these two steps 10 times, leading to a 10 folds:
- 2 folds for tuning (CL-WESS) and fusion (Decision Tree)
- 8 folds for validation
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 10
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-CTS-WE: Cross-Language Conceptual Thesaurus-based Similarity with
Word-Embedding
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-WES: Cross-Language Word Embedding-based Similarity
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97%
on chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-C3G: Cross-Language Character 3-Gram
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves
the results.
Decision Tree fusion: C4.5 [Quinlan, 1993]
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Conclusion
• Augmentation of several baseline approaches using word embeddings
instead of lexical resources;
• CL-WESS beats in overall the precedent best state-of-the-art methods;
• Methods are complementary and their fusion significantly helps
cross-language textual similarity detection performance;
• Winning method at SemEval-2017 Task 1 track 4a, i.e. the task on
Spanish-English Cross-lingual Semantic Textual Similarity detection.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 12
Thank you for your attention.
Do you have any questions?
 jeremy.ferrero@compilatio.net
 @FerreroJeremy
 github.com/FerreroJeremy
 fr.linkedin.com/in/FerreroJeremy
 researchgate.net/profile/Jeremy_Ferrero
Augmented CL-CTS
• CL-CTS-WE uses the top 10 closest words in the embeddings model to build the
BOW of a word;
• A BOW of a sentence is a merge of the BOW of its words;
• Jaccard distance between the two BOW.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 14
CL-WES: Cross-Language Word Embedding-based Similarity
The similarity between two sentences S and S is calculated by Cosine Distance
between the two vectors V and V , built such as:
V (S) =
|S|
i=1
(vector(ui)) (1)
• ui is the ith word of S;
• vector is the function which gives the word embedding vector of a word.
This feature is available in MultiVec2 [Berard et al., 2016].
2
https://github.com/eske/multivec
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 15
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
V (S) =
|S|
i=1
(weight(pos(ui)).vector(ui)) (2)
• ui is the ith word of S;
• pos is the function which gives the universal part-of-speech tag of a word;
• weight is the function which gives the weight of a part-of-speech;
• vector is the function which gives the word embedding vector of a word;
• . is the scalar product.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 16
Complementarity
Figure: Distribution histograms of CL-CNG (left) and CL-ASA (right) for 1000 positives and
1000 negatives (mis)matches.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 17
Fusions
Weighted Average Fusion Decision Tree Fusion C4.5 [Quinlan, 1993]
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 18
Weighted Fusion
fus(M) =
|M|
j=1(wj ∗ mj)
|M|
j=1 wj
(3)
• M is the set of the scores of the methods for one match;
• mj and wj are the score and the weight of the jth method respectively.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 19
Results at Chunk-Level
Chunk level
Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%)
CL-C3G 63.04 40.80 36.80 80.69 53.26 50.76
CL-CTS 58.05 33.66 30.15 67.88 45.31 42.84
CL-ASA 23.70 23.24 33.06 26.34 55.45 47.32
CL-ESA 64.86 23.73 13.91 23.01 13.98 14.81
T+MA 58.26 38.90 28.81 73.25 36.60 37.12
CL-CTS-WE 58.00 38.04 31.73 73.13 49.91 46.67
CL-WES 37.53 21.70 32.96 39.14 46.01 41.95
CL-WESS 52.68 34.49 45.00 56.83 57.06 53.73
Average fusion 81.34 65.78 61.87 91.87 79.77 75.82
Weighed fusion 84.61 69.69 67.02 94.38 83.74 80.01
Decision Tree 95.25 74.10 72.19 97.05 95.16 89.15
Table: Average F1 scores of cross-language similarity detection methods applied on chunk-level
EN→FR sub-corpora – 8 folds validation.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 20
Results at Sentence-Level
Sentence level
Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%)
CL-C3G 48.24 48.19 36.85 61.30 52.70 49.34
CL-CTS 46.71 38.93 28.38 51.43 53.35 47.50
CL-ASA 27.68 27.33 34.78 25.95 36.73 35.81
CL-ESA 50.89 14.41 14.45 14.18 14.09 14.44
T+MA 50.39 37.66 32.31 61.95 37.70 37.42
CL-CTS-WE 47.26 43.93 31.63 57.85 56.39 50.69
CL-WES 28.48 24.37 33.99 39.10 44.06 41.43
CL-WESS 45.65 40.45 48.64 58.08 58.84 56.35
Decision Tree 80.45 80.89 72.70 78.91 94.04 88.50
Table: Average F1 scores of cross-language similarity detection methods applied on
sentence-level EN→FR sub-corpora – 8 folds validation.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 21
References I
Barrón-Cedeño, A. (2012).
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism.
In PhD thesis, València, Spain.
Barrón-Cedeño, A., Rosso, P., Pinto, D., and Juan, A. (2008).
On Cross-lingual Plagiarism Analysis using a Statistical Model.
In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings
of the ECAI’08 PAN Workshop: Uncovering Plagiarism, Authorship and Social
Software Misuse, pages 9–13, Patras, Greece.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 13
References II
Berard, A., Servan, C., Pietquin, O., and Besacier, L. (2016).
MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP.
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC’16), pages 4188–4192, Portoroz, Slovenia. European Language
Resources Association (ELRA).
Ferrero, J., Agnès, F., Besacier, L., and Schwab, D. (2016).
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language
Textual Similarity Detection.
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC’16), pages 4162–4169, Portoroz, Slovenia. European Language
Resources Association (ELRA).
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 14
References III
Gabrilovich, E. and Markovitch, S. (2007).
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic
Analysis.
In Proceedings of the 20th International Joint Conference on Artifical Intelligence
(IJCAI’07), pages 1606–1611, Hyderabad, India. Morgan Kaufmann Publishers Inc.
Gupta, P., Barrón-Cedeño, A., and Rosso, P. (2012).
Cross-language High Similarity Search using a Conceptual Thesaurus.
In Information Access Evaluation. Multilinguality, Multimodality, and Visual
Analytics, pages 67–75, Rome, Italy. Springer Berlin Heidelberg.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 15
References IV
Mcnamee, P. and Mayfield, J. (2004).
Character N-Gram Tokenization for European Language Text Retrieval.
In Information Retrieval Proceedings, volume 7, pages 73–97. Kluwer Academic
Publishers.
Muhr, M., Kern, R., Zechner, M., and Granitzer, M. (2010).
External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and
Segmentation System - Lab Report for PAN at CLEF 2010.
In Braschler, M., Harman, D., and Pianta, E., editors, CLEF Notebook, Padua,
Italy.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 16
References V
Pataki, M. (2012).
A New Approach for Searching Translated Plagiarism.
In Proceedings of the 5th International Plagiarism Conference, pages 49–64,
Newcastle, UK.
Pinto, D., Civera, J., Juan, A., Rosso, P., and Barrón-Cedeño, A. (2009).
A Statistical Approach to Crosslingual Natural Language Tasks.
In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51–60.
Potthast, M., Barrón-Cedeño, A., Stein, B., and Rosso, P. (2011).
Cross-Language Plagiarism Detection.
In Language Resources and Evaluation, volume 45, pages 45–62.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 17
References VI
Potthast, M., Stein, B., and Anderka, M. (2008).
A Wikipedia-Based Multilingual Retrieval Model.
In 30th European Conference on IR Research (ECIR’08), volume 4956 of LNCS of
Lecture Notes in Computer Science, pages 522–530, Glasgow, Scotland. Springer.
Quinlan, J. R. (1993).
C4.5: Programs for Machine Learning.
The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 18
References VII
Sérasset, G. (2015).
DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.
In Semantic Web Journal (special issue on Multilingual Linked Open Data),
volume 6, pages 355–361.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 19

Mais conteúdo relacionado

Mais procurados

Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
Interview presentation
Interview presentationInterview presentation
Interview presentationJoseph Gubbins
 

Mais procurados (18)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
 

Semelhante a Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection

Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Association for Computational Linguistics
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Combining Vocabulary Alignment Techniques
Combining Vocabulary Alignment TechniquesCombining Vocabulary Alignment Techniques
Combining Vocabulary Alignment Techniquespancsurka
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysiskrukob9
 
Bangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorBangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorMdAlAmin187
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...csandit
 
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...cscpconf
 

Semelhante a Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection (20)

Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
 
ijcai11
ijcai11ijcai11
ijcai11
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Combining Vocabulary Alignment Techniques
Combining Vocabulary Alignment TechniquesCombining Vocabulary Alignment Techniques
Combining Vocabulary Alignment Techniques
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysis
 
Bangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorBangla spell checker & suggestion generator
Bangla spell checker & suggestion generator
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Doc format.
Doc format.Doc format.
Doc format.
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
 
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
 

Mais de Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 

Mais de Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Último

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answersdalebeck957
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactisticshameyhk98
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17Celine George
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 

Último (20)

TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
latest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answerslatest AZ-104 Exam Questions and Answers
latest AZ-104 Exam Questions and Answers
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Philosophy of china and it's charactistics
Philosophy of china and it's charactisticsPhilosophy of china and it's charactistics
Philosophy of china and it's charactistics
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 

Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection

  • 1. Using Word Embedding for Cross-Language Plagiarism Detection Authors Jérémy Ferrero Frédéric Agnès Laurent Besacier Didier Schwab Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 1
  • 2. What is Cross-Language Plagiarism Detection? Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). From a text in a language L, we must find similar passage(s) in other text(s) from among a set of candidate texts in language L’ (cross-language textual similarity). Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 2
  • 3. Why is it so important? Sources: - McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School. - Josephson Institute. (2011). What would honest Abe Lincoln say? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 3
  • 4. Research Questions • Are Word Embeddings useful for cross-language plagiarism detection? • Is syntax weighting in distributed representations of sentences useful for the text entailment? • Are cross-language plagiarism detection methods complementary? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 4
  • 5. State-of-the-Art Methods MT-Based Models Translation + Monolingual Analysis [Muhr et al., 2010, Barrón-Cedeño, 2012] Comparable Corpora-Based Models CL-KGA, CL-ESA [Gabrilovich and Markovitch, 2007, Potthast et al., 2008] Parallel Corpora-Based Models CL-ASA [Barrón-Cedeño et al., 2008, Pinto et al., 2009], CL-LSI, CL-KCCA Dictionary-Based Models CL-VSM, CL-CTS [Gupta et al., 2012, Pataki, 2012] Syntax-Based Models Length Model, CL-CnG [Mcnamee and Mayfield, 2004, Potthast et al., 2011], Cognateness Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 5
  • 6. Augmented CL-CTS We use DBNary [Sérasset, 2015] as linked lexical resource. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 7. Augmented CL-CTS Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 8. Augmented CL-CTS Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 9. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 10. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 11. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 12. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 13. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 14. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 15. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 16. Evaluation Dataset [Ferrero et al., 2016]1 • French, English and Spanish; • Parallel and comparable (mix of Wikipedia, conference papers, product reviews, Europarl and JRC); • Different granularities: document level, sentence level and chunk level; • Human and machine translated texts; • Obfuscated (to make the similarity detection more complicated) and without added noise; • Written and translated by multiple types of authors; • Cover various fields. 1 A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of LREC 2016. https://github.com/FerreroJeremy/Cross-Language-Dataset Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 9
  • 17. Evaluation Protocol • We compared each English textual unit to its corresponding French unit and to 999 other units randomly selected; • We threshold the obtained distance matrix to find the threshold giving the best F1 score; • We repeat these two steps 10 times, leading to a 10 folds: - 2 folds for tuning (CL-WESS) and fusion (Decision Tree) - 8 folds for validation Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 10
  • 18. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-CTS-WE: Cross-Language Conceptual Thesaurus-based Similarity with Word-Embedding Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 19. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-WES: Cross-Language Word Embedding-based Similarity CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 20. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-C3G: Cross-Language Character 3-Gram CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 21. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. Decision Tree fusion: C4.5 [Quinlan, 1993] Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 22. Conclusion • Augmentation of several baseline approaches using word embeddings instead of lexical resources; • CL-WESS beats in overall the precedent best state-of-the-art methods; • Methods are complementary and their fusion significantly helps cross-language textual similarity detection performance; • Winning method at SemEval-2017 Task 1 track 4a, i.e. the task on Spanish-English Cross-lingual Semantic Textual Similarity detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 12
  • 23. Thank you for your attention. Do you have any questions?  jeremy.ferrero@compilatio.net  @FerreroJeremy  github.com/FerreroJeremy  fr.linkedin.com/in/FerreroJeremy  researchgate.net/profile/Jeremy_Ferrero
  • 24. Augmented CL-CTS • CL-CTS-WE uses the top 10 closest words in the embeddings model to build the BOW of a word; • A BOW of a sentence is a merge of the BOW of its words; • Jaccard distance between the two BOW. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 14
  • 25. CL-WES: Cross-Language Word Embedding-based Similarity The similarity between two sentences S and S is calculated by Cosine Distance between the two vectors V and V , built such as: V (S) = |S| i=1 (vector(ui)) (1) • ui is the ith word of S; • vector is the function which gives the word embedding vector of a word. This feature is available in MultiVec2 [Berard et al., 2016]. 2 https://github.com/eske/multivec Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 15
  • 26. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity V (S) = |S| i=1 (weight(pos(ui)).vector(ui)) (2) • ui is the ith word of S; • pos is the function which gives the universal part-of-speech tag of a word; • weight is the function which gives the weight of a part-of-speech; • vector is the function which gives the word embedding vector of a word; • . is the scalar product. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 16
  • 27. Complementarity Figure: Distribution histograms of CL-CNG (left) and CL-ASA (right) for 1000 positives and 1000 negatives (mis)matches. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 17
  • 28. Fusions Weighted Average Fusion Decision Tree Fusion C4.5 [Quinlan, 1993] Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 18
  • 29. Weighted Fusion fus(M) = |M| j=1(wj ∗ mj) |M| j=1 wj (3) • M is the set of the scores of the methods for one match; • mj and wj are the score and the weight of the jth method respectively. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 19
  • 30. Results at Chunk-Level Chunk level Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%) CL-C3G 63.04 40.80 36.80 80.69 53.26 50.76 CL-CTS 58.05 33.66 30.15 67.88 45.31 42.84 CL-ASA 23.70 23.24 33.06 26.34 55.45 47.32 CL-ESA 64.86 23.73 13.91 23.01 13.98 14.81 T+MA 58.26 38.90 28.81 73.25 36.60 37.12 CL-CTS-WE 58.00 38.04 31.73 73.13 49.91 46.67 CL-WES 37.53 21.70 32.96 39.14 46.01 41.95 CL-WESS 52.68 34.49 45.00 56.83 57.06 53.73 Average fusion 81.34 65.78 61.87 91.87 79.77 75.82 Weighed fusion 84.61 69.69 67.02 94.38 83.74 80.01 Decision Tree 95.25 74.10 72.19 97.05 95.16 89.15 Table: Average F1 scores of cross-language similarity detection methods applied on chunk-level EN→FR sub-corpora – 8 folds validation. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 20
  • 31. Results at Sentence-Level Sentence level Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%) CL-C3G 48.24 48.19 36.85 61.30 52.70 49.34 CL-CTS 46.71 38.93 28.38 51.43 53.35 47.50 CL-ASA 27.68 27.33 34.78 25.95 36.73 35.81 CL-ESA 50.89 14.41 14.45 14.18 14.09 14.44 T+MA 50.39 37.66 32.31 61.95 37.70 37.42 CL-CTS-WE 47.26 43.93 31.63 57.85 56.39 50.69 CL-WES 28.48 24.37 33.99 39.10 44.06 41.43 CL-WESS 45.65 40.45 48.64 58.08 58.84 56.35 Decision Tree 80.45 80.89 72.70 78.91 94.04 88.50 Table: Average F1 scores of cross-language similarity detection methods applied on sentence-level EN→FR sub-corpora – 8 folds validation. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 21
  • 32. References I Barrón-Cedeño, A. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism. In PhD thesis, València, Spain. Barrón-Cedeño, A., Rosso, P., Pinto, D., and Juan, A. (2008). On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI’08 PAN Workshop: Uncovering Plagiarism, Authorship and Social Software Misuse, pages 9–13, Patras, Greece. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 13
  • 33. References II Berard, A., Servan, C., Pietquin, O., and Besacier, L. (2016). MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4188–4192, Portoroz, Slovenia. European Language Resources Association (ELRA). Ferrero, J., Agnès, F., Besacier, L., and Schwab, D. (2016). A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4162–4169, Portoroz, Slovenia. European Language Resources Association (ELRA). Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 14
  • 34. References III Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07), pages 1606–1611, Hyderabad, India. Morgan Kaufmann Publishers Inc. Gupta, P., Barrón-Cedeño, A., and Rosso, P. (2012). Cross-language High Similarity Search using a Conceptual Thesaurus. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, pages 67–75, Rome, Italy. Springer Berlin Heidelberg. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 15
  • 35. References IV Mcnamee, P. and Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. In Information Retrieval Proceedings, volume 7, pages 73–97. Kluwer Academic Publishers. Muhr, M., Kern, R., Zechner, M., and Granitzer, M. (2010). External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Braschler, M., Harman, D., and Pianta, E., editors, CLEF Notebook, Padua, Italy. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 16
  • 36. References V Pataki, M. (2012). A New Approach for Searching Translated Plagiarism. In Proceedings of the 5th International Plagiarism Conference, pages 49–64, Newcastle, UK. Pinto, D., Civera, J., Juan, A., Rosso, P., and Barrón-Cedeño, A. (2009). A Statistical Approach to Crosslingual Natural Language Tasks. In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51–60. Potthast, M., Barrón-Cedeño, A., Stein, B., and Rosso, P. (2011). Cross-Language Plagiarism Detection. In Language Resources and Evaluation, volume 45, pages 45–62. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 17
  • 37. References VI Potthast, M., Stein, B., and Anderka, M. (2008). A Wikipedia-Based Multilingual Retrieval Model. In 30th European Conference on IR Research (ECIR’08), volume 4956 of LNCS of Lecture Notes in Computer Science, pages 522–530, Glasgow, Scotland. Springer. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 18
  • 38. References VII Sérasset, G. (2015). DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. In Semantic Web Journal (special issue on Multilingual Linked Open Data), volume 6, pages 355–361. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 19