SlideShare a Scribd company logo
1 of 42
What is Word2Vec?
Traian Rebedea
Bucharest Machine Learning reading group
25-Aug-15
Intro
• About n-grams: “simple models trained on huge
amounts of data outperform complex systems
trained on less data”
• Solution: “possible to train more complex models
on much larger data set, and they typically
outperform the simple models”
• Why? “neural network based language models
significantly outperform N-gram models”
• How? “distributed representations of words”
(Hinton, 1986 – not discussed today)
Goal
• “learning high-quality word vectors from huge
data sets with billions of words, and with
millions of words in the vocabulary”
• Resulting word representations
– Similar words tend to be close to each other
– Words can have multiple degrees of similarity
Previous work
• Representation of words as continuous vectors
• Neural network language model (NNLM) (Bengio et al., 2003 – not
discussed today)
• Mikolov previously proposed (MSc thesis, PhD thesis, other papers)
to first learn word vectors “using neural network with a single
hidden layer” and then train the NNLM independently
• Word2vec directly extends this work => “word vectors learned using
a simple model”
• These word vectors were useful in various NLP applications
• Many architectures and models have been proposed for computing
these word vectors (e.g. see Socher’s Stanford group work which
resulted in GloVe - http://nlp.stanford.edu/projects/glove/)
• “these architectures were significantly more computationally
expensive for training than” word2vec (in 2013)
Model Architectures
• Some “classic” NLP for estimating continuous
representations of words
– LSA (Latent Semantic Analysis)
– LDA (Latent Dirichlet Allocation)
• Distributed representations of words learned by
neural networks outperform LSA on various tasks
that require to preserve linear regularities among
words
• LDA is computationally expensive and cannot be
trained on very large datasets
Model Architectures
• Feedforward Neural Net Language Model
(NNLM)
Model Architectures
• Recurrent Neural Net Language Model
(RNNLM)
• Simple Elman RNN
Word2vec (log-linear) Models
• Previous models - the complexity is in the non-linear
hidden layer of the model
• Explore simpler models
– Not able to represent the data as precisely as NN
– Can be trained on more data
• In earlier works, Mikolov found that “neural network
language model can be successfully trained in two
steps”:
– Continuous word vectors are learned using simple model
– The N-gram NNLM is trained on top of these distributed
representations of words
Continuous BoW (CBOW) Model
• Similar to the feedforward NNLM, but
• Non-linear hidden layer removed
• Projection layer shared for all words
– Not just the projection matrix
• Thus, all words get projected into the same position
– Their vectors are just averaged
• Called CBOW (continuous BoW) because the order of
the words is lost
• Another modification is to use words from past and
from future (window centered on current word)
CBOW Model
Continuous Skip-gram Model
• Similar to CBOW, but instead of predicting the current word based
on the context
• Tries to maximize classification of a word based on another word in
the same sentence
• Thus, uses each current word as an input to a log-linear classifier
• Predicts words within a certain window
• Observations
– Larger window size => better quality of the resulting word vectors,
higher training time
– More distant words are usually less related to the current word than
those close to it
– Give less weight to the distant words by sampling less from those
words in the training examples
Continuous Skip-gram Model
Results
• Training high dimensional word vectors on a large amount of
data captures “subtle semantic relationships between words”
– Mikolov has made a similar observation for the previous models he
has proposed (e.g. the RNN model, see Mikolov, T., Yih, W. T., & Zweig, G. (2013, June).
Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746-751).)
Results
• “Comprehensive test set that contains five types of
semantic questions, and nine types of syntactic questions”
– 8869 semantic questions
– 10675 syntactic questions
• E.g. “For example, we made a list of 68 large American
cities and the states they belong to, and formed about 2.5K
questions by picking two word pairs at random.”
• Methodology
• Input: ”What is the word that is similar to small in the same
sense as biggest is similar to big?”
• Compute: X = vector("biggest") – vector("big") +
vector("small") and then find closest word to X using cosine
Results
• Other results are reported as well
Skip-gram Revisited
• Formally, the skip-gram model proposes that for a give
sequence of words to maximize:
• Where T = size of the sequence (or number of words
considered for training)
• c = window/context size
• Mikolov, also says that for each word the model only
uses a random window size r = random(1..c)
– This way words that are closer to the “input” word have a
higher probability of being used in training than words that
are more distant
Skip-gram Revisited
• As already seen, p(wt+j|wt) should be the output
of a classifier (e.g. softmax)
• wI is the “input” vector representation of word w
• wO is the “output” (or “context”) vector
representation of word w
• Computing log p(wO|wI) takes O(W) time, where
W is the vocabulary dimension
Skip-gram Alternative View
• Training
– Where
• Getting to
Skip-gram Improvements
• Hierarchical softmax
• Negative sampling
• Subsampling of frequent words
Hierarchical Softmax
• Computationally efficient approximation of
the softmax
• When W output nodes, need to evaluate only
about log(W) nodes to obtain the softmax
probability distribution
Negative Sampling
• Noise Contrastive Estimation (NCE) is an alternative to
hierarchical softmax
• NCE – “a good model should be able to differentiate data from
noise by means of logistic regression”
• “While NCE can be shown to approximately maximize the log
probability of the softmax, the Skip-gram model is only
concerned with learning high-quality vector representations,
so we are free to simplify NCE as long as the vector
representations retain their quality. We define Negative
sampling (NEG) by the objective”:
Subsampling of Frequent Words
• Frequent words provide less information value than
the rare words
• “More, the vector representations of frequent words
do not change significantly after training on several
million examples”
• Each word wi in the training set is discarded with a
probability depending on the frequency of the word
Other remarks
• Mikolov also developed a method to extract relevant
n-grams (bigrams and trigrams) using something
similar to PMI
• Effects of improvements
• Vectors can also be summed
Other Applications
• Dependency-based contexts
• Word2vec for machine learning translation
Dependency-based Contexts
• Levi and Goldberg, 2014: Propose to use
dependency-based contexts instead of linear BoW
(windows of size k)
Dependency-based Contexts
• Why?
– Syntactic dependencies are “more inclusive and more
focused” than BoW
– Capture relations to words that are far apart and that
are not used by small window BoW
– Remove “coincidental contexts which are within the
window but not directly related to the target word”
• A possible problem
– Dependency parsing is still somewhat computational
expensive
– However, English Wikipedia can be parsed on a small
cluster and the results can then be persisted
Dependency-based Contexts
• Examples of syntactic contexts
Dependency-based Contexts
• Comparison with BoW word2vec
Dependency-based Contexts
• Evaluation on the WordSim353 dataset with pairs of similar words
– Relatedness (topical similarity)
– Similarity (functional similarity)
– Both (these pairs have been ignored)
• Task: “rank the similar pairs in the dataset above the related ones”
• Simple ranking: Pairs ranked by cosine similarity of the embedded words
Dependency-based Contexts
• Main conclusion
• Dependency-based context is more useful to
capture functional similarities (e.g. similarity)
between words
• Linear BoW context is more useful to capture
topical similarities (e.g. relatedness) between
words
– The larger the size of the window, the better it
captures related concepts
• Therefore, dependency-based contexts would
perform poorly in analogy experiments
Estimating Similarities Across
Languages
• Given a set of word pairs in two languages (or different
types of corpora) and their associated vector
representations (xi and zi)
• They can have even different dimensions (d1 and d2)
• Find a transformation matrix, W(d2, d1), such that Wxi
approximates as close as possible zi, for all pairs i
• Solved using stochastic gradient descent
• The transformation is seen as a linear transformation
(rotation and scaling) between the two spaces
Estimating Similarities Across
Languages
• Authors also highlight this using a manual rotation (between
En and Sp) and a visualization with 2D-PCA
Estimating Similarities Across
Languages
• The most frequent 5K words from the source
language and their translations given GT –
training data for learning the Translation
Matrix
• Subsequent 1K words in the source language
and their translations are used as a test set
Estimating Similarities Across
Languages
• Very simple baselines
More Explanations
• CBOW model with a single input word
Update Equations
• Maximize the conditional probability of observing the actual
output word wO (denote its index in the output layer as j)
given the input context word wI with regard to the weights
CBOW with Larger Context
Skip-gram Model
• Context and input word have changed order
Skip-gram Model
More…
• Why does word2vec work?
• It seems that SGNS (skip-gram negative sampling) is actually performing a
(weighted) implicit matrix factorization
• The matrix is using the PMI between words and contexts
• PMI and implicit matrix factorizations have been widely used in NLP
• It is interesting that the PMI matrix emerges as the optimal solution for
SGNS’s objective
Final
• “PMI matrices are commonly used by the traditional approach
to represent words (often dubbed "distributional semantics").
What's really striking about this discovery, is that word2vec
(specifically, SGNS) is doing something very similar to what
the NLP community has been doing for about 20 years; it's
just doing it really well.”
Omer Levy - http://www.quora.com/How-does-word2vec-work
References
Word2vec & related papers:
• Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector
space. arXiv preprint arXiv:1301.3781.
• Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and
phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
• Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic Regularities in Continuous Space Word Representations.
In HLT-NAACL (pp. 746-751).
Explanations
• Rong, X. (2014). word2vec Parameter Learning Explained. arXiv preprint arXiv:1411.2738.
• Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding
method. arXiv preprint arXiv:1402.3722.
• Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in Neural
Information Processing Systems (pp. 2177-2185).
• Dyer, C. (2014). Notes on Noise Contrastive Estimation and Negative Sampling. arXiv preprint arXiv:1410.8251.
Applications of word2vec
• Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv
preprint arXiv:1309.4168.
• Levy, O., & Goldberg, Y. (2014). Dependency based word embeddings. In Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics (Vol. 2, pp. 302-308).

More Related Content

What's hot

Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 

What's hot (20)

Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
Word embeddings
Word embeddingsWord embeddings
Word embeddings
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Bert
BertBert
Bert
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 

Similar to What is word2vec?

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Yuki Tomo
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Zachary S. Brown
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachAhmed Hani Ibrahim
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...linshanleearchive
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...GeeksLab Odessa
 

Similar to What is word2vec? (20)

Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
AINL 2016: Nikolenko
AINL 2016: NikolenkoAINL 2016: Nikolenko
AINL 2016: Nikolenko
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
Deep Learning勉強会@小町研 "Learning Character-level Representations for Part-of-Sp...
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...From Semantics to Self-supervised Learning  for Speech and Beyond (Opening Ke...
From Semantics to Self-supervised Learning for Speech and Beyond (Opening Ke...
 
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
DataScience Lab 2017_From bag of texts to bag of clusters_Терпиль Евгений / П...
 

More from Traian Rebedea

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeTraian Rebedea
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5Traian Rebedea
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesTraian Rebedea
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringTraian Rebedea
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...Traian Rebedea
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discoveryTraian Rebedea
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaTraian Rebedea
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1Traian Rebedea
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitareTraian Rebedea
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaTraian Rebedea
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeTraian Rebedea
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianTraian Rebedea
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...Traian Rebedea
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriTraian Rebedea
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Traian Rebedea
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Traian Rebedea
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyTraian Rebedea
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Traian Rebedea
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Traian Rebedea
 

More from Traian Rebedea (20)

An Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning ChallengeAn Evolution of Deep Learning Models for AI2 Reasoning Challenge
An Evolution of Deep Learning Models for AI2 Reasoning Challenge
 
AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5AI @ Wholi - Bucharest.AI Meetup #5
AI @ Wholi - Bucharest.AI Meetup #5
 
Deep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profilesDeep neural networks for matching online social networking profiles
Deep neural networks for matching online social networking profiles
 
Intro to Deep Learning for Question Answering
Intro to Deep Learning for Question AnsweringIntro to Deep Learning for Question Answering
Intro to Deep Learning for Question Answering
 
How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...How useful are semantic links for the detection of implicit references in csc...
How useful are semantic links for the detection of implicit references in csc...
 
A focused crawler for romanian words discovery
A focused crawler for romanian words discoveryA focused crawler for romanian words discovery
A focused crawler for romanian words discovery
 
Detecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large CorporaDetecting and Describing Historical Periods in a Large Corpora
Detecting and Describing Historical Periods in a Large Corpora
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Propunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitarePropunere de dezvoltare a carierei universitare
Propunere de dezvoltare a carierei universitare
 
Automatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corporaAutomatic plagiarism detection system for specialized corpora
Automatic plagiarism detection system for specialized corpora
 
Relevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTubeRelevance based ranking of video comments on YouTube
Relevance based ranking of video comments on YouTube
 
Opinion mining for social media and news items in Romanian
Opinion mining for social media and news items in RomanianOpinion mining for social media and news items in Romanian
Opinion mining for social media and news items in Romanian
 
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
PhD Defense: Computer-Based Support and Feedback for Collaborative Chat Conve...
 
Importanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuriImportanța algoritmilor pentru problemele de la interviuri
Importanța algoritmilor pentru problemele de la interviuri
 
Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...Web services for supporting the interactions of learners in the social web - ...
Web services for supporting the interactions of learners in the social web - ...
 
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
Automatic assessment of collaborative chat conversations with PolyCAFe - EC-T...
 
Conclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD SurveyConclusions and Recommendations of the Romanian ICT RTD Survey
Conclusions and Recommendations of the Romanian ICT RTD Survey
 
Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009Istoria Web-ului - part 2 - tentativ How to Web 2009
Istoria Web-ului - part 2 - tentativ How to Web 2009
 
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
Istoria Web-ului - part 1 (2) - tentativ How to Web 2009
 
Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009Istoria Web-ului - part 1 - tentativ How to Web 2009
Istoria Web-ului - part 1 - tentativ How to Web 2009
 

Recently uploaded

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

What is word2vec?

  • 1. What is Word2Vec? Traian Rebedea Bucharest Machine Learning reading group 25-Aug-15
  • 2. Intro • About n-grams: “simple models trained on huge amounts of data outperform complex systems trained on less data” • Solution: “possible to train more complex models on much larger data set, and they typically outperform the simple models” • Why? “neural network based language models significantly outperform N-gram models” • How? “distributed representations of words” (Hinton, 1986 – not discussed today)
  • 3. Goal • “learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary” • Resulting word representations – Similar words tend to be close to each other – Words can have multiple degrees of similarity
  • 4. Previous work • Representation of words as continuous vectors • Neural network language model (NNLM) (Bengio et al., 2003 – not discussed today) • Mikolov previously proposed (MSc thesis, PhD thesis, other papers) to first learn word vectors “using neural network with a single hidden layer” and then train the NNLM independently • Word2vec directly extends this work => “word vectors learned using a simple model” • These word vectors were useful in various NLP applications • Many architectures and models have been proposed for computing these word vectors (e.g. see Socher’s Stanford group work which resulted in GloVe - http://nlp.stanford.edu/projects/glove/) • “these architectures were significantly more computationally expensive for training than” word2vec (in 2013)
  • 5. Model Architectures • Some “classic” NLP for estimating continuous representations of words – LSA (Latent Semantic Analysis) – LDA (Latent Dirichlet Allocation) • Distributed representations of words learned by neural networks outperform LSA on various tasks that require to preserve linear regularities among words • LDA is computationally expensive and cannot be trained on very large datasets
  • 6. Model Architectures • Feedforward Neural Net Language Model (NNLM)
  • 7. Model Architectures • Recurrent Neural Net Language Model (RNNLM) • Simple Elman RNN
  • 8. Word2vec (log-linear) Models • Previous models - the complexity is in the non-linear hidden layer of the model • Explore simpler models – Not able to represent the data as precisely as NN – Can be trained on more data • In earlier works, Mikolov found that “neural network language model can be successfully trained in two steps”: – Continuous word vectors are learned using simple model – The N-gram NNLM is trained on top of these distributed representations of words
  • 9. Continuous BoW (CBOW) Model • Similar to the feedforward NNLM, but • Non-linear hidden layer removed • Projection layer shared for all words – Not just the projection matrix • Thus, all words get projected into the same position – Their vectors are just averaged • Called CBOW (continuous BoW) because the order of the words is lost • Another modification is to use words from past and from future (window centered on current word)
  • 11. Continuous Skip-gram Model • Similar to CBOW, but instead of predicting the current word based on the context • Tries to maximize classification of a word based on another word in the same sentence • Thus, uses each current word as an input to a log-linear classifier • Predicts words within a certain window • Observations – Larger window size => better quality of the resulting word vectors, higher training time – More distant words are usually less related to the current word than those close to it – Give less weight to the distant words by sampling less from those words in the training examples
  • 13. Results • Training high dimensional word vectors on a large amount of data captures “subtle semantic relationships between words” – Mikolov has made a similar observation for the previous models he has proposed (e.g. the RNN model, see Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746-751).)
  • 14. Results • “Comprehensive test set that contains five types of semantic questions, and nine types of syntactic questions” – 8869 semantic questions – 10675 syntactic questions • E.g. “For example, we made a list of 68 large American cities and the states they belong to, and formed about 2.5K questions by picking two word pairs at random.” • Methodology • Input: ”What is the word that is similar to small in the same sense as biggest is similar to big?” • Compute: X = vector("biggest") – vector("big") + vector("small") and then find closest word to X using cosine
  • 15. Results • Other results are reported as well
  • 16. Skip-gram Revisited • Formally, the skip-gram model proposes that for a give sequence of words to maximize: • Where T = size of the sequence (or number of words considered for training) • c = window/context size • Mikolov, also says that for each word the model only uses a random window size r = random(1..c) – This way words that are closer to the “input” word have a higher probability of being used in training than words that are more distant
  • 17. Skip-gram Revisited • As already seen, p(wt+j|wt) should be the output of a classifier (e.g. softmax) • wI is the “input” vector representation of word w • wO is the “output” (or “context”) vector representation of word w • Computing log p(wO|wI) takes O(W) time, where W is the vocabulary dimension
  • 18. Skip-gram Alternative View • Training – Where • Getting to
  • 19. Skip-gram Improvements • Hierarchical softmax • Negative sampling • Subsampling of frequent words
  • 20. Hierarchical Softmax • Computationally efficient approximation of the softmax • When W output nodes, need to evaluate only about log(W) nodes to obtain the softmax probability distribution
  • 21. Negative Sampling • Noise Contrastive Estimation (NCE) is an alternative to hierarchical softmax • NCE – “a good model should be able to differentiate data from noise by means of logistic regression” • “While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. We define Negative sampling (NEG) by the objective”:
  • 22. Subsampling of Frequent Words • Frequent words provide less information value than the rare words • “More, the vector representations of frequent words do not change significantly after training on several million examples” • Each word wi in the training set is discarded with a probability depending on the frequency of the word
  • 23. Other remarks • Mikolov also developed a method to extract relevant n-grams (bigrams and trigrams) using something similar to PMI • Effects of improvements • Vectors can also be summed
  • 24. Other Applications • Dependency-based contexts • Word2vec for machine learning translation
  • 25. Dependency-based Contexts • Levi and Goldberg, 2014: Propose to use dependency-based contexts instead of linear BoW (windows of size k)
  • 26. Dependency-based Contexts • Why? – Syntactic dependencies are “more inclusive and more focused” than BoW – Capture relations to words that are far apart and that are not used by small window BoW – Remove “coincidental contexts which are within the window but not directly related to the target word” • A possible problem – Dependency parsing is still somewhat computational expensive – However, English Wikipedia can be parsed on a small cluster and the results can then be persisted
  • 27. Dependency-based Contexts • Examples of syntactic contexts
  • 29. Dependency-based Contexts • Evaluation on the WordSim353 dataset with pairs of similar words – Relatedness (topical similarity) – Similarity (functional similarity) – Both (these pairs have been ignored) • Task: “rank the similar pairs in the dataset above the related ones” • Simple ranking: Pairs ranked by cosine similarity of the embedded words
  • 30. Dependency-based Contexts • Main conclusion • Dependency-based context is more useful to capture functional similarities (e.g. similarity) between words • Linear BoW context is more useful to capture topical similarities (e.g. relatedness) between words – The larger the size of the window, the better it captures related concepts • Therefore, dependency-based contexts would perform poorly in analogy experiments
  • 31. Estimating Similarities Across Languages • Given a set of word pairs in two languages (or different types of corpora) and their associated vector representations (xi and zi) • They can have even different dimensions (d1 and d2) • Find a transformation matrix, W(d2, d1), such that Wxi approximates as close as possible zi, for all pairs i • Solved using stochastic gradient descent • The transformation is seen as a linear transformation (rotation and scaling) between the two spaces
  • 32. Estimating Similarities Across Languages • Authors also highlight this using a manual rotation (between En and Sp) and a visualization with 2D-PCA
  • 33. Estimating Similarities Across Languages • The most frequent 5K words from the source language and their translations given GT – training data for learning the Translation Matrix • Subsequent 1K words in the source language and their translations are used as a test set
  • 35. More Explanations • CBOW model with a single input word
  • 36. Update Equations • Maximize the conditional probability of observing the actual output word wO (denote its index in the output layer as j) given the input context word wI with regard to the weights
  • 37. CBOW with Larger Context
  • 38. Skip-gram Model • Context and input word have changed order
  • 40. More… • Why does word2vec work? • It seems that SGNS (skip-gram negative sampling) is actually performing a (weighted) implicit matrix factorization • The matrix is using the PMI between words and contexts • PMI and implicit matrix factorizations have been widely used in NLP • It is interesting that the PMI matrix emerges as the optimal solution for SGNS’s objective
  • 41. Final • “PMI matrices are commonly used by the traditional approach to represent words (often dubbed "distributional semantics"). What's really striking about this discovery, is that word2vec (specifically, SGNS) is doing something very similar to what the NLP community has been doing for about 20 years; it's just doing it really well.” Omer Levy - http://www.quora.com/How-does-word2vec-work
  • 42. References Word2vec & related papers: • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). • Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic Regularities in Continuous Space Word Representations. In HLT-NAACL (pp. 746-751). Explanations • Rong, X. (2014). word2vec Parameter Learning Explained. arXiv preprint arXiv:1411.2738. • Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722. • Levy, O., & Goldberg, Y. (2014). Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems (pp. 2177-2185). • Dyer, C. (2014). Notes on Noise Contrastive Estimation and Negative Sampling. arXiv preprint arXiv:1410.8251. Applications of word2vec • Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. • Levy, O., & Goldberg, Y. (2014). Dependency based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 302-308).

Editor's Notes

  1. Where t=10^-5, chosen empiricaly