SlideShare uma empresa Scribd logo
1 de 39
Word2vec from scratch
11/10/2015
Jinpyo Lee
KAIST
Contents
• Introduction
• Previous Methods for represent words
• Word2Vec
• Extensions of skip-gram model / Learning Phrase / Additive
Compositionality & Evaluation
• Conclusion
• Demo
• Discussions
• References
Introduction
• Example of NLP processing
• EASY
• Spell Chekcing (Checking)
• Keyword Search (Ctrl+F)
• Finding Synonyms
• MEDIUM
• Parsing information form documents, web, etc.
• HARD
• Machine Translation (e.g. Translate Korean to English)
• Semantic Analysis (e.g. What’s meaning of this query?)
• Co-reference (e.g. What does “it” refers in this sentence?)
• Question Answering (e.g. IBM Watson)
Introduction
• BUT, Most important is
How we represent words
as input for all the NLP tasks.
Introduction
• BUT, Most important is
How we represent meaning of words
as input for all the NLP tasks.
• At first, most NLP treated word as ATOMIC symbol
• They needed notion of similarity & difference
• So,
• WordNet: Taxonomy has hypernyms (is-a)
relationship and synonym set
Simple example of wordnet showing synonyms and antonyms
Prev. Methods for represent words
- Discrete Representation
• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
Expert == Good ? Usually?
 Probably NO!
* Synonym set of good using nltk lib (CS224d-Lecture note)
How about new words?
: Wicked, ace, wizard, genius, ninja
- Discrete Representation
Prev. Methods for represent words
• COOL! (see also, Semantic Web)
• Great resource but, missing nuances
* Synonym set of good using nltk lib (CS224d-Lecture note)
Disadvantage
• Hard to keep up to date
• Requires human labor
• Subjective
• Hard to compute accurate word
similarity
- Discrete Representation
Prev. Methods for represent words
• Another problem of discrete representation
• Can’t gives similarity
• Too sparse
e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ]
Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
 “one-hot” representation: Typical, simple representation.
All 0s with one 1, Identical
Horse ∩ Zebra
= [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]
= 0 (nothing) (But, we know does are mammal)
- Discrete Representation
Mammal
Prev. Methods for represent words
• Use neighbor to represent words! (Co-occurence)
• Conjecture: Words that are related will often appear in the same
documents.
 Window allow capture both syntactic and semantic info.
e.g. I enjoy baseball
corpus I like NLP
I like deep learning.
* Co-occurrence Matrix with window size = 1 (CS224d-Lecture note)
Co-occurs beside “I”, 2-times
Prev. Methods for represent words
• Use this matrix for word-embedding (feat. SVD)
• Applying Single Value Decomposition
for the simplicity, SVD: X (Co-occur Mat) = U*S*VT
X U S VT
(detail would be in linear algebra textbook)
• Select k-columns from U as k-dimension word-vector
Prev. Methods for represent words
• Result of SVD based Model
K = 2 K = 3
Prev. Methods for represent words
• Disadvantage
• Co-occur Matrix is extremely sparse
• Very high dimensional
• Quadratic cost to train (i.e. perform SVD)
• Needs hacks for the imbalance in word frequency
(i.e. “it”, “the”, “has”, etc.)
• Some solutions exist for problem but, not intrinsic
Prev. Methods for represent words
Contents
• Introduction
• Previous Methods for represent words
• Word2Vec
• Extensions of skip-gram model / Learning Phrase / Additive
Compositionality & Evaluation
• Conclusion
• Demo
• Discussions
• References
Word2vec (related paper)
• Then how?
Directly learn (iteration) low-dimensional word vectors at a time!
 Go Back to the 1986
• Learning representations by back-propagating errors
(Rumelhart et al. 1986)
• A neural probabilistic language model (Bengio et al., 2003)
• NLP from Scratch (Collobert & Weston, 2008)
• Word2Vec (Mikolov et al. 2013)
• Efficient Estimation of Word Representation in Vector Space
• Distributed Representations of words and phrases and their
compositionality
7/31
Efficient Estimation of Word
Representation in Vector Space
• Introduce initial architecture of word2vec (2013)
• Two New Model: Continuous-Bag-of-word, Skip-gram model
• Empirically show that this word model has better syntactic, semantic
representation then other model
• Compare two model
• Skip-gram model works well on semantic but training is slower.
• CBOW model works well on syntactic and training is faster.
(P)Review
8/31
Word2vec (profile)
• Distributed Representations of words and phrases
and their compositionality
• NIPS 2013 (Submitted on 16 Oct 2013)
• Tomas Mikorov, (FaceBook (2014 ~ )) et al.
• Includes additional works of “Efficient Estimation of Word
Representation in Vector Space”.
9/31
Word2vec (Contents)
• This paper includes,
• Extensions of skip-gram model (fast & accurate)
• Method
• Hierarchical soft-max
• NEG
• Subsampling
• Ability of Learning Phrase
• Find Additive Compositionality
• Conclusion
10/31
• Skip-gram model
• Objective of Skip-gram model is “Find word representations
useful for predicting context words in a sentence.
• Softmax function
• …
Extension of Skip-Gram
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
BUT, without understanding
original model, we will..
..going to.. fall ...asleep..
11/31
Example
13/31
CBOW (Original)
• Continuous-Bag-of-word model
• Idea: Using context words, we can predict center word
i.e. Probability( “It is ( ? ) to finish”  “time” )
• Present word as distributed vector of probability  Low dimension
• Goal: Train weight-matrix(W ) satisfies below
• Loss-function (using cross-entropy method)
argmax 𝑊 {𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒕𝒊𝒎𝒆 − 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑝𝑟 time 𝒊𝒕, 𝒊𝒔, 𝒕𝒐, 𝒇𝒊𝒏𝒊𝒔𝒉 ; 𝑊 }
* Softmax(): K-dim vector of x∈ℝ  K-dim vector that has (0,1)∈ℝ
𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
context words (window_size=2)
12/31
CBOW (Original)
• Continuous-Bag-of-word model
• Input
• “one-hot” word vector
• Remove nonlinear hidden layer
• Back-propagate error from
output layer to Weight matrix
(Adjust W s)
It
is
finish
to
time
[
0
1
0
0
0
]
Wout T∙h =
𝒚(predicted)
[0 0 1 0 0]T
Win
∙
h
Win ∙ x i
[0 0 0 0 1]T
Win
∙
y(true) =
Backpropagate to
Minimize error
vs
Win(old) Wout(old)
Win(new) Wout(new)
Win
,Wout
∈ ℝ 𝑛×|𝑉|
: Input, output Weight
-matrix, n is dimension for word embedding
x 𝑖
, 𝑦 𝑖
: input, output word vector
(one-hot) from vocabulary V
ℎ: hidden vector, avg of W*x
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
Initial input, not results
14/31
• Skip-gram model
• Idea: With center word,
we can predict context words
• Mirror of CBOW (vice versa)
i.e. Probability( “time”  “It is ( ? ) to finish” )
• Loss-function:
Skip-Gram (Original)
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
time
It
is
to
finish
Win ∙ x i
h
y i
Win(old) Wout(old)
Win(new) Wout(new)
[NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1]
CBOW: 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶)
15/31
• Hierarchical Soft-max function
• To train weight matrix in every step, we need to pass the
calculated vector into Loss-Function
• Soft-max function
• Before calculate loss function
calculated vector should normalized as real-number in (0,1)
Extension of Skip-Gram(1)
𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝
𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤
𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕
(𝑬 = − 𝐥𝐨𝐠 𝒑 𝒘 𝒕−𝑪. . 𝒘 𝒕+𝑪 𝒘 𝒕 )
16/31
• Hierarchical Soft-max function (cont.)
• Soft-max function
(I have already calculated, it’s boring …….…)
Extension of Skip-Gram(1)
Original soft-max function
of skip-gram model
17/31
• Hierarchical Soft-max function (cont.)
• Since V is quite large, computing log(𝑝 𝑤𝑜 𝑤𝐼 ) costs to much
• Idea: Construct binary Huffman tree with word
 Cost: O( 𝑽 ) to O(lo g 𝑽 )
• Can train Faster!
• Assigning
• Word = 𝑛𝑜𝑑𝑒 𝑤, 𝐿 𝑤 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘
(* details in “Hierarchical Probabilistic Neural Network Language Model ")
Extension of Skip-Gram(1)
18/31
• Negative Sampling (similar to NCE)
• Size(Vocabulary) is computationally huge!  Slow for train
• Idea: Just sample several negative examples!
• Do not loop full vocabulary, only use neg. sample  fast
• Change the target word as negative sample and learn
negative examples  get more accuracy
• Objective function
Extension of Skip-Gram(2)
i.e. “Stock boil fish is toy” ????  negative sample
Noise Constrastive Estimation
𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡)
19/31
• Subsampling
• (“Korea”, ”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful
• Idea: Frequent word vectors (i.e. “the”) should not change
significantly after training on several million examples.
• Each word 𝑤𝑖 in the training set is discarded with below
probability
• It aggressively subsamples frequent words while preserve
ranking of the frequencies
• But, this formula was chosen heuristically…
Extension of Skip-Gram(3)
f wi : 𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤𝑖
𝑡: 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑎𝑟𝑜𝑢𝑛𝑑 10−5
20/31
• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly.
i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”)
Accuracy = cosine_similarity( vec(X), vec(“Paris”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG(Negative Sampling)-5, 15
• Hierarchical Softmax-Huffman
• NCE-5(Noise Contrastive Estimation)
Extension of Skip-Gram
21/31
• Empirical Results
• Model w/ NEG outperforms the HS on the analogical reasoning task
(even slightly better than NCE)
• The subsampling improves the training speed several times and
makes the word representations more accurate
Extension of Skip-Gram
22/31
• Word base model can not represent idiomatic word
• i.e. “Newyork Times”, “Larry Page”
• Simple data driven approach
• If phrases are formed based on 1-gram, 2-gram counts
• Target words that has high score would meaningful phrase
Learning Phrases
𝛿: 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡
(P𝑟𝑒𝑣𝑒𝑛𝑡 𝑡𝑜𝑜 𝑚𝑎𝑛𝑦 𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠)
23/31
• Evaluation
• Task: Analogical reasoning
• Accuracy test using cosine similarity determine how the model
answer correctly with phrase
• i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry Page”)
Accuracy = cosine_similarity( vec(x), vec(“Google”) )
• Model: skip-gram model(Word-embedding dimension = 300)
• Data Set: News article (Google dataset with 1 billion words)
• Comparing Method (w/ or w/o 10-5subsampling)
• NEG-5
• NEG-15
• HS-Huffman
Learning Phrases
24/31
• Empirical Results
• NEG-15 achieves better performance than NEG-5
• HS become the best performing method when subsampling
• This shows that the subsampling can result in faster training and can
also improve accuracy, at least in some cases.
• When training set = 33 billion, d=1000  72% (6B  66%)
• Amount of training set is crucial!
Learning Phrases
25/31
• Simple vector addition (on Skip-gram model)
• Previous experiments shows Analogical reasoning (A+B-C)
• Vector’s values are related logarithmically to the probabilities
 Sum of two vector is related to product of context distribution
• Interesting!
Additive Compositionality
26/31
• Contributions
• Showed detailed process of training distributed
representation of words and phrases
• Can be more accurate and faster model than previous
word2vec model by sub-sampling
• Negative Sampling: Extremely simple and accurate for
frequent words. (not frequent like phrase, HS was better)
• Word vectors can be meaningful by simple vector addition
• Made a code and dataset as open-source project
Conclusion
27/31
• Compare to other Neural network model
<Find most similar word>
• Skip-gram model trained on large corpus outperforms all
to other paper’s models.
Conclusion
28/31
• Very Interesting model
• Simple, short paper
• Easy to read
• Hard to understand detail
• In HS, way of Tree construction
• Several Heuristic methods
• Pre-processing like eliminate stop-words
Speaker’s Opinion
29/31
• Papers
• Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint
arXiv:1301.3781 (2013).
• Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." Proceedings of
the international workshop on artificial intelligence and statistics. 2005.
• Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Conference on
Language Resources and Evaluation (LREC-2006). 2006.
• Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014).
• Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word-
embedding method." arXiv preprint arXiv:1402.3722(2014).
• Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine Learning
Research 12 (2011): 2493-2537.
• Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003):
1137-1155.
• Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating
errors." Cognitive modeling 5 (1988): 3.
• Websites & Courses
• Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/)
• http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html
• http://nohhj.blogspot.kr/2015/08/word-embedding.html
• https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/
• http://rare-technologies.com/word2vec-tutorial/
• https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482
• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors
References
30/31
? = Word2vec(“Slide” + “End”)
End
31/31

Mais conteúdo relacionado

Mais procurados

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approachGarima Nanda
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Universitat Politècnica de Catalunya
 
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...Asma Ben Abacha
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdaviirpycon
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringApache MXNet
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptxRkRahul16
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowSri Ambati
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningRoberto Pereira Silveira
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to KerasJohn Ramey
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Simplilearn
 

Mais procurados (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Skip gram and cbow
Skip gram and cbowSkip gram and cbow
Skip gram and cbow
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
Neural Machine Translation (D3L4 Deep Learning for Speech and Language UPC 2017)
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...
Multimodal Question Answering in the Medical Domain (CMU/LTI 2020) | Dr. Asma...
 
Word2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad MahdaviWord2Vec: Vector presentation of words - Mohammad Mahdavi
Word2Vec: Vector presentation of words - Mohammad Mahdavi
 
Fine-tuning BERT for Question Answering
Fine-tuning BERT for Question AnsweringFine-tuning BERT for Question Answering
Fine-tuning BERT for Question Answering
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
 

Destaque

Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vecKai Sasaki
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practicehen_drik
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithmAndrew Koo
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)Yiwei Chen
 
Emerging Trends in Online Search
Emerging Trends in Online SearchEmerging Trends in Online Search
Emerging Trends in Online SearchDistilled
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecCarsten Eickhoff
 
word2vec (часть 2)
word2vec (часть 2)word2vec (часть 2)
word2vec (часть 2)Denis Dus
 
word2vec (part 1)
word2vec (part 1)word2vec (part 1)
word2vec (part 1)Denis Dus
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesTuri, Inc.
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documentsAbdullah Khan Zehady
 
Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15Sri Ambati
 
From A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecFrom A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecJungkyu Lee
 
【論文紹介】Distributed Representations of Sentences and Documents
【論文紹介】Distributed Representations of Sentences and Documents【論文紹介】Distributed Representations of Sentences and Documents
【論文紹介】Distributed Representations of Sentences and DocumentsTomofumi Yoshida
 

Destaque (20)

Drawing word2vec
Drawing word2vecDrawing word2vec
Drawing word2vec
 
word2vec - From theory to practice
word2vec - From theory to practiceword2vec - From theory to practice
word2vec - From theory to practice
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Textrank algorithm
Textrank algorithmTextrank algorithm
Textrank algorithm
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Word2vec 4 all
Word2vec 4 allWord2vec 4 all
Word2vec 4 all
 
Word2vec (中文)
Word2vec (中文)Word2vec (中文)
Word2vec (中文)
 
Emerging Trends in Online Search
Emerging Trends in Online SearchEmerging Trends in Online Search
Emerging Trends in Online Search
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Efficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2VecEfficient Parallel Learning of Word2Vec
Efficient Parallel Learning of Word2Vec
 
word2vec (часть 2)
word2vec (часть 2)word2vec (часть 2)
word2vec (часть 2)
 
word2vec (part 1)
word2vec (part 1)word2vec (part 1)
word2vec (part 1)
 
Word2vec для поискового движка
Word2vec для поискового движкаWord2vec для поискового движка
Word2vec для поискового движка
 
Deep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep FeaturesDeep Learning Made Easy with Deep Features
Deep Learning Made Easy with Deep Features
 
Word2vec для поискового движка II
Word2vec для поискового движка IIWord2vec для поискового движка II
Word2vec для поискового движка II
 
Distributed representation of sentences and documents
Distributed representation of sentences and documentsDistributed representation of sentences and documents
Distributed representation of sentences and documents
 
Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15Sparkling Water Applications Meetup 07.21.15
Sparkling Water Applications Meetup 07.21.15
 
From A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecFrom A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vec
 
【論文紹介】Distributed Representations of Sentences and Documents
【論文紹介】Distributed Representations of Sentences and Documents【論文紹介】Distributed Representations of Sentences and Documents
【論文紹介】Distributed Representations of Sentences and Documents
 

Semelhante a Word2vec slide(lab seminar)

Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Satyam Saxena
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptxGowrySailaja
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowBruno Gonçalves
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validationgmorishita
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksJonathan Mugan
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Association for Computational Linguistics
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdfChaoYang81
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsBenjamin Le
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier ananth
 

Semelhante a Word2vec slide(lab seminar) (20)

DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Deep Learning Bangalore meet up
Deep Learning Bangalore meet up Deep Learning Bangalore meet up
Deep Learning Bangalore meet up
 
Word_Embeddings.pptx
Word_Embeddings.pptxWord_Embeddings.pptx
Word_Embeddings.pptx
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
tutorial.ppt
tutorial.ppttutorial.ppt
tutorial.ppt
 
wordembedding.pptx
wordembedding.pptxwordembedding.pptx
wordembedding.pptx
 
Word embedding
Word embedding Word embedding
Word embedding
 
Word2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlowWord2vec in Theory Practice with TensorFlow
Word2vec in Theory Practice with TensorFlow
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
Lecture1.pptx
Lecture1.pptxLecture1.pptx
Lecture1.pptx
 
Yoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and WhitherYoav Goldberg: Word Embeddings What, How and Whither
Yoav Goldberg: Word Embeddings What, How and Whither
 
Model Selection and Validation
Model Selection and ValidationModel Selection and Validation
Model Selection and Validation
 
Matlab lec1
Matlab lec1Matlab lec1
Matlab lec1
 
Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow Thai Word Embedding with Tensorflow
Thai Word Embedding with Tensorflow
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
 
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
Chris Dyer - 2017 - Neural MT Workshop Invited Talk: The Neural Noisy Channel...
 
05-transformers.pdf
05-transformers.pdf05-transformers.pdf
05-transformers.pdf
 
Deep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender SystemsDeep Learning for Personalized Search and Recommender Systems
Deep Learning for Personalized Search and Recommender Systems
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 

Último

300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxseri bangash
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsSérgio Sacani
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxRenuJangid3
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.Silpa
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Silpa
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxSilpa
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry Areesha Ahmad
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxANSARKHAN96
 

Último (20)

300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.LUNULARIA -features, morphology, anatomy ,reproduction etc.
LUNULARIA -features, morphology, anatomy ,reproduction etc.
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
GBSN - Biochemistry (Unit 2) Basic concept of organic chemistry
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.Reboulia: features, anatomy, morphology etc.
Reboulia: features, anatomy, morphology etc.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 

Word2vec slide(lab seminar)

  • 2. Contents • Introduction • Previous Methods for represent words • Word2Vec • Extensions of skip-gram model / Learning Phrase / Additive Compositionality & Evaluation • Conclusion • Demo • Discussions • References
  • 3. Introduction • Example of NLP processing • EASY • Spell Chekcing (Checking) • Keyword Search (Ctrl+F) • Finding Synonyms • MEDIUM • Parsing information form documents, web, etc. • HARD • Machine Translation (e.g. Translate Korean to English) • Semantic Analysis (e.g. What’s meaning of this query?) • Co-reference (e.g. What does “it” refers in this sentence?) • Question Answering (e.g. IBM Watson)
  • 4. Introduction • BUT, Most important is How we represent words as input for all the NLP tasks.
  • 5. Introduction • BUT, Most important is How we represent meaning of words as input for all the NLP tasks.
  • 6. • At first, most NLP treated word as ATOMIC symbol • They needed notion of similarity & difference • So, • WordNet: Taxonomy has hypernyms (is-a) relationship and synonym set Simple example of wordnet showing synonyms and antonyms Prev. Methods for represent words - Discrete Representation
  • 7. • COOL! (see also, Semantic Web) • Great resource but, missing nuances Expert == Good ? Usually?  Probably NO! * Synonym set of good using nltk lib (CS224d-Lecture note) How about new words? : Wicked, ace, wizard, genius, ninja - Discrete Representation Prev. Methods for represent words
  • 8. • COOL! (see also, Semantic Web) • Great resource but, missing nuances * Synonym set of good using nltk lib (CS224d-Lecture note) Disadvantage • Hard to keep up to date • Requires human labor • Subjective • Hard to compute accurate word similarity - Discrete Representation Prev. Methods for represent words
  • 9. • Another problem of discrete representation • Can’t gives similarity • Too sparse e.g. Horse = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] Zebra = [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ]  “one-hot” representation: Typical, simple representation. All 0s with one 1, Identical Horse ∩ Zebra = [ 0 0 0 0 1 0 0 0 0 0 0 0 0 0 ] ∩ [ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ] = 0 (nothing) (But, we know does are mammal) - Discrete Representation Mammal Prev. Methods for represent words
  • 10. • Use neighbor to represent words! (Co-occurence) • Conjecture: Words that are related will often appear in the same documents.  Window allow capture both syntactic and semantic info. e.g. I enjoy baseball corpus I like NLP I like deep learning. * Co-occurrence Matrix with window size = 1 (CS224d-Lecture note) Co-occurs beside “I”, 2-times Prev. Methods for represent words
  • 11. • Use this matrix for word-embedding (feat. SVD) • Applying Single Value Decomposition for the simplicity, SVD: X (Co-occur Mat) = U*S*VT X U S VT (detail would be in linear algebra textbook) • Select k-columns from U as k-dimension word-vector Prev. Methods for represent words
  • 12. • Result of SVD based Model K = 2 K = 3 Prev. Methods for represent words
  • 13. • Disadvantage • Co-occur Matrix is extremely sparse • Very high dimensional • Quadratic cost to train (i.e. perform SVD) • Needs hacks for the imbalance in word frequency (i.e. “it”, “the”, “has”, etc.) • Some solutions exist for problem but, not intrinsic Prev. Methods for represent words
  • 14. Contents • Introduction • Previous Methods for represent words • Word2Vec • Extensions of skip-gram model / Learning Phrase / Additive Compositionality & Evaluation • Conclusion • Demo • Discussions • References
  • 15. Word2vec (related paper) • Then how? Directly learn (iteration) low-dimensional word vectors at a time!  Go Back to the 1986 • Learning representations by back-propagating errors (Rumelhart et al. 1986) • A neural probabilistic language model (Bengio et al., 2003) • NLP from Scratch (Collobert & Weston, 2008) • Word2Vec (Mikolov et al. 2013) • Efficient Estimation of Word Representation in Vector Space • Distributed Representations of words and phrases and their compositionality 7/31
  • 16. Efficient Estimation of Word Representation in Vector Space • Introduce initial architecture of word2vec (2013) • Two New Model: Continuous-Bag-of-word, Skip-gram model • Empirically show that this word model has better syntactic, semantic representation then other model • Compare two model • Skip-gram model works well on semantic but training is slower. • CBOW model works well on syntactic and training is faster. (P)Review 8/31
  • 17. Word2vec (profile) • Distributed Representations of words and phrases and their compositionality • NIPS 2013 (Submitted on 16 Oct 2013) • Tomas Mikorov, (FaceBook (2014 ~ )) et al. • Includes additional works of “Efficient Estimation of Word Representation in Vector Space”. 9/31
  • 18. Word2vec (Contents) • This paper includes, • Extensions of skip-gram model (fast & accurate) • Method • Hierarchical soft-max • NEG • Subsampling • Ability of Learning Phrase • Find Additive Compositionality • Conclusion 10/31
  • 19. • Skip-gram model • Objective of Skip-gram model is “Find word representations useful for predicting context words in a sentence. • Softmax function • … Extension of Skip-Gram 𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝 𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤 𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕 BUT, without understanding original model, we will.. ..going to.. fall ...asleep.. 11/31
  • 21. CBOW (Original) • Continuous-Bag-of-word model • Idea: Using context words, we can predict center word i.e. Probability( “It is ( ? ) to finish”  “time” ) • Present word as distributed vector of probability  Low dimension • Goal: Train weight-matrix(W ) satisfies below • Loss-function (using cross-entropy method) argmax 𝑊 {𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝒕𝒊𝒎𝒆 − 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑝𝑟 time 𝒊𝒕, 𝒊𝒔, 𝒕𝒐, 𝒇𝒊𝒏𝒊𝒔𝒉 ; 𝑊 } * Softmax(): K-dim vector of x∈ℝ  K-dim vector that has (0,1)∈ℝ 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶) context words (window_size=2) 12/31
  • 22. CBOW (Original) • Continuous-Bag-of-word model • Input • “one-hot” word vector • Remove nonlinear hidden layer • Back-propagate error from output layer to Weight matrix (Adjust W s) It is finish to time [ 0 1 0 0 0 ] Wout T∙h = 𝒚(predicted) [0 0 1 0 0]T Win ∙ h Win ∙ x i [0 0 0 0 1]T Win ∙ y(true) = Backpropagate to Minimize error vs Win(old) Wout(old) Win(new) Wout(new) Win ,Wout ∈ ℝ 𝑛×|𝑉| : Input, output Weight -matrix, n is dimension for word embedding x 𝑖 , 𝑦 𝑖 : input, output word vector (one-hot) from vocabulary V ℎ: hidden vector, avg of W*x [NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1] Initial input, not results 14/31
  • 23. • Skip-gram model • Idea: With center word, we can predict context words • Mirror of CBOW (vice versa) i.e. Probability( “time”  “It is ( ? ) to finish” ) • Loss-function: Skip-Gram (Original) 𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡) time It is to finish Win ∙ x i h y i Win(old) Wout(old) Win(new) Wout(new) [NxV]*[Vx1]  [Nx1] [VxN]*[Nx1]  [Vx1] CBOW: 𝐸 = − log 𝑝(𝑤𝑡|𝑤𝑡−𝐶. . 𝑤𝑡+𝐶) 15/31
  • 24. • Hierarchical Soft-max function • To train weight matrix in every step, we need to pass the calculated vector into Loss-Function • Soft-max function • Before calculate loss function calculated vector should normalized as real-number in (0,1) Extension of Skip-Gram(1) 𝑻: 𝑤ℎ𝑜𝑙𝑒 𝑠𝑡𝑒𝑝 𝒄: 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑐𝑜𝑛𝑡𝑒𝑥𝑡 𝑤𝑖𝑛𝑑𝑜𝑤 𝒘𝒕, 𝒘𝒕+𝒋: 𝑐𝑢𝑟𝑒𝑛𝑡 𝑠𝑡𝑒𝑝′ 𝑠 𝑤𝑜𝑟𝑑 𝑎𝑛𝑑 𝑗 − 𝑡ℎ 𝑤𝑜𝑟𝑑 𝒘𝒕 (𝑬 = − 𝐥𝐨𝐠 𝒑 𝒘 𝒕−𝑪. . 𝒘 𝒕+𝑪 𝒘 𝒕 ) 16/31
  • 25. • Hierarchical Soft-max function (cont.) • Soft-max function (I have already calculated, it’s boring …….…) Extension of Skip-Gram(1) Original soft-max function of skip-gram model 17/31
  • 26. • Hierarchical Soft-max function (cont.) • Since V is quite large, computing log(𝑝 𝑤𝑜 𝑤𝐼 ) costs to much • Idea: Construct binary Huffman tree with word  Cost: O( 𝑽 ) to O(lo g 𝑽 ) • Can train Faster! • Assigning • Word = 𝑛𝑜𝑑𝑒 𝑤, 𝐿 𝑤 𝑏𝑦 𝑟𝑎𝑛𝑑𝑜𝑚 𝑤𝑎𝑙𝑘 (* details in “Hierarchical Probabilistic Neural Network Language Model ") Extension of Skip-Gram(1) 18/31
  • 27. • Negative Sampling (similar to NCE) • Size(Vocabulary) is computationally huge!  Slow for train • Idea: Just sample several negative examples! • Do not loop full vocabulary, only use neg. sample  fast • Change the target word as negative sample and learn negative examples  get more accuracy • Objective function Extension of Skip-Gram(2) i.e. “Stock boil fish is toy” ????  negative sample Noise Constrastive Estimation 𝐸 = − log 𝑝(𝑤𝑡−𝐶. . 𝑤𝑡+𝐶|𝑤𝑡) 19/31
  • 28. • Subsampling • (“Korea”, ”Seoul”) is helpful, but (“Korea”, ”the”) isn’t helpful • Idea: Frequent word vectors (i.e. “the”) should not change significantly after training on several million examples. • Each word 𝑤𝑖 in the training set is discarded with below probability • It aggressively subsamples frequent words while preserve ranking of the frequencies • But, this formula was chosen heuristically… Extension of Skip-Gram(3) f wi : 𝑓𝑟𝑒𝑞𝑢𝑛𝑐𝑦 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑤𝑖 𝑡: 𝑐ℎ𝑜𝑠𝑒𝑛 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑, 𝑎𝑟𝑜𝑢𝑛𝑑 10−5 20/31
  • 29. • Evaluation • Task: Analogical reasoning • Accuracy test using cosine similarity determine how the model answer correctly. i.e. vec(X) = vec(“Berlin”) – vec(“Germany”) + vec(“France”) Accuracy = cosine_similarity( vec(X), vec(“Paris”) ) • Model: skip-gram model(Word-embedding dimension = 300) • Data Set: News article (Google dataset with 1 billion words) • Comparing Method (w/ or w/o 10-5subsampling) • NEG(Negative Sampling)-5, 15 • Hierarchical Softmax-Huffman • NCE-5(Noise Contrastive Estimation) Extension of Skip-Gram 21/31
  • 30. • Empirical Results • Model w/ NEG outperforms the HS on the analogical reasoning task (even slightly better than NCE) • The subsampling improves the training speed several times and makes the word representations more accurate Extension of Skip-Gram 22/31
  • 31. • Word base model can not represent idiomatic word • i.e. “Newyork Times”, “Larry Page” • Simple data driven approach • If phrases are formed based on 1-gram, 2-gram counts • Target words that has high score would meaningful phrase Learning Phrases 𝛿: 𝑑𝑖𝑠𝑐𝑜𝑢𝑛𝑡𝑖𝑛𝑔 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 (P𝑟𝑒𝑣𝑒𝑛𝑡 𝑡𝑜𝑜 𝑚𝑎𝑛𝑦 𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑖𝑛𝑔 𝑜𝑓 𝑖𝑛𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑤𝑜𝑟𝑑𝑠) 23/31
  • 32. • Evaluation • Task: Analogical reasoning • Accuracy test using cosine similarity determine how the model answer correctly with phrase • i.e. vec(X) = vec(“Steve Ballmer”) – vec(“Microsoft”) + vec(“Larry Page”) Accuracy = cosine_similarity( vec(x), vec(“Google”) ) • Model: skip-gram model(Word-embedding dimension = 300) • Data Set: News article (Google dataset with 1 billion words) • Comparing Method (w/ or w/o 10-5subsampling) • NEG-5 • NEG-15 • HS-Huffman Learning Phrases 24/31
  • 33. • Empirical Results • NEG-15 achieves better performance than NEG-5 • HS become the best performing method when subsampling • This shows that the subsampling can result in faster training and can also improve accuracy, at least in some cases. • When training set = 33 billion, d=1000  72% (6B  66%) • Amount of training set is crucial! Learning Phrases 25/31
  • 34. • Simple vector addition (on Skip-gram model) • Previous experiments shows Analogical reasoning (A+B-C) • Vector’s values are related logarithmically to the probabilities  Sum of two vector is related to product of context distribution • Interesting! Additive Compositionality 26/31
  • 35. • Contributions • Showed detailed process of training distributed representation of words and phrases • Can be more accurate and faster model than previous word2vec model by sub-sampling • Negative Sampling: Extremely simple and accurate for frequent words. (not frequent like phrase, HS was better) • Word vectors can be meaningful by simple vector addition • Made a code and dataset as open-source project Conclusion 27/31
  • 36. • Compare to other Neural network model <Find most similar word> • Skip-gram model trained on large corpus outperforms all to other paper’s models. Conclusion 28/31
  • 37. • Very Interesting model • Simple, short paper • Easy to read • Hard to understand detail • In HS, way of Tree construction • Several Heuristic methods • Pre-processing like eliminate stop-words Speaker’s Opinion 29/31
  • 38. • Papers • Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013). • Morin, Frederic, and Yoshua Bengio. "Hierarchical probabilistic neural network language model." Proceedings of the international workshop on artificial intelligence and statistics. 2005. • Guthrie, David, et al. "A closer look at skip-gram modelling." Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006). 2006. • Rong, Xin. "word2vec Parameter Learning Explained." arXiv preprint arXiv:1411.2731 (2014). • Goldberg, Yoav, and Omer Levy. "word2vec Explained: deriving Mikolov et al.'s negative-sampling word- embedding method." arXiv preprint arXiv:1402.3722(2014). • Collobert, Ronan, et al. "Natural language processing (almost) from scratch." The Journal of Machine Learning Research 12 (2011): 2493-2537. • Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155. • Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." Cognitive modeling 5 (1988): 3. • Websites & Courses • Richard Socher, CS224d: Deep Learning for Natural Language Processing (http://cs224d.stanford.edu/) • http://alexminnaar.com/word2vec-tutorial-part-i-the-skip-gram-model.html • http://nohhj.blogspot.kr/2015/08/word-embedding.html • https://yinwenpeng.wordpress.com/category/deep-learning-in-nlp/ • http://rare-technologies.com/word2vec-tutorial/ • https://code.google.com/p/word2vec/source/browse/trunk/word2vec.c?spec=svn42&r=42#482 • https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors References 30/31
  • 39. ? = Word2vec(“Slide” + “End”) End 31/31

Notas do Editor

  1. It seems to be propose new notion but it’s old notion. Used chain rule and evaluate errors at every iterations -> Show probabilistic model to large scale set -> deep learning can be used for various NLP taks
  2. - Instead of looping over entire vocabulary, just sample several negative examples! & Good model can distinguish bad samples - build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not.