What Are The Drone Anti-jamming Systems Technology?
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Transfer Learning and Transformers
2. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
2
3. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
3
4. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
• Let's classify birds! We have 10K labeled
images.
• From ImageNet, we know that deep NN
like Resnet-50 works well.
• Problem: NN is so large that it overfits on
our small data!
• Solution: use NN trained on ImageNet
(1M images), and fine-tune on bird data.
• Result: better performance than anything
else!
4
5. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
5
A LOT of data
Large model
Traditional Machine Learning:
6. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
6
A LOT of data
Large model
Traditional Machine Learning:
slow training on a lot of data
7. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
7
A LOT of data
Train on much
less data
Large model
Pretrained large model
+ new layers
Traditional Machine Learning:
slow training on a lot of data
Transfer learning:
8. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
8
A LOT of data Much less data
Large model
Pretrained large model
+ new layers
Traditional Machine Learning:
slow training on a lot of data
Transfer learning:
fast training on a little data
9. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
9
Imagenet-
pretrained
model
Keep these Replace these
10. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
10
Imagenet-
pretrained
model
Keep these Replace these
Fine-tuned model
New Learned Weights
Same weights as before
11. Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
11
https://wandb.ai/wandb/wandb-lightning/reports/Transfer-Learning-Using-PyTorch-Lightning--VmlldzoyODk2MjA
12. Full Stack Deep Learning - UC Berkeley Spring 2021
Model Zoos
12
https://pytorch.org/docs/stable/torchvision/models.html https://github.com/tensorflow/models/tree/master/official
13. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
13
14. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
14
15. Full Stack Deep Learning - UC Berkeley Spring 2021
Converting words to vectors
• In NLP, our inputs are sequences of words, but deep learning needs
vectors.
• How to convert words to vectors?
• Idea: one-hot encoding
15
16. Full Stack Deep Learning - UC Berkeley Spring 2021
One-hot encoding
16
https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2
17. Full Stack Deep Learning - UC Berkeley Spring 2021
Problems with one-hot encoding
• Scales poorly with vocabulary size
• Very high-dimensional sparse vectors ->
NN operations work poorly
• Violates what we know about word
similarity (e.g. "run" is as far away from
"running" as from "tertiary," or "poetry")
17
(0, 1)
(1, 0)
2D
(0, 1, 0)
(1, 0, 0)
3D
(0, 0, 1)
18. Full Stack Deep Learning - UC Berkeley Spring 2021
Map one-hot to dense vectors
• Problem: how do we find the values of the embedding matrix?
18
https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2
VxV matrix VxE matrix
VxE
embedding
matrix
19. Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 1: Learn as part of the task
19
20. Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 2: Learn a Language Model
• "Pre-train" for your NLP task by learning a really good word embedding!
• How to learn a really good embedding? Train for a very general task on a
large corpus of text.
20
+ ?
21. Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 2: Learn a Language Model
21
http://jalammar.github.io/illustrated-word2vec/
22. Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
22
http://jalammar.github.io/illustrated-word2vec/
23. Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
23
http://jalammar.github.io/illustrated-word2vec/
24. Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
24
http://jalammar.github.io/illustrated-word2vec/
25. Full Stack Deep Learning - UC Berkeley Spring 2021
• Look on both sides of the target word, and form multiple samples from
each N-gram
Skip-grams
25
http://jalammar.github.io/illustrated-word2vec/
26. Full Stack Deep Learning - UC Berkeley Spring 2021
• Binary instead of multi-class (faster training)
Speed up training
26
http://jalammar.github.io/illustrated-word2vec/
27. Full Stack Deep Learning - UC Berkeley Spring 2021
Training
27
http://jalammar.github.io/illustrated-word2vec/
28. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
28
http://jalammar.github.io/illustrated-word2vec/
"king"
29. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
29
http://jalammar.github.io/illustrated-word2vec/
30. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
30
http://jalammar.github.io/illustrated-word2vec/
31. Full Stack Deep Learning - UC Berkeley Spring 2021
Vector math...
31
https://mathinsight.org/vectors_cartesian_coordinates_2d_3d
32. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
32
http://jalammar.github.io/illustrated-word2vec/
33. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
33
http://jalammar.github.io/illustrated-word2vec/
34. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
34
http://jalammar.github.io/illustrated-word2vec/
35. Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
35
https://ruder.io/nlp-imagenet/
36. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
36
37. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
37
38. Full Stack Deep Learning - UC Berkeley Spring 2021
Beyond embeddings
• Word2Vec and GloVe embeddings became popular in ~2013-14
• Boosted accuracy on many tasks by low single-digit %
• But these representations are shallow:
• only first layer would have benefit of seeing all of Wikipedia
• rest of the model -- LSTMs, etc -- would be trained only on the task
dataset (much smaller)
• Why not pre-train more layers, and thus disambiguate words (e.g. rule in
"to rule" vs "a rule"), learn grammar, etc?
38
39. Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
• Bidirectional stacked LSTM!
39
https://ruder.io/nlp-imagenet/
40. Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
40
41. Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
41
https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
42. Full Stack Deep Learning - UC Berkeley Spring 2021
Q&A: SQuAD
• 100K question-answer pairs
• Answers are always spans in
the question
42
https://arxiv.org/abs/1606.05250
43. Full Stack Deep Learning - UC Berkeley Spring 2021
Natural Language Inference: SNLI
• What relation exists between piece of text and hypothesis?
• 570K pairs
43
https://nlp.stanford.edu/pubs/snli_paper.pdf
44. Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
44
https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
45. Full Stack Deep Learning - UC Berkeley Spring 2021
GLUE
45
https://mccormickml.com/2019/11/05/GLUE/
46. Full Stack Deep Learning - UC Berkeley Spring 2021
GLUE
• 9 tasks, model score is averaged across them
46
https://mccormickml.com/2019/11/05/GLUE/
47. Full Stack Deep Learning - UC Berkeley Spring 2021
ULMFit (2018)
47
https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
48. Full Stack Deep Learning - UC Berkeley Spring 2021
ULMFit (2018)
48
https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
49. Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-training is here!
49
50. Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-training is here!
50
51. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
51
52. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
52
53. Full Stack Deep Learning - UC Berkeley Spring 2021
Rise of Transformers
53
54. Full Stack Deep Learning - UC Berkeley Spring 2021
Rise of Transformers
54
55. Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• Encoder-decoder with only attention and
fully-connected layers (no recurrence or
convolutions)
• Set new SOTA on translation datasets.
55
56. Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
56
• For simplicity, can focus just on the
encoder
• (e.g BERT is just the encoder)
57. Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• The components:
• (Masked) Self-attention
• Positional encoding
• Layer normalization
57
58. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
58
59. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors
59
http://www.peterbloem.nl/blog/transformers
60. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input sequence
60
http://www.peterbloem.nl/blog/transformers
61. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input sequence
61
http://www.peterbloem.nl/blog/transformers
- not a learned weight, but a function of x_i and x_j
62. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input sequence
62
http://www.peterbloem.nl/blog/transformers
- not a learned weight, but a function of x_i and x_j
- must sum to 1 over j
63. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
63
http://www.peterbloem.nl/blog/transformers
The Cat is yawning
64. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• SO FAR:
• No learned weights
• Order of the sequence does not affect result of computations
64
http://www.peterbloem.nl/blog/transformers
65. Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• SO FAR:
• No learned weights
• Order of the sequence does not affect result of computations
65
http://www.peterbloem.nl/blog/transformers
Let's learn some weights!
66. Full Stack Deep Learning - UC Berkeley Spring 2021
Query, Key, Value
• Every input vector x_i is used in 3 ways:
• Compared to every other vector to compute
attention weights for its own output y_i (query)
• Compared to every other vector to compute
attention weight w_ij for output y_j (key)
• Summed with other vectors to form the result
of the attention weighted sum (value)
66
http://www.peterbloem.nl/blog/transformers
67. Full Stack Deep Learning - UC Berkeley Spring 2021
Query, Key, Value
• We can process each input vector to fulfill the three roles with matrix multiplication
• Learning the matrices --> learning attention
67
http://www.peterbloem.nl/blog/transformers
68. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
68
69. Full Stack Deep Learning - UC Berkeley Spring 2021
Multi-head attention
• Multiple "heads" of attention just means learning different sets of W_q,
W_k, and W_v matrices simultaneously.
• Implemented as just a single matrix anyway...
69
http://www.peterbloem.nl/blog/transformers
70. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
70
http://www.peterbloem.nl/blog/transformers
71. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
71
http://www.peterbloem.nl/blog/transformers
72. Full Stack Deep Learning - UC Berkeley Spring 2021
Layer Normalization
• Neural net layers work best when
input vectors have uniform mean and
std in each dimension
• (remember input data scaling,
weight initialization)
• As inputs flow through the network,
means and std's get blown out.
• Layer Normalization is a hack to reset
things to where we want them in
between layers.
72
https://arxiv.org/pdf/1803.08494.pdf
73. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
73
http://www.peterbloem.nl/blog/transformers
74. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• SO FAR:
• Learned query, key, value weights
• Multiple heads
• Order of the sequence does not affect result of computations
74
http://www.peterbloem.nl/blog/transformers
75. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• SO FAR:
• Learned query, key, value weights
• Multiple heads
• Order of the sequence does not affect result of computations
75
http://www.peterbloem.nl/blog/transformers
Let's encode each vector with position
76. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
76
http://www.peterbloem.nl/blog/transformers
77. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Position embedding: just what it sounds!
77
http://www.peterbloem.nl/blog/transformers
78. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector in
sequence (e.g. generate text), we need to mask the future.
78
http://www.peterbloem.nl/blog/transformers
79. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector in
sequence (e.g. generate text), we need to mask the future.
79
http://www.peterbloem.nl/blog/transformers
80. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
80
81. Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers
• Attention in detail
• BERT, GPT-2, DistillBERT, T5
81
82. Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• Encoder-decoder for translation
82
83. Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
83
http://www.peterbloem.nl/blog/transformers
• Encoder-decoder for translation
• Later models made it mostly just
the encoder or just the decoder
• ...but then the latest models are
back to encoder-decoder
84. Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer
84
https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
85. Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer
• GPT learns to predict the next word in the sequence, just like ELMO or
ULMFIT
85
https://arxiv.org/pdf/1810.04805.pdf
86. Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer
• GPT learns to predict the next word in the sequence, just like ELMO or
ULMFIT
• Since it conditions only on preceding words, it uses masked self-attention
86
http://jalammar.github.io/illustrated-gpt2/
87. Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
87
117M params 345M params 762M params 1.5B params
Trained on 8M web pages
http://jalammar.github.io/illustrated-gpt2/
88. Full Stack Deep Learning - UC Berkeley Spring 2021
Talk to Transformer
88
https://talktotransformer.com
1.5B parameter model
89. Full Stack Deep Learning - UC Berkeley Spring 2021
BERT
• Bidirectional Encoder Representations from Transformers
• Encoder blocks only (no masking)
• BERT involves pre-training on A LOT of text with 15% of all words masked out
• also sometimes predicting whether one sentence follows another
89
https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
90. Full Stack Deep Learning - UC Berkeley Spring 2021
BERT
• 340M parameters:
• 24 transformer blocks, embedding dim of 1024, 16 attention heads
90
91. Full Stack Deep Learning - UC Berkeley Spring 2021
Transformers
91
https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8
92. Full Stack Deep Learning - UC Berkeley Spring 2021
Number of parameters
92
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
93. Full Stack Deep Learning - UC Berkeley Spring 2021
T5: Text-to-Text Transfer Transformer
• Feb 2020
• Evaluated most recent transfer learning
techniques
• Input and output are both text strings
• Trained on C4 (Colossal Clean Crawled
Corpus) - 100x larger than Wikipedia
• 11B parameters
• SOTA on GLUE, SuperGLUE, SQuAD
93
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
94. Full Stack Deep Learning - UC Berkeley Spring 2021
T5: Text-to-Text Transfer Transformer
94
https://medium.com/@Moscow25/the-best-deep-natural-language-papers-you-should-read-bert-gpt-2-and-looking-
forward-1647f4438797
99. GSV Breakfast Club - January 2021 - Sergey Karayev
Really, really good text generation
Qspwjefe!cz!nf
Hfofsbufe
100. GSV Breakfast Club - January 2021 - Sergey Karayev
One problem:
perpetuating unfortunate
things
https://twitter.com/an_open_mind/status/1284487376312709120
101. GSV Breakfast Club - January 2021 - Sergey Karayev
Reasonable mental model
102. GSV Breakfast Club - January 2021 - Sergey Karayev
🤯 Useful for more than text?
103. GSV Breakfast Club - January 2021 - Sergey Karayev
🤯 Useful for more than text?
104. GSV Breakfast Club - January 2021 - Sergey Karayev
Image Generation
https://openai.com/blog/dall-e/
105. Full Stack Deep Learning - UC Berkeley Spring 2021
No sign of slowing down
105
107. Full Stack Deep Learning - UC Berkeley Spring 2021
Number of parameters
• The way things are trending, only big
companies can afford to compete!
• But there is another direction to go in:
doing more with less
107
https://gluebenchmark.com/leaderboard
108. Full Stack Deep Learning - UC Berkeley Spring 2021
DistillBERT
108
Knowledge Distillation: a smaller model is
trained to reproduce the output of a larger model
109. Full Stack Deep Learning - UC Berkeley Spring 2021
Papers and Leaderboards
109
paperswithcode.com
110. Full Stack Deep Learning - UC Berkeley Spring 2021
Papers and Leaderboards
110
https://github.com/sebastianruder/NLP-progress
111. Full Stack Deep Learning - UC Berkeley Spring 2021
Implementations
111
https://github.com/huggingface/transformers
112. Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
112
113. Full Stack Deep Learning - UC Berkeley Spring 2021
113
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
114. Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
114