SlideShare uma empresa Scribd logo
1 de 114
Baixar para ler offline
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
Transfer Learning and Transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers

• Attention in detail

• BERT, GPT-2, DistillBERT, T5
2
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision
• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers

• Attention in detail

• BERT, GPT-2, DistillBERT, T5
3
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
• Let's classify birds! We have 10K labeled
images.

• From ImageNet, we know that deep NN
like Resnet-50 works well.

• Problem: NN is so large that it overfits on
our small data!

• Solution: use NN trained on ImageNet
(1M images), and fine-tune on bird data.

• Result: better performance than anything
else!
4
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
5
A LOT of data
Large model
Traditional Machine Learning:
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
6
A LOT of data
Large model
Traditional Machine Learning:
slow training on a lot of data
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
7
A LOT of data
Train on much
less data
Large model
Pretrained large model
+ new layers
Traditional Machine Learning:
slow training on a lot of data
Transfer learning:
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
8
A LOT of data Much less data
Large model
Pretrained large model
+ new layers
Traditional Machine Learning:
slow training on a lot of data
Transfer learning:
fast training on a little data
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
9
Imagenet-
pretrained
model
Keep these Replace these
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
10
Imagenet-
pretrained
model
Keep these Replace these
Fine-tuned model
New Learned Weights
Same weights as before
Full Stack Deep Learning - UC Berkeley Spring 2021
Transfer Learning
11
https://wandb.ai/wandb/wandb-lightning/reports/Transfer-Learning-Using-PyTorch-Lightning--VmlldzoyODk2MjA
Full Stack Deep Learning - UC Berkeley Spring 2021
Model Zoos
12
https://pytorch.org/docs/stable/torchvision/models.html https://github.com/tensorflow/models/tree/master/official
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
13
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models
• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers

• Attention in detail

• BERT, GPT-2, DistillBERT, T5
14
Full Stack Deep Learning - UC Berkeley Spring 2021
Converting words to vectors
• In NLP, our inputs are sequences of words, but deep learning needs
vectors.

• How to convert words to vectors?

• Idea: one-hot encoding
15
Full Stack Deep Learning - UC Berkeley Spring 2021
One-hot encoding
16
https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2
Full Stack Deep Learning - UC Berkeley Spring 2021
Problems with one-hot encoding
• Scales poorly with vocabulary size

• Very high-dimensional sparse vectors ->
NN operations work poorly

• Violates what we know about word
similarity (e.g. "run" is as far away from
"running" as from "tertiary," or "poetry")
17
(0, 1)
(1, 0)
2D
(0, 1, 0)
(1, 0, 0)
3D
(0, 0, 1)
Full Stack Deep Learning - UC Berkeley Spring 2021
Map one-hot to dense vectors
• Problem: how do we find the values of the embedding matrix?
18
https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2
VxV matrix VxE matrix
VxE
embedding
matrix
Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 1: Learn as part of the task
19
Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 2: Learn a Language Model
• "Pre-train" for your NLP task by learning a really good word embedding!

• How to learn a really good embedding? Train for a very general task on a
large corpus of text.
20
+ ?
Full Stack Deep Learning - UC Berkeley Spring 2021
Solution 2: Learn a Language Model
21
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
22
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
23
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
• Slide an N-sized window through the text, forming a dataset of predicting
the last word.
N-Grams
24
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
• Look on both sides of the target word, and form multiple samples from
each N-gram
Skip-grams
25
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
• Binary instead of multi-class (faster training)
Speed up training
26
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Training
27
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
28
http://jalammar.github.io/illustrated-word2vec/
"king"
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
29
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
30
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Vector math...
31
https://mathinsight.org/vectors_cartesian_coordinates_2d_3d
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
32
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
33
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
34
http://jalammar.github.io/illustrated-word2vec/
Full Stack Deep Learning - UC Berkeley Spring 2021
Word2Vec (2013)
35
https://ruder.io/nlp-imagenet/
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
36
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit
• Transformers

• Attention in detail

• BERT, GPT-2, DistillBERT, T5
37
Full Stack Deep Learning - UC Berkeley Spring 2021
Beyond embeddings
• Word2Vec and GloVe embeddings became popular in ~2013-14

• Boosted accuracy on many tasks by low single-digit %

• But these representations are shallow:

• only first layer would have benefit of seeing all of Wikipedia

• rest of the model -- LSTMs, etc -- would be trained only on the task
dataset (much smaller)

• Why not pre-train more layers, and thus disambiguate words (e.g. rule in
"to rule" vs "a rule"), learn grammar, etc?
38
Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
• Bidirectional stacked LSTM!
39
https://ruder.io/nlp-imagenet/
Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
40
Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
41
https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
Full Stack Deep Learning - UC Berkeley Spring 2021
Q&A: SQuAD
• 100K question-answer pairs

• Answers are always spans in
the question
42
https://arxiv.org/abs/1606.05250
Full Stack Deep Learning - UC Berkeley Spring 2021
Natural Language Inference: SNLI
• What relation exists between piece of text and hypothesis?

• 570K pairs
43
https://nlp.stanford.edu/pubs/snli_paper.pdf
Full Stack Deep Learning - UC Berkeley Spring 2021
Elmo (2018)
44
https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
Full Stack Deep Learning - UC Berkeley Spring 2021
GLUE
45
https://mccormickml.com/2019/11/05/GLUE/
Full Stack Deep Learning - UC Berkeley Spring 2021
GLUE
• 9 tasks, model score is averaged across them
46
https://mccormickml.com/2019/11/05/GLUE/
Full Stack Deep Learning - UC Berkeley Spring 2021
ULMFit (2018)
47
https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
Full Stack Deep Learning - UC Berkeley Spring 2021
ULMFit (2018)
48
https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-training is here!
49
Full Stack Deep Learning - UC Berkeley Spring 2021
Pre-training is here!
50
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
51
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers
• Attention in detail

• BERT, GPT-2, DistillBERT, T5
52
Full Stack Deep Learning - UC Berkeley Spring 2021
Rise of Transformers
53
Full Stack Deep Learning - UC Berkeley Spring 2021
Rise of Transformers
54
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• Encoder-decoder with only attention and
fully-connected layers (no recurrence or
convolutions)

• Set new SOTA on translation datasets.
55
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
56
• For simplicity, can focus just on the
encoder

• (e.g BERT is just the encoder)
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• The components:

• (Masked) Self-attention

• Positional encoding

• Layer normalization
57
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers

• Attention in detail
• BERT, GPT-2, DistillBERT, T5
58
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors

59
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors

• Output: sequence of tensors, each one a weighted sum of the input sequence

60
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors

• Output: sequence of tensors, each one a weighted sum of the input sequence

61
http://www.peterbloem.nl/blog/transformers
- not a learned weight, but a function of x_i and x_j
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• Input: sequence of tensors

• Output: sequence of tensors, each one a weighted sum of the input sequence

62
http://www.peterbloem.nl/blog/transformers
- not a learned weight, but a function of x_i and x_j

- must sum to 1 over j
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
63
http://www.peterbloem.nl/blog/transformers
The Cat is yawning
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• SO FAR:

• No learned weights

• Order of the sequence does not affect result of computations
64
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Basic self-attention
• SO FAR:

• No learned weights
• Order of the sequence does not affect result of computations
65
http://www.peterbloem.nl/blog/transformers
Let's learn some weights!
Full Stack Deep Learning - UC Berkeley Spring 2021
Query, Key, Value
• Every input vector x_i is used in 3 ways:

• Compared to every other vector to compute
attention weights for its own output y_i (query)

• Compared to every other vector to compute
attention weight w_ij for output y_j (key)

• Summed with other vectors to form the result
of the attention weighted sum (value)
66
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Query, Key, Value
• We can process each input vector to fulfill the three roles with matrix multiplication

• Learning the matrices --> learning attention
67
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
68
Full Stack Deep Learning - UC Berkeley Spring 2021
Multi-head attention
• Multiple "heads" of attention just means learning different sets of W_q,
W_k, and W_v matrices simultaneously.

• Implemented as just a single matrix anyway...
69
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
70
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
71
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Layer Normalization
• Neural net layers work best when
input vectors have uniform mean and
std in each dimension

• (remember input data scaling,
weight initialization)

• As inputs flow through the network,
means and std's get blown out.

• Layer Normalization is a hack to reset
things to where we want them in
between layers.
72
https://arxiv.org/pdf/1803.08494.pdf
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Self-attention layer -> Layer normalization -> Dense layer
73
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• SO FAR:

• Learned query, key, value weights

• Multiple heads

• Order of the sequence does not affect result of computations
74
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• SO FAR:

• Learned query, key, value weights

• Multiple heads

• Order of the sequence does not affect result of computations
75
http://www.peterbloem.nl/blog/transformers
Let's encode each vector with position
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
76
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer
• Position embedding: just what it sounds!
77
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector in
sequence (e.g. generate text), we need to mask the future.
78
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector in
sequence (e.g. generate text), we need to mask the future.
79
http://www.peterbloem.nl/blog/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
80
Full Stack Deep Learning - UC Berkeley Spring 2021
Outline
• Transfer Learning in Computer Vision

• Embeddings and Language Models

• "NLP's ImageNet Moment": ELMO/ULMFit

• Transformers

• Attention in detail

• BERT, GPT-2, DistillBERT, T5
81
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
• Encoder-decoder for translation
82
Full Stack Deep Learning - UC Berkeley Spring 2021
Attention is all you need (2017)
83
http://www.peterbloem.nl/blog/transformers
• Encoder-decoder for translation

• Later models made it mostly just
the encoder or just the decoder

• ...but then the latest models are
back to encoder-decoder
Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer
84
https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer

• GPT learns to predict the next word in the sequence, just like ELMO or
ULMFIT
85
https://arxiv.org/pdf/1810.04805.pdf
Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
• Generative Pre-trained Transformer

• GPT learns to predict the next word in the sequence, just like ELMO or
ULMFIT 

• Since it conditions only on preceding words, it uses masked self-attention
86
http://jalammar.github.io/illustrated-gpt2/
Full Stack Deep Learning - UC Berkeley Spring 2021
GPT / GPT-2
87
117M params 345M params 762M params 1.5B params
Trained on 8M web pages
http://jalammar.github.io/illustrated-gpt2/
Full Stack Deep Learning - UC Berkeley Spring 2021
Talk to Transformer
88
https://talktotransformer.com
1.5B parameter model
Full Stack Deep Learning - UC Berkeley Spring 2021
BERT
• Bidirectional Encoder Representations from Transformers

• Encoder blocks only (no masking)

• BERT involves pre-training on A LOT of text with 15% of all words masked out

• also sometimes predicting whether one sentence follows another
89
https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
Full Stack Deep Learning - UC Berkeley Spring 2021
BERT
• 340M parameters:

• 24 transformer blocks, embedding dim of 1024, 16 attention heads
90
Full Stack Deep Learning - UC Berkeley Spring 2021
Transformers
91
https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8
Full Stack Deep Learning - UC Berkeley Spring 2021
Number of parameters
92
https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
Full Stack Deep Learning - UC Berkeley Spring 2021
T5: Text-to-Text Transfer Transformer
• Feb 2020

• Evaluated most recent transfer learning
techniques

• Input and output are both text strings

• Trained on C4 (Colossal Clean Crawled
Corpus) - 100x larger than Wikipedia

• 11B parameters

• SOTA on GLUE, SuperGLUE, SQuAD
93
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
Full Stack Deep Learning - UC Berkeley Spring 2021
T5: Text-to-Text Transfer Transformer
94
https://medium.com/@Moscow25/the-best-deep-natural-language-papers-you-should-read-bert-gpt-2-and-looking-
forward-1647f4438797
Full Stack Deep Learning - UC Berkeley Spring 2021
95
Full Stack Deep Learning - UC Berkeley Spring 2021
GPT-3
96
Full Stack Deep Learning - UC Berkeley Spring 2021
97
Full Stack Deep Learning - UC Berkeley Spring 2021
98
GSV Breakfast Club - January 2021 - Sergey Karayev
Really, really good text generation
Qspwjefe!cz!nf
Hfofsbufe
GSV Breakfast Club - January 2021 - Sergey Karayev
One problem:
perpetuating unfortunate
things
https://twitter.com/an_open_mind/status/1284487376312709120
GSV Breakfast Club - January 2021 - Sergey Karayev
Reasonable mental model
GSV Breakfast Club - January 2021 - Sergey Karayev
🤯 Useful for more than text?
GSV Breakfast Club - January 2021 - Sergey Karayev
🤯 Useful for more than text?
GSV Breakfast Club - January 2021 - Sergey Karayev
Image Generation
https://openai.com/blog/dall-e/
Full Stack Deep Learning - UC Berkeley Spring 2021
No sign of slowing down
105
Full Stack Deep Learning - UC Berkeley Spring 2021
106
Full Stack Deep Learning - UC Berkeley Spring 2021
Number of parameters
• The way things are trending, only big
companies can afford to compete!

• But there is another direction to go in:

doing more with less
107
https://gluebenchmark.com/leaderboard
Full Stack Deep Learning - UC Berkeley Spring 2021
DistillBERT
108
Knowledge Distillation: a smaller model is
trained to reproduce the output of a larger model
Full Stack Deep Learning - UC Berkeley Spring 2021
Papers and Leaderboards
109
paperswithcode.com
Full Stack Deep Learning - UC Berkeley Spring 2021
Papers and Leaderboards
110
https://github.com/sebastianruder/NLP-progress
Full Stack Deep Learning - UC Berkeley Spring 2021
Implementations
111
https://github.com/huggingface/transformers
Full Stack Deep Learning - UC Berkeley Spring 2021
Questions?
112
Full Stack Deep Learning - UC Berkeley Spring 2021
113
http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Full Stack Deep Learning - UC Berkeley Spring 2021
Thank you!
114

Mais conteúdo relacionado

Mais procurados

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsOVHcloud
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 

Mais procurados (20)

NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
gpt3_presentation.pdf
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdf
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...PR-297: Training data-efficient image transformers & distillation through att...
PR-297: Training data-efficient image transformers & distillation through att...
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
Bert
BertBert
Bert
 

Semelhante a Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)

Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy StuffCONNECT FOUNDATION
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Universitat Politècnica de Catalunya
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningSean Yu
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
Chapter06 designing class
Chapter06 designing classChapter06 designing class
Chapter06 designing classFajar Baskoro
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveUniversity of Amsterdam
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
CSE202.pptx
CSE202.pptxCSE202.pptx
CSE202.pptxJoyBoy45
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Roelof Pieters
 
Lancaster UCREL Summer School 2017 - Big Data and NLP
Lancaster UCREL Summer School 2017 - Big Data and NLPLancaster UCREL Summer School 2017 - Big Data and NLP
Lancaster UCREL Summer School 2017 - Big Data and NLPDaniel Kershaw
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2Dr. Aparna Varde
 

Semelhante a Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021) (20)

Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff
[부스트캠프 Tech Talk] 안영진_Tackling Complexity with Easy Stuff
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
Deep Learning for Computer Vision: Transfer Learning and Domain Adaptation (U...
 
NTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer LearningNTU DBME5028 Week8 Transfer Learning
NTU DBME5028 Week8 Transfer Learning
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Chapter06 designing class
Chapter06 designing classChapter06 designing class
Chapter06 designing class
 
What multimodal foundation models cannot perceive
What multimodal foundation models cannot perceiveWhat multimodal foundation models cannot perceive
What multimodal foundation models cannot perceive
 
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
Lecture 11: ML Deployment & Monitoring (Full Stack Deep Learning - Spring 2021)
 
CSE202.pptx
CSE202.pptxCSE202.pptx
CSE202.pptx
 
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
Transfer Learning and Domain Adaptation - Ramon Morros - UPC Barcelona 2018
 
Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!Deep Learning & NLP: Graphs to the Rescue!
Deep Learning & NLP: Graphs to the Rescue!
 
Lancaster UCREL Summer School 2017 - Big Data and NLP
Lancaster UCREL Summer School 2017 - Big Data and NLPLancaster UCREL Summer School 2017 - Big Data and NLP
Lancaster UCREL Summer School 2017 - Big Data and NLP
 
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
 
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2
Information to Wisdom: Commonsense Knowledge Extraction and Compilation - Part 2
 

Mais de Sergey Karayev

Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningSergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 

Mais de Sergey Karayev (16)

Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Testing and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep LearningTesting and Deployment - Full Stack Deep Learning
Testing and Deployment - Full Stack Deep Learning
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Setting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep LearningSetting up Machine Learning Projects - Full Stack Deep Learning
Setting up Machine Learning Projects - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Último

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Último (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)

  • 1. Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel Transfer Learning and Transformers
  • 2. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 2
  • 3. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 3
  • 4. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning • Let's classify birds! We have 10K labeled images. • From ImageNet, we know that deep NN like Resnet-50 works well. • Problem: NN is so large that it overfits on our small data! • Solution: use NN trained on ImageNet (1M images), and fine-tune on bird data. • Result: better performance than anything else! 4
  • 5. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 5 A LOT of data Large model Traditional Machine Learning:
  • 6. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 6 A LOT of data Large model Traditional Machine Learning: slow training on a lot of data
  • 7. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 7 A LOT of data Train on much less data Large model Pretrained large model + new layers Traditional Machine Learning: slow training on a lot of data Transfer learning:
  • 8. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 8 A LOT of data Much less data Large model Pretrained large model + new layers Traditional Machine Learning: slow training on a lot of data Transfer learning: fast training on a little data
  • 9. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 9 Imagenet- pretrained model Keep these Replace these
  • 10. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 10 Imagenet- pretrained model Keep these Replace these Fine-tuned model New Learned Weights Same weights as before
  • 11. Full Stack Deep Learning - UC Berkeley Spring 2021 Transfer Learning 11 https://wandb.ai/wandb/wandb-lightning/reports/Transfer-Learning-Using-PyTorch-Lightning--VmlldzoyODk2MjA
  • 12. Full Stack Deep Learning - UC Berkeley Spring 2021 Model Zoos 12 https://pytorch.org/docs/stable/torchvision/models.html https://github.com/tensorflow/models/tree/master/official
  • 13. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 13
  • 14. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 14
  • 15. Full Stack Deep Learning - UC Berkeley Spring 2021 Converting words to vectors • In NLP, our inputs are sequences of words, but deep learning needs vectors. • How to convert words to vectors? • Idea: one-hot encoding 15
  • 16. Full Stack Deep Learning - UC Berkeley Spring 2021 One-hot encoding 16 https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2
  • 17. Full Stack Deep Learning - UC Berkeley Spring 2021 Problems with one-hot encoding • Scales poorly with vocabulary size • Very high-dimensional sparse vectors -> NN operations work poorly • Violates what we know about word similarity (e.g. "run" is as far away from "running" as from "tertiary," or "poetry") 17 (0, 1) (1, 0) 2D (0, 1, 0) (1, 0, 0) 3D (0, 0, 1)
  • 18. Full Stack Deep Learning - UC Berkeley Spring 2021 Map one-hot to dense vectors • Problem: how do we find the values of the embedding matrix? 18 https://towardsdatascience.com/why-do-we-use-embeddings-in-nlp-2f20e1b632d2 VxV matrix VxE matrix VxE embedding matrix
  • 19. Full Stack Deep Learning - UC Berkeley Spring 2021 Solution 1: Learn as part of the task 19
  • 20. Full Stack Deep Learning - UC Berkeley Spring 2021 Solution 2: Learn a Language Model • "Pre-train" for your NLP task by learning a really good word embedding! • How to learn a really good embedding? Train for a very general task on a large corpus of text. 20 + ?
  • 21. Full Stack Deep Learning - UC Berkeley Spring 2021 Solution 2: Learn a Language Model 21 http://jalammar.github.io/illustrated-word2vec/
  • 22. Full Stack Deep Learning - UC Berkeley Spring 2021 • Slide an N-sized window through the text, forming a dataset of predicting the last word. N-Grams 22 http://jalammar.github.io/illustrated-word2vec/
  • 23. Full Stack Deep Learning - UC Berkeley Spring 2021 • Slide an N-sized window through the text, forming a dataset of predicting the last word. N-Grams 23 http://jalammar.github.io/illustrated-word2vec/
  • 24. Full Stack Deep Learning - UC Berkeley Spring 2021 • Slide an N-sized window through the text, forming a dataset of predicting the last word. N-Grams 24 http://jalammar.github.io/illustrated-word2vec/
  • 25. Full Stack Deep Learning - UC Berkeley Spring 2021 • Look on both sides of the target word, and form multiple samples from each N-gram Skip-grams 25 http://jalammar.github.io/illustrated-word2vec/
  • 26. Full Stack Deep Learning - UC Berkeley Spring 2021 • Binary instead of multi-class (faster training) Speed up training 26 http://jalammar.github.io/illustrated-word2vec/
  • 27. Full Stack Deep Learning - UC Berkeley Spring 2021 Training 27 http://jalammar.github.io/illustrated-word2vec/
  • 28. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 28 http://jalammar.github.io/illustrated-word2vec/ "king"
  • 29. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 29 http://jalammar.github.io/illustrated-word2vec/
  • 30. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 30 http://jalammar.github.io/illustrated-word2vec/
  • 31. Full Stack Deep Learning - UC Berkeley Spring 2021 Vector math... 31 https://mathinsight.org/vectors_cartesian_coordinates_2d_3d
  • 32. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 32 http://jalammar.github.io/illustrated-word2vec/
  • 33. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 33 http://jalammar.github.io/illustrated-word2vec/
  • 34. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 34 http://jalammar.github.io/illustrated-word2vec/
  • 35. Full Stack Deep Learning - UC Berkeley Spring 2021 Word2Vec (2013) 35 https://ruder.io/nlp-imagenet/
  • 36. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 36
  • 37. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 37
  • 38. Full Stack Deep Learning - UC Berkeley Spring 2021 Beyond embeddings • Word2Vec and GloVe embeddings became popular in ~2013-14 • Boosted accuracy on many tasks by low single-digit % • But these representations are shallow: • only first layer would have benefit of seeing all of Wikipedia • rest of the model -- LSTMs, etc -- would be trained only on the task dataset (much smaller) • Why not pre-train more layers, and thus disambiguate words (e.g. rule in "to rule" vs "a rule"), learn grammar, etc? 38
  • 39. Full Stack Deep Learning - UC Berkeley Spring 2021 Elmo (2018) • Bidirectional stacked LSTM! 39 https://ruder.io/nlp-imagenet/
  • 40. Full Stack Deep Learning - UC Berkeley Spring 2021 Elmo (2018) 40
  • 41. Full Stack Deep Learning - UC Berkeley Spring 2021 Elmo (2018) 41 https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
  • 42. Full Stack Deep Learning - UC Berkeley Spring 2021 Q&A: SQuAD • 100K question-answer pairs • Answers are always spans in the question 42 https://arxiv.org/abs/1606.05250
  • 43. Full Stack Deep Learning - UC Berkeley Spring 2021 Natural Language Inference: SNLI • What relation exists between piece of text and hypothesis? • 570K pairs 43 https://nlp.stanford.edu/pubs/snli_paper.pdf
  • 44. Full Stack Deep Learning - UC Berkeley Spring 2021 Elmo (2018) 44 https://www.slideshare.net/shuntaroy/a-review-of-deep-contextualized-word-representations-peters-2018
  • 45. Full Stack Deep Learning - UC Berkeley Spring 2021 GLUE 45 https://mccormickml.com/2019/11/05/GLUE/
  • 46. Full Stack Deep Learning - UC Berkeley Spring 2021 GLUE • 9 tasks, model score is averaged across them 46 https://mccormickml.com/2019/11/05/GLUE/
  • 47. Full Stack Deep Learning - UC Berkeley Spring 2021 ULMFit (2018) 47 https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
  • 48. Full Stack Deep Learning - UC Berkeley Spring 2021 ULMFit (2018) 48 https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/
  • 49. Full Stack Deep Learning - UC Berkeley Spring 2021 Pre-training is here! 49
  • 50. Full Stack Deep Learning - UC Berkeley Spring 2021 Pre-training is here! 50
  • 51. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 51
  • 52. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 52
  • 53. Full Stack Deep Learning - UC Berkeley Spring 2021 Rise of Transformers 53
  • 54. Full Stack Deep Learning - UC Berkeley Spring 2021 Rise of Transformers 54
  • 55. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention is all you need (2017) • Encoder-decoder with only attention and fully-connected layers (no recurrence or convolutions) • Set new SOTA on translation datasets. 55
  • 56. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention is all you need (2017) 56 • For simplicity, can focus just on the encoder • (e.g BERT is just the encoder)
  • 57. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention is all you need (2017) • The components: • (Masked) Self-attention • Positional encoding • Layer normalization 57
  • 58. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 58
  • 59. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • Input: sequence of tensors
 59 http://www.peterbloem.nl/blog/transformers
  • 60. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • Input: sequence of tensors
 • Output: sequence of tensors, each one a weighted sum of the input sequence
 60 http://www.peterbloem.nl/blog/transformers
  • 61. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • Input: sequence of tensors
 • Output: sequence of tensors, each one a weighted sum of the input sequence
 61 http://www.peterbloem.nl/blog/transformers - not a learned weight, but a function of x_i and x_j
  • 62. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • Input: sequence of tensors
 • Output: sequence of tensors, each one a weighted sum of the input sequence
 62 http://www.peterbloem.nl/blog/transformers - not a learned weight, but a function of x_i and x_j
 - must sum to 1 over j
  • 63. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention 63 http://www.peterbloem.nl/blog/transformers The Cat is yawning
  • 64. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • SO FAR: • No learned weights • Order of the sequence does not affect result of computations 64 http://www.peterbloem.nl/blog/transformers
  • 65. Full Stack Deep Learning - UC Berkeley Spring 2021 Basic self-attention • SO FAR: • No learned weights • Order of the sequence does not affect result of computations 65 http://www.peterbloem.nl/blog/transformers Let's learn some weights!
  • 66. Full Stack Deep Learning - UC Berkeley Spring 2021 Query, Key, Value • Every input vector x_i is used in 3 ways: • Compared to every other vector to compute attention weights for its own output y_i (query) • Compared to every other vector to compute attention weight w_ij for output y_j (key) • Summed with other vectors to form the result of the attention weighted sum (value) 66 http://www.peterbloem.nl/blog/transformers
  • 67. Full Stack Deep Learning - UC Berkeley Spring 2021 Query, Key, Value • We can process each input vector to fulfill the three roles with matrix multiplication • Learning the matrices --> learning attention 67 http://www.peterbloem.nl/blog/transformers
  • 68. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 68
  • 69. Full Stack Deep Learning - UC Berkeley Spring 2021 Multi-head attention • Multiple "heads" of attention just means learning different sets of W_q, W_k, and W_v matrices simultaneously. • Implemented as just a single matrix anyway... 69 http://www.peterbloem.nl/blog/transformers
  • 70. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • Self-attention layer -> Layer normalization -> Dense layer 70 http://www.peterbloem.nl/blog/transformers
  • 71. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • Self-attention layer -> Layer normalization -> Dense layer 71 http://www.peterbloem.nl/blog/transformers
  • 72. Full Stack Deep Learning - UC Berkeley Spring 2021 Layer Normalization • Neural net layers work best when input vectors have uniform mean and std in each dimension • (remember input data scaling, weight initialization) • As inputs flow through the network, means and std's get blown out. • Layer Normalization is a hack to reset things to where we want them in between layers. 72 https://arxiv.org/pdf/1803.08494.pdf
  • 73. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • Self-attention layer -> Layer normalization -> Dense layer 73 http://www.peterbloem.nl/blog/transformers
  • 74. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • SO FAR: • Learned query, key, value weights • Multiple heads • Order of the sequence does not affect result of computations 74 http://www.peterbloem.nl/blog/transformers
  • 75. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • SO FAR: • Learned query, key, value weights • Multiple heads • Order of the sequence does not affect result of computations 75 http://www.peterbloem.nl/blog/transformers Let's encode each vector with position
  • 76. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer 76 http://www.peterbloem.nl/blog/transformers
  • 77. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer • Position embedding: just what it sounds! 77 http://www.peterbloem.nl/blog/transformers
  • 78. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer: last trick • Since the Transformer sees all inputs at once, to predict next vector in sequence (e.g. generate text), we need to mask the future. 78 http://www.peterbloem.nl/blog/transformers
  • 79. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformer: last trick • Since the Transformer sees all inputs at once, to predict next vector in sequence (e.g. generate text), we need to mask the future. 79 http://www.peterbloem.nl/blog/transformers
  • 80. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 80
  • 81. Full Stack Deep Learning - UC Berkeley Spring 2021 Outline • Transfer Learning in Computer Vision • Embeddings and Language Models • "NLP's ImageNet Moment": ELMO/ULMFit • Transformers • Attention in detail • BERT, GPT-2, DistillBERT, T5 81
  • 82. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention is all you need (2017) • Encoder-decoder for translation 82
  • 83. Full Stack Deep Learning - UC Berkeley Spring 2021 Attention is all you need (2017) 83 http://www.peterbloem.nl/blog/transformers • Encoder-decoder for translation • Later models made it mostly just the encoder or just the decoder • ...but then the latest models are back to encoder-decoder
  • 84. Full Stack Deep Learning - UC Berkeley Spring 2021 GPT / GPT-2 • Generative Pre-trained Transformer 84 https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
  • 85. Full Stack Deep Learning - UC Berkeley Spring 2021 GPT / GPT-2 • Generative Pre-trained Transformer • GPT learns to predict the next word in the sequence, just like ELMO or ULMFIT 85 https://arxiv.org/pdf/1810.04805.pdf
  • 86. Full Stack Deep Learning - UC Berkeley Spring 2021 GPT / GPT-2 • Generative Pre-trained Transformer • GPT learns to predict the next word in the sequence, just like ELMO or ULMFIT • Since it conditions only on preceding words, it uses masked self-attention 86 http://jalammar.github.io/illustrated-gpt2/
  • 87. Full Stack Deep Learning - UC Berkeley Spring 2021 GPT / GPT-2 87 117M params 345M params 762M params 1.5B params Trained on 8M web pages http://jalammar.github.io/illustrated-gpt2/
  • 88. Full Stack Deep Learning - UC Berkeley Spring 2021 Talk to Transformer 88 https://talktotransformer.com 1.5B parameter model
  • 89. Full Stack Deep Learning - UC Berkeley Spring 2021 BERT • Bidirectional Encoder Representations from Transformers • Encoder blocks only (no masking) • BERT involves pre-training on A LOT of text with 15% of all words masked out • also sometimes predicting whether one sentence follows another 89 https://docs.google.com/presentation/d/1fIhGikFPnb7G5kr58OvYC3GN4io7MznnM0aAgadvJfc
  • 90. Full Stack Deep Learning - UC Berkeley Spring 2021 BERT • 340M parameters: • 24 transformer blocks, embedding dim of 1024, 16 attention heads 90
  • 91. Full Stack Deep Learning - UC Berkeley Spring 2021 Transformers 91 https://medium.com/huggingface/encoder-decoders-in-transformers-a-hybrid-pre-trained-architecture-for-seq2seq-af4d7bf14bb8
  • 92. Full Stack Deep Learning - UC Berkeley Spring 2021 Number of parameters 92 https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
  • 93. Full Stack Deep Learning - UC Berkeley Spring 2021 T5: Text-to-Text Transfer Transformer • Feb 2020 • Evaluated most recent transfer learning techniques • Input and output are both text strings • Trained on C4 (Colossal Clean Crawled Corpus) - 100x larger than Wikipedia • 11B parameters • SOTA on GLUE, SuperGLUE, SQuAD 93 https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
  • 94. Full Stack Deep Learning - UC Berkeley Spring 2021 T5: Text-to-Text Transfer Transformer 94 https://medium.com/@Moscow25/the-best-deep-natural-language-papers-you-should-read-bert-gpt-2-and-looking- forward-1647f4438797
  • 95. Full Stack Deep Learning - UC Berkeley Spring 2021 95
  • 96. Full Stack Deep Learning - UC Berkeley Spring 2021 GPT-3 96
  • 97. Full Stack Deep Learning - UC Berkeley Spring 2021 97
  • 98. Full Stack Deep Learning - UC Berkeley Spring 2021 98
  • 99. GSV Breakfast Club - January 2021 - Sergey Karayev Really, really good text generation Qspwjefe!cz!nf Hfofsbufe
  • 100. GSV Breakfast Club - January 2021 - Sergey Karayev One problem: perpetuating unfortunate things https://twitter.com/an_open_mind/status/1284487376312709120
  • 101. GSV Breakfast Club - January 2021 - Sergey Karayev Reasonable mental model
  • 102. GSV Breakfast Club - January 2021 - Sergey Karayev 🤯 Useful for more than text?
  • 103. GSV Breakfast Club - January 2021 - Sergey Karayev 🤯 Useful for more than text?
  • 104. GSV Breakfast Club - January 2021 - Sergey Karayev Image Generation https://openai.com/blog/dall-e/
  • 105. Full Stack Deep Learning - UC Berkeley Spring 2021 No sign of slowing down 105
  • 106. Full Stack Deep Learning - UC Berkeley Spring 2021 106
  • 107. Full Stack Deep Learning - UC Berkeley Spring 2021 Number of parameters • The way things are trending, only big companies can afford to compete! • But there is another direction to go in:
 doing more with less 107 https://gluebenchmark.com/leaderboard
  • 108. Full Stack Deep Learning - UC Berkeley Spring 2021 DistillBERT 108 Knowledge Distillation: a smaller model is trained to reproduce the output of a larger model
  • 109. Full Stack Deep Learning - UC Berkeley Spring 2021 Papers and Leaderboards 109 paperswithcode.com
  • 110. Full Stack Deep Learning - UC Berkeley Spring 2021 Papers and Leaderboards 110 https://github.com/sebastianruder/NLP-progress
  • 111. Full Stack Deep Learning - UC Berkeley Spring 2021 Implementations 111 https://github.com/huggingface/transformers
  • 112. Full Stack Deep Learning - UC Berkeley Spring 2021 Questions? 112
  • 113. Full Stack Deep Learning - UC Berkeley Spring 2021 113 http://www.incompleteideas.net/IncIdeas/BitterLesson.html
  • 114. Full Stack Deep Learning - UC Berkeley Spring 2021 Thank you! 114