SlideShare a Scribd company logo
1 of 14
XLNET: GENERALIZED
AUTOREGRESSIVE PRETRAINING
FOR LANGUAGE
UNDERSTANDING
ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME CARBONELL
, RUSLAN SALAKHUTDINOV , QUOC V. LE
AR and AE
Autoregressive (AR) language
modeling:
An autoregressive model’s output ht at
time t depends on not just xt, but also all
xs from previous time steps.
given a text sequence x = (x1, · · · ,
xT ), AR language modeling factorizes
the likelihood into a forward
product. p(x) = p(xt | x<t)
Examples:
GPT , ELMO(The simple
combination of 2 AR models)
AR and AE
Autoencoding(AE) Language Modeling:
The AE language model aims to
reconstruct the original data from
corrupted input.
Corrupted input: The corrupted input
means we use [MASK] to replace the
original token
Example:
BERT
Pros and Cons of AR and AE
AR:
Advantages: good at
generative NLP tasks.
(When generating
context,It’s usually
forward)
Disadvantages: it only
concerns one
direction(forward or
backward.)
AE:
Advantages: bi-directional.
Downstream language understanding
tasks often require bidirectional
context information.
Disadvantages: 1. the artificial
symbols like [MASK] used by BERT
during pretraining are absent from real
data at finetuning time, resulting in a
pretrain-finetune discrepancy.
2. It assumes the predicted
(masked) tokens are independent of
each other given the unmasked tokens.
Example: I am [mask] in the [mask] of
Waterloo.
Can we combine the 2 methods so that we
can take their pros and avoid their cons?
Yes!
XLNet, a generalized autoregressive method
that leverages the best of both AR language
modeling and AE while avoiding their
limitations.
XLNet
Recall: In AR methods, we maximize the likelyhood in a fixed forward
or backward factorization order
Idea:
XLNet maximizes the expected log likelihood of a sequence w.r.t.
all possible permutations of the factorization order.
Permutations:
For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and
we want to predict x3.
Then we will have 4!(24) permutations.
[x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]…
Every token can appear before x3, then if we apply the forward
maximum likelihood function, it concerns all tokens in this sentence.
XLNet
Maximize the expectation for
a factor order in all
permutations
Also, , XLNet does not rely on data
corruption.(which means no masks in XLnet)
FYI: For Bert, since Bert introduces
masks, mt indicates if it’s a mask
Problems:
contradictory in a standard Transformer architecture:
1.To predict the token x_t, the model should only see
the position of x_t, not the content of x_t
2.To predict the token x_t, the model should encode
all tokens before x_t as the content
(In transformers, word embedding and position
information are combined)
Solution: 2-stream self-attention
The model will only encounter text sequences with the
natural order during finetuning. Which means we can not
chenge the sentences, we have to implement permutation
in encoders.
Solution: Attention Mask
Two-Stream Self-
Attention
• Content stream attention: the standard self-attention in
transformers.
• query stream attention.: it’s used for predicting x_t
Original sequence order:
[x1,x2,x3,x4]
Sample a random
factorization order:
[x3,x2,x4,x1]
Calculate content stream
attention:
KV = [h1, h2, h3, h4] Q=h1
Calculate query stream
attention:
KV = [h2, h3, h4] Q=g1
The initial value for h_i =
e(x_i) and
g_i = w
Recal
l:
Where g is from the last layer
of query representation
In this graph, the other parts of
the encoder are omitted. Actually,
they used the same structure as
Transformer-XL
Partial prediction
It’s expensive since N! is really large
and it will cause slow convergence
Formally, we split z into a non-target subsequence z≤c
and a target subsequence z>c, where c is the cutting
point. And we only predict the tokens after c
Generally, only 1/K tokens will be selected for
predictions. (k=6 or 7 is recommended)
Some methods in Transformer-XL are
incorporated.
Such as relative positional encoding scheme
and the segment recurrence
mechanism.(They are helpful for long
context)
Results:
XLNet outperforms BERT.
ROBERTa was released after XLNet.
And It’s hard to tell which one is
better, XLNet may be better in
reading comprehension
tasks(especially for longer
context).
Reference:
https://arxiv.org/pdf/1906.08237.pdf
https://eigenfoo.xyz/deep-autoregressive-models/
https://towardsdatascience.com/what-is-xlnet-and-why-it-outperforms-bert-
8d8fce710335

More Related Content

Similar to Ruifeng.pptx

Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
lswing
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Silvio Cesare
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
ClarkTony
 

Similar to Ruifeng.pptx (20)

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow AnalysisDetecting Bugs in Binaries Using Decompilation and Data Flow Analysis
Detecting Bugs in Binaries Using Decompilation and Data Flow Analysis
 
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
The Theory of Finite Automata.pptx
The Theory of Finite Automata.pptxThe Theory of Finite Automata.pptx
The Theory of Finite Automata.pptx
 
Dsm as theory building
Dsm as theory buildingDsm as theory building
Dsm as theory building
 
Digital communication lab lectures
Digital communication lab  lecturesDigital communication lab  lectures
Digital communication lab lectures
 
Adversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptxAdversarial_Examples_in_Audio_and_Text.pptx
Adversarial_Examples_in_Audio_and_Text.pptx
 
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
Ultra-efficient algorithms for testing well-parenthesised expressions by Tati...
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)The Concurrent Constraint Programming Research Programmes -- Redux (part2)
The Concurrent Constraint Programming Research Programmes -- Redux (part2)
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: L...
 
Matching networks for one shot learning
Matching networks for one shot learningMatching networks for one shot learning
Matching networks for one shot learning
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Hmm and neural networks
Hmm and neural networksHmm and neural networks
Hmm and neural networks
 

Recently uploaded

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Recently uploaded (20)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 

Ruifeng.pptx

  • 1. XLNET: GENERALIZED AUTOREGRESSIVE PRETRAINING FOR LANGUAGE UNDERSTANDING ZHILIN YANG , ZIHANG DAI, YIMING YANG , JAIME CARBONELL , RUSLAN SALAKHUTDINOV , QUOC V. LE
  • 2. AR and AE Autoregressive (AR) language modeling: An autoregressive model’s output ht at time t depends on not just xt, but also all xs from previous time steps. given a text sequence x = (x1, · · · , xT ), AR language modeling factorizes the likelihood into a forward product. p(x) = p(xt | x<t) Examples: GPT , ELMO(The simple combination of 2 AR models)
  • 3. AR and AE Autoencoding(AE) Language Modeling: The AE language model aims to reconstruct the original data from corrupted input. Corrupted input: The corrupted input means we use [MASK] to replace the original token Example: BERT
  • 4. Pros and Cons of AR and AE AR: Advantages: good at generative NLP tasks. (When generating context,It’s usually forward) Disadvantages: it only concerns one direction(forward or backward.) AE: Advantages: bi-directional. Downstream language understanding tasks often require bidirectional context information. Disadvantages: 1. the artificial symbols like [MASK] used by BERT during pretraining are absent from real data at finetuning time, resulting in a pretrain-finetune discrepancy. 2. It assumes the predicted (masked) tokens are independent of each other given the unmasked tokens. Example: I am [mask] in the [mask] of Waterloo.
  • 5. Can we combine the 2 methods so that we can take their pros and avoid their cons? Yes! XLNet, a generalized autoregressive method that leverages the best of both AR language modeling and AE while avoiding their limitations.
  • 6. XLNet Recall: In AR methods, we maximize the likelyhood in a fixed forward or backward factorization order Idea: XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Permutations: For example, we have a sentence with 4 tokens [x1 x2 x3 x4], and we want to predict x3. Then we will have 4!(24) permutations. [x1 x2 x3 x4],[x1 x3 x2 x4],[x1 x4 x3 x4],[x2 x1 x3 x4 ]… Every token can appear before x3, then if we apply the forward maximum likelihood function, it concerns all tokens in this sentence.
  • 7. XLNet Maximize the expectation for a factor order in all permutations Also, , XLNet does not rely on data corruption.(which means no masks in XLnet) FYI: For Bert, since Bert introduces masks, mt indicates if it’s a mask
  • 8.
  • 9. Problems: contradictory in a standard Transformer architecture: 1.To predict the token x_t, the model should only see the position of x_t, not the content of x_t 2.To predict the token x_t, the model should encode all tokens before x_t as the content (In transformers, word embedding and position information are combined) Solution: 2-stream self-attention The model will only encounter text sequences with the natural order during finetuning. Which means we can not chenge the sentences, we have to implement permutation in encoders. Solution: Attention Mask
  • 10. Two-Stream Self- Attention • Content stream attention: the standard self-attention in transformers. • query stream attention.: it’s used for predicting x_t
  • 11. Original sequence order: [x1,x2,x3,x4] Sample a random factorization order: [x3,x2,x4,x1] Calculate content stream attention: KV = [h1, h2, h3, h4] Q=h1 Calculate query stream attention: KV = [h2, h3, h4] Q=g1 The initial value for h_i = e(x_i) and g_i = w Recal l: Where g is from the last layer of query representation In this graph, the other parts of the encoder are omitted. Actually, they used the same structure as Transformer-XL
  • 12. Partial prediction It’s expensive since N! is really large and it will cause slow convergence Formally, we split z into a non-target subsequence z≤c and a target subsequence z>c, where c is the cutting point. And we only predict the tokens after c Generally, only 1/K tokens will be selected for predictions. (k=6 or 7 is recommended) Some methods in Transformer-XL are incorporated. Such as relative positional encoding scheme and the segment recurrence mechanism.(They are helpful for long context)
  • 13. Results: XLNet outperforms BERT. ROBERTa was released after XLNet. And It’s hard to tell which one is better, XLNet may be better in reading comprehension tasks(especially for longer context).