SlideShare uma empresa Scribd logo
1 de 26
BERT: Bidirectional
Encoder Representation
from Transformer
By: Shaurya Uppal
Defining Language
Language:- Divided into 3 Parts
● Syntax
● Semantics
● Pragmatics
Syntax- Word Ordering, Sentence form
Semantics- Meaning of word
Pragmatics- refers to the social language skills that
we use in our daily interactions with others.
Example of Syntax, Semantics, Pragmatics
+ This discussion is about BERT.
+ The green frogs sleep soundly.
+ BERT play football good
SSP
SS
None
Why study about BERT?
Bert has ability to perform state of the art performance in many Natural Language
Processing Tasks. It can perform tasks such as Text Classification, Text Similarity
finding, Next Sentence Sequence Prediction, Question Answering, Auto-
Summarization, Named Entity Recognition,etc.
What is BERT Exactly?
BERT is Pretrained model by Google, which is a
bidirectional representation from unlabeled text by
jointly conditioning on both left and right context in
all layers.
Dataset used to Pre-train BERT
+ BooksCorpus (800M words)
+ English Wikipedia (2,500+M words)
A pretrained model can be applied by feature-based approach or fine tuning.
+ In Fine Tuning all weights change.
+ In Feature based approach only the final layer weights change. (Approach by
ELMo)
This pretrained model is then fine tuned on different NLP tasks.
Pretraining and Fine Tuning: You train a model m on Data A, then this model m is
trained on Data B from the checkpoint. SLIDE 17
Language
Training
Approach
To train a language model:
Two approaches
Context Free
+ Traditionally we use to convert word2vec
or use Glove.
Contextual
+ RNN
+ BERT
How does BERT Work?
BERT weights are learned in advance through two unsupervised tasks: masked
language modeling (predicting a missing word given the left and right context)
and next sentence prediction (predicting whether one sentence follows another).
BERT makes use of Transformer, an attention mechanism that learns contextual
relations between words (or sub-words) in a text.
(Paper 2 Attention is all you need)
Multi-headed attention is used in BERT. It uses multiple layers of attention and
also incorporates multiple attention “heads” in every layer (12 or 16). Since model
weights are not shared between layers, a single BERT model effectively has up to
12 x 12 = 144 different attention mechanisms.
What does BERT learn, how it tokenize and handle
OOV?
Consider the input example:- I went to the store. At the store, I bought fresh
strawberries.
BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word
into segments. For example, if play, ##ing, and ##ed are present in the
vocabulary but playing and played are OOV words then they will be broken down
into play + ##ing and play + ##ed respectively. (## is used to represent sub-
words).
BERT also requires a [CLS] special classifier token at beginning and [SEP] at end
of a sequence.
[CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
Attention
An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by
concatenating the entries aij, in every attention matrix from every attention head in every layer.
Some visual Attention Patterns and Why we use
Attention Mechanism?
Reason for Attention: Attention helps in two main
tasks of BERT, MLM (Masked Language Model)
and NSP (Next Sentence Prediction).
Visual Pattern from Attention mechanism
● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN
● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN
● Attention to identical/related word.
● Attention to identical words in other sentence. | Helps in nextsentence prediction task
● Attention to other words predictive of word.
● Attention to delimiter tokens [CLS], [SEP]
● Attention to Bag of Words.
How input in BERT is Feeded?
MLM: Masked Language Model
Input: My dog is hairy.
Masking is done randomly, and 15% of all WordPiece tokens in each input
sequence in masked. We only predict the masked tokens rather than predicting
the entire input sequence.
Procedure:
+ 80% of the time: Replace the word with [MASK]. My dog is [MASK].
+ 10% of time: Replace word randomly. My dog is apple.
+ 10% of time: Keep same My dog is hairy.
Why MLM is best?
NSP: Next Sentence Prediction
Training Method:
In unlabelled data, we take a input
sequence A and 50% of time
making next occurring input
sequence as B. Rest 50% of time
we randomly pick any sequence as
B.
BERT Architecture
BERT is a multi-layer bidirectional Transformer encoder.
There are two models introduced in the paper.
● BERT base – 12 layers (transformer blocks), 12
attention heads, and 110 million parameters.
● BERT Large – 24 layers, 16 attention heads and, 340
million parameters.
BERT Pretraining and Fine Tuning Architecture
Illustration how the BERT Pretrain
architecture remain the same and
just the fine tuning layer architecture
change for different NLP tasks.
Related Work
EMLo:- A pretrained model based which is feature based (only final layer weights
change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention
based model with positional encodings to represent word positions). ELMo also failed because is was
word based and could not handle OOV.
OpenAI GPT: uses a left to right architecture where every token can only attend to
previous tokens in the self-attention layer of Transformer. Failed because it could
not get proper contextual knowledge.
How BERT Outperforms others?
In the paper Visualizing and Measuring the Geometry of BERT, we prove how
BERT holds semantic and syntax features of a text.
In this paper aims to show how attention matrix contains grammatical
representations. Turning to semantics, using visualizations of the activations
created by different pieces of text, we show suggestive evidence that BERT
distinguishes word senses at a very fine level.
BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a
pre-trained word piece embedding with position and segment information. Next, this initial sequence of
embeddings is run through multiple transformer layers, producing a new sequence of context embeddings
at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head,
each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
Experiment for Syntax Representation
Experiment on corpus of Penn TreeBank (3.1M dependency relations). With
PyStanford Dependency Library we found the grammatical dependency on which
we ran BERT-base through each sentence and obtained model-wide attention
matrix. [ SLIDE 9].
On this dataset we train test split of 30% and achieve an accuracy of 85.8% on
binary probe and 71.9% on multiclass probe.
Proved: Attention mechanism contains syntactic features.
Geometry of Word Sense (Experiment)
On wikipedia articles with a query word we applied nearest-neighbor classifier
where each neighbour is the centroid of a given word sense’s BERT-base
embeddings in training data.
Conclusion
BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language
Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide
range of practical applications in the future.
Tested on our data of SupportLen for Text Classification.
We have a priority column in supportlen where we manually label
whether a customer email is urgent or not.
On this dataset we used BERT-base-uncased.
Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
Some FAQs on BERT
1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT?
512 tokens
2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING
● Dropout – 0.1
● Batch Size – 16, 32
● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5
● Number of epochs – 3, 4
3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP?
No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained
simultaneously.
4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE?
Google took 4days to Pretrain BERT with 16TPUs.
ULMFiT: Universal Language Model Fine-tuning for Text Classification
ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on
text from the same domain as the target task. Now, along with BERT Pretrained model classification task
is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our
custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
Future work and use cases that BERT can solve for us
+ Email Prioritization
+ Sentiment Analysis of Reviews
+ Review Tagging
+ Question-Answering for ChatBot & Community
+ Similar Products problem, we currently use cosine similarity on description
text.
Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain
dataset.

Mais conteúdo relacionado

Mais procurados

BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionRrubaa Panchendrarajan
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkYan Xu
 

Mais procurados (20)

BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 
BERT
BERTBERT
BERT
 

Semelhante a NLP State of the Art | BERT

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfhelloworld28847
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGijasuc
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELijaia
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Modelgerogepatton
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Modelssaurav singla
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.Vladimir Ulogov
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representationszperjaccico
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratchMahmoud Yasser
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of berttaeseon ryu
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysisBarry DeCicco
 

Semelhante a NLP State of the Art | BERT (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
ENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKINGENSEMBLE MODEL FOR CHUNKING
ENSEMBLE MODEL FOR CHUNKING
 
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODELEXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
 
Extractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language ModelExtractive Summarization with Very Deep Pretrained Language Model
Extractive Summarization with Very Deep Pretrained Language Model
 
Nltk
NltkNltk
Nltk
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 
NLP
NLPNLP
NLP
 
An Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language RepresentationsAn Introduction to Pre-training General Language Representations
An Introduction to Pre-training General Language Representations
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Tokenization and how to use it from scratch
Tokenization and how to use it from scratchTokenization and how to use it from scratch
Tokenization and how to use it from scratch
 
visualizing and measuring the geometry of bert
visualizing and measuring the geometry of bertvisualizing and measuring the geometry of bert
visualizing and measuring the geometry of bert
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
Y24168171
Y24168171Y24168171
Y24168171
 

Último

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

NLP State of the Art | BERT

  • 1. BERT: Bidirectional Encoder Representation from Transformer By: Shaurya Uppal
  • 2. Defining Language Language:- Divided into 3 Parts ● Syntax ● Semantics ● Pragmatics Syntax- Word Ordering, Sentence form Semantics- Meaning of word Pragmatics- refers to the social language skills that we use in our daily interactions with others.
  • 3. Example of Syntax, Semantics, Pragmatics + This discussion is about BERT. + The green frogs sleep soundly. + BERT play football good SSP SS None
  • 4. Why study about BERT? Bert has ability to perform state of the art performance in many Natural Language Processing Tasks. It can perform tasks such as Text Classification, Text Similarity finding, Next Sentence Sequence Prediction, Question Answering, Auto- Summarization, Named Entity Recognition,etc. What is BERT Exactly? BERT is Pretrained model by Google, which is a bidirectional representation from unlabeled text by jointly conditioning on both left and right context in all layers.
  • 5. Dataset used to Pre-train BERT + BooksCorpus (800M words) + English Wikipedia (2,500+M words) A pretrained model can be applied by feature-based approach or fine tuning. + In Fine Tuning all weights change. + In Feature based approach only the final layer weights change. (Approach by ELMo) This pretrained model is then fine tuned on different NLP tasks. Pretraining and Fine Tuning: You train a model m on Data A, then this model m is trained on Data B from the checkpoint. SLIDE 17
  • 6. Language Training Approach To train a language model: Two approaches Context Free + Traditionally we use to convert word2vec or use Glove. Contextual + RNN + BERT
  • 7. How does BERT Work? BERT weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. (Paper 2 Attention is all you need) Multi-headed attention is used in BERT. It uses multiple layers of attention and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 12 x 12 = 144 different attention mechanisms.
  • 8. What does BERT learn, how it tokenize and handle OOV? Consider the input example:- I went to the store. At the store, I bought fresh strawberries. BERT uses a WORD PIECE Tokenizer which breaks a OOV(out of vocab) word into segments. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub- words). BERT also requires a [CLS] special classifier token at beginning and [SEP] at end of a sequence. [CLS] I went to the store. [SEP] At the store I bought fresh straw ##berries.[SEP]
  • 9. Attention An attention probe is a task for a pair of tokens (tokeni, tokenj) where the input is a model-wide attention vector formed by concatenating the entries aij, in every attention matrix from every attention head in every layer.
  • 10. Some visual Attention Patterns and Why we use Attention Mechanism? Reason for Attention: Attention helps in two main tasks of BERT, MLM (Masked Language Model) and NSP (Next Sentence Prediction).
  • 11. Visual Pattern from Attention mechanism ● Attention to next word. [ Layer 2, Head 0 ] | Backward RNN ● Attention to Previous word. [Layer 0, Head 2 ] | Forward RNN ● Attention to identical/related word. ● Attention to identical words in other sentence. | Helps in nextsentence prediction task ● Attention to other words predictive of word. ● Attention to delimiter tokens [CLS], [SEP] ● Attention to Bag of Words.
  • 12. How input in BERT is Feeded?
  • 13. MLM: Masked Language Model Input: My dog is hairy. Masking is done randomly, and 15% of all WordPiece tokens in each input sequence in masked. We only predict the masked tokens rather than predicting the entire input sequence. Procedure: + 80% of the time: Replace the word with [MASK]. My dog is [MASK]. + 10% of time: Replace word randomly. My dog is apple. + 10% of time: Keep same My dog is hairy.
  • 14. Why MLM is best?
  • 15. NSP: Next Sentence Prediction Training Method: In unlabelled data, we take a input sequence A and 50% of time making next occurring input sequence as B. Rest 50% of time we randomly pick any sequence as B.
  • 16. BERT Architecture BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper. ● BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. ● BERT Large – 24 layers, 16 attention heads and, 340 million parameters.
  • 17. BERT Pretraining and Fine Tuning Architecture
  • 18. Illustration how the BERT Pretrain architecture remain the same and just the fine tuning layer architecture change for different NLP tasks.
  • 19. Related Work EMLo:- A pretrained model based which is feature based (only final layer weights change) for NLP tasks. Difference: ELMo uses LSTMS; BERT uses Transformer - an attention based model with positional encodings to represent word positions). ELMo also failed because is was word based and could not handle OOV. OpenAI GPT: uses a left to right architecture where every token can only attend to previous tokens in the self-attention layer of Transformer. Failed because it could not get proper contextual knowledge.
  • 20. How BERT Outperforms others? In the paper Visualizing and Measuring the Geometry of BERT, we prove how BERT holds semantic and syntax features of a text. In this paper aims to show how attention matrix contains grammatical representations. Turning to semantics, using visualizations of the activations created by different pieces of text, we show suggestive evidence that BERT distinguishes word senses at a very fine level. BERT’s internals consist of two parts. First, an initial embedding for each token is created by combining a pre-trained word piece embedding with position and segment information. Next, this initial sequence of embeddings is run through multiple transformer layers, producing a new sequence of context embeddings at each step. Implicit in each transformer layer is a set of attention matrices, one for each attention head, each of which contains a scalar value for each ordered pair (tokeni , tokenj ). [SLIDE 11]
  • 21. Experiment for Syntax Representation Experiment on corpus of Penn TreeBank (3.1M dependency relations). With PyStanford Dependency Library we found the grammatical dependency on which we ran BERT-base through each sentence and obtained model-wide attention matrix. [ SLIDE 9]. On this dataset we train test split of 30% and achieve an accuracy of 85.8% on binary probe and 71.9% on multiclass probe. Proved: Attention mechanism contains syntactic features.
  • 22. Geometry of Word Sense (Experiment) On wikipedia articles with a query word we applied nearest-neighbor classifier where each neighbour is the centroid of a given word sense’s BERT-base embeddings in training data.
  • 23. Conclusion BERT is undoubtedly a breakthrough in the use of Machine Learning for Natural Language Processing. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. Tested on our data of SupportLen for Text Classification. We have a priority column in supportlen where we manually label whether a customer email is urgent or not. On this dataset we used BERT-base-uncased. Model config { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 28996 }
  • 24. Some FAQs on BERT 1. WHAT IS THE MAXIMUM SEQUENCE LENGTH OF THE INPUT? 512 tokens 2. OPTIMAL VALUES OF THE HYPERPARAMETERS USED IN FINE-TUNING ● Dropout – 0.1 ● Batch Size – 16, 32 ● Learning Rate (Adam) – 5e-5, 3e-5, 2e-5 ● Number of epochs – 3, 4 3. HOW MANY LAYERS ARE FROZEN IN THE FINE-TUNING STEP? No layers are frozen during fine-tuning. All the pre-trained layers along with the task-specific parameters are trained simultaneously. 4. IN HOW MUCH TIME BERT WAS PRETRAINED BY GOOGLE? Google took 4days to Pretrain BERT with 16TPUs.
  • 25. ULMFiT: Universal Language Model Fine-tuning for Text Classification ULMFiT paper added a intermediate step in which an intermediate step in which model is fine-tuned on text from the same domain as the target task. Now, along with BERT Pretrained model classification task is done, resulting in better accuracy than simply using BERT model alone. We too fine-tune bert on our custom data. It took around 50mins per epoch on Telsa K80 12GB GPU on P2.
  • 26. Future work and use cases that BERT can solve for us + Email Prioritization + Sentiment Analysis of Reviews + Review Tagging + Question-Answering for ChatBot & Community + Similar Products problem, we currently use cosine similarity on description text. Testing of ULMFiT Experiment to be done, by fine tuning BERT on our domain dataset.

Notas do Editor

  1. Language discussion
  2. Examples
  3. BERT's power What is bert
  4. Data used and pretrain vs finetune
  5. Talk about:- Bank of the River Bank account
  6. As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
  7. Word Piece Tokenizer: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf
  8. Attention Visual:- https://colab.research.google.com/drive/1Nlhh2vwlQdKleNMqpmLDBsAwrv_7NnrB
  9. Understanding the Attention Patterns: https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77
  10. A positional embedding is also added to each token to indicate its position in the sequence.
  11. Advantage of this method is that the Transformer Does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every token.
  12. NSP helps in Q&A and understand the relation b/w sentences.
  13. State of the Art: the most recent stage in the development of a product, incorporating the newest ideas and features. Parse Tree Embedding Concept- mathematical proof
  14. Miscellaneous:- Matthew Correlation Coefficient: https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
  15. Miscellaneous: What is a TPU? https://www.google.com/search?q=tpu+full+form&rlz=1C5CHFA_enIN835IN835&oq=TPU+full+form&aqs=chrome.0.0l6.3501j0j9&sourceid=chrome&ie=UTF-8 Which Activation is used in BERT? https://datascience.stackexchange.com/questions/49522/what-is-gelu-activation Gaussian Error Linear Unit
  16. Demo of Google Collab: Sentiment on Movie Reviews: https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb?authuser=1#scrollTo=VG3lQz_j2BtD Sentence Pairing and Sentence Classification: https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb?authuser=1#scrollTo=0yamCRHcV-nQ BERT FineTune on Data. https://github.com/huggingface/pytorch-pretrained-BERT/tree/master/examples/lm_finetuning