The document summarizes a presentation about state-of-the-art natural language processing (NLP) techniques. It discusses how transformer networks have achieved state-of-the-art results in many NLP tasks using transfer learning from large pre-trained models. It also describes how Hugging Face's Transformers library and Tokenizers library provide tools for tokenization and using pre-trained transformer models through a simple interface.
3. Agenda
Lysandre DEBUT
Machine Learning Engineer @ Hugging Face,
maintainer and core contributor of
huggingface/transformers
Anthony MOI
Technical Lead @ Hugging Face, maintainer and
core contributor of huggingface/tokenizers
Some slides were adapted from previous
HuggingFace talk by Thomas Wolf, Victor Sanh and
Morgan Funtowicz
9. Subjects we’ll dive in today
● NLP: Transfer learning, transformer networks
● Tokenizers: from text to tokens
● Transformers: from tokens to predictions
13. Sequential transfer learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classification
Word labeling
Question-Answering
....
Pre-training Adaptation
Computationally
intensive
step General purpose
model
15. Transformer Networks
● Very large networks
● Can be trained on very big datasets
● Better than previous architectures at maintaining
long-term dependencies
● Require a lot of compute to be trained Source: BERT: Pre-training of Deep Bidirectional
Transformers for
Language Understanding. Jacob Devlin, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova.In NACCL, 2019.
18. Model Sharing
Reduced compute, cost, energy footprint
From 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
distilled version of BERT, by Victor Sanh
19. A deeper look at the inner mechanisms
Pipeline, pre-training, fine-tuning
20. Adaptation
Head
Pre-trained
model
Tokenizer
Transfer Learning pipeline in NLP
From text to tokens, from tokens to prediction
Jim
Henson
was
a
puppet
##eer
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
True 0.7886
False 0.223
Jim Henson was a puppeteer
Tokenization
Convert to
vocabulary indices
Pre-trained model
Task-specificmodel
21. Pre-training
Many currently successful pre-training approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
- Doesn’t require human annotation - self-supervised
- Many languages have enough text to learn high capacity models
- Versatile - can be used to learn both sentence and word representations with
a variety of objective functions
The rise of language modeling pre-training
25. Tokenization
- Convert input strings to a set of numbers
Its role in the pipeline
Jim Henson was a puppet ##eer
11067 5567 245 120 7756 9908
Jim Henson was a puppeteer
- Goal: Find the most meaningful and smallest possible representation
27. Word-based
Word by word tokenization
Let’s do tokenization!
Let ‘s do tokenization !
Split on punctuation:
Split on spaces:
▪ Split on spaces, or following specific rules to obtain words
▪ What to do with punctuation?
▪ Requires large vocabularies: dog != dogs, run != running
▪ Out-of-vocabulary (aka <UNK>) tokens for unknown words
28. Character
Character by character tokenization
▪ Split on characters individually
▪ Do we include spaces or not?
▪ Smaller vocabularies
▪ But lack of meaning -> Characters don’t necessarily have a meaning separately
▪ End up with a huge amount of tokens to be processed by the model
L e t ‘ s d o t o k e n i z a t i o n !
29. Byte Pair Encoding
Welcome subword tokenization
▪ First introduced by Philip Gage in 1994, as a compression algorithm
▪ Applied to NLP by Rico Sennrich et al. in “Neural Machine Translation of Rare Words with
Subwords Units”. ACL 2016.
30. Byte Pair Encoding
Welcome subword tokenization
A B C ... a b c ... ? ! ...
Initial alphabet:
▪ Start with a base vocabulary using Unicode characters seen in the data
▪ Most frequent pairs get merged to a new token:
1. T + h => Th
2. Th + e => The
32. And a lot more
So many algorithms...
▪ Byte-level BPE as used in GPT-2 (Alec Radford et al. OpenAI)
▪ WordPiece as used in BERT (Jacob Devlin et al. Google)
▪ SentencePiece (Unigram model) (Taku Kudo et al. Google)
33. Tokenizers
Why did we build it?
▪ Performance
▪ One API for all the different tokenizers
▪ Easy to share and reproduce your work
▪ Easy to use any tokenizer, and re-train it on a new language/dataset
36. The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Set of rules to split:
- Whitespace use
- Punctuation use
- Something else?
37. The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Actual tokenization algorithm:
- BPE
- Unigram
- Word level
38. The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Add special tokens: for example [CLS], [SEP] with BERT
▪ Truncate to match the maximum length of the model
▪ Pad all sequence in a batch to the same length
▪ ...
47. Transformers
An explosion of Transformer architectures
▪ Wordpiece tokenization
▪ MLM & NSP
BERT
ALBERT
GPT-2
▪ SentencePiece tokenization
▪ MLM & SOP
▪ Repeating layers
▪ Byte-level BPE tokenization
▪ CLM
Same API
48. Transformers
As flexible as possible
Runs and trains on:
▪ CPU
▪ GPU
▪ TPU
With optimizations:
▪ XLA
▪ TorchScript
▪ Half-precision
▪ Others
All models
BERT & RoBERTa
More to come!
62. Transformers
Training models
Example scripts (TensorFlow & PyTorch)
- Named Entity Recognition
- Sequence Classification
- Question Answering
- Language modeling (fine-tuning & from scratch)
- Multiple Choice
Trains on TPU, CPU, GPU
Example scripts for PyTorch Lightning
63. Transformers
Just grazed the surface
The transformers library covers a lot more ground:
- ELECTRA
- Reformer
- Longformer
- Encoder-decoder architectures
- Translation & Summarization
64. Transformers + Tokenizers
The full pipeline?
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp