SlideShare a Scribd company logo
1 of 46
Download to read offline
PaLM: Scaling Language
Modeling with Pathways
Chowdhery, Aakanksha, et al. arXiv preprint arXiv:2204.02311
2023. 02. 19
허정원, 조해창, 박산희
1
Contents
•
•
•
•
•
2
1. Introduction
3
Gopher
LaMDA
GaLM MT NLG
175B 1.2T 137B 280B 530B
1. Introduction
(1)
(2)
(3)
(4)
4
540B
780B Tokens
Achieved through the use of Pathways
PaLM
The key takeaways
•
•
•
•
•
•
5
Model Architecture
6
2. Model Architecture
•
•
•
•
•
•
•
7
8
9
•
SwiGLU = xW·sigmoid(βxW) @ xV
An improvement in quality in compute- equivalent experiments
10
•
The parallel formulation results in roughly 15% faster
training speed at large scales, since the MLP and
Attention input matrix multiplications can be fused.
11
• Multi-Query Attention
12
• RoPE Embeddings
𝑓! 𝑥" ≔ 𝑊
!𝑥"
𝑓# 𝑥$ + 𝑛 ≔ 𝑊#(𝑥$ + (
𝑝%
#
)
𝑓& 𝑥$ + 𝑛 ≔ 𝑊
&(𝑥$ + (
𝑝%
&
)
13
• Vocabulary
A SentencePiece vocabulary with 256k tokens, which was chosen
to support the large number of languages in the training corpus
without excess tokenization.
The vocabulary is completely lossless and reversible.
2. Model Architecture
•
•
• cost savings
•
•
•
•
14
2.1 Model Scale Hyperparameters
15
Model Architecture
16
Training
17
3 Training Dataset
18
4 Training Infrastructure
19
4.1 Training Efficiency
20
5 Training Setup
•
•
•
•
•
•
•
•
21
5 Training Setup
•
22
5 Training Setup
•
23
5 Training Setup
•
24
5 Training Setup
•
25
5 Training Setup
•
26
5 Training Setup
•
•
•
27
5.1 Training Instability
28
Training
29
Evaluation
30
31
6.1 English NLP tasks
6.2 BIG-bench
32
6.3 Reasoning
33
6.4 Code Tasks
34
6.5 Translation
•
•
•
35
6.6 Multilingual Natural Language Generation
•
•
•
• 36
6.7 Multilingual Question Answering
37
6.8 Analysis
38
Discussions
39
7 Memorization
•
•
•
40
8 Dataset Contamination
41
9 Exploring Explanations
•
•
•
42
10 Representational
Bias Analysis
43
13 Open Questions in Scaling
44
14 Conclusion
•
•
•
45
Q & A
46

More Related Content

What's hot

Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examples
Peter Neubauer
 

What's hot (20)

Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Neo4j graph database
Neo4j graph databaseNeo4j graph database
Neo4j graph database
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial Deep Learning for Recommender Systems RecSys2017 Tutorial
Deep Learning for Recommender Systems RecSys2017 Tutorial
 
Graph neural networks overview
Graph neural networks overviewGraph neural networks overview
Graph neural networks overview
 
Image captioning
Image captioningImage captioning
Image captioning
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
Personalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep LearningPersonalizing "The Netflix Experience" with Deep Learning
Personalizing "The Netflix Experience" with Deep Learning
 
Neural Architectures for Named Entity Recognition
Neural Architectures for Named Entity RecognitionNeural Architectures for Named Entity Recognition
Neural Architectures for Named Entity Recognition
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Neo4j - 5 cool graph examples
Neo4j - 5 cool graph examplesNeo4j - 5 cool graph examples
Neo4j - 5 cool graph examples
 
Diffusion Deformable Model for 4D Temporal Medical Image Generation
Diffusion Deformable Model for 4D Temporal Medical Image GenerationDiffusion Deformable Model for 4D Temporal Medical Image Generation
Diffusion Deformable Model for 4D Temporal Medical Image Generation
 
머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)머신러닝의 자연어 처리기술(I)
머신러닝의 자연어 처리기술(I)
 
YOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection reviewYOLOv4: optimal speed and accuracy of object detection review
YOLOv4: optimal speed and accuracy of object detection review
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)Deep Learning for Video: Action Recognition (UPC 2018)
Deep Learning for Video: Action Recognition (UPC 2018)
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 

More from taeseon ryu

VoxelNet
VoxelNetVoxelNet
VoxelNet
taeseon ryu
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
taeseon ryu
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
taeseon ryu
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
taeseon ryu
 

More from taeseon ryu (20)

VoxelNet
VoxelNetVoxelNet
VoxelNet
 
OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...OpineSum Entailment-based self-training for abstractive opinion summarization...
OpineSum Entailment-based self-training for abstractive opinion summarization...
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 
JetsonTX2 Python
 JetsonTX2 Python  JetsonTX2 Python
JetsonTX2 Python
 
Hyperbolic Image Embedding.pptx
Hyperbolic  Image Embedding.pptxHyperbolic  Image Embedding.pptx
Hyperbolic Image Embedding.pptx
 
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
 
YOLO V6
YOLO V6YOLO V6
YOLO V6
 
Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories Dataset Distillation by Matching Training Trajectories
Dataset Distillation by Matching Training Trajectories
 
RL_UpsideDown
RL_UpsideDownRL_UpsideDown
RL_UpsideDown
 
Packed Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation ExtractionPacked Levitated Marker for Entity and Relation Extraction
Packed Levitated Marker for Entity and Relation Extraction
 
MOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement LearningMOReL: Model-Based Offline Reinforcement Learning
MOReL: Model-Based Offline Reinforcement Learning
 
Scaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language ModelsScaling Instruction-Finetuned Language Models
Scaling Instruction-Finetuned Language Models
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 
mPLUG
mPLUGmPLUG
mPLUG
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdfvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
 
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdfReinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
 
The Forward-Forward Algorithm
The Forward-Forward AlgorithmThe Forward-Forward Algorithm
The Forward-Forward Algorithm
 
Towards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural NetworksTowards Robust and Reproducible Active Learning using Neural Networks
Towards Robust and Reproducible Active Learning using Neural Networks
 
BRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive SummarizationBRIO: Bringing Order to Abstractive Summarization
BRIO: Bringing Order to Abstractive Summarization
 
ProximalPolicyOptimization
ProximalPolicyOptimizationProximalPolicyOptimization
ProximalPolicyOptimization
 

PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf