SlideShare uma empresa Scribd logo
1 de 25
Semi-Orthogonal Low-Rank
Matrix Factorization for Deep
Neural Networks
Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li,
Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur
Center for Language and Speech Processing,
Human Language Technology Center of Excellence,
Johns Hopkins University, Baltimore, MD, USA
University of Chinese Academy of Sciences, Beijing, China
2020/01 陳品媛
2/31
Outline
■ Introduction
■ Training with semi-orthogonal constraint
■ Factorized model topologies
■ Experimental setup
■ Experiments
■ Conclusion
3/31
Introduction
4/31
Introduction
■ Automatic Speech Recognition - Acoustic modeling
■ Author proposes a factored formed of TDNN whose layers
compresssed via SVD and a idea - Skip connections.
5/31
Introduction - TDNN
■ Time Delay Neural Networks
■ One-dimensional Convolutional Neural Networks (1-d CNNs)
■ The context width increases as we go to upper layers.
8/31
Introduction - SVD
■ Why SVD in DNN?
• DNN has huge computation cost → reduce the model size
• A large portion of weight parameters in DNN are very small.
• Fast computation and small memory usage can be obtained for
runtime evaluation.
number of singular values
%oftotalsingularvalues
15% 40%
10/31
Training with semi-orthogonal constraint
11/31
Training with semi-orthogonal constraint
■ Basic case - Update equation
After every few (specifically, every 4) time-steps of SGD, we
apply an efficient update that brings it closer to being a semi-
orthogonal matrix.
𝑃 ≡ 𝑀𝑀 𝑇
Force P = I (orthogonal matrix property: AAT
= I)
𝑄 ≡ 𝑃 − 𝐼
Minimize function:
𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
M = patameter matrix
i.e. the sum of squared
elements of Q
13/31
Training with semi-orthogonal constraint
■ Basic case - Update equation (cont.)
• The derivative of a scalar w.r.t. a matrix is not transposed
w.r.t. that matrix.
• 𝜈 =
1
8
leads quadratic convergence
𝜕𝑓
𝜕𝑄
= 2𝑄
𝜕𝑓
𝜕𝑃
= 2𝑄
𝜕𝑓
𝜕𝑀
= 4𝑄𝑀
𝑃 ≡ 𝑀𝑀 𝑇
, 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
𝑀 ← 𝑀 − 4𝜈𝑄𝑀
(𝜈 = learning rate)
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 with 𝜈 =
1
8
(1)
14/31
Training with semi-orthogonal constraint
■ Basic case - Weight Initialization
• (1) diverge if M is too far from being orthonormal to start with,
but this does not happen if using Glorot-style initialization
(Xavier initialization) (𝜎 =
1
#𝑐𝑜𝑙
).
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 (1)
Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
16/31
Training with semi-orthogonal constraint
■ Scale case
• Suppose we want M to be a scaled version of a semi-orthogonal
matrix
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
Substitue 𝑀 with
1
𝛼
𝑀 (some specified contant 𝛼)
17/31
Training with semi-orthogonal constraint
■ Floating case
• Control how fast the parameters of the various layers change
• Apply l2 regularization to the constrained layers
• Compute scale 𝛼 and apply to (2)
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2 𝐼)𝑀 (2)
𝑃 ≡ 𝑀𝑀 𝑇
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
18/31
Training with semi-orthogonal constraint
■ Floating case
Why 𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
?
𝑀 is a matrix with orthonormal rows. We pick the scale that will
give us an update to M that is orthogonal to M (viewed as a vector):
i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇
) = 0.
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
𝑡𝑟(𝑀 × 𝑀 𝑇
× (𝑀 𝑇
𝑀 − 𝛼2 𝐼)) = 0
𝑡𝑟(𝑃𝑃 𝑇
− 𝛼2
𝑃) = 0 or 𝛼2
= 𝑡𝑟(𝑃𝑃 𝑇
)/𝑡𝑟(𝑃)
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)
Ignore contant −
1
2𝛼2
𝑃 ≡ 𝑀𝑀 𝑇
19/31
Factorized model topologies
20/31
Factorized model topologies
1. Basic factorization
M=AB, with B constrained to be semi-orthogonal
M: 700 x 2100
A: 700 x 250, B: 250 x 2100
We call 250 as linear bottleneck dimension
2. Tuning the dimensions
Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension
of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers.
3. Factorizing the convolution
In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution.
We found better results when using a constrained 2x1 convolution followed by a 2x1
convolution.
700
2100
250
21/31
Factorized model topologies
4. 3-stage splicing
A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution
to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension.
Even more better then part 3.
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the
final layer was helpful.
256
1280
1280
256
22/31
Factorized model topologies
7. Skip connections
• Some layers receive as input, not just the output of the previous layer but also selected
other prior layers (up to 3) which are appended to the previous one.
• It helps to solve vanishing gradient problem.
23/31
Experiments
24/31
Experiments
• Experimental setup
1. basic factorization
Switchboard 300 hours
Fisher+Switchboard 2000 hours
MATERIAL • two low-resource languages: Swahili and Tagalog
• 80 hours for each
25/31
Experiments
2. Comparing model types
26/31
Conclusions
27/31
Conclusions
1. factorized TDNN (TDNN-F): an effective way to train networks
with parameter matrices represented as the product of two or
more smaller matrices, with all but one of the factors
constrained to be semi-orthogonal.
2. skip connections can solve vanishing gradient
3. dropout mask that is shared across time
4. better result and faster to decode
28/31
Appendix
29/31
Appendix
Factorizing the convolution
time-stride = 1
1024-128: time-offset = -1, 0
128-1024: time-offset = 0, 1
time-stride = 3
1024-128: time-offset = -3, 0
128-1024: time-offset = 0, 3
time-stride = 0
1024-128: time-offset = 0
128-1024: time-offset = 0
Tdnnf structure
Time-shared dropout
weighted sum of the input
and the output
30/31
Factorized model topologies
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
batch#×dim×seq_len
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers
was not helpful, factorizing the final layer was helpful.
dim
seq_lenbatch#

Mais conteúdo relacionado

Semelhante a Semi orthogonal low-rank matrix factorization for deep neural networks

Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningCastLabKAIST
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Dongmin Choi
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification學翰 施
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationX 37
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Varad Meru
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
The Chimera Grid Concept and Application
The Chimera Grid Concept and Application The Chimera Grid Concept and Application
The Chimera Grid Concept and Application Putika Ashfar Khoiri
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningVahid Mirjalili
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054Jinwon Lee
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attributiontaeseon ryu
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaSangwoo Mo
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
AIML2 DNN 3.5hr (111-1).pdf
AIML2 DNN  3.5hr (111-1).pdfAIML2 DNN  3.5hr (111-1).pdf
AIML2 DNN 3.5hr (111-1).pdfssuserb4d806
 

Semelhante a Semi orthogonal low-rank matrix factorization for deep neural networks (20)

deep CNN vs conventional ML
deep CNN vs conventional MLdeep CNN vs conventional ML
deep CNN vs conventional ML
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
EE5180_G-5.pptx
EE5180_G-5.pptxEE5180_G-5.pptx
EE5180_G-5.pptx
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
Compressed learning for time series classification
Compressed learning for time series classificationCompressed learning for time series classification
Compressed learning for time series classification
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
Reading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex OptimizationReading papers - survey on Non-Convex Optimization
Reading papers - survey on Non-Convex Optimization
 
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
Subproblem-Tree Calibration: A Unified Approach to Max-Product Message Passin...
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
The Chimera Grid Concept and Application
The Chimera Grid Concept and Application The Chimera Grid Concept and Application
The Chimera Grid Concept and Application
 
Introduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep LearningIntroduction to Neural Networks and Deep Learning
Introduction to Neural Networks and Deep Learning
 
ShuffleNet - PR054
ShuffleNet - PR054ShuffleNet - PR054
ShuffleNet - PR054
 
Restricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for AttributionRestricting the Flow: Information Bottlenecks for Attribution
Restricting the Flow: Information Bottlenecks for Attribution
 
Learning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat MinimaLearning Theory 101 ...and Towards Learning the Flat Minima
Learning Theory 101 ...and Towards Learning the Flat Minima
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
AIML2 DNN 3.5hr (111-1).pdf
AIML2 DNN  3.5hr (111-1).pdfAIML2 DNN  3.5hr (111-1).pdf
AIML2 DNN 3.5hr (111-1).pdf
 

Último

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 

Último (20)

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 

Semi orthogonal low-rank matrix factorization for deep neural networks

  • 1. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur Center for Language and Speech Processing, Human Language Technology Center of Excellence, Johns Hopkins University, Baltimore, MD, USA University of Chinese Academy of Sciences, Beijing, China 2020/01 陳品媛
  • 2. 2/31 Outline ■ Introduction ■ Training with semi-orthogonal constraint ■ Factorized model topologies ■ Experimental setup ■ Experiments ■ Conclusion
  • 4. 4/31 Introduction ■ Automatic Speech Recognition - Acoustic modeling ■ Author proposes a factored formed of TDNN whose layers compresssed via SVD and a idea - Skip connections.
  • 5. 5/31 Introduction - TDNN ■ Time Delay Neural Networks ■ One-dimensional Convolutional Neural Networks (1-d CNNs) ■ The context width increases as we go to upper layers.
  • 6. 8/31 Introduction - SVD ■ Why SVD in DNN? • DNN has huge computation cost → reduce the model size • A large portion of weight parameters in DNN are very small. • Fast computation and small memory usage can be obtained for runtime evaluation. number of singular values %oftotalsingularvalues 15% 40%
  • 8. 11/31 Training with semi-orthogonal constraint ■ Basic case - Update equation After every few (specifically, every 4) time-steps of SGD, we apply an efficient update that brings it closer to being a semi- orthogonal matrix. 𝑃 ≡ 𝑀𝑀 𝑇 Force P = I (orthogonal matrix property: AAT = I) 𝑄 ≡ 𝑃 − 𝐼 Minimize function: 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) M = patameter matrix i.e. the sum of squared elements of Q
  • 9. 13/31 Training with semi-orthogonal constraint ■ Basic case - Update equation (cont.) • The derivative of a scalar w.r.t. a matrix is not transposed w.r.t. that matrix. • 𝜈 = 1 8 leads quadratic convergence 𝜕𝑓 𝜕𝑄 = 2𝑄 𝜕𝑓 𝜕𝑃 = 2𝑄 𝜕𝑓 𝜕𝑀 = 4𝑄𝑀 𝑃 ≡ 𝑀𝑀 𝑇 , 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇 ) 𝑀 ← 𝑀 − 4𝜈𝑄𝑀 (𝜈 = learning rate) 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 with 𝜈 = 1 8 (1)
  • 10. 14/31 Training with semi-orthogonal constraint ■ Basic case - Weight Initialization • (1) diverge if M is too far from being orthonormal to start with, but this does not happen if using Glorot-style initialization (Xavier initialization) (𝜎 = 1 #𝑐𝑜𝑙 ). 𝑀 ← 𝑀 − 1 2 (𝑀𝑀 𝑇 − 𝐼)𝑀 (1) Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)
  • 11. 16/31 Training with semi-orthogonal constraint ■ Scale case • Suppose we want M to be a scaled version of a semi-orthogonal matrix 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) Substitue 𝑀 with 1 𝛼 𝑀 (some specified contant 𝛼)
  • 12. 17/31 Training with semi-orthogonal constraint ■ Floating case • Control how fast the parameters of the various layers change • Apply l2 regularization to the constrained layers • Compute scale 𝛼 and apply to (2) 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑃 ≡ 𝑀𝑀 𝑇 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3)
  • 13. 18/31 Training with semi-orthogonal constraint ■ Floating case Why 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) ? 𝑀 is a matrix with orthonormal rows. We pick the scale that will give us an update to M that is orthogonal to M (viewed as a vector): i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇 ) = 0. 𝑀 ← 𝑀 − 1 2𝛼2 (𝑀𝑀 𝑇 − 𝛼2 𝐼)𝑀 (2) 𝑡𝑟(𝑀 × 𝑀 𝑇 × (𝑀 𝑇 𝑀 − 𝛼2 𝐼)) = 0 𝑡𝑟(𝑃𝑃 𝑇 − 𝛼2 𝑃) = 0 or 𝛼2 = 𝑡𝑟(𝑃𝑃 𝑇 )/𝑡𝑟(𝑃) 𝛼 = 𝑡𝑟(𝑃𝑃 𝑇 ) 𝑡𝑟(𝑃) (3) Ignore contant − 1 2𝛼2 𝑃 ≡ 𝑀𝑀 𝑇
  • 15. 20/31 Factorized model topologies 1. Basic factorization M=AB, with B constrained to be semi-orthogonal M: 700 x 2100 A: 700 x 250, B: 250 x 2100 We call 250 as linear bottleneck dimension 2. Tuning the dimensions Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers. 3. Factorizing the convolution In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution. We found better results when using a constrained 2x1 convolution followed by a 2x1 convolution. 700 2100 250
  • 16. 21/31 Factorized model topologies 4. 3-stage splicing A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension. Even more better then part 3. 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. 256 1280 1280 256
  • 17. 22/31 Factorized model topologies 7. Skip connections • Some layers receive as input, not just the output of the previous layer but also selected other prior layers (up to 3) which are appended to the previous one. • It helps to solve vanishing gradient problem.
  • 19. 24/31 Experiments • Experimental setup 1. basic factorization Switchboard 300 hours Fisher+Switchboard 2000 hours MATERIAL • two low-resource languages: Swahili and Tagalog • 80 hours for each
  • 22. 27/31 Conclusions 1. factorized TDNN (TDNN-F): an effective way to train networks with parameter matrices represented as the product of two or more smaller matrices, with all but one of the factors constrained to be semi-orthogonal. 2. skip connections can solve vanishing gradient 3. dropout mask that is shared across time 4. better result and faster to decode
  • 24. 29/31 Appendix Factorizing the convolution time-stride = 1 1024-128: time-offset = -1, 0 128-1024: time-offset = 0, 1 time-stride = 3 1024-128: time-offset = -3, 0 128-1024: time-offset = 0, 3 time-stride = 0 1024-128: time-offset = 0 128-1024: time-offset = 0 Tdnnf structure Time-shared dropout weighted sum of the input and the output
  • 25. 30/31 Factorized model topologies 5. Dropout dropout mask are shared across time dropout schedule 𝛼(dropout strength): 0 → 0.5 → 0 continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution batch#×dim×seq_len 6. Factorizing the final layer Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the final layer was helpful. dim seq_lenbatch#

Notas do Editor

  1. 2018 interspeech
  2. Subsamping: 相鄰時間點所包含的上下文信息有很大部分重疊的,因此可以採用取樣的方法,只保留部分的連線,可以獲得原始模型近似的效果,同時能夠大大減小模型的計算量 https://www.twblogs.net/a/5c778577bd9eee3399183f67
  3. Eigendecomposition of a matrix: https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix low rank: 當捨棄掉小的且非零的特徵值
  4. where ∑ is a diagonal matrix with A's singular values on the diagonal in the decreasing order. The m columns of U and the n columns of V are called the left-singular vectors and rightsingular vectors of A, respectively.
  5. 之前的work都是在訓練過後的模型上做SVD,但這篇是直接以這個架構下去做訓練,所以來說一下他是怎麼做更新的
  6. Product of independent variables implement in Caffe library, not in paper https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization
  7. Because we use batchnorm and because the ReLU nonlinearity is scale invariant, l2 does not have a true regularization effect when applied to the hidden layers; but it reduces the scale of the parameter matrix which makes it learn faster
  8. tr(MX^t) = 0 正交定義 https://github.com/kaldi-asr/kaldi/blob/7b762b1b32140cbf8fbf4c72b713b4bd18c71104/src/nnet3/nnet-utils.cc#L1009
  9. 2. 現在實做是1280*128
  10. In the current kaldi implementaion,"3-stage splicing" aspect and the skip connecctions were token out. https://groups.google.com/forum/#!topic/kaldi-help/gBinGgj6Xy4
  11. Eval2000: full HUB5'00 evaluation set (also known as Eval2000) and its “switchboard” subset RT03: test set (LDC2007S10)
  12. 依序是1*1, 2*1, 3*1