Semi orthogonal low-rank matrix factorization for deep neural networks

Semi-Orthogonal Low-Rank
Matrix Factorization for Deep
Neural Networks
Daniel Povey , Gaofeng Cheng, Yiming Wang, Ke Li,
Hainan Xu, Mahsa Yarmohamadi, Sanjeev Khudanpur
Center for Language and Speech Processing,
Human Language Technology Center of Excellence,
Johns Hopkins University, Baltimore, MD, USA
University of Chinese Academy of Sciences, Beijing, China
2020/01 陳品媛

2/31
Outline
■ Introduction
■ Training with semi-orthogonal constraint
■ Factorized model topologies
■ Experimental setup
■ Experiments
■ Conclusion

4/31
Introduction
■ Automatic Speech Recognition - Acoustic modeling
■ Author proposes a factored formed of TDNN whose layers
compresssed via SVD and a idea - Skip connections.

5/31
Introduction - TDNN
■ Time Delay Neural Networks
■ One-dimensional Convolutional Neural Networks (1-d CNNs)
■ The context width increases as we go to upper layers.

8/31
Introduction - SVD
■ Why SVD in DNN?
• DNN has huge computation cost → reduce the model size
• A large portion of weight parameters in DNN are very small.
• Fast computation and small memory usage can be obtained for
runtime evaluation.
number of singular values
%oftotalsingularvalues
15% 40%

10/31
Training with semi-orthogonal constraint

11/31
■ Basic case - Update equation
After every few (specifically, every 4) time-steps of SGD, we
apply an efficient update that brings it closer to being a semi-
orthogonal matrix.
𝑃 ≡ 𝑀𝑀 𝑇
Force P = I (orthogonal matrix property: AAT
= I)
𝑄 ≡ 𝑃 − 𝐼
Minimize function:
𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
M = patameter matrix
i.e. the sum of squared
elements of Q

13/31
■ Basic case - Update equation (cont.)
• The derivative of a scalar w.r.t. a matrix is not transposed
w.r.t. that matrix.
• 𝜈 =
1
8
leads quadratic convergence
𝜕𝑓
𝜕𝑄
= 2𝑄
𝜕𝑓
𝜕𝑃
= 2𝑄
𝜕𝑓
𝜕𝑀
= 4𝑄𝑀
, 𝑄 ≡ 𝑃 − 𝐼, 𝑓 = 𝑡𝑟(𝑄𝑄 𝑇
)
𝑀 ← 𝑀 − 4𝜈𝑄𝑀
(𝜈 = learning rate)
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 with 𝜈 =
1
8
(1)

14/31
■ Basic case - Weight Initialization
• (1) diverge if M is too far from being orthonormal to start with,
but this does not happen if using Glorot-style initialization
(Xavier initialization) (𝜎 =
1
#𝑐𝑜𝑙
).
𝑀 ← 𝑀 −
1
2
(𝑀𝑀 𝑇
− 𝐼)𝑀 (1)
Reference: Understanding the difficulty of training deep feedforward neural networks (Xavier Glorot and Yoshua Bengio)

16/31
■ Scale case
• Suppose we want M to be a scaled version of a semi-orthogonal
matrix
𝑀 ← 𝑀 −
1
2𝛼2 (𝑀𝑀 𝑇
− 𝛼2
𝐼)𝑀 (2)
Substitue 𝑀 with
1
𝛼
𝑀 (some specified contant 𝛼)

17/31
■ Floating case
• Control how fast the parameters of the various layers change
• Apply l2 regularization to the constrained layers
• Compute scale 𝛼 and apply to (2)
𝑀 ← 𝑀 −
1
− 𝛼2 𝐼)𝑀 (2)
𝛼 =
𝑡𝑟(𝑃𝑃 𝑇
)
𝑡𝑟(𝑃)
(3)

18/31
■ Floating case
Why 𝛼 =
)
𝑡𝑟(𝑃)
?
𝑀 is a matrix with orthonormal rows. We pick the scale that will
give us an update to M that is orthogonal to M (viewed as a vector):
i.e., 𝑀: = 𝑀 + 𝑋, we want to have 𝑡𝑟(𝑀𝑋 𝑇
) = 0.
𝑀 ← 𝑀 −
1
− 𝛼2
𝐼)𝑀 (2)
𝑡𝑟(𝑀 × 𝑀 𝑇
× (𝑀 𝑇
𝑀 − 𝛼2 𝐼)) = 0
− 𝛼2
𝑃) = 0 or 𝛼2
= 𝑡𝑟(𝑃𝑃 𝑇
)/𝑡𝑟(𝑃)
𝛼 =
)
𝑡𝑟(𝑃)
(3)
Ignore contant −
1
2𝛼2

19/31
Factorized model topologies

20/31
1. Basic factorization
M=AB, with B constrained to be semi-orthogonal
M: 700 x 2100
A: 700 x 250, B: 250 x 2100
We call 250 as linear bottleneck dimension
2. Tuning the dimensions
Tuning on 300hrs and we ended up using larger matrixes sizes, with a hidden-layer dimension
of 1280 or 1536, a linear bottleneck dimension of 256, and more hidden layers.
3. Factorizing the convolution
In part 1 example, the setup use constrained 3x1 convolutions followed by 1x1 convolution.
We found better results when using a constrained 2x1 convolution followed by a 2x1
convolution.
700
2100
250

21/31
4. 3-stage splicing
A constrained 2x1 convolution to dimension 256, followed by another constrained 2x1 convolution
to dimension 256, followed by a 2x1 convolution back to the hidden-layer dimension.
Even more better then part 3.
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼（dropout strength）: 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers was not helpful, factorizing the
final layer was helpful.
256
1280
1280
256

22/31
7. Skip connections
• Some layers receive as input, not just the output of the previous layer but also selected
other prior layers (up to 3) which are appended to the previous one.
• It helps to solve vanishing gradient problem.

24/31
Experiments
• Experimental setup
1. basic factorization
Switchboard 300 hours
Fisher+Switchboard 2000 hours
MATERIAL • two low-resource languages: Swahili and Tagalog
• 80 hours for each

25/31
Experiments
2. Comparing model types

27/31
Conclusions
1. factorized TDNN (TDNN-F): an effective way to train networks
with parameter matrices represented as the product of two or
more smaller matrices, with all but one of the factors
constrained to be semi-orthogonal.
2. skip connections can solve vanishing gradient
3. dropout mask that is shared across time
4. better result and faster to decode

29/31
Appendix
Factorizing the convolution
time-stride = 1
1024-128: time-offset = -1, 0
128-1024: time-offset = 0, 1
time-stride = 3
1024-128: time-offset = -3, 0
128-1024: time-offset = 0, 3
time-stride = 0
1024-128: time-offset = 0
128-1024: time-offset = 0
Tdnnf structure
Time-shared dropout
weighted sum of the input
and the output

30/31
5. Dropout
dropout mask are shared across time
dropout schedule 𝛼（dropout strength）: 0 → 0.5 → 0
continuous dropout scale: 1 − 2𝛼, 1 + 2𝛼 uniform distribution
batch#×dim×seq_len
6. Factorizing the final layer
Even with very small datasets in which factorizing the TDNN layers
was not helpful, factorizing the final layer was helpful.
dim
seq_lenbatch#

Semi orthogonal low-rank matrix factorization for deep neural networks

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Semi orthogonal low-rank matrix factorization for deep neural networks

Semelhante a Semi orthogonal low-rank matrix factorization for deep neural networks (20)

Último

Último (20)

Semi orthogonal low-rank matrix factorization for deep neural networks

Notas do Editor