O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

LSTM Structured Pruning

Slides for the paper titled "Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications", by N. Gkalelis and V. Mezaris, presented at the 22nd IEEE Int. Symposium on Multimedia (ISM), Dec. 2020.

  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

LSTM Structured Pruning

  1. 1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Title of presentation Subtitle Name of presenter Date Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications N. Gkalelis, V. Mezaris CERTH-ITI, Thermi - Thessaloniki, Greece IEEE Int. Symposium on Multimedia, Naples, Italy (Virtual), Dec. 2020
  2. 2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Outline 2 • Problem statement • Related work • Layer’s pruning rate computation • LSTM unit importance estimation • Experiments • Conclusions
  3. 3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 3 • Deep learning (DL) is currently becoming a game changer in most industries due to breakthrough classification performance in many machine learning tasks Problem statement • Mobile Multimedia • Self-driving cars • Edge computing Image Credits: [2] Image Credits: [3] [1] V-Soft Consulting: https://blog.vsoftconsulting.com/; [2] V2Gov: https://www.facebook.com/V2Gov/ [3] J. Chen, X. Ran, Deep Learning With Edge Computing: A Review, Proc. of the IEEE, Aug. 2019 Image Credits: [1]
  4. 4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 4 • Recurrent neural networks (RNNs) have shown excellent performance in processing sequential data • The deployment of top-performing RNNs in resource-limited applications such as mobile multimedia devices is still difficult due to their high inference time and storage requirements  How to reduce the size of RNNs and at the same time retain generalization performance ? Problem statement
  5. 5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 5 Related work • Pruning is getting increasing attention because these methods achieve a high compression rate and maintain a stable model performance [4,5] • Two main pruning categories: a) unstructured: prune individual network weights, b) structured: prune well-defined network components, e.g., DCNN filters or LSMT units  Models derived using structured pruning can be deployed in conventional hardware (e.g. GPUs); no special purpose accelerators required [4] K. Ota, M.S. Dao, V. Mezaris, F.G.B. De Natale: Deep Learning for Mobile Multimedia: A Survey, ACM Trans. Multimedia Computing Communications & Applications (TOMM), vol. 13, no. 3s, June 2017 [5] Y. Cheng, D. Wang, P. Zhou and T. Zhang: Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 126-136, Jan. 2018
  6. 6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 6 Related work • Structured pruning of DCNNs has been extensively studied in the literature; structured RNN pruning is a much less investigated topic: • In [6], Intrinsic Sparse Structures (ISS) of LSTMs are defined and a Group Lasso- based approach is used for sparsifying the network • In [7], LSTM parameters are constrained using an L0 norm penalty and ISSs close to zero are pruned  Both [6], [7], utilize sparsity-inducing regularizers to modify the loss function, which may lead to numerical instabilities and suboptimal solutions [8] [6] W. Wen et al., Learning intrinsic sparse structures within long short-term memory, ICLR, 2018 [7] L. Wen et al., Structured pruning of recurrent neural networks through neuron selection, Neural Networks, Mar. 2020. [8] H. Xu et al., Sparse algorithms are not stable: A no-free-lunch theorem, IEEE Trans. Pattern Anal. Mach. Intell., Jan. 2012.
  7. 7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 7 Overview of proposed method • Inspired from recent advances in DCNN filter pruning [9, 10] we extend [6]: • The covariance matrix formed by layer’s responses is used to compute the respective eigenvalues, quantify layer’s redundancy and pruning rate (as in [9] for DCNN layers) • A Geometric Median-based (GM-based) criterion is used to identify the most redundant LSTM units (as in [10] for DCNN filters)  The GM-based criterion has shown superior performance over sparsity-inducing ones in the DCNN domain [9] X. Suau, U. Zappella, and N. Apostoloff, Filter distillation for network compression, IEEE WACV, CO, USA, Mar. 2020 [10] Y. He et al., Filter pruning via Geometric median for deep convolutional neural networks acceleration, IEEE CVPR, CA, USA, Jun. 2019
  8. 8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 8 Computation of layer’s pruning rate • Suppose an annotated training set of N sequences and C classes • The training set at LSTM layer’s output can be represented as 𝐙 = 𝒛1, … , 𝒛N , 𝒛k ∈ ℛ 𝐻 • zk is the hidden state vector of the k-th sequence at last time step; has high representational power and often used for representing overall input sequence; H is the number of units in the layer • The sample covariance matrix S of the responses can be computed as 𝐒 = 𝐳 𝑘 − 𝒎 𝐳 𝑘 − 𝒎 𝑇 𝑁 𝑘=1 , 𝒎 = 1 𝑁 𝐳 𝑘 𝑁 𝑘=1
  9. 9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 9 Computation of layer’s pruning rate • The eigenvalues of S are computed; sorted into descending order and normalized to sum to one: 𝜆1, … , 𝜆 𝐻, 𝜆1 ≥ … ≥ 𝜆 𝐻 ≥ 0, 𝜆𝑖 = 1 𝐻 𝑖=1 • They give insight about the redundancy of the LSTM layer: if only a small fraction is nonzero, we conclude that many redundant units exist in the layer
  10. 10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 10 Computation of layer’s pruning rate • We further define ζi and δi as: 𝜁1, … , 𝜁 𝐻, 𝜁𝑗 = 𝜆𝑖 𝑗 𝑖=1 , 𝛿1, … , 𝛿 𝐻, 𝛿𝑖 = 1, 𝑖𝑓𝜁𝑖 ≤ 𝛼 0, 𝑒𝑙𝑠𝑒 • α: tuning parameter for deriving the required pruning level • Pruning rate θ of the LSTM layer is then computed using δ’s: 𝜃 = 1 − 𝛿𝑖 𝐻 𝑖=1 𝐻
  11. 11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 11 Computation of layer’s pruning rate • Toy example: 2 LSTM layers and 6 units at each layer • We compute λi‘s, ζi‘s and δi‘s using α=0.95 (overall energy level to retain): • 1st LSTM layer (left): energy is spread among many eigenvalues; exhibits small redundancy; a low pruning rate is computed (θ[1] = 1 – 4/6 = 33%) • 2nd LSTM layer (right): energy is accumulated only in a few eigenvalues; exhibits high redundancy; a high pruning rate is computed (θ[2] = 1 – 1/6 = 83%) • The total pruning rate is (33% + 83%)/2 = 58%; alternatively we can adjust α through grid search in order to achieve a given target pruning rate 0.5, 0.3, 0.1, 0.05, 0.03, 0.02 0.5, 0.8, 0.9, 0.95, 0.98, 1 1, 1, 1, 1, 0, 0 0.93, 0.04, 0.02, 0.01, 0, 0 0.93, 0.97, 0.99, 1, 1, 1 1, 0, 0, 0, 0, 0 λi ζi δi
  12. 12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 12 Computation of layer’s units importance estimation • Stack all LSTM layer weight matrices to form an overall weight matrix W 𝑾 = 𝑾𝑖𝑥, 𝑾 𝑓𝑥, 𝑾 𝑢𝑥, 𝑾 𝑜𝑥, 𝑾𝑖ℎ, 𝑾 𝑓ℎ, 𝑾 𝑢ℎ, 𝑾 𝑜ℎ ∈ ℛ 𝐻×𝑄 • H: hidden state dimensionality (number of layer units); Q = 4(H + F);F: layer’s input vector dimensionality
  13. 13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 13 Computation of layer’s units importance estimation • Each row of W is associated with a layer’s unit; rewrite it as: 𝑾 = 𝒘1, … , 𝒘 𝐻 𝑇, 𝒘 𝑘 ∈ ℛ 𝑄 • Derive GM-based dissimilarity value [9] for each LSTM layer’s unit 𝜂 𝑗 = 𝒘𝑗 − 𝒘 𝑘 𝐻 𝑘=1 • A small ηj denotes that unit j is highly correlated with other units in the layer (i.e. redundant); discard units with the smallest ηj
  14. 14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 14 • Penn Treebank (PTB) [11]: word-level prediction, 1086k tokens, 10k classes (unique tokens), 930k training, 74k validation and 82k testing tokens • YouTube-8M (YT8M) [12]: multilabel concept detection, 3862 classes (semantic concepts), more than 6 million videos, 1024- and 128-dimensional visual and audio feature vector sequences are provided for each video • The proposed ISS-GM is compared with ISS-GL [6] and ISS-L0 [7] [11] M. P. Marcus, M. Marcinkiewicz, B. Santorini, Building a large annotated corpus of English: The Penn Treebank, Comput. Linguist, Jun. 1993 [12] J. Lee et al., The 2nd YouTube-8M large-scale video understanding challenge, ECCV Workshops, Munich, Germany, Sep. 2018
  15. 15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experimental Setup 15 • PTB: as in [13], 2 layer stacked LSTM, 1500 units each, output layer of size 10000, dropout keep rate 0.5; sequence length 35; 55 epochs, minibatch averaged SGD, batch size 20, initial learning rate 1, etc. • YT8M: 1st BLSTM layer with 512 units per forward/backward layer, 2nd LSTM layer with 1024 units, output layer of size 3862 units; sequence length 300 frames; 10 epochs, minibatch SGD, batch size 256, initial learning rate 0.0002, etc. • The performance is measured using the per-word perplexity (PPL) and global average precision at 20 (GAP@20) for PTB and YT8M, respectively [13] W. Zaremba, I. Sutskever, and O. Vinyals, Recurrent neural network regularization, CoRR, vol. abs/1804.03209, 2014
  16. 16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Experiments 16 ISS # in (1st, 2nd ) PPL (valid., test) baseline [13] (1500, 1500) (82.57, 78.57) ISS-GL [6] (373, 315) (82.59, 78.65) ISS-L0 [7] (296, 247) (81.62, 78.08) ISS-GM (prop.) (236, 297) (81.49, 77.97) • Evaluation results in PTB (top) and YT8M (bottom) • Lower PPL values are better; Higher GAP@20 values are better; Training time (Ttr) is in hours • ISS-GM outperforms all other methods • Exhibits high degree of robustness against large pruning rates (e.g. only 1.23% drop for θ = 70%) • Approx. 2 times slower compared to ISS-GL due to the eigenanalysis of the covariance matrix; training is performed off-line, this limitation is considered insignificant GAP@20 Ttr no pruning 84.33% 6.73 ISS-GL [6] (θ=30%) 83.20% 7.82 ISS-GM (prop.) (θ=30%) 84.12% 15.4 ISS-GL [6] (θ=70%) 82.20% 7.43 ISS-GM (prop.) (θ=70%) 83.10% 14.5
  17. 17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project Summary and next steps 17 • A new LSTM structured pruning approach presented: utilizes the sample covariance matrix of layer’s responses and a GM-based criterion to automatically derive pruning rates and discard the most redundant units • The proposed approach evaluated successfully in two popular datasets (PTB, YT8M) for word- level prediction in text and multilabel video classification tasks • As a future work, planning to investigate the use of the proposed approach in pruning deeper RNN architectures, e.g. Recurrent Highway Networks [14, 15] [14] J. G. Zilly, R. K. Srivastava, J. Koutnı́k, J. Schmidhuber, Recurrent Highway Networks, Proc.ICML, 2017 [15] G. Pundak, T. Sainath, Highway-LSTM and Recurrent Highway Networks for Speech Recognition, Proc. Interspeech, 2017
  18. 18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project 18 Thank you for your attention! Questions? Nikolaos Gkalelis, gkalelis@iti.gr Vasileios Mezaris, bmezaris@iti.gr Code will be publicly available by end of December 2020 at: https://github.com/bmezaris/lstm_structured_pruning_geometric_median This work was supported by the EUs Horizon 2020 research and innovation programme under grant agreement H2020-780656 ReTV