基于零模正则的神经网络剪枝方法

doi:10.12005/orms.2023.0326

摘要/Abstract

摘要： 本文提出一种有效的神经网络剪枝方法。该方法对神经网络训练模型引入零模正则项来促使模型权重稀疏,并通过删减取值为零的权重来压缩模型。对所提出的零模正则神经网络训练模型,文中通过建立其等价MPEC形式的全局精确罚得到其等价的局部Lipschitz代理,然后通过用交替方向乘子法求解该Lipschitz代理模型对网络进行训练、剪枝。最后,对MLP和LeNet-5网络模型进行测试,分别在误差2.2%和1%下,取得97.43%和99.50%的稀疏度,达到很好的剪枝效果。

关键词: 神经网络剪枝, 零模正则, 交替方向乘子法

Abstract: Deep Neural Network (DNN) has become ubiquitous in our daily life ranging from autonomous driving to smart home. It has become an inevitable trend to introduce DNN model into mobile devices and embedded systems. The redundancy of parameters has always been the main reason for hindering neural network inference and making it difficult to deploy on mobile system.
In recent years, academia and industry have proposed many methods for model compression, such as model compression, knowledge distillation, and network pruning. Neural network pruning, as an important means of network model compression, reduces network parameters by removing some neural connections, effectively overcoming the high computational cost and high memory resource proportion caused by neural network weight redundancy. Our method in this article is a further extension of the network pruning model and solving algorithm.
In this work, we propose an effective pruning method for neural networks against the problem of high computational costs and considerable memory bandwidth caused by huge complexity and parameters redundancy of neural network model. This method improves the sparsity of model weights by introducing zero-norm regularized term into the neural network model, and compresses the model by deleting those zero weights. For the proposed zero-norm regularized neural network model, by establishing the global exact penalty for its equivalent MPEC form, we obtain an equivalent Lipschitz surrogate.
Based on the equivalent local Lipschitz surrogate, considering that when the activation function is sigmod, the loss function of the final optimization model is a combination of smooth and non-smooth terms, and the smooth part can be solved through existing frameworks, while the non-smooth part has an exact expression, we design an proximal alternating direction multiplier method (P-ADMM) to solve the smooth loss model induced by sigmod activation function. Numerical experiments conducted for P-ADMM validate their efficiency. The tests for the MLP and LeNet-5 network respectively yield 97.43% and 99.50% sparsity without the loss of accuracy. The results of numerical experiment show that our method effectively reduces the complexity of the model, and has better sparse ratio compared with other pruning methods. Meanwhile, it has the advantages of convenient implementation and easy extension.
This article proposes a (P-ADMM) method for solving the smooth loss network pruning model. For the highly non convexity of the neural network model, although the paper utilizes alternating solution and the computational graph framework to solve the model, the convergence speed of the algorithm is slow in the later stage. Therefore, one of the future research directions is whether to propose an acceleration strategy to improve the convergence rate of the algorithm, and whether to directly solve the non-convex and non-smooth model using gradient methods for backpropagation algorithms and computational graph frameworks. Another interesting research direction is how to design effective algorithms to find a solution when the smooth loss function is non smooth, and what convergence properties the algorithm possesses.

Key words: network pruning, zero-norm regularization, ADMM

中图分类号:

O224
TP39

柳智. 基于零模正则的神经网络剪枝方法[J]. 运筹与管理, 2023, 32(10): 102-107.

LIU Zhi. Pruning Approach to Neural Networks Based on Zero-norm Regularization[J]. Operations Research and Management Science, 2023, 32(10): 102-107.

参考文献

[1] HAN S, POOL J, TRAN J, et al. Learning both weights and connections for efficient neural network[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems, December 7-12, 2015, La Jolla. CA:NIPS, 2015: 1135-1143.
[2] ZHU M, GUPTA S. To prune, or not to prune: Exploring the efficacy of pruning for model compression[EB/OL]. (2017-11-13)[2021-08-31]. https://arxiv.org/abs/1710.01878.
[3] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. (2015-03-09)[2021-08-31]. https://arxiv.org/abs/1503.02531.
[4] IOANNOU Y, ROBERTSON D, SHOTTON J, et al. Training CNNs with low-rank filters for efficient image classification[C]//Proceedings of the 4th International Conference on Learning Representations, May 2-4, 2016, San Juan, Puerto Rico. Appleton, WI: ICLR, 2016: 1-17.
[5] KIAEE F, GAGN C, ABBASI M. Alternating direction method of multipliers for sparse convolutional neural networks[EB/OL]. (2017-01-15)[2021-08-31]. https://arxiv.org/abs/1611.01590.
[6] WU W, FAN Q, ZURADA J M, et al. Batch gradient method with smoothing L_1/2 regularization for training of feedforward neural networks[J]. Neural Networks, 2014, 50: 72-78.
[7] LIU Y, BI S, PAN S. Equivalent Lipschitz surrogates for zero-norm and rank optimization problems[J]. Journal of Global Optimization, 2018, 72(4): 679-704.
[8] KINGMA D P, BA J. Adam: A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations, May 7-9, 2015, San Diego, CA. Appleton, WI: ICLR, 2015: 1-15.
[9] WEN W, WU C, WANG Y, et al. Learning structured sparsity in deep neural networks[C]//Proceedings of the 30th Annual Conference on Neural Information Processing Systems, December 5-10, 2016, Barcelona. La Jolla, CA: NIPS, 2016: 2082-2090.
[10] LOUIZOS C, WELLING M, KINGMA D P. Learning sparse neural networks through L0 regularization[C]//Proceedings of the 6th International Conference on Learning Representations, April 30-May 3, 2018, Vancouver. Appleton, WI: ICLR, 2018: 1-13.
[11] ROCKAFELLAR R T. Convex analysis[M]. Princeton, NJ: Princeton University Press, 1970.
[12] ZHANG T, YE S, ZHANG K, et al. A systematic DNN weight pruning framework using alternating direction method of multipliers[C]//Proceedings of the 15th European Conference on Computer Vision, September 8-14, 2018, Munich. Cham: Springer, 2018: 191-207.
[13] WANG Y, YIN W, ZENG J. Global convergence of ADMM in nonconvex nonsmooth optimization[J]. Journal of Scientific Computing, 2019, 78(1): 29-63.
[14] LI Y, JI S. L₀-ARM: Network sparsification via stochastic binary optimization[C]//Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, September 16-20, 2019, Würzburg. Cham: Springer, 2020: 432-448.
[15] GLOROT X, BENGIO Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, May 13-15, 2010, Sardinia. Cambridge.MA: JMLR, 2010: 249-256.