We introduce a mixed integer program (MIP) for assigning importance scores to each neuron in
deep neural network architectures which are guided by the impact of their simultaneous pruning
on the main learning task of the network. By carefully devising the objective function of the MIP,
we drive the solver to minimize the number of critical neurons (i.e., with high importance score)
that need to be kept for maintaining the overall accuracy of the trained neural network. Further, the
proposed formulation generalizes the recently considered lottery ticket optimization by identifying multiple “lucky” sub-networks resulting in optimized architecture that not only performs well
on a single dataset, but also generalizes across multiple ones upon retraining of network weights.
Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming
1. Identifying Critical Neurons in
ANN Architectures using Mixed
Integer Programming
Mostafa ElAraby Guy Wolf Margarida Carvalho
[OPTML Neurips 2020]
2. Motivation
The existence of efficient sub-networks
with faster inference and marginal loss in
accuracy when compared to the original
over-parameterized ANN.
Frankle and Carbin (2018) introduced the
lottery ticket conjecture and empirically
showed the existence of a lucky pruned
subnetwork, a winning ticket.
5. Linear Programming (LP)
A powerful framework used to solve optimization problems in the following form:
Linear Objective
An optimization objective that
can be minimization or
maximization of a linear
equation consisting of
decision variables that wer are
trying to solve.
Decision Variables
Variable optimized by the LP
optimization process and at
the end the solver will give its
solved value.
Linear Constraints
A set of constraints on the
decision variables that the
solver tries to satisfy
narrowing its optimization
space. The solver would throw
an infeasible solution if it can’t
find a solution satisfying the
linear constraints.
6. Mixed-Integer Programming (MIP)
Similar to the linear programming optimization but can have integer decision
variables along with continuous variables used in linear programming.
It is considered a harder problem that can be relaxed into a linear programming
problem.
7. Branch and Bound algorithm
We relax our MIP into an LP if we solve it
we are lucky and wwe get the optimal
solution. Otherwise, which is the normal
case we take an integer variable having a
float solution (branching variable) and we
add linear constraints excluding that
solution resulting in 2 new MIPs.
9. Introduction
MIP solver will compute a neuron
importance score [0-1] for neurons in
convolutional/ fully connected layers.
Neurons with small importance score
can be safely pruned without loss in
terms of accuracy.
15. Representing Convolutional layers
We convert convolutional layers to Toeplitz flat matrices converting the
convolution to simple matrices multiplication and using same previous
constraints introduced for the fully connected layers with importance score for
each filter
16. Objective Function : Softmax
Softmax: is the marginal softmax that penalize for wrong predictions
regardless of the logit value. Y is the one hot encoded true label.
17. Objective Function: Sparsity
I represents the scaled down importance score (s - 2) that shown empirically to
give non-important neurons a lower score.
When we increase ƛ , more neurons gets the value near zero.
20. MIP Solvers are slow
Representing a deep neural network is hard to solve in even commercial solvers
making it harder for our algorithm to scale well for large models.
For that problem we propose 2 solutions:
- Parallelizing computation layer wise
- Parallelizing computation Class wise
22. Class-wise decoupling
In this experiment, we show that the neuron importance scores can be
approximated by 1) solving for each class the MIP with only one data point
from it, and then 2) taking the average of the computed scores for each neuron
as the final score estimation. Such procedure would speed-up our methodology
for problems with numerous classes.
25. Robustness Experiments
We show empirically that our framework is robust on different convergence
levels of the trained neural network as shown in the following Figure.
26. Generalization Experiments
Cross-dataset generalization: sub-network masking is computed on source
dataset (d1 ) and then applied to target dataset (d2 ) by retraining with the
same early initialization. Test accuracies are presented for masked and
unmasked (REF.) networks on d2 , as well as pruning percentage.
27. Conclusion
We proposed a mixed integer program to compute neuron importance scores in
ReLU-based deep neural networks. Our contributions focus on providing
scalable computation of importance scores in fully connected and
convolutional layers.