Optimal Brain Damage (OBD) is a techique to make a network smaller by pruning small weights.
## Idea
* use second-derivative information to make tradeoff between network complexity and training error
* do this while training to prevent overfitting / reduce the need for data / reduce training time
* **How to choose what to delete**: Weights which have least impact on training error. This is estimated by approximating the function with a Taylor series.
## Recipe
(Directly copied from the paper):
The OBD procedure can be carried out as follows:
1. Choose a reasonable network architecture
2. Train the network until a reasonable solution is obtained
3. Compute the second derivatives $h_{kk}$ for each parameter
4. Compute the saliencies for each parameter: $s_k = h_{kk} u_k^2 /2$
5. Sort the parameters by saliency and delete some low-saliency parameters
6. Iterate to step 2
Deleting a parameter is defined as setting it to 0 and freezing it there. Several
variants of the procedure can be devised, such as decreasing the values of the low-saliency parameters instead of simply setting them to 0, or allowing the deleted
parameters to adapt again after they have been set to 0.
