Ayinde et al. study the impact of network architecture and weight initialization on learning redundant features. To empirically estimate the number of redundant features, the authors use an agglomerative clustering approach to cluster features based on their cosine similarity. Essentially, given a set of features, these are merged as long as their (average) cosine similarity is within some threshold $\tau$. Then, this number is compared across network architectures. Figure 1, for example, shows the number of redundant features for different depths of the network and using different activation functions on MNIST. As can be seen, ReLU activations avoid redundant features, while depth of the network usually encourages redundant features.
https://i.imgur.com/ICcCL2u.jpg
Figure 1: Number of redundant features $n_r$ for networks with $n’ = 1000$ hidden units computed suing the given threshold $\tau$ for computing $n_r$. Experiments with different depths and activation functions are shown.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).