Mean(input) = 0, var(input) =1 is good for learning. Independent input features are good for learning.
So:
1) Pre-Initialize network weights with (approximate) orthonormal matrices
2) Do forward pass with mini-batch
3) Divide layer weights by $\sqrt{var(Output)}$
4) PROFIT!