Math analysis shows residual connections reduce exploding and vanishing gradients by shrinking the Lyapunov spectrum
This paper studies why deep neural networks sometimes suffer from exploding or vanishing gradients and how residual connections (the skip links used in ResNet-style architectures) change that behavior. The author models a deep network as a long composition of blocks and studies the linearized dynamics of small perturbations — the same quantities that control gradients during training. Using tools from multiplicative ergodic theory, the paper gives a precise mathematical statement about how residual connections change the long-term growth rates of those perturbations.
Concretely, the researcher treats the Jacobian matrices of each block (the matrices that describe how small changes in the input map to changes in the output) as random, independent draws from a fixed distribution. The key objects of study are Lyapunov exponents, which are numbers that measure the exponential rates at which a small perturbation grows or shrinks when multiplied by the sequence of Jacobians. The paper applies an exact characterization of the Lyapunov spectrum due to Furstenberg and Kifer. That characterization gives a decomposition of the input space into subspaces with distinct exponential growth rates.
The main technical finding, stated as Theorem 4.1 in the excerpt, is that replacing a block map A by the residual map I + A (where I is the identity matrix) makes the Lyapunov spectrum a smaller perturbation of the identity spectrum than without the residual connection. In plain terms: adding a residual (skip) connection moves the spectrum closer to neutral growth and so reduces extreme exponential growth or decay of gradients. The analysis uses projective geometry and the Furstenberg–Kifer description of stationary measures on projective space to make this statement precise.
This matters because exploding or vanishing gradients are a core obstacle to training very deep networks. The paper gives a mathematical mechanism that links residual connections to a taming of these gradient problems. Compared with prior work, the novelty claimed here is the use of the exact Furstenberg–Kifer characterization to pin down the full Lyapunov spectrum and to describe how residual links change it.