One could easily believe that backpropagation ranks among the three most-computed algorithms in history.1 Backpropagation (backprop) exploits the chain rule from multivariable calculus to enable optimization of a parameterized function, usually a neural network.2 In order to support calculus, the function must be differentiable everywhere, and therefore continuous. We’ve already taken a look at the dangers of continuity. In this post, we’ll examine how the mechanics of backprop could harm representation learning.
The vanishing gradient problem is a common issue with backprop. Due to the repeated multiplication of partial derivatives, the average gradient for a neuron in a particular layer decreases exponentially with distance from the output layer. As the network becomes deeper, the gradients for the earlier layers “vanish,” so neurons in those layers don’t learn. Here’s a diagram of what the gradients might look like in a fully-connected network.
The standard counter-strategy is to introduce skip connections, which bypass layers to provide gradient paths that don’t include the problematic multiplications. In fact, this technique is so effective that it basically started the deep learning era.
Let’s take a step back and think about how humans represent objects and ideas. Humans tend to group concepts using hierarchical relations: eye (part of) head (part of) body. It would make intuitive sense if neural networks did the same thing, by increasing the level of abstraction at each successive layer. In fact, finding a good hierarchy of abstractions will likely improve the performance of any downstream task.
Convolutional networks have this idea built into their architecture, enforced by multiple levels of pooling. Fully-connected networks have no such architectural bias; even so, one might expect the layers to map somehow to levels of abstraction.
Let’s select a neuron (blue) and find the five neurons that are the most important to this neuron’s activation value. That is, with respect to which neurons' activations is the partial derivative of our blue neuron’s activation the largest?
The most impactful neurons are preferentially located in the layer immediately prior to the selected neuron. This pattern of gradients is what we’d expect if layers represented hierarchical structure in the input. Changing early neurons (which are closely tied to individual elements of the input vector) would not affect the overall representation, and therefore blue’s activation value, very much. But changing later neurons, which represent large-scale features, significantly changes the activation of the blue neuron. Now, let’s look at the same diagram, with an additional skip connection:
Some neurons in the first layer are now more important than those in later layers! This is a natural consequence of solving the vanishing gradient problem. By trying to make the gradients more uniform, we’ve scattered the representational hierarchy across layers. In this case, successive layers don’t get increasingly abstract representations as input; instead, they have to work with data that mixes multiple levels of abstraction.
Although convolutional networks are structured in an hierarchical way, I think similar logic probably applies, and inhibits hierarchical learning to some degree. This is pure speculation, but adversarial examples might be a direct consequence of a broken representational structure. In fact, there’s some evidence3 that using learning algorithms other than backprop actually makes networks more robust to adversarial examples.
To restate the point, it’s not that backpropagation itself is harmful. The problem is that the skip connections, which are necessary to prevent vanishing gradients, induce mixing of representational levels. Although some research exists on alternatives to backprop,4 the proposed algorithms are far from mainstream, and in many cases fail to match backprop’s performance on even simple datasets like MNIST.
If you liked this post, you might like my current project, AGI notes. You can also follow me on twitter.
My guesses for the other two are SHA-256 and Keccak-256—Bitcoin and Ethereum, respectively. But of course there’s no way of knowing. ↩︎
In case you need a refresher on backprop, I really like Michael Nielsen’s explanation. ↩︎
Akrout, Mohamed. On the Adversarial Robustness of Neural Networks without Weight Transport. arXiv preprint arXiv:1908.03560 (2019). ↩︎
Duan, Shiyu, and Jose C. Principe. Training Deep Architectures Without End-to-End Backpropagation: A Brief Survey. arXiv preprint arXiv:2101.03419 (2021). ↩︎