Deep learning with artificial neural networks has several serious limitations that, in my opinion, render it completely incapable of modeling the kind of intelligence that humans possess. In the last few years, deep learning has seen several huge successes: GPT-3, AlphaGo, AlphaFold 2. These achievements were obtained through clever architectures based mostly on artificial neural networks and trained with massive computational scale.
The scaling laws diagram from the OpenAI team has frequently been presented as empirical proof of the power of large models:
This graph and GPT-3’s impressive verbal acuity suggest that intelligence is just an emergent property of a sufficiently large model. Increase the parameter count enough, and the network will “wake up.”
Although larger models may achieve greater performance on domain-specific tasks like text generation, I believe that artificial neural networks are profoundly lacking as a substrate for artificial general intelligence (AGI), in ways that scale and architecture cannot fix. This series of posts will explain the limitations and point to alternatives, if they exist.
Artificial neural networks (ANNs) are continuous, differentiable functions, and a well-known result1 established ANNs as universal approximators. Given any continuous vector-valued function \( f \), it is possible to construct an ANN that approximates \( f \) over a given region2 with as little error as desired. Decreasing the error means we’ll probably need to enlarge the network, but that’s not an issue in principle.
However, most functions that we’d like to approximate are not continuous. Suppose we have a cube that is fixed on the origin in 3D space, and can be rotated around the origin freely. The orientation of the cube can be fully described by two values; let’s call these \(\phi\) and \(\theta\), as in spherical coordinates. Consider the function that maps the pair \( (\phi, \theta) \) to a visual image of this cube, represented as a grid of pixels. Although rigid transformations of an object in 3D space are continuous, the quantization into pixels introduces discontinuity. The easiest way to imagine this is to take the image and “flatten” it into a linear arrangement (a vector) of pixels. As the cube rotates, its boundaries will align with different pixels, and the vector will change rapidly. Continuity means that very small changes in a function’s input lead to very small changes in the output. But no matter how slight a rotation is—no matter how small the deltas in the new orientation \((\phi + \Delta \phi, \theta + \Delta \theta)\) are—if the cube’s edges happen to cross a pixel boundary during the rotation, the image vector will change by a large amount. Furthermore, any inverse3 of the orientation-to-image function that maps an image of the cube to an orientation must also be discontinuous.
This idea easily generalizes to a wide variety of transformations: translation, scaling, reflection. In fact, almost all motion or deformation in 3D is continuous in terms of the volume of the object, but is discontinuous when approximated by a regular pixel grid. Occlusion, when one object blocks out another, seems discontinuous in both real-life and pixel space. However, it’s worth thinking carefully about what occlusion is: the overlap of the projections of two objects. Actually, occlusion is continuous when considering the visible area of each object. Imagine a solar eclipse, where the ambient light smoothly dims and brightens as the moon passes across the sun’s disk. Of course, as with all the other transformations, occlusion is discontinuous in terms of pixels.
With a little imagination, we can conclude that pretty much every function that involves decomposing, describing, or producing images of scenes containing objects is discontinuous, due to the nature of pixel quantization. This is a problem for the obvious reason: all the images that we train our networks on are made of pixels. So, by attempting to generate or understand images, we try to learn pixel-continuous approximations to discontinuous functions.
Interpolation videos from GANs trained on complex scenes display some negative effects of continuity. When interpolating, objects tend to warp and change color to form new shapes, rather than rotating or translating. Part semantics are usually not preserved (i.e. hands don’t usually transform into hands), which implies that the GAN is not learning a compositional 3D representation in the way humans do. I think that the pixel-continuity requirement actually inhibits the formation of compositional or semantic representations. One would expect that the gradient descent process would settle on the representation that best explains the data; for images of 3D scenes, that representation must be some sort of decomposition into objects and their poses. But this representation is not continuous in pixel-space, so it’s likely ignored in favor of some inferior representation that satisfies the continuity constraint.
The effects of this problem seem relatively minor in practice; after all, convolutional networks seem to work fine for object localization and pose estimation, and GANs can generate images of objects from multiple views. This is easy to explain: since, for image applications, the desired function is something like a sum of 2-dimensional unit step functions, a continuous approximation can become quite good. If the network is large enough, most of the ill effects of continuity can be masked at the expense of wasted network capacity and poor representations. However, the fundamental problem remains. For more complicated functions, which may have infinitely many discontinuities, a continuous neural network will fail to provide a reasonable approximation. For images, discontinuity is easy to identify, but for other domains it may not even be known whether the desired function is discontinuous.
Could we use discontinuous function approximators instead? Possibly, but functions cannot be differentiated at discontinuities, so gradient descent would no longer apply. Since gradient descent and its variants are the most popular and successful training algorithms by a large margin, non-differentiable or discontinuous alternatives to neural networks are relatively underexplored. In fact, as we’ll see in the next post, backpropagation and gradient descent cause a separate set of problems.
Hornik et al., 1989 ↩︎
The region needs to be compact. ↩︎
The inverse is not unique everywhere, since the cube has symmetries, but we could restrict the domain of the inverse in order to define it formally. ↩︎