Last year, Deep Learning AI accomplished a task many people thought impossible: DeepMind, Google's deep learning AI system, defeated the world's best Go player after trouncing the European Go champion. The feat stunned the world because the number of potential Go moves exceeds the number of atoms in the universe, and past Go-playing robots performed only as well as a mediocre human player.
But even more astonishing than DeepMind's utter rout of its opponents was how it accomplished the task.
"The big mystery behind neural networks is why they work so well," said study co-author Henry Lin, a physicist at Harvard University. "Almost every problem we throw at them, they crack."
The answer is that the universe is governed by a tiny subset of all possible functions. In other words, when thelaws of physics are written down mathematically, they can all be described by functions that have a remarkable set of simple properties.
So deep neural networks don’t have to approximate any possible mathematical function, only a tiny subset of them.
Not only do Lin and Tegmark’s ideas explain why deep learning machines work so well, they also explain why human brains can make sense of the universe. Evolution has somehow settled on a brain structure that is ideally suited to teasing apart the complexity of the universe.
To put this in perspective, consider the order of a polynomial function, which is the size of its highest exponent. So a quadratic equation like y=x2 has order 2, the equation y=x24 has order 24, and so on.
Obviously, the number of orders is infinite and yet only a tiny subset of polynomials appear in the laws of physics. “For reasons that are still not fully understood, our universe can be accurately described by polynomial Hamiltonians of low order,” say Lin and Tegmark. Typically, the polynomials that describe laws of physics have orders ranging from 2 to 4.
The laws of physics have other important properties. For example, they are usually symmetrical when it comes to rotation and translation. Rotate a cat or dog through 360 degrees and it looks the same; translate it by 10 meters or 100 meters or a kilometer and it will look the same. That also simplifies the task of approximating the process of cat or dog recognition.
There is another property of the universe that neural networks exploit. This is the hierarchy of its structure. “Elementary particles form atoms which in turn form molecules, cells, organisms, planets, solar systems, galaxies, etc.,” say Lin and Tegmark. And complex structures are often formed through a sequence of simpler steps.
This is why the structure of neural networks is important too: the layers in these networks can approximate each step in the causal sequence.
Researchers show how the success of deep learning depends not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can be approximated through "cheap learning" with exponentially fewer parameters than generic ones, because they have simplifying properties tracing back to the laws of physics. The exceptional simplicity of physics-based functions hinges on properties such as symmetry, locality, compositionality and polynomial log-probability, and we explore how these properties translate into exceptionally simple neural networks approximating both natural phenomena such as images and abstract representations thereof such as drawings. They further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine-learning, a deep neural network can be more efficient than a shallow one. They formalize these claims using information theory and discuss the relation to renormalization group procedures. They prove various "no-flattening theorems" showing when such efficient deep networks cannot be accurately approximated by shallow ones without efficiency loss: flattening even linear functions can be costly, and flattening polynomials is exponentially expensive; they use group theoretic techniques to show that n variables cannot be multiplied using fewer than 2^n neurons in a single hidden layer.