Local minima is also a controversial issue. A theoretical proof (Hornik 1989) under strong assumptions with the conclusion that the neural network has no local minima is rejected by experimentation on the real model with finite sample set. In this article we will analyze the problem in a different direction with conclusive conclusion: The neural network may have local minima but not serious.
In the previous article I talked about the limit of mathematics. So what is that limit? That is the limit on its axioms.
PRINCIPLE
"Can't be inside but prove the outside".
Mathematics is limited to axioms, its scope is only a small special case in the problem space in general. For example, vector space on ℝ must satisfy its 10 axioms. What would you do if, for example, only 9 axioms are satisfied? You cannot use mathematics to prove the outside things that is not bound by those axioms. If you put all the problems inside you will become misguided. We are entering the era of AI of cognitive programs that simulate the activity of the human brain. Human awareness is very rich and cannot be calculated. AI also, the capacity of a neural network is not the same as a normal program.
LOOK AT THE PARAMETERS, DO NOT CONSIDER THE WEIGHTS
Instead of proving, we are outside observing. We do not question the local minima of the error function in the weight space, because by doing that we have defaulted the problem to the minima problem of a function. Instead, consider the weights as the network parameters.
For simplicity, we consider a network that has only one input node, one output node, no hidden node, and no transfer function (many called activation function). Only two weights a_{1} and a_{0}, the output of the network is simply a linear function
y = a_{1}x + a_{0}
We also use only one sample (Xs, Ys).
For a_{0} = 0, we have y = a_{1}x
In order for the network to fit the sample, the weight a_{1} needs to change, the line y = a_{1}x rotates around the origin to the position passing through the sample point (Xs, Ys).
We see that just the weight a_{1} is enough for the network to learn the example. We still have a_{0} left over, it shows a rich network of parameters. The change of weight a_{0} will create a family of output functions. So the network has a lot of capacity to learn to fit sample.
In fact, a network has one or more hidden layers with hidden nodes, and transfer functions. That gives the network a very rich nonlinear output mapping.
Assuming a network has four weights, and assuming the network can operate with three weights, it has an error function whose graph is a drawable surface in the 3D space. The fourth weight can act as a parameter, it creates an error function family instead of unique error function. That is, the network has a family of error surfaces. If at a step it falls into a local minima on an error surface, it will jump down to the error surface below.
That is why the local gradient method is rarely dropped into the local minima.
The imaginary image of the minima of neural networks is not the same as the usual minima problem. Imagining a ball rolling on the stadium is not suitable to describe. It would be more appropriate if we imagine a bird on a tree top, it will step by step pass down the lower branches to move to the ground. Each branch corresponds to an error surface. At each branch, the bird steps to the lowest point of the branch, then jumps down to the branch below. Of course those are just error surfaces in the imagination, because in every step all the weights are updated, the network has just stepped down on a branch and jumped down the branch below in a family of virtual error surfaces that are not clearly separated.
HANDLING LOCAL MINIMA
Local minima hardly occurs. If you build your own model, you may be interested in the following aspects:
1) Add hidden units
This is the basic problem. As discussed above, adding the hidden units is adding parameters to the network. It will easily exit the local minima.
2) Weight initialization
Random functions are responsible for generating random numbers during the training of the network, in particular weight initialization. Good random functions ensure that the network has a good starting point and the weights are updated harmoniously. If the initialization is not good, the network with noise may be stuck.
If you use rand() function, well seeding using srand() is recommended, the argument may be the current time in microseconds. For example
#include <sys/time.h> timeval tv; gettimeofday (&tv, NULL); srand (tv.tv_usec);
In the case of bad samples, it may be necessary to run multiple networks, each with a different set of initial weights.
3) Use mini-batch
Use mini-batch if possible. This counteracts the monotony of the samples and makes the network more free to move in the weight space.
4) Increase the coefficient of learning
The coefficient of learning is too small makes it difficult for the network to jump out of a local groove or it is difficult to switch states clearly.
Appropriate coefficients of learning usually range from 0.05
to 0.1
5) Use genetic algorithms
Genetic algorithms can solve all problems of neural networks. The most prominent one is anti-overfitting. It also handles local minima that applies to all types of objects, not just neural networks, in a gentle and explicit way: Local minimal individuals will not evolve over several generations and will be eliminated from the population despite high adaptability.
Can't see mail in Inbox? Check your Spam folder.
Comments
There are currently no comments
New Comment