It can be said that setting up a neural network without using a test set is a legitimate desire of neural network researchers, because taking away some examples to create a test set the network will not be learned those examples. In the past, it must be mentioned that these approaches are made by John Moody, David MacKay, Vladimir Vapnik. However, those proposals have not come up with a solution that we can use today.
So the question "Need the test set?" Or "Whether or not a test set exists?" is left open, and we will answer the question here.
The problem of test set becomes important when the example set is too small, the loss of some rare examples used in testing will be an expensive price for the learning quality of the network. So is there any way that we do not need to use the test set and still ensure the requirements of the network? To answer this question, first of all, let's see what the test set is for. What does it check. If we achieve the requirement that the test requires and do not use it as a mandatory element, then we do not have to cost it.
OVERFITTING
The overfitting is a phenomenon that a network well fits with the examples it learns, but gets large errors for examples that it is not learned. In other words, it is not capable of generalization.
n1 is the output graph of network 1
n2 is the output graph of network 2
Network 2 exactly fits the learning examples, but it has great error for the example not being learned. It has a poor generalization vs the network 1. We say that network 2 is overfitting.
Overfitting causes and the previous remedial methods are as follows:
1) There is noise in the samples or lack of samples
Add samples if possible. Sample set is a set of objective events, which we add if any but are not modified. The more samples the better. This is a recommendation rather than a method.
2) Early stopping
Stop training when the error on the test set starts to increase. This method has the advantage of needing less training steps but very precarious. It can lose the great learning opportunity of the network in the following steps. Experiments have demonstrated that test errors have lots of minima in different stages of training rather than unique. Unfortunately, many places still apply this method. It is rare for a researcher of neural network theory is also the person who implements the actual model. The researcher presents the method and then continues his other work, while the implementer uses carefree.
This method and the methods below assume that the network is overfitting because of the strong network capacity (?), Thus it is necessary to reduce the capacity of the network.
But I doubt the capacity reduction of the network means overcoming the overfitting situation. We will discuss this issue at the end of the article.
3) Reducing the number of hidden units
This method actually finds the optimal number of hidden nodes by running multiple networks with different number of hidden nodes. The network with the smallest testing error will be the selected model. In theory this is good. However networks are only comparable after they converge. But when does a neural network converge? That is a tough question. The network converges when the derivative of the error function is zero. The derivative will approach 0 through a very very long training process but may never reach 0. If you want to stop training at a close value of zero, then that value is very different for different problems. Choose a value manually for each problem?
Football Predictions 3.0 and the newer version also follow the approach of optimizing the number of hidden units but performing automatically, and the genetic algorithms are the answer.
4) Degenerating the weights
The focus of this method is to use the weight penalty rule with a positive coefficient λ. The problem becomes to find the coefficient λ in the same way optimizing the number of hidden units except that the network does not converge to the lowest point of the error surface. This method resolves two goals simultaneously: training the network and preventing the weight growth (absolute value), thereby preventing the network from overfitting.
New target function O is added with penalty function C with coefficient λ
O = E + λC
Where
E is the error function
C = Σ | w |
The partial derivative of the new target function by w will be:
∂O / ∂w = d + λ, if w> 0,
∂O / ∂w = d - λ, if w <0
Where d is the partial derivative of the error function by w.
If w is equal to 0, assign the derivative to zero to remove it from the learning process.
Consider w> 0, d <0, the weight will want to increase to reduce the error. Normally it will step to the right one step in proportion to -d, but because there is λ, it is shorter, proportional to - (d + λ) <-d. (The absolute value of d is greater than λ)
For example
w = 2.0
d = -0.01
λ = 0.001
e = 0.1 (e is the learning coefficient)
If the method is not applied, w is increased by e * -d = 0.1 * 0.01 = 0.001
The updated w will be 2.0 + 0.001 = 2.001
When there is λ, w is increased by e * - (d + λ) = 0.1 * 0.009 = 0.0009
The updated w will be 2.0 + 0.0009 = 2.0009 <2.001
The algorithm has a stop when the derivative is zero, ie d = -λ, all subsequent weight updates do not change the network. Can confirm that the method has prevented the network from overfitting. In this case the stop is on the left (and of course higher) the lowest point of the error surface
Indeed, the weight of the network is smaller than the weight value at the lowest point of the error surface, when the network fits the examples best.
If w <0, the stop is determined at d = λ
The weight of the network is greater than the weight value at the lowest point of the error surface, ie its absolute value is smaller than the case of the best fitting.
Thus, this method creates a network that does not fit the training examples and has absolute values of weight smaller than the case that it is overfitting. We say that it prevents weight growth.
There is a funny situation: Due to the presence of the coefficient λ in the derivative of the target function, if the absolute value of the derivative d is smaller than λ, the more the weight learns the more ignorant.
But learning ignorance is sometimes good. Some weights not only not increase but decrease. Small weights go to 0 and disappear when they are zero. The idea is to remove irrelevant weights as less important features and normalize the network into the main features.
That is in theory.
To analyze the actual situation, we consider the algorithm to stop at the derivative of error function d = λ. The derivative d must be small or the error will be large. But like the method of reducing the number of hidden units, the value of d varies greatly for different problems, leads the coefficient λ fluctuating in a very wide range, which makes it difficult to optimize.
The graph below illustrates the errors of the two networks of two different problems. The curves e1 and e2 represent the corresponding error surfaces. First, we can visually see that because the curvatures of the two error surfaces are very different, so their appropriate values λ are very different
For specific analysis, we assume that the two networks use the same coefficient λ, then they stop at the errors E1 and E2 that are great different from each other. And big errors are not acceptable. Therefore two different networks generally have to use very different coefficients λ. Note that the errors are absolutely comparable because the outputs are compressed through the transfer function to the range [0, 1].
In addition, this method is not beautiful because the network never fits training data, while there may be a case where a great network can fit both training data and test data. Further more, preventing weight growth is also synonymous with preventing learning. It is a forced measure.
This method appears on Google's development network called "L1 regularization".
This method also makes a common mistake, which is what I call "The bogged down in math formulas" that we will see below.
5) Weight decay
This method is a variant of the weight degenerate method, but it strongly penalizes large weights.
The penalty function C is defined:
C = Σw^{2} / 2
Therefore ∂O / ∂w = d + λw
The weight w is present in the derivative, the larger it is, the more penalized it is. This method significantly reduces the weight of the entire network and is known as "Weight decay".
This method appears on Google's development network with the name "L2 regularization".
6) Dropout
This method randomly removes some units from the network with a given rate in each training step. Although simple practice, this method works quite well because it logically assembles an exponential number of small networks.
Dropout methods are widely used, not just at Google.
Because of the logic combination of small networks, it has the appearance of evolution. However it is only an idea, it is far from being a gene evolutionary method, by two things:
Therefore, although more effective than the above methods, dropout has modest test results.
People often use dropout in situations where the number of hidden units cannot be determined, so they take the excess number of hidden units and then apply the method to overcome. Dropout only handles against overfitting, it has not answered the question: "How many hidden units are needed for the network?" In fact, network configuration and number of training steps must still be manually entered according to experience for each problem.
Football Predictions uses genetic algorithms to automatically find the number of hidden units and automatically stop training effectively.
PRINCIPLE
We agree with each other that the machine learning algorithm must be an algorithm that is not explicitly formed.
Clearly, if it can be fully formulated, it is no longer a machine learning algorithm, but a classical algorithm.
THE BOGGED DOWN IN MATH FORMULAS
I have read a question, the idea is
"Is the supervised neural network equivalent to the predefined function for the given data, then infer?"
This is an interesting question. Both right and wrong.
It is true that if there is a function to infer then it in general becomes unsupervised learning, and the unsupervised learning has a higher level of learning because it has a higher generalization, and is not "imposed" in the learning process.
Wrong in that if it is possible to clearly define such a function, it is no longer the machine learning. Remember that artificial neural networks are derived from human brain structures. Want to use a function to replace the neural network is like the desire to replace the brain.
The power of machine learning programs is that it has its inner knowledge, independent of the programmer. It is necessary to let the program learning naturally to solve problems.
Some methods fail due to an attempt to formalize the problem on an uncertain basis.
For neural networks, the error function is only correct when there is no noise. In other words, in theory in general the sample set must be infinite. That is something we cannot have because we have only finite set of objective events. Therefore, relying entirely on the error function or going too deep into it gives unintended results. The better the mathman, the easier it is to make this mistake (except for those who know the limits of mathematics).
The above methods of overfitting are actually correcting the subjectivity of the target function, not because the capacity of the network is too much. It is not right to say that too much capacity is not good. Like we can not say that the country is too civilized is not good, the weather is too fine is not good.
The capacity of a network is shown primarily in the number of hidden units. But the allowed number of hidden units can fluctuate with a fairly wide dynamic range. For example, if a network works well with 20 hidden units then 16 units as well as 32 units can give the same results. Therefore network capacity is not the root cause of creating a bad network.
It is also possible that excess number of hidden units causes overfitting, but that is because latent errors are magnified, not because of the strong network capacity. Although the two reasons share the same phenomenon, but their meanings are different.
If we think that the capacity of the network is too strong, we will weaken it, going into the adjustment of the target function, making the network unable to fit the samples and doing a reverse movement that prevents the learning process as in the weight degenerate method.
And if you think that it is not because the capacity of the network is so strong, but because we instruct the network to learn incorrectly, then we are less imposing and letting it learn naturally according to its surprising perception.
Genetic algorithms can do this. Football Predictions lets only 10% of individuals in the neural network population to learn gradient descent, the rest will hybridize, mutate and grow themselves. The population selects its own growth period, it can increase its size many times the original size, and reduce itself when needed. Individuals are also assigned age for their life cycle. Natural selection will evolve the population to miraculous results.
NEED THE TEST SET?
Only using a test set when there are many samples, reducing some training samples to use for testing does not significantly increase noise to the network. You can identify this by running training multiple times. If the test error is the same, the test set is just for reference instead of using it to make a decision. Then there is no need the test set, and you will have a higher quality network.
Genetic algorithms can create such a network.
Using id3 to create the test set
In the case of only a few dozen examples Football Predictions does not use the test set. Instead, use id3. Although id3 is a legacy that cannot be compared to neural network and does not have the role to create a standard test set, it is still the number one candidate to distinguish important features based on entropy with a strong theoretical basis. It doesn't matter if we have chosen not to use the test set and use id3 as an additional measure. id3 will create fake examples based on its capabilities and limitations. These examples are only used to select models and are completely not involved in the evolution of neural networks.
CONCLUSION
Most of the anti-overfitting methods of neural networks have come from the last century.
Dropout is the first anti-overfitting method of the 21st century. It has the idea of evolution but is incomplete.
Questions: "How many hidden units are needed for a network?", "Number of training steps for every problem?", "Can a model be built without a test set?" are left open.
This practice requires a comprehensive approach with the idea of utilizing machine learning naturally rather than imposing error function formulas, finding the optimal network that applies natural selection during evolution, responding to all questions above. That is
TRAINING NEURAL NETWORKS WITH GENETIC ALGORITHMS
Share on Twitter Share on Facebook Share on Linked In
Contact: tuyen@omarine.org
Comments
There are currently no comments
New Comment