We describe the problem as follows:
There are two attributes X and Y, and the two attribute values are less and greater.
If x < 1 (or y < 1) then X (corresponding Y) is less. If x > 1 (or y > 1) then X (corresponding Y) is greater.
The result has two values, corresponding to the two classes On and Off.
There are 10 samples as follows:
Order X Y RESULTS
1 less less Off
2 greater less On
3 greater greater On
4 greater greater On
5 greater greater On
6 greater greater On
7 greater greater On
8 greater greater On
9 greater greater On
10 greater greater On
The samples are shown in the following illustration:
Ask that if x < 1 and y > 1 then the result is On or Off?
At first glance we may think that the result is On because of the number of samples on the On class when y > 1 is the majority, there are 8 such samples out of a total of 10.
But not! The result is Off.
Based on the theoretical basis of the id3 decision tree, when we divide the sample set by attribute X we get two subsets of elements of the same class. That means the information is fired completely and the IG is biggest. So attribute X is the most important attribute. Therefore every case with x < 1 regardless of y will result in Off.
This problem can be solved by running the program id3 as follows
with the data file's content as follows
-----BEGIN DATA DEFINITION-----
X: less greater
Y: less greater
RESULTS: On Off
-----END DATA DEFINITION-----
ID3 data
Order X Y RESULTS
1 less less Off
2 greater less On
3 greater greater On
4 greater greater On
5 greater greater On
6 greater greater On
7 greater greater On
8 greater greater On
9 greater greater On
10 greater greater On
However, if we add a large number of identical samples of x > 1, y > 1 with the result is On, the information divided by the attribute Y starts to be fired relatively well and the attribute Y gradually becomes important. It is no longer too far from the X attribute. For example, we add 410 such samples and the total number of samples is 420.
But the id3 decision tree does not show that, nor does it reflect the frequency of the sample's information. The result stays the same even though the sample set has changed
This can be overcome using Football Predictions 2.0, we are right again, the result is On:
Can't see mail in Inbox? Check your Spam folder.
Comments
There are currently no comments
New Comment