I am finding it difficult to calculate entropy and information gain for ID3 in the scenario that there are multiple possible classes and the parent class has a lower entropy that the child. Let me use this as an example:
Attr Y AttrX Result
4 a 0
3 b 0
4 c 0
4 d 0
3 e 0
4 a 1
3 b 1
3 c 1
4 d 1
4 e 1
Now here, I believe the first step is to calculate the Entropy of the original dataset, in this case, Result. Which would be the sum of negative logs of result being 1 and 0. i.e -(5/10)*log(5/10) - (5/10)*log(5/10) with log taken to the base 2. This is equal to an entropy of 1.
Now for attribute Y, the Info Gain can be calculated as 1 - entropy of each possible outcome * probability of outcome happening or 1 - 5/10[-(2/5)*log(2/5)-3/5*log(3/5)] - 5/10[-(2/5)*log(2/5) -(3/5)*log(3/5)] = 0.029
My question is, for attribute X, the info gain would be 1 - entropy of each outcome * probability of each outcome or 1 - (5/10)*(-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)) - 5/10*(-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)-1/5log(1/5)) = -1.32?
But doesnt Info Gain always have to be positive? Where am I going wrong here?
CodePudding user response:
The right formula for the gain is:
Entropy(S) - sum_i(|Si|/|S| * Entropy(Si))
with
Entropy(S) = -p1 * log(p1) -p0 * log(p0)
Si
being the subset of S of class ip0
(orp1
) the proportion of elements in S with result = 0 (or 1).
So in the case of AttrX, each class appears 2/10 times, hence |Si|/|S| = 1/5
.
Once we know the class (a, b, c, ...) the probability of having 1 or 0 in Result is 1/2.
So the entropy of each class is -1/2 * log(1/2) -1/2 * log(1/2) = 1 so the Gain is 1 - 1/5 * 1 * 5 = 0
In fact you could see this result intuitively: whatever the class is, the result is with 50% chances 1 or 0, so the information gain in knowing AttrX is 0.