Decision Trees
Some definitions:
1) Suppose P1, P2 , …….P n are probabilities and P1 + P2 +…..+Pn = 1, we
the define the entropy as follows:
Entropy(P1 ,P2,….,Pn) = Sumi –P i*lg(Pi) where
the sum runs from i =1 to n
Consider partitioning a set S of examples into K class S1,S2,…Sk. Let |Si|
denote the number of elements from class Si. We define the probability that
an element is a member of Si as being Pi=|Si|/|S| . We define
Entropy(S) = Entropy(P1,P2,…. Pk)
We will frequently consider a set S split into positive examples Sp and
negative examples Sn. Let p = |Sp| and n =|Sn|. Then
Entropy(S) = - ( p/(p+n) * lg (p/p+n) + n/(p+n)*lg(n/p+n)
)
For example , if S contains 9 positive examples and 5 negative examples,
then
Entropy(S) = - ( (9/14)*lg(9/14) +
(5/14)*lg(5/14))=0.940
2) Consider a set S and an attribute A that splits S into finite subsets
A1 , A2, ….., An . Let |Ai| denote the number of elements in the subset
Ai. Each Ai contains positive and negative examples, so we can define Entropy(Ai)
. We define Info(A) as :
Info(A) = Sumi |A i|/|S| * Entropy(Ai)
For example, consider the following partial dataset:
Temperature Plays Soccer
High
Yes
Medium
Yes
High
No
Medium
Yes
Low
Yes
Low
No
A_High has 2 elements, 1 positive and 1 negative.
Entropy(A_High)=1.
A_Medium has 2 elements, both positive.
Entropy(A_Medium) = 0
A_Low has 2 elements, 1 positive and 1 negative.
Entropy(A_Low)=1
Info(A) = (2/6)*1 + (2/6)*0 + (2/6)*1 =0.666
3) Consider a set S and an attribute A. We define Gain as:
Gain(S,A) = Entropy(S) – Info(A)
The decision tree algorithms (ID3, C4, C4.5, C5) asserts to interatively
pick the attribute that maximizes Gain. Maximizing gain is biased to attributes
that have many values. It has been proposed to instead pick the attribute
that maximizes the Gain Ratio. The Gain Ratio is defined as follows:
Gain Ratio(S,A) = Gain(S,A)/ Split_Info(A)
where Split_Info(A) is the entropy of A.
Example: Consider the following table, also used in our majority rule example
Outlook = { sunny, overcast, rain}
Temperature = { hot , mild , cool}
Humidity={high , normal }
Windy={false,true}
Play={no,yes}
Outlook
|
Temperature
|
Humidity
|
Windy
|
Play
|
Sunny
|
Hot
|
High
|
False
|
No
|
Sunny
|
Hot
|
High
|
True
|
No
|
Overcast
|
Hot
|
High
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Cool
|
Normal
|
True
|
No
|
Overcast
|
Cool
|
Normal
|
True
|
Yes
|
Sunny
|
Mild
|
High
|
False
|
No
|
Suuny
|
Cool
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
Normal
|
False
|
Yes
|
Sunny
|
Mild
|
Normal
|
True
|
Yes
|
Overcast
|
Mild
|
High
|
True
|
Yes
|
Overcast
|
Hot
|
Normal
|
False
|
Yes
|
Rainy
|
Mild
|
High
|
True
|
No
|
Note that there are 5 No and 9 Yes. The entropy present initially is -(5/14)*
lg(5/14)-(9/14)*lg(9/14) = 0.940. We note the following quantities:
Attribute
Gain
Split Info Gain Ratio
Outlook
0.940 - 0.693 = 0.247
1.577
0.156
Temperature
0.940 - 0.911 = 0.029 1.362
0.021
Humidity
0.940 - 0.788 = 0.152
1.000
0.152
Windy
0.940 - 0.892 =
0.048
0.985
0.049
Maximizing Gain would yield Outlook as the clear choice to split the tree
on. Maximizing the Gain Ratio would again pick Outlook, but now the choice
is not as clear because Humidity is almost as good. Further investigation
would be required.