Sesi 09 Dosen Pembina : Danang Junaedi
IF-UTAMA
1
IF-UTAMA
2
Given a collection of records (training set ) ◦ Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. ◦ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
IF-UTAMA
3
IF-UTAMA
4
1
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
Class
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
Attrib3
Class
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
Categorizing news stories as finance, weather, entertainment, sports, etc
Learn Model
10
Attrib2
Apply Model
10
IF-UTAMA
5
Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
IF-UTAMA
IF-UTAMA
7
6
Mengubah data menjadi pohon keputusan (decision tree) dan aturan-aturan keputusan (rule)
IF-UTAMA
8
2
A decision tree is a chronological representation of the decision problem. Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives. The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives. At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.
IF-UTAMA
9
IF-UTAMA
IF-UTAMA
11
10
Diagnosa penyakit tertentu, seperti hipertensi, kanker, stroke dan lain-lain Pemilihan produk seperti rumah, kendaraan, komputer dan lain-lain Pemilihan pegawai teladan sesuai dengan kriteria tertentu Deteksi gangguan pada komputer atau jaringan komputer seperti Deteksi Entrusi, deteksi virus (trojan dan varians) dll
IF-UTAMA
12
3
Tid
Refund
M arital Status
Taxable Incom e
Cheat
1
Yes
Single
125K
No
2
No
M arried
100K
No
3
No
Single
70K
No
4
Yes
M arried
120K
No
5
No
Divorced
95K
Y es
6
No
M arried
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Y es
9
No
M arried
75K
No
10
No
Single
90K
Y es
Splitting Attributes
Refund Yes
No
NO
MarSt Single, Divorced TaxInc < 80K NO
Married NO
> 80K YES
Tid
Refund
M arital Status
Taxable Income
Cheat
1
Yes
Single
125K
No
2
No
M arried
100K
No
3
No
Single
70K
No
4
Yes
M arried
120K
No
5
No
Divorced
95K
Yes
6
No
M arried
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
M arried
75K
No
10
No
Single
90K
Yes
10
10
Training Data
Married
Single, Divorced
MarSt
NO
Refund No
Yes NO
TaxInc < 80K
> 80K
NO
YES
There could be more than one tree that fits the same data!
Model: Decision Tree IF-UTAMA
13
IF-UTAMA
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
14
Class
Learn Model
10
Attrib2
Attrib3
Class
Apply Model
Decision Tree
10
IF-UTAMA
15
IF-UTAMA
16
4
Test Data Start from the root of tree.
Refund
No
Refund Yes
Test Data
Marital Status
Taxable Income
Cheat
Refund
Married
80K
?
No
NO
Yes
MarSt
TaxInc
NO
Married
YES
17
IF-UTAMA
No
Refund Yes
Refund
Marital Status
Taxable Income
Cheat
Married
80K
?
Yes
NO
< 80K NO
IF-UTAMA
19
Married
80K
?
Married
TaxInc
NO
YES
Cheat
10
Single, Divorced
> 80K
Taxable Income
MarSt
Married
TaxInc < 80K
Marital Status
No
NO
MarSt Single, Divorced
No
Refund
10
No
NO
18
Test Data
Test Data Refund
?
> 80K
NO
IF-UTAMA
80K
NO
< 80K
YES
NO
Married
MarSt
TaxInc
> 80K
Cheat
10
Single, Divorced
NO
< 80K
Taxable Income
No
Married
Single, Divorced
No
Refund
10
Marital Status
NO > 80K YES
IF-UTAMA
20
5
Test Data Refund
No
Refund
Test Data
Marital Status
Taxable Income
Cheat
Married
80K
?
Refund
Yes
No
NO
Yes
MarSt
TaxInc
NO
NO
< 80K
YES
NO
IF-UTAMA
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
Attrib2
100K
Attrib3
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tid
Attrib1
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Married
80K
?
MarSt Married
TaxInc
> 80K
Cheat
10
Single, Divorced
NO
< 80K
Taxable Income
No
Married
Single, Divorced
No
Refund
10
Marital Status
Assign Cheat to “No”
NO > 80K YES
21
IF-UTAMA
22
23
IF-UTAMA
24
Class
Learn Model
10
Attrib2
Attrib3
Class
Apply Model
Decision Tree
10
IF-UTAMA
6
IF-UTAMA
25
IF-UTAMA
IF-UTAMA
27
26
Data dinyatakan dalam bentuk tabel dengan atribut dan record. Atribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target atribut. atribut Atribut memiliki nilai-nilai yang dinamakan dengan instance. instance Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.
IF-UTAMA
28
7
1.
Mengubah bentuk data (tabel) menjadi model tree. ◦ ◦ ◦
2.
Mengubah model tree menjadi rule ◦ ◦
3.
ID3 Algorithm C.45 Algorithm etc Disjunction (v OR) Conjunction (^ AND)
Menyederhanakan Rule (Pruning)
IF-UTAMA
29
IF-UTAMA
30
IF-UTAMA
31
IF-UTAMA
32
8
IF-UTAMA
33
IF-UTAMA
34
IF-UTAMA
35
IF-UTAMA
36
9
Given a set of examples, S, categorised in categories ci, then: 1. Choose the root node to be the attribute, A, which scores the highest for information gain relative to S. 2. For each value v that A can possibly take, draw a branch from the node. 3. For each branch from A corresponding to value v, calculate Sv. Then: ◦ ◦ ◦
If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch. If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch. Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.
The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples.
IF-UTAMA
37
Spesifikasikan masalah menentukan Atribut dan Target Atribut berdasarkan data yang ada Hitung nilai Entropy dari setiap kriteria dengan data sample yang ditentukan. Hitung Information Gain dari setiap kriteria Node terpilih adalah kriteria dengan Information Gain yang paling tinggi. Ulangi sampai diperoleh node terakhir yang berisi target atribut
IF-UTAMA
IF-UTAMA
39
38
Entropy(S) adalah jumlah bit yang diperkirakan dibutuhkan untuk dapat mengekstrak suatu kelas (+ atau -) dari sejumlah data acak pada ruang sample S. Entropy bisa dikatakan sebagai kebutuhan bit untuk menyatakan suatu kelas. Semakin kecil nilai Entropy maka semakin baik untuk digunakan dalam mengekstraksi suatu kelas. Panjang kode untuk menyatakan informasi secara optimal adalah –log2 p bits untuk messages yang mempunyai probabilitas p.
IF-UTAMA
40
10
IF-UTAMA
41
IF-UTAMA
42
Training Data Set
IF-UTAMA
43
IF-UTAMA
44
11
IF-UTAMA
45
IF-UTAMA
46
IF-UTAMA
47
IF-UTAMA
48
12
IF-UTAMA
49
IF-UTAMA
50
IF-UTAMA
51
IF-UTAMA
52
13
Sample
Atribut
Target Atribut
Decision Tree?? IF-UTAMA
53
Jumlah instance = 8 Jumlah instance positif = 3 Jumlah instance negatif = 5
IF-UTAMA
54
Jumlah instance = 8 Instance Usia ◦ Muda
Instance positif = 1 Instance negatif = 3
◦ Tua Entropy (Hipertensi
) = − Pins tan ce _ positif
Instance positif = 2 Instance negatif = 2
log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
3 3 5 5 = − × log 2 − × log 2 8 8 8 8 = − (0 .375 × log 2 0 .375 ) − (0 .625 × log 2 0 . 625 )
Entropy Usia
Entropy (Muda ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Tua ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
◦ Entropy(muda) = 0.906 ◦ Entropy(tua) = 1
= − (0 .375 × -1.415 ) − (0 .625 × - 0.678 ) = 0,531 + 0,424 = 0,955 IF-UTAMA
55
IF-UTAMA
56
14
∑
Gain (S , Usia ) = Entropy (S ) −
v∈Muda ,Tua
Sv S
Entropy (S v )
Jumlah instance = 8 Intance Berat ◦ Overweight Instance positif = 3 Instance negatif = 1
S Muda S Entropy (S Muda ) − Tua Entropy (S Tua ) S S 4 4 = (0 .955 ) − (0 .906 ) − (1) 8 8 = 0 .955 − 0 .453 − 0 .5
◦ Average
= Entropy (S ) −
Instance positif = 0 Instance negatif = 2
◦ Underweight Instance positif = 0 Instance negatif = 2
Entropy (Overweight
) = − Pins tan ce _ positif
log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
Entropy ( Average ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Underweigh t ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
= 0 .002
◦ Entropy(Overweight)=0.906 ◦ Entropy(Average)=0.5 ◦ Entropy(Underweight)=0.5 IF-UTAMA
Sv
∑
Gain (S , Berat ) = Entropy (S ) −
v∈Overwight , Average ,Underweigh t
= Entropy (S ) −
S Overweight S
Entropy (S Overweight ) −
S Average S
S
57
Entropy (S v )
Entropy (S average ) −
IF-UTAMA
S Underweigh t S
Entropy (S Underweigh t )
58
Jumlah instance = 8 Intance Jenis Kelamin ◦ Pria Instance positif = 2 Instance negatif = 4
4 (0 .906 ) − 2 (0 .5 ) − 2 (0.5 ) 8 8 8 = 0 .955 − 0 .453 − 0 .125 − 0 .125 = 0,252
= (0 .955 ) −
◦ Wanita Instance positif = 1 Instance negatif = 1 Entropy (Pr ia ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Wanita ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif
◦ Entropy(Pria)=1 ◦ Entropy(Wanita)=0.75
IF-UTAMA
59
IF-UTAMA
60
15
Gain (S , JenisKela min ) = Entropy (S ) −
∑ v∈Pr ia ,Wanita
= = = =
Sv Entropy (S v ) S
S S Entropy (S ) − Pr ia Entropy (S Pr ia ) − Wanita Entropy (S Wanita S S 6 2 (0 .955 ) − (1) − (0.75 ) 8 8 0 . 955 − 0 . 75 − 0 .188 0,017
Berat
)
Underweight
Overweight Average
IF-UTAMA
Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi
61
Jumlah Instance untuk Overweight = 4 Jumlah Instance untuk Average = 2 Jumlah Instance untuk Underweight = 2 Hitung Gain paling tinggi untuk dijadikan cabang berikutnya
IF-UTAMA
62
Jumlah instance = 4 Instance (Berat = Overwight ) & Usia = ◦ Muda
Instance positif = 1 Instance negatif = 0
◦ Tua
Instance positif = 2 Instance negatif = 1
Instance (Berat = Overwight ) & Jenis Kelamin = ◦ Pria
Instance positif = 2 Instance negatif = 1
Clasification Rule ????
◦ Wanita
Instance positif = 1 Instance negatif = 0
63
IF-UTAMA
64
16
Underfitting and Overfitting
◦ Stop the algorithm before it becomes a fully-grown tree ◦ Typical stopping conditions for a node:
◦ Underfitting: Underfitting when model is too simple, both training and test errors are large ◦ Overfitting results in decision trees that are more complex than necessary
Stop if all instances belong to the same class Stop if all the attribute values are the same
Missing Values Costs of Classification
◦ More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
IF-UTAMA
Pre-Pruning (Early Stopping Rule)
65
IF-UTAMA
Post-pruning
Class = Yes
20
◦ Grow decision tree to its entirety ◦ Trim the nodes of the decision tree in a bottom-up fashion ◦ If generalization error improves after trimming, replace sub-tree by a leaf node. ◦ Class label of leaf node is determined from majority class of instances in the sub-tree ◦ Can use MDL for post-pruning
Class = No
10
IF-UTAMA
66
Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30
Error = 10/30
Pessimistic error (After splitting) = (9 + 4 × 0.5)/30 = 11/30
A? A1
A4
A2
67
PRUNE!
A3
Class = Yes
8
Class = Yes
3
Class = Yes
4
Class = Yes
5
Class = No
4
Class = No
4
Class = No
1
Class = No
1
IF-UTAMA
68
17
Missing values affect decision tree construction in three different ways:
Tid Refund Marital Status
Taxable Income Class
1
Yes
Single
125K
No
◦ Affects how impurity measures are computed ◦ Affects how to distribute instance with missing value to child nodes ◦ Affects how a test instance with missing value is classified
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
60K
Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Class Class = Yes = No 0 3 2 4
Refund=Yes Refund=No Refund=?
1
0
Split on Refund:
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
Entropy(Refund=Yes) = 0
9
No
Married
75K
No
10
?
Single
90K
Yes
Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183
10
Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551
Missing value
Gain = 0.9 × (0.8813 – 0.551) = 0.3303 IF-UTAMA
Tid Refund Marital Status
Taxable Income Class
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
Tid
10
Refund
?
69
Taxable Income
Class
Single
90K
Yes
Tid Refund Marital Status 11
10
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
Class=Yes
0 + 3/9
Class=Yes
2 + 6/9
8
No
Single
Yes
Class=No
3
Class=No
4
No
Married
75K
Yes
?
No
Single
Divorced
Total
Class=No
3
1
0
4
Class=Yes
6/9
1
1
2.67
Total
3.67
2
1
6.67
Refund Yes NO
No
10
Probability that Refund=Yes is 3/9 Refund Yes
85K
10
Divorced 95K
9
?
No
Taxable Income Class
70
Refund
No
85K
Married
New record:
Marital Status
5
60K
IF-UTAMA
No Single, Divorced
Probability that Refund=No is 6/9
No
Class=Yes
0
Cheat=Yes
2
Class=No
3
Cheat=No
4
Married
TaxInc
Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 IF-UTAMA
MarSt
71
< 80K NO
NO > 80K
Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67
YES IF-UTAMA
72
18
IF-UTAMA
73
IF-UTAMA
74
IF-UTAMA
75
IF-UTAMA
76
19
1.
2.
3. 4.
Dr. Mourad YKHLEF.2009. Decision Tree Induction System.King Saud University System Achmad Basuki, Iwan Syarif. 2003. Decision Tree. Tree Politeknik Elektronika Negeri Surabaya (PENS) – ITS Simon Colton.2004. Decision Tree Learning.Learning.Tom M. Mitchell. 1997. Machine Learning. Learning McGraw Hill
IF-UTAMA
77
20