Sesi 09 Dosen Pembina : Danang Junaedi IF-UTAMA 1 IF-UTAMA 2

Sesi 09 Dosen Pembina : Danang Junaedi

IF-UTAMA

1

IF-UTAMA

2

Given a collection of records (training set ) ◦ Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. ◦ A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

IF-UTAMA

3

IF-UTAMA

4

1

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

Class

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tid

Attrib1

Attrib3

Class

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Predicting tumor cells as benign or malignant

Classifying credit card transactions as legitimate or fraudulent

Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

Categorizing news stories as finance, weather, entertainment, sports, etc

Learn Model

10

Attrib2

Apply Model

10

IF-UTAMA

5

Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

IF-UTAMA

IF-UTAMA

7

6

Mengubah data menjadi pohon keputusan (decision tree) dan aturan-aturan keputusan (rule)

IF-UTAMA

8

2

A decision tree is a chronological representation of the decision problem. Each decision tree has two types of nodes; round nodes correspond to the states of nature while square nodes correspond to the decision alternatives. The branches leaving each round node represent the different states of nature while the branches leaving each square node represent the different decision alternatives. At the end of each limb of a tree are the payoffs attained from the series of branches making up that limb.

IF-UTAMA

9

IF-UTAMA

IF-UTAMA

11

10

Diagnosa penyakit tertentu, seperti hipertensi, kanker, stroke dan lain-lain Pemilihan produk seperti rumah, kendaraan, komputer dan lain-lain Pemilihan pegawai teladan sesuai dengan kriteria tertentu Deteksi gangguan pada komputer atau jaringan komputer seperti Deteksi Entrusi, deteksi virus (trojan dan varians) dll

IF-UTAMA

12

3

Tid

Refund

M arital Status

Taxable Incom e

Cheat

1

Yes

Single

125K

No

2

No

M arried

100K

No

3

No

Single

70K

No

4

Yes

M arried

120K

No

5

No

Divorced

95K

Y es

6

No

M arried

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Y es

9

No

M arried

75K

No

10

No

Single

90K

Y es

Splitting Attributes

Refund Yes

No

NO

MarSt Single, Divorced TaxInc < 80K NO

Married NO

> 80K YES

Tid

Refund

M arital Status

Taxable Income

Cheat

1

Yes

Single

125K

No

2

No

M arried

100K

No

3

No

Single

70K

No

4

Yes

M arried

120K

No

5

No

Divorced

95K

Yes

6

No

M arried

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

M arried

75K

No

10

No

Single

90K

Yes

10

10

Training Data

Married

Single, Divorced

MarSt

NO

Refund No

Yes NO

TaxInc < 80K

> 80K

NO

YES

There could be more than one tree that fits the same data!

Model: Decision Tree IF-UTAMA

13

IF-UTAMA

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

14

Class

Learn Model

10

Attrib2

Attrib3

Class

Apply Model

Decision Tree

10

IF-UTAMA

15

IF-UTAMA

16

4

Test Data Start from the root of tree.

Refund

No

Refund Yes

Test Data

Marital Status

Taxable Income

Cheat

Refund

Married

80K

?

No

NO

Yes

MarSt

TaxInc

NO

Married

YES

17

IF-UTAMA

No

Refund Yes

Refund

Marital Status

Taxable Income

Cheat

Married

80K

?

Yes

NO

< 80K NO

IF-UTAMA

19

Married

80K

?

Married

TaxInc

NO

YES

Cheat

10

Single, Divorced

> 80K

Taxable Income

MarSt

Married

TaxInc < 80K

Marital Status

No

NO

MarSt Single, Divorced

No

Refund

10

No

NO

18

Test Data

Test Data Refund

?

> 80K

NO

IF-UTAMA

80K

NO

< 80K

YES

NO

Married

MarSt

TaxInc

> 80K

Cheat

10

Single, Divorced

NO

< 80K

Taxable Income

No

Married

Single, Divorced

No

Refund

10

Marital Status

NO > 80K YES

IF-UTAMA

20

5

Test Data Refund

No

Refund

Test Data

Marital Status

Taxable Income

Cheat

Married

80K

?

Refund

Yes

No

NO

Yes

MarSt

TaxInc

NO

NO

< 80K

YES

NO

IF-UTAMA

Tid

Attrib1

1

Yes

Large

125K

No

2

No

Medium

Attrib2

100K

Attrib3

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

Tid

Attrib1

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

Married

80K

?

MarSt Married

TaxInc

> 80K

Cheat

10

Single, Divorced

NO

< 80K

Taxable Income

No

Married

Single, Divorced

No

Refund

10

Marital Status

Assign Cheat to “No”

NO > 80K YES

21

IF-UTAMA

22

23

IF-UTAMA

24

Class

Learn Model

10

Attrib2

Attrib3

Class

Apply Model

Decision Tree

10

IF-UTAMA

6

IF-UTAMA

25

IF-UTAMA

IF-UTAMA

27

26

Data dinyatakan dalam bentuk tabel dengan atribut dan record. Atribut menyatakan suatu parameter yang dibuat sebagai kriteria dalam pembentukan tree. Misalkan untuk menentukan main tenis, kriteria yang diperhatikan adalah cuaca, angin dan temperatur. Salah satu atribut merupakan atribut yang menyatakan data solusi per-item data yang disebut dengan target atribut. atribut Atribut memiliki nilai-nilai yang dinamakan dengan instance. instance Misalkan atribut cuaca mempunyai instance berupa cerah, berawan dan hujan.

IF-UTAMA

28

7

1.

Mengubah bentuk data (tabel) menjadi model tree. ◦ ◦ ◦

2.

Mengubah model tree menjadi rule ◦ ◦

3.

ID3 Algorithm C.45 Algorithm etc Disjunction (v OR) Conjunction (^ AND)

Menyederhanakan Rule (Pruning)

IF-UTAMA

29

IF-UTAMA

30

IF-UTAMA

31

IF-UTAMA

32

8

IF-UTAMA

33

IF-UTAMA

34

IF-UTAMA

35

IF-UTAMA

36

9

Given a set of examples, S, categorised in categories ci, then: 1. Choose the root node to be the attribute, A, which scores the highest for information gain relative to S. 2. For each value v that A can possibly take, draw a branch from the node. 3. For each branch from A corresponding to value v, calculate Sv. Then: ◦ ◦ ◦

If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch. If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch. Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.

The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples.

IF-UTAMA

37

Spesifikasikan masalah menentukan Atribut dan Target Atribut berdasarkan data yang ada Hitung nilai Entropy dari setiap kriteria dengan data sample yang ditentukan. Hitung Information Gain dari setiap kriteria Node terpilih adalah kriteria dengan Information Gain yang paling tinggi. Ulangi sampai diperoleh node terakhir yang berisi target atribut

IF-UTAMA

IF-UTAMA

39

38

Entropy(S) adalah jumlah bit yang diperkirakan dibutuhkan untuk dapat mengekstrak suatu kelas (+ atau -) dari sejumlah data acak pada ruang sample S. Entropy bisa dikatakan sebagai kebutuhan bit untuk menyatakan suatu kelas. Semakin kecil nilai Entropy maka semakin baik untuk digunakan dalam mengekstraksi suatu kelas. Panjang kode untuk menyatakan informasi secara optimal adalah –log2 p bits untuk messages yang mempunyai probabilitas p.

IF-UTAMA

40

10

IF-UTAMA

41

IF-UTAMA

42

Training Data Set

IF-UTAMA

43

IF-UTAMA

44

11

IF-UTAMA

45

IF-UTAMA

46

IF-UTAMA

47

IF-UTAMA

48

12

IF-UTAMA

49

IF-UTAMA

50

IF-UTAMA

51

IF-UTAMA

52

13

Sample

Atribut

Target Atribut

Decision Tree?? IF-UTAMA

53

Jumlah instance = 8 Jumlah instance positif = 3 Jumlah instance negatif = 5

IF-UTAMA

54

Jumlah instance = 8 Instance Usia ◦ Muda

Instance positif = 1 Instance negatif = 3

◦ Tua Entropy (Hipertensi

) = − Pins tan ce _ positif


log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

 3   3   5   5  = −    × log 2    −    × log 2    8 8 8        8     = − (0 .375 × log 2 0 .375 ) − (0 .625 × log 2 0 . 625 )

Entropy Usia

Entropy (Muda ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Tua ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

◦ Entropy(muda) = 0.906 ◦ Entropy(tua) = 1

= − (0 .375 × -1.415 ) − (0 .625 × - 0.678 ) = 0,531 + 0,424 = 0,955 IF-UTAMA

55

IF-UTAMA

56

14

∑

Gain (S , Usia ) = Entropy (S ) −

v∈Muda ,Tua

Sv S

Entropy (S v )

Jumlah instance = 8 Intance Berat ◦ Overweight Instance positif = 3 Instance negatif = 1

S Muda S Entropy (S Muda ) − Tua Entropy (S Tua ) S S 4 4 = (0 .955 ) − (0 .906 ) − (1) 8 8 = 0 .955 − 0 .453 − 0 .5

◦ Average

= Entropy (S ) −


◦ Underweight Instance positif = 0 Instance negatif = 2

Entropy (Overweight

) = − Pins tan ce _ positif

log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

Entropy ( Average ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Underweigh t ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

= 0 .002

◦ Entropy(Overweight)=0.906 ◦ Entropy(Average)=0.5 ◦ Entropy(Underweight)=0.5 IF-UTAMA

Sv

∑

Gain (S , Berat ) = Entropy (S ) −

v∈Overwight , Average ,Underweigh t

= Entropy (S ) −

S Overweight S

Entropy (S Overweight ) −

S Average S

S

57

Entropy (S v )

Entropy (S average ) −

IF-UTAMA

S Underweigh t S

Entropy (S Underweigh t )

58

Jumlah instance = 8 Intance Jenis Kelamin ◦ Pria Instance positif = 2 Instance negatif = 4

4 (0 .906 ) − 2 (0 .5 ) − 2 (0.5 ) 8 8 8 = 0 .955 − 0 .453 − 0 .125 − 0 .125 = 0,252

= (0 .955 ) −

◦ Wanita Instance positif = 1 Instance negatif = 1 Entropy (Pr ia ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif Entropy (Wanita ) = − Pins tan ce _ positif log 2 Pins tan ce _ positif − Pins tan ce _ negatif log 2 Pins tan ce _ negatif

◦ Entropy(Pria)=1 ◦ Entropy(Wanita)=0.75

IF-UTAMA

59

IF-UTAMA

60

15

Gain (S , JenisKela min ) = Entropy (S ) −

∑ v∈Pr ia ,Wanita

= = = =

Sv Entropy (S v ) S

S S Entropy (S ) − Pr ia Entropy (S Pr ia ) − Wanita Entropy (S Wanita S S 6 2 (0 .955 ) − (1) − (0.75 ) 8 8 0 . 955 − 0 . 75 − 0 .188 0,017

Berat

)

Underweight

Overweight Average

IF-UTAMA

Atribut yang dipilih adalah atribut berat karena nilai Information Gainnya paling tinggi

61

Jumlah Instance untuk Overweight = 4 Jumlah Instance untuk Average = 2 Jumlah Instance untuk Underweight = 2 Hitung Gain paling tinggi untuk dijadikan cabang berikutnya

IF-UTAMA

62

Jumlah instance = 4 Instance (Berat = Overwight ) & Usia = ◦ Muda


◦ Tua


Instance (Berat = Overwight ) & Jenis Kelamin = ◦ Pria


Clasification Rule ????

◦ Wanita


63

IF-UTAMA

64

16

Underfitting and Overfitting

◦ Stop the algorithm before it becomes a fully-grown tree ◦ Typical stopping conditions for a node:

◦ Underfitting: Underfitting when model is too simple, both training and test errors are large ◦ Overfitting results in decision trees that are more complex than necessary

Stop if all instances belong to the same class Stop if all the attribute values are the same

Missing Values Costs of Classification

◦ More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).

IF-UTAMA

Pre-Pruning (Early Stopping Rule)

65

IF-UTAMA

Post-pruning

Class = Yes

20

◦ Grow decision tree to its entirety ◦ Trim the nodes of the decision tree in a bottom-up fashion ◦ If generalization error improves after trimming, replace sub-tree by a leaf node. ◦ Class label of leaf node is determined from majority class of instances in the sub-tree ◦ Can use MDL for post-pruning

Class = No

10

IF-UTAMA

66

Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Training Error (After splitting) = 9/30

Error = 10/30

Pessimistic error (After splitting) = (9 + 4 × 0.5)/30 = 11/30

A? A1

A4

A2

67

PRUNE!

A3

Class = Yes

8

Class = Yes

3

Class = Yes

4

Class = Yes

5

Class = No

4

Class = No

4

Class = No

1

Class = No

1

IF-UTAMA

68

17

Missing values affect decision tree construction in three different ways:

Tid Refund Marital Status

Taxable Income Class

1

Yes

Single

125K

No

◦ Affects how impurity measures are computed ◦ Affects how to distribute instance with missing value to child nodes ◦ Affects how a test instance with missing value is classified

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced 95K

Yes

6

No

Married

No

60K

Before Splitting: Entropy(Parent) = -0.3 log(0.3)-(0.7)log(0.7) = 0.8813 Class Class = Yes = No 0 3 2 4

Refund=Yes Refund=No Refund=?

1

0

Split on Refund:

7

Yes

Divorced 220K

No

8

No

Single

85K

Yes

Entropy(Refund=Yes) = 0

9

No

Married

75K

No

10

?

Single

90K

Yes

Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.9183

10

Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551

Missing value

Gain = 0.9 × (0.8813 – 0.551) = 0.3303 IF-UTAMA

Tid Refund Marital Status


1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

Tid

10

Refund

?

69

Taxable Income

Class

Single

90K

Yes

Tid Refund Marital Status 11

10

Yes

6

No

Married

No

7

Yes

Divorced 220K

No

Class=Yes

0 + 3/9

Class=Yes

2 + 6/9

8

No

Single

Yes

Class=No

3

Class=No

4

No

Married

75K

Yes

?

No

Single

Divorced

Total

Class=No

3

1

0

4

Class=Yes

6/9

1

1

2.67

Total

3.67

2

1

6.67

Refund Yes NO

No

10

Probability that Refund=Yes is 3/9 Refund Yes

85K

10

Divorced 95K

9

?

No


70

Refund

No

85K

Married

New record:

Marital Status

5

60K

IF-UTAMA

No Single, Divorced

Probability that Refund=No is 6/9

No

Class=Yes

0

Cheat=Yes

2

Class=No

3

Cheat=No

4

Married

TaxInc

Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9 IF-UTAMA

MarSt

71

< 80K NO

NO > 80K

Probability that Marital Status = Married is 3.67/6.67 Probability that Marital Status ={Single,Divorced} is 3/6.67

YES IF-UTAMA

72

18

IF-UTAMA

73

IF-UTAMA

74

IF-UTAMA

75

IF-UTAMA

76

19

1.

2.

3. 4.

Dr. Mourad YKHLEF.2009. Decision Tree Induction System.King Saud University System Achmad Basuki, Iwan Syarif. 2003. Decision Tree. Tree Politeknik Elektronika Negeri Surabaya (PENS) – ITS Simon Colton.2004. Decision Tree Learning.Learning.Tom M. Mitchell. 1997. Machine Learning. Learning McGraw Hill

IF-UTAMA

77

20

Sesi 09 Dosen Pembina : Danang Junaedi IF-UTAMA 1 IF-UTAMA 2

Recommend Documents