Association Rule Dengan FP-Tree dan FP Growth

Association Rule Dengan FP-Tree dan FP Growth

Sistem Informasi FT – UPI YAI

Jesa Ariawan, MTI

Latar Belakang • Algoritma Association Rule dengan Apriori kurang baik bila terdapat banyak pola kombinasi data yang sering muncul (banyak frequent pattern), banyak jenis item tetapi pemenuhan minimum support rendah. • Algoritma Association Rule dengan Apriori membutuhkan waktu yang cukup lama karena scanning database dapat dilakukan berulang-ulang untuk mendapatkan frequent pattern yang ideal. • FP-Tree (Frequent Pattern – Tree) merupakan suatu algoritma yang dirancang untuk mengatasi kendala bottleneck pada proses penggalian data dengan algoritma Apriori (Zhao et al. 2003).

Ide gagasan FP-Tree • Pemampatan data dengan model struktur data pohon untuk menghindari pengulangan scanning database. • Tanpa memerlukan candidate generation • Dilanjutkan dengan proses algoritma FP-growth yang dapat langsung mengekstrak frequent Itemset dari FP tree yang telah terbentuk dengan menggunakan prinsip divide and conquer.

Definisi Frequent Pattern – Tree (FP – Tree) • Terdiri atas sebuah root dengan label ‘null’, sekumpulan subtree yang menjadi child dari root dan sebuah tabel frequent header. • Setiap node dalam FP-Tree mengandung tiga informasi penting. yaitu label item, menginformasikan jenis item yang direpresentasikan node tersebut, support count, merepresentasikan jumlah lintasan transaksi yang melalui node tesebut, dan pointer penghubung yang menghubungkan node-node dengan label item sama antar-lintasan, ditandai dengan garis panah putus-putus. • Isi dari setiap baris dalam tabel frequent header terdiri atas label item dan head of nodelink yang menunjuk ke node pertama dalam FP-Tree yang menyimpan label item tersebut.

Proses untuk mendapatkan Frequent Itemset dilakukan dalam 2 tahap • Pembentukan Frequent Pattern – Tree (FP-Tree)

• Ekstrak Frequent Itemset hasil dari FP-Tree dengan menggunakan algoritma FP-Growth

Algoritma FP Growth dibagi dalam 3 tugas utama 1.

2.

3.

Tahap Pembangkitan Conditional Pattern Base Conditional Pattern Base merupakan subdatabase yang berisi prefix path (lintasan prefix) dan suffix pattern (pola akhiran). Pembangkitan conditional pattern base didapatkan melalui FP-tree yang telah dibangun sebelumnya. Tahap Pembangkitan Conditional FP-tree Pada tahap ini, support count dari setiap item pada setiap conditional pattern base dijumlahkan, lalu setiap item yang memiliki jumlah support count lebih besar sama dengan minimum support count akan dibangkitkan dengan conditional FP-tree. Tahap Pencarian frequent itemset Apabila Conditional FP-tree merupakan lintasan tunggal (single path), maka didapatkan frequent itemset dengan melakukan kombinasi item untuk setiap conditional FP-tree. Jika bukan lintasan tunggal, maka dilakukan pembangkitan FP-growth secara rekursif.

Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Contoh Tabel Transaksai dengan minimum support count 2

No. 1 2 3 4 5 6 7 8 9 10

Transaksi a,b b,c,d,g,h a,c,d,e,f a,d,e a,b,z,c a,b,c,d a,r a,b,c a,b,d b,c,e

Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Frekuensi kemunculan tiap item dapat dilihat pada tabel berikut

Item a b c d e f r z g h

Frekuensi 8 7 6 5 3 1 1 1 1 1

Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Kemunculan item yang frequent dalam setiap transaksi, diurut berdasarkan yang frekuensinya paling tinggi.

TID 1 2 3 4 5 6 7 8 9 10

Item {a,b} {b,c,d} {a,c,d,e} {a,d,e} {a,b,c} {a,b,c,d} {a} {a,b,c} {a,b,d} {b,c,e}

Frequent Pattern Growth (FP-Growth) Algorithm

An Introduction Florian Verhein [email protected]

School of Information Technologies, The University of Sydney, Australia

Copyright 2008 Florian Verhein. Most gures derived from [1].

January 10, 2008

Outline

Introduction FP-Tree data structure Step 1: FP-Tree Construction Step 2: Frequent Itemset Generation Discussion

Introduction

uses a generate-and-test approach generates candidate itemsets and tests if they are frequent

I Apriori:

I Generation of candidate itemsets is expensive (in both space

and time) I Support counting is expensive I I

Subset checking (computationally expensive) Multiple Database scans (I/O)

allows frequent itemset discovery without candidate itemset generation. Two step approach:

I FP-Growth:

I Step 1: Build a compact data structure called the I

FP-tree

Built using 2 passes over the data-set.

I Step 2: Extracts frequent itemsets directly from the FP-tree I

Traversal through FP-Tree

Core Data Structure: FP-Tree

I Nodes correspond to items and have a

counter

I FP-Growth reads 1 transaction at a

time and maps it to a path

I Fixed order is used, so paths can

overlap when transactions share items (when they have the same prex ).

I In this case, counters are incremented I Pointers are maintained between

nodes containing the same item, creating singly linked lists (dotted lines)

I The more paths that overlap, the

higher the compression. FP-tree may t in memory.

I Frequent itemsets extracted from the

FP-Tree.

Step 1: FP-Tree Construction (Example)

FP-Tree is constructed using 2 passes over the data-set: I Pass 1: I Scan data and nd support for each item. I Discard infrequent items. I Sort frequent items in decreasing order based on their support. I I

For our example: a, b , c , d , e Use this order when building the FP-Tree, so common prexes can be shared.

Step 1: FP-Tree Construction (Example) I Pass 2:

construct the FP-Tree (see diagram on next slide)

I Read transaction 1: I

Create 2 nodes a and b and the path null → a → b . Set counts of a and b to 1.


I

{b , c , d }

Create 3 nodes for b , c and d and the path null → b → c → d . Set counts to 1. Note that although transaction 1 and 2 share b , the paths are disjoint as they don't share a common prex. Add the link between the b 's.


{a, b }

{a, c , d , e }

It shares common prex item a with transaction 1 so the path for transaction 1 and 3 will overlap and the frequency count for node a will be incremented by 1. Add links between the c 's and d 's.

I Continue until all transactions are mapped to a path in the

FP-tree.

Step 1: FP-Tree Construction (Example)

FP-Tree size I

The FP-Tree usually has a smaller size than the uncompressed data typically many transactions share items (and hence prexes). I

Best case scenario:

all transactions contain the same set of

items. I I

1 path in the FP-tree

Worst case scenario:

every transaction has a unique set of

items (no items in common) I I

Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher need to store the pointers between the nodes and the counters.

I The size of the FP-tree depends on how the items are ordered I

Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic).

Step 2: Frequent Itemset Generation I FP-Growth extracts frequent itemsets from the FP-tree. I Bottom-up algorithm from the leaves towards the root I Divide and conquer: rst look for frequent itemsets ending in

e,

then

de ,

etc. . . then

d,

then

cd ,

etc. . .

I First, extract prex path sub-trees ending in an item(set). (hint: use the linked lists)

↑ Complete FP-tree → Example: prex path sub-trees

Step 2: Frequent Itemset Generation I

Each prex path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged. prex path sub-tree for e will be used to extract frequent itemsets ending in e , then in de , ce , be and ae , then in cde , bde , cde , etc.

I E.g. the

I Divide and conquer approach

Prex path sub-tree ending in

e.

Example Let minSup = 2 and extract all frequent itemsets containing e . I

I

I

1. Obtain the prex path sub-tree for e :

2. Check if e is a frequent item by adding the counts along the linked list (dotted line). If so, extract it. I Yes, count =3 so {e } is extracted as a frequent itemset. 3. As e is frequent, nd frequent itemsets ending in e . i.e. de , ce , be and ae . I i.e. decompose the problem recursively. I To do this, we must rst to obtain the conditional FP-tree for

e.

Conditional FP-Tree

I

The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).

I Example:

FP-Tree conditional on e .

Conditional FP-Tree

To obtain the conditional FP-tree for e from the prex sub-tree ending in e : I

Update the support counts along the prex paths (from e ) to reect the number of transactions containing e . I

b

and

c

should be set to 1 and

a

to 2.

Conditional FP-Tree


Remove the nodes containing e information about node e is no longer needed because of the previous step

Conditional FP-Tree


Remove infrequent items (nodes) from the prex paths

b has a support of 1 (note this really means be has a support of 1). i.e. there is only 1 transaction containing b and e so be is infrequent can remove b.

I E.g.

Question: why were c and d not removed?

Example (continued) I

4. Use the the conditional FP-tree for e to nd frequent itemsets ending in de , ce and ae be is for e .

I Note that

FP-tree

not considered as

I For each of them (e.g.

conditional tree for

e,

de ),

b

is not in the conditional

nd the prex paths from the

extract frequent itemsets, generate

conditional FP-tree, etc... (recursive) I Example:

frequent)

e → de → ade ({d , e },{a, d , e }

are found to be

Example (continued) I

4. Use the the conditional FP-tree for e to nd frequent itemsets ending in de , ce and ae I Example:

I

e → ce ({c , e }

is found to be frequent)

etc... (ae , then do the whole thing for b,... etc)

Result

I

Frequent itemsets found (ordered by sux and order in which they are found):

Association Rule Dengan FP-Tree dan FP Growth

Recommend Documents