Association Rule Dengan FP-Tree dan FP Growth
Sistem Informasi FT – UPI YAI
Jesa Ariawan, MTI
Latar Belakang • Algoritma Association Rule dengan Apriori kurang baik bila terdapat banyak pola kombinasi data yang sering muncul (banyak frequent pattern), banyak jenis item tetapi pemenuhan minimum support rendah. • Algoritma Association Rule dengan Apriori membutuhkan waktu yang cukup lama karena scanning database dapat dilakukan berulang-ulang untuk mendapatkan frequent pattern yang ideal. • FP-Tree (Frequent Pattern – Tree) merupakan suatu algoritma yang dirancang untuk mengatasi kendala bottleneck pada proses penggalian data dengan algoritma Apriori (Zhao et al. 2003).
Ide gagasan FP-Tree • Pemampatan data dengan model struktur data pohon untuk menghindari pengulangan scanning database. • Tanpa memerlukan candidate generation • Dilanjutkan dengan proses algoritma FP-growth yang dapat langsung mengekstrak frequent Itemset dari FP tree yang telah terbentuk dengan menggunakan prinsip divide and conquer.
Definisi Frequent Pattern – Tree (FP – Tree) • Terdiri atas sebuah root dengan label ‘null’, sekumpulan subtree yang menjadi child dari root dan sebuah tabel frequent header. • Setiap node dalam FP-Tree mengandung tiga informasi penting. yaitu label item, menginformasikan jenis item yang direpresentasikan node tersebut, support count, merepresentasikan jumlah lintasan transaksi yang melalui node tesebut, dan pointer penghubung yang menghubungkan node-node dengan label item sama antar-lintasan, ditandai dengan garis panah putus-putus. • Isi dari setiap baris dalam tabel frequent header terdiri atas label item dan head of nodelink yang menunjuk ke node pertama dalam FP-Tree yang menyimpan label item tersebut.
Proses untuk mendapatkan Frequent Itemset dilakukan dalam 2 tahap • Pembentukan Frequent Pattern – Tree (FP-Tree)
• Ekstrak Frequent Itemset hasil dari FP-Tree dengan menggunakan algoritma FP-Growth
Algoritma FP Growth dibagi dalam 3 tugas utama 1.
2.
3.
Tahap Pembangkitan Conditional Pattern Base Conditional Pattern Base merupakan subdatabase yang berisi prefix path (lintasan prefix) dan suffix pattern (pola akhiran). Pembangkitan conditional pattern base didapatkan melalui FP-tree yang telah dibangun sebelumnya. Tahap Pembangkitan Conditional FP-tree Pada tahap ini, support count dari setiap item pada setiap conditional pattern base dijumlahkan, lalu setiap item yang memiliki jumlah support count lebih besar sama dengan minimum support count akan dibangkitkan dengan conditional FP-tree. Tahap Pencarian frequent itemset Apabila Conditional FP-tree merupakan lintasan tunggal (single path), maka didapatkan frequent itemset dengan melakukan kombinasi item untuk setiap conditional FP-tree. Jika bukan lintasan tunggal, maka dilakukan pembangkitan FP-growth secara rekursif.
Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Contoh Tabel Transaksai dengan minimum support count 2
No. 1 2 3 4 5 6 7 8 9 10
Transaksi a,b b,c,d,g,h a,c,d,e,f a,d,e a,b,z,c a,b,c,d a,r a,b,c a,b,d b,c,e
Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Frekuensi kemunculan tiap item dapat dilihat pada tabel berikut
Item a b c d e f r z g h
Frekuensi 8 7 6 5 3 1 1 1 1 1
Contoh Proses Asosiasi dengan FP-Tree dan FP-Growth • Kemunculan item yang frequent dalam setiap transaksi, diurut berdasarkan yang frekuensinya paling tinggi.
TID 1 2 3 4 5 6 7 8 9 10
Item {a,b} {b,c,d} {a,c,d,e} {a,d,e} {a,b,c} {a,b,c,d} {a} {a,b,c} {a,b,d} {b,c,e}
Frequent Pattern Growth (FP-Growth) Algorithm
An Introduction Florian Verhein
[email protected]
School of Information Technologies, The University of Sydney, Australia
Copyright 2008 Florian Verhein. Most gures derived from [1].
January 10, 2008
Outline
Introduction FP-Tree data structure Step 1: FP-Tree Construction Step 2: Frequent Itemset Generation Discussion
Introduction
uses a generate-and-test approach generates candidate itemsets and tests if they are frequent
I Apriori:
I Generation of candidate itemsets is expensive (in both space
and time) I Support counting is expensive I I
Subset checking (computationally expensive) Multiple Database scans (I/O)
allows frequent itemset discovery without candidate itemset generation. Two step approach:
I FP-Growth:
I Step 1: Build a compact data structure called the I
FP-tree
Built using 2 passes over the data-set.
I Step 2: Extracts frequent itemsets directly from the FP-tree I
Traversal through FP-Tree
Core Data Structure: FP-Tree
I Nodes correspond to items and have a
counter
I FP-Growth reads 1 transaction at a
time and maps it to a path
I Fixed order is used, so paths can
overlap when transactions share items (when they have the same prex ).
I In this case, counters are incremented I Pointers are maintained between
nodes containing the same item, creating singly linked lists (dotted lines)
I The more paths that overlap, the
higher the compression. FP-tree may t in memory.
I Frequent itemsets extracted from the
FP-Tree.
Step 1: FP-Tree Construction (Example)
FP-Tree is constructed using 2 passes over the data-set: I Pass 1: I Scan data and nd support for each item. I Discard infrequent items. I Sort frequent items in decreasing order based on their support. I I
For our example: a, b , c , d , e Use this order when building the FP-Tree, so common prexes can be shared.
Step 1: FP-Tree Construction (Example) I Pass 2:
construct the FP-Tree (see diagram on next slide)
I Read transaction 1: I
Create 2 nodes a and b and the path null → a → b . Set counts of a and b to 1.
I Read transaction 2: I
I
{b , c , d }
Create 3 nodes for b , c and d and the path null → b → c → d . Set counts to 1. Note that although transaction 1 and 2 share b , the paths are disjoint as they don't share a common prex. Add the link between the b 's.
I Read transaction 3: I
{a, b }
{a, c , d , e }
It shares common prex item a with transaction 1 so the path for transaction 1 and 3 will overlap and the frequency count for node a will be incremented by 1. Add links between the c 's and d 's.
I Continue until all transactions are mapped to a path in the
FP-tree.
Step 1: FP-Tree Construction (Example)
FP-Tree size I
The FP-Tree usually has a smaller size than the uncompressed data typically many transactions share items (and hence prexes). I
Best case scenario:
all transactions contain the same set of
items. I I
1 path in the FP-tree
Worst case scenario:
every transaction has a unique set of
items (no items in common) I I
Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher need to store the pointers between the nodes and the counters.
I The size of the FP-tree depends on how the items are ordered I
Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation I FP-Growth extracts frequent itemsets from the FP-tree. I Bottom-up algorithm from the leaves towards the root I Divide and conquer: rst look for frequent itemsets ending in
e,
then
de ,
etc. . . then
d,
then
cd ,
etc. . .
I First, extract prex path sub-trees ending in an item(set). (hint: use the linked lists)
↑ Complete FP-tree → Example: prex path sub-trees
Step 2: Frequent Itemset Generation I
Each prex path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged. prex path sub-tree for e will be used to extract frequent itemsets ending in e , then in de , ce , be and ae , then in cde , bde , cde , etc.
I E.g. the
I Divide and conquer approach
Prex path sub-tree ending in
e.
Example Let minSup = 2 and extract all frequent itemsets containing e . I
I
I
1. Obtain the prex path sub-tree for e :
2. Check if e is a frequent item by adding the counts along the linked list (dotted line). If so, extract it. I Yes, count =3 so {e } is extracted as a frequent itemset. 3. As e is frequent, nd frequent itemsets ending in e . i.e. de , ce , be and ae . I i.e. decompose the problem recursively. I To do this, we must rst to obtain the conditional FP-tree for
e.
Conditional FP-Tree
I
The FP-Tree that would be built if we only consider transactions containing a particular itemset (and then removing that itemset from all transactions).
I Example:
FP-Tree conditional on e .
Conditional FP-Tree
To obtain the conditional FP-tree for e from the prex sub-tree ending in e : I
Update the support counts along the prex paths (from e ) to reect the number of transactions containing e . I
b
and
c
should be set to 1 and
a
to 2.
Conditional FP-Tree
To obtain the conditional FP-tree for e from the prex sub-tree ending in e : I
Remove the nodes containing e information about node e is no longer needed because of the previous step
Conditional FP-Tree
To obtain the conditional FP-tree for e from the prex sub-tree ending in e : I
Remove infrequent items (nodes) from the prex paths
b has a support of 1 (note this really means be has a support of 1). i.e. there is only 1 transaction containing b and e so be is infrequent can remove b.
I E.g.
Question: why were c and d not removed?
Example (continued) I
4. Use the the conditional FP-tree for e to nd frequent itemsets ending in de , ce and ae be is for e .
I Note that
FP-tree
not considered as
I For each of them (e.g.
conditional tree for
e,
de ),
b
is not in the conditional
nd the prex paths from the
extract frequent itemsets, generate
conditional FP-tree, etc... (recursive) I Example:
frequent)
e → de → ade ({d , e },{a, d , e }
are found to be
Example (continued) I
4. Use the the conditional FP-tree for e to nd frequent itemsets ending in de , ce and ae I Example:
I
e → ce ({c , e }
is found to be frequent)
etc... (ae , then do the whole thing for b,... etc)
Result
I
Frequent itemsets found (ordered by sux and order in which they are found):