Internetworking Indonesia Journal

Volume 1 Number 2 Fall 2009

Internetworking Indonesia Journal The Indonesian Journal of ICT and Internet Development

ISSN: 1942-9703 IIJ © 2009 www.InternetworkingIndonesia.org

Guest Editors’ Introduction: Special Issue on Data Mining by Anto Satriyo Nugroho & Moch Arif Bijaksana

1

Enhanced SMART-TV: A Classifier with Vertical Data Structure and Dimensional Projections Pruning by Taufik Fuadi Abidin & William Perrizo

3

Using Business Intelligence Solutions for Achieving Organization’s Strategy: Arab International University Case Study by Mouhib Alnoukari

11

Feature Selection for Large Scale Data by Combining Class Association Rule Mining and Information Gain: a Hybrid Approach by Appavu alias Balamurugan, Pramala, Rajalakshmi & Rajaram

17

Detecting the Originality of Extended Research Articles Using Similarity Techniques - A Proposal by Shanmugasundaram Hariharan

25

Prediksi Masa Studi Sarjana dengan Artificial Neural Network by Muhamad Hanief Meinanda, Metri Annisa, Narendi Muhandri & Kadarsyah Suryadi

31

Adaptive Content-based Navigation Generating System: Data Mining on Unorganized Multi Web Resources by Diana Purwitasari, Yasuhisa Okazaki & Kenzi Watanabe

37

Fuzzy Decision Tree dengan Algoritme ID3 pada Data Diabetes by F. Romansyah, I. S. Sitanggang & S. Nurdiati

45

Internetworking Indonesia Journal

The Indonesian Journal of ICT and Internet Development ISSN: 1942-9703 Internetworking Indonesia is a semi-annual electronic journal devoted to the timely study of the Information and Communication Technology (ICT) and Internet development in Indonesia. The journal seeks high-quality manuscripts on the challenges and opportunities presented by information technology and the Internet in Indonesia. Journal mailing address: Internetworking Indonesia Journal, PO Box 397110 MIT Station, Cambridge, MA 02139, USA.

Co-Editors Thomas Hardjono, PhD (MIT Kerberos Consortium, MIT, USA)

Budi Rahardjo, PhD (ITB, Indonesia)

Kuncoro Wastuwibowo, MSc (PT. Telkom, Indonesia)

Editorial Advisory Board Prof. Edy Tri Baskoro, PhD (ITB, Indonesia) Mark Baugher, MA (Cisco Systems, USA) Lakshminath Dondeti, PhD (Qualcomm, USA) Prof. Svein Knapskog, PhD (NTNU, Norway) Prof. Merlyna Lim, PhD (Arizona State University, USA) Prof. Bambang Parmanto, PhD (University of Pittsburgh, USA)

Prof. Wishnu Prasetya, PhD (Utrecht University, The Netherlands) Prof. Jennifer Seberry, PhD (University of Wollongong, Australia) Prof. Willy Susilo, PhD (University of Wollongong, Australia) Prof. David Taniar, PhD (Monash University, Australia)

Focus & Scope: The Internetworking Indonesia Journal aims to become the foremost publication for practitioners, teachers, researchers and policy makers to share their knowledge and experience in the design, development, implementation, and the management of ICT and the Internet in Indonesia. Topics of Interest: The journal welcomes and strongly encourages submissions based on interdisciplinary approaches focusing on ICT & Internet development and its related aspects in the Indonesian context. These include (but not limited to) information technology, communications technology, computer sciences, electrical engineering, and the broader social studies regarding ICT and Internet development in Indonesia. A list of topics can be found on the journal website at www.InternetworkingIndonesia.org. Open Access Publication Policy: The Internetworking Indonesia Journal provides open access to all of its content on the principle that making research freely available to the public supports a greater global exchange of knowledge. This follows the philosophy of the Open Journal Systems (see the Public Knowledge Project at pkp.sfu.ca). The journal will be published electronically and there are no subscription fees. Such access is associated with increased readership and increased citation of an author’s work. Manuscript Language: The Internetworking Indonesia Journal accepts and publishes papers in Bahasa Indonesia and English. Manuscript Submission: Manuscripts should be submitted according to the IEEE Guide for authors, and will be refereed in the standard way. Manuscript pages should not exceed 7 pages of the IEEE 2-column format in a Microsoft-Word file format. Links to the IEEE style files can be found at www.InternetworkingIndonesia.org. Manuscripts submitted to Internetworking Indonesia must not have been previously published or committed to another publisher under a copyright transfer agreement, and must not be under consideration by another journal. Authors of accepted papers are responsible for the Camera Ready Copy (CRC) in the IEEE 2-column format (delivered to the Editor in a Microsoft-Word file). Authors are advised that no revisions of the manuscript can be made after acceptance by the Editor for publication. The benefits of this procedure are many, with speed and accuracy being the most obvious. We look forward to working with your electronic submission which will allow us to serve you more efficiently. Please email manuscripts or inquiries to the editors at the following address: [email protected]. Submission Guidelines: Internetworking Indonesia accepts a variety of manuscripts in either English or Bahasa Indonesia. Please review the descriptions below and identify the submission type best suited to your intended submission: • Research Papers: Research papers report on results emanating from research projects, both theoretical and practical in nature. • Short papers: Short research papers provide an introduction to new developments or advances regarding on-going work. • Policy Viewpoints: Policy Viewpoints explore competing perspectives in the Indonesian policy debate that are informed by academic research. • Teaching Innovation: Teaching Innovation papers explore creative uses of information technology tools and the Internet to improve learning and education in Indonesia. • Book Reviews: A review of a book, or other book-length document, such as a government report or foundation report.

Vol. 1/No. 2 (2009)

INTERNETWORKING INDONESIA JOURNAL

1

Guest Editors’ Introduction: Special Issue on Data Mining

E

khusus IIJ ini memilih datamining sebagai topik utama. Datamining merupakan salah satu topik menarik dalam bidang komputasi yang akhir-akhir ini mendapatkan perhatian yang makin meningkat, baik dari kalangan akademisi maupun praktisi di berbagai sektor. Hal ini tidak terlepas dari ketersediaan data yang berskala besar pada berbagai bidang, sehingga menuntut pengembangan teknikteknik komputasi untuk mengekstrak informasi signifikan dari lautan data tsb. Ketersediaan data dalam skala besar ini dijumpai di berbagai bidang, antara lain bioteknologi, meteorologi, akademik dan finansial. Terlebih lagi dewasa ini penelitian lintas bidang seolah sudah menjadi keharusan, karena pemecahan masalah yang demikian kompleks tidak akan tercapai tanpa adanya kerjasama berbagai disiplin ilmu. Hal ini menyebabkan data yang diperoleh tidak hanya berskala besar, tetapi juga semakin kompleks dan heterogen. Tantangan-tantangan inilah yang menyebabkan datamining menjadi salah satu tren teknologi terkini dalam bidang komputasi. Datamining diharapkan mampu menemukan informasi yang tersuruk dalam lautan data itu untuk dimanfaatkan dalam berbagai aplikasi, baik di dunia kedokteran, marketing maupun bidang yang lain. Di Indonesia, minat terhadap datamining terlihat makin meningkat, sebagaimana tercermin pada diskusi di berbagai komunitas maya. Berbagai perguruan tinggi telah memasukkan datamining dalam kurikulumnya, disamping matakuliah yang secara konten berkaitan erat seperti statistika, pengenalan pola dan machine learning. Pelatihan datamining pun telah dilakukan di berbagai instansi, seperti perbankan misalnya. Kebutuhan akan praktisi TI yang menguasai teknologi datamining pun mulai sering terlihat dari proses perekrutan pegawai, misalnya perekrutan pegawai perbankan di Indonesia. Hal lain yang menggembirakan adalah mulai ditemuinya buku-buku teks datamining berbahasa Indonesia, sehingga memudahkan para pelajar, mahasiswa dan peneliti untuk melakukan studi dan penelitian di bidang ini. Hal-hal ini yang melatarbelakangi dipilihnya topik datamining dalam edisi khusus IIJ kali ini. Dari berbagai makalah yang diterima, dilakukan proses review oleh para pakar, dan akhirnya terpilih 7 makalah untuk dimuat pada edisi kali ini. Paper pertama memperkenalkan algoritma Enhanced SMART-TV (SMall Absolute diffeRence of ToTal Variation) yang merupakan varian dari nearest neighbor yang memiliki kelebihan pada sisi kecepatan, dibandingkan dengan pendekatan nearest neighbor konvensional. Paper kedua mengambil topik Business DISI

The Guest Editors can be reached at the following email addresses. Dr. Anto Satriyo Nugroho is at [email protected], while Moch Arif Bijaksana is at [email protected].

Intelligence untuk perancangan strategi, dengan studi kasus pada Arab International University. Paper ketiga mengupas teknik feature selection untuk aplikasi pada data skala besar. Paper keempat merupakan aplikasi datamining untuk informasi tekstual, yaitu mendeteksi originality sebuah paper terhadap paper yang telah dipublikasikan sebelumnya. Dengan demikian, bisa diketahui apakah seorang peneliti memang benar-benar telah berhasil mengembangkan ide penelitiannya, ataukah sekedar mengulang ide yang telah dipresentasikan yang bukan lain adalah bentuk plagiarism. Paper kelima adalah pemenang kompetisi datamining nasional bagi mahasiswa S1 yang diselenggarakan oleh Dikti pada GeMasTIK II tahun 2009, bertemakan prediksi masa studi sarjana berdasarkan data-data akademis. Paper keenam mengajukan teknik baru untuk membangun sistim generasi navigasi berdasar konten secara otomatis. Paper ketujuh mengetengahkan aplikasi datamining pada kesehatan, yaitu dalam menganalisa data diabetes. Teknik datamining yang dibahas dalam ketujuh tema tersebut cukup beragam, dengan target aplikasi pada bidang akademik dan kesehatan. Edisi khusus ini terbit dengan dukungan para reviewer yang merupakan pakar data mining Indonesia. Untuk itu kami ucapkan terimakasih dan penghargaan yang tinggi kepada beberapa kolega yang telah mlakukan review; Dr. Budi Santosa, Dr. Yudho Giri Sucahyo, Dr. G.A. Putri Saptawati, Prof. The Houw Liong, Dr. Yudhi Agusta, Dr. Iko Pramudiono, dan terutama kepada Prof David Taniar yang juga telah mendorong adanya edisi khusus ini. Dan tentunya kepada editor IIJ Dr. Thomas Harjono yang telah mengawal dan mendukung dari awal hingga akhir edisi ini. Kami berharap agar paper yang diterbitkan pada Edisi Khusus kali ini dapat bermanfaat bagi penelitian dan pengembangan aplikasi datamining pada berbagai sektor di tanah air. Semoga pada masa mendatang semakin banyak tulisan berkaitan dengan datamining yang dapat dimuat pada IIJ. ____________ This Special Issue of the IIJ focuses on data mining. Data mining is an interesting topic within the broad field of computation that has received increased attention recently, not only from academics but also from practitioners in various sectors. This is not unrelated to the availability of large scale data within different areas, and which therefore has necessitated the development of new computational techniques to extract significant information from this ocean of data. The availability of large scale data can be found in various areas, including biotechnology, meteorology, the academic fields and the financial sector. Furthermore, in recent times multidisciplinary research has in reality become

ISSN: 1942-9703 / © 2009 IIJ

2


unavoidable (and even mandatory) due to the fact that solving many of these complex problems requires collaboration across various disciplines of study. This has resulted in data not only being large scale, but also more complex and heterogeneous. It is these challenges that have made data mining a new technological trend of interest within the field of computation. The expectation is that data mining can be deployed to find useful information buried within the vast ocean of data, which in turn can be beneficial to diverse areas of application such as medicine, marketing and others. In Indonesia recently there has been an increase in interest in data mining, as reflected by the discussion traffic on the Internet. A number of universities and higher education institutions have introduced the subject of data mining into their curriculum as an addition to subjects whose contents are already closely related to data mining, such as statistics, pattern recognition and machine learning. Training in data mining has also been provided by a number of institutions, such as those in the banking sector. The increasing demand for IT practitioners who are familiar with data mining technologies is also evident from increasing recruitments seeking these practitioners, such as in the area of banking. Most pleasing also is the increasing availability of textbooks on data mining written in Bahasa Indonesia, which allows a broader range of students and researchers alike to focus on the study of data mining and conduct research in this area. It is all these factors that have provided the backdrop for this IIJ Special Issue on Data Mining. This Special Issues carries seven (7) papers which have been reviewed by experts in this field. The first paper introduces the Enhanced SMART-TV algorithm (SMall Absolute diffeRence of ToTal Variation), which is a variation of the nearest neighbor approach. However, this algorithm variation provides an increase in speed compared to the conventional nearest neighbor approach. The second paper discusses on the role of business intelligence in achieving the business strategy of an organization, using the Arab International University as its case-study. The third paper focuses on a feature selection technique for large scale data. The fourth paper uses data mining on contextual information for the purpose of detecting the originality of the contents of a new paper as compared to existing published papers. The aim here is to determine whether a researcher has truly expanded a given research idea, or if he or she has simply repeated the same idea using a different presentation, something considered by many to be equivalent to plagiarism. The fifth paper is the winner of the Indonesian national data mining competition for S1 (undergraduate) students held by the DIKTI organization (Directorate General for Higher Education) at the GeMasTIK II event in 2009, which is a student event focusing on information and communications technologies. The theme of this fifth paper is the predictive techniques that can be used to predict a student’s performance based on their previous academic data. The sixth paper is on a navigation-generating system based on adaptive content, while the seventh paper discusses the application of data mining within the field of healthcare, for the purpose of analyzing diabetes related data.

Vol. 1/No. 2 (2009)

All in all the data mining techniques discussed within these seven papers cover a broad range, with the target area of application primarily being those of the academic field and healthcare. This Special Issue on Data Mining received tremendous support from reviewers who are experts in the field of data mining in Indonesia. We would like to thank and indicate our deepest appreciation for these colleagues who did the reviews: Dr. Budi Santosa, Dr. Yudho Giri Sucahyo, Dr. G.A. Putri Saptawati, Prof. The Houw Liong, Dr. Yudhi Agusta, and Dr. Iko Pramudiono. Special appreciation goes to Prof. David Taniar who came-up with the idea of an IIJ Special Issue and strongly supported its creation. We also thank the IIJ Editor Dr. Thomas Hardjono who has provided editorial guidance and support from the beginning until the completion of this Special Issue. We hope the papers appearing in this IIJ Special Issue on Data Mining will be of benefit to research and development on data mining applications within the various sectors in Indonesia. We hope other papers related to data mining can appear in the IIJ in the future.

ISSN: 1942-9703 / © 2009 IIJ

Anto Satriyo Nugroho Moch Arif Bijaksana

Dr. Anto Satriyo Nugroho is a researcher working for Agency for the Assessment & Application of Technolog (BPPT), Indonesia. He received his B.Eng degree in 1995, M.Eng in 2000 and Dr.Eng degree in 2003, all in Electrical & Computer Engineering from Nagoya Institute of Technology, Japan. From 2003 to 2007 he worked for Chukyo University as a Visiting Professor in the School of Cognitive and Computer Science (2003-2004) and in the School of Life System Science & Technology (2004-2007). He received the First Prize award of the ‘99 Fog Forecasting Contest conducted by the Neurocomputing Technical Group of The Institute of Electronics, Information and Communication Engineers (IEICE), Japan, and also the best paper award in the Konferensi Nasional Sistem & Informatika 2008 at Inna Sindhu Beach Hotel, Sanur Bali. His research interests are in the areas of pattern recognition, computational biology and data mining. Since 2007 he has been serving as the vice president of the Indonesian Society for Softcomputing. Dr. Nugroho is a member of the Institute of Electrical & Electronics Engineers (IEEE).

Moch Arif Bijaksana is a lecturer working for Institut Teknologi Telkom (ITTelkom), Indonesia. He received his B.Eng degree in 1991 from Gadjah Mada University and M.Tech in 1998 from RMIT University. He is an active member of Indonesian Data Mining Community. His main research interest is text mining.


Vol.1/No.2 (2009)

3

Enhanced SMART-TV: A Classifier with Vertical Data Structure and Dimensional Projections Pruning Taufik Fuadi Abidin Data Mining and IR Research Group Department of Mathematics, Syiah Kuala University Banda Aceh, NAD 23111 Indonesia [email protected] Abstract—Classification is the process of assigning class membership to the newly encountered and unlabeled objects based on some notion of closeness between the objects in the training set and the new objects. In this work, we introduce the enhancement of SMART-TV (SMall Absolute diffeRence of ToTal Variation) classifier by introducing dimensional projections process to prune the neighbors in the candidate set that are far away from the unclassified object in terms of distance but are close in terms of total variation. SMART-TV is scalable nearest-neighbors classifier. It uses vertical data structure and approximates a set of potential candidate of neighbors by means of vertical total variation. The total variation of a set of objects about a given object is computed using an efficient and scalable Vertical Set Squared Distance algorithm. In our previous work of SMART-TV, the dimensional projections was not introduced, and thus, the candidate set that are far away from the unclassified object but close in terms of total variation also voted. This, in some cases, can reduce the accuracy of the algorithm. Our empirical results using both real and synthetic datasets show that the proposed dimensional projections effectively prune the superfluous neighbors that are far away from the unclassified object. In terms of speed and scalability, the enhanced SMART-TV is very fast and comparable to the other vertical nearest neighbor algorithms, and in terms of accuracy, the enhanced SMART-TV with dimensional projections pruning is very comparable to that of KNN classifier. Index Terms—Data Mining, Classification, Vertical Data Structure, Vertical Set Squared Distance.

I. INTRODUCTION

C

lassification on massive datasets has become one of the

most important research challenges in data mining. Classification involves predicting the class label of newly encountered objects using feature attributes of a set of preclassified objects. The main goal of classification is to determine data membership of the unclassified objects and to discover the group of the new objects. K-nearest neighbor (KNN) classifier is one of the methods Manuscript received September 30, 2009. T. Abidin is the faculty member of Syiah Kuala University and the founder of Data Mining and Information Retrieval Research Group at the Department of Mathematics, College of Science (e-mail: [email protected]). W. Perrizo is a Professor of Computer Science at North Dakota State University, Fargo, ND, 58105, USA (e-mail: [email protected]).

William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105 USA [email protected] used for classification [5]. KNN classifier does not build a model in advance like decision tree induction, Neural Network, and Support Vector Machine; instead, it invests all the efforts when new objects arrive. The class assingment is made locally based on the features of the new objects. KNN classifier searches for the most similar objects in the training set and assigns class labels to the new instances based on the plurality of class of the k-nearest neighbors. The notion of closeness between the training objects and the new object is determined using a distance measure, e.g. Euclidian distance. KNN has shown good performance on various datasets. However, when the training set is very large, i.e. millions of objects, it will not scale. The brute force searches to find the k-nearest neighbors will increase the classification time significantly. Several new approaches have been developed to accelerate the classification process by improving the KNN search, such as [9]-[16] just to mention a few. Grother [9] accelerated the nearest neighbor search using kd-tree method, while Vries [16] improved the efficiency of KNN search in high dimensional spaces as a physical database design problem. Most of the state-of-the-art classification algorithms represent the data as the traditional horizontal row based structuring. This structure, combined with sequential scanbased approach, is known to scale poorly to very large datasets. Gray [8] in his talk at SIGMOD 2004 conference emphasized that vertical column based structuring as opposed to horizontal row based structuring is a good alternative to speed up query process and ensure scalability. In this work, we introduce the enhancement of SMART-TV classifier with dimensional projections to prune the neighbors in the candidate set that are distance away from the unclassified object, even though in terms of absolute difference of total variation, their values can be small. SMART-TV is an efficient and scalable nearest-neighbors classifier that approximates a set of candidates of neighbors by means of vertical total variation [1]. The total variation of a set of objects about a given object is computed using an efficient and scalable Vertical Set Squared Distance algorithm [2].

ISSN: 1942-9703 / © 2009 IIJ

4

INTERNETWORKING INDONESIA JOURNAL A1

A1

P13

P12

P11

4 2 2 7 5 1 6 3

100 010 010 111 101 001 110 011

1 0 0 1 1 0 1 0

0 1 1 1 0 0 1 1

0 0 0 1 1 1 0 1

P11

ABIDIN ET AL. P12

0 0 0

0 0

0

L3

0 0

1 1

0

0 0

1

0

0 1

0

L2 1

1

L1 L0

P13 0

L3

0 0

0 0

0

L2 0

1 0 0 1 1 0 1 0

L1 L0

Fig. 1. The construction of P-tree

We use vertical data structure that has been experimentally proven to address the curse of scalability and to facilitate efficient data mining over large datasets. We demonstrate the scalability of the algorithm empirically through several experiments.

II. P-TREE VERTICAL DATA STRUCTURE Vertical data representation arranges and processes the data differently from horizontal data representation. In vertical data representation, the data is structured column-by-column and processed horizontally through logical operation like AND or OR, while in horizontal data representation, the data is structured row-by-row and processed vertically through scanning or using some indexes. P-tree vertical data structure is one choice of vertical data representation [13]. P-tree is a lossless, compressed, and datamining ready data structure. P-tree is lossless because the vertical bit-wise partitioning guarantees that the original data values can be retained completely. P-tree is compressed because when the segments of bit sequences are either pure-1 or pure-0, they can be represented in a single bit. P-tree is data-mining ready because it addresses the curse of cardinality or the curse of scalability, one of the major issues in data mining. P-tree vertical data structure has been intensively exploited in various domains and data mining algorithms, ranging from classification [1]-[2]-[11], clustering [12], association rule mining [14], and outlier analysis [15]. In September 2005, P-tree technology was patented in the United States by North Dakota State University (US patent number: 6,941,303). The construction of P-tree vertical data structure is began by converting the dataset, normally arranged in a relation of horizontal records, into binary. Each attribute in the relation is vertically partitioned, and for each bit position in the attribute,

vertical bit sequences (containing 0s and 1s) are subsequently created. During partitioning, the relative order of the data is retained to ensure convertibility. In 0-dimensional P-trees, the vertical bit sequences are left uncompressed and are not constructed into predicate trees. The size of 0-dimensional Ptrees is equal to the cardinality of the dataset. In 1-dimensional compressed P-trees, the vertical bit sequences are constructed into predicate trees. In this compressed form, AND operations can be accelerated. The 1-dimensional P-trees are constructed by recursively halving the vertical bit sequences and recording the truth of “purely 1-bits” predicate in each half. A predicate 1 indicates that the bits in that half are all 1s, and a predicate 0 indicates otherwise. The complete set of (uncompressed) vertical bit slices (for numeric attributes) or bitmaps (for categorical attributes) will be called the 0-dimensional P-trees for the given relation of horizontal records. A 1-dimensional P-tree set involves a binary tree compression of those 0-dimensional P-trees as shown in the Fig. 1. A 2-dimensional, 3-dimensional or multidimensional P-tree set can be created as well, depending upon the nature of the dataset. In this work, we used the uncompressed P-tree set. In P-tree data structure, range queries, values, or any other patterns can be obtained using a combination of Boolean algebra operators AND, OR, and NOT. Beside those operators, COUNT operation is also very important in this structure, which counts the number of 1-bits in the basic or complement P-tree. Ding [7] shows the complete description about P-tree algebra and its operations.

III. VERTICAL SET SQUARED DISTANCE Let R(A1, A2, …, Ad) be a relation in d-dimensional space. A numerical value v of attribute Ai can be written in b bits binary representation to be:

ISSN: 1942-9703 / © 2009 IIJ


Vol.1/No.2 (2009) 0

∑2

vib −1 vi 0 =

j

j =b −1

d

⋅ vij

dimensional space. The binary representation of x in b bits can be written as:

x = ( x1(b−1)  x10 , x2 (b−1)  x20 ,  , xd (b−1)  xd 0 ) (2) Let X be a set of vectors in R or X be the relation R itself, x∈X, and a be the input vector. The total variation of X about a, denoted as TV(X, a), measures the cumulative squared separation of the vectors in X about a, which can be computed quickly and scalably using the Vertical Set Squared Distance (VSSD) formula, derived as follows: 2

TV ( X , a ) = ( X − a )  ( X − a ) = ∑ (x − a )  (x − a ) = ∑∑ (xi − ai ) x∈X

d

d

x∈X i =1

= T1 − 2T2 + T3 d

T1 =

∑∑

(3)  0   2 j ⋅ xij    i =1  j =b −1  d

xi2

x∈X i =1

=

2

∑∑ ∑

x∈X

ij

(4)

⋅ xik with COUNT ( PX ∧ Pij ∧ Pik ) ,

(6)

Similarly for terms T2 and T3, we get: d

∑∑ x a

i i

x∈ X i =1

d   0 = ∑ ai  ∑ 2 j ⋅ ∑ xij  i =1 x∈ X   j =b −1

= ∑ ai i =1

0

∑2

j =b −1

j

i =1

∑2

j = b −1

⋅ COUNT ( PX ∧ Pij )

j

⋅ COUNT ( PX ∧ Pij ) + d

COUNT ( PX ) ⋅ ∑ a i2 i =1

0 d  0  TV ( X , a ) = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  − i =1  j =b −1 k =b −1  0

d

j =b −1

i =1

∑ 2 j ⋅ COUNT ( Pij ) + | X | ∑ ai2

(10)

where |X| is the cardinality of R. Note that the COUNT operations are independent from a (the input value). For classification problem where X is the training set (or the relation R itself), this independency allows us to pre-compute the count values in advance, retain them, and use them repeatedly when computing total variations. The reusability of count values expedites the computation of total variation significantly.

(5)

d  0 0  T1 = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( PX ∧ Pij ∧ Pik )  i =1  j =b −1 k =b −1 

d

0

d

2 ⋅ ∑ ai ⋅

i =1

where COUNT is the aggregate operation that counts bit 1 in a given pattern and Pij is the P-tree at ith dimension and jth bit position. Thus, T1 can be simplified to be:

T2 =

(9)

0 d  0  = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( PX ∧ Pij ∧ Pik )  − i =1  j = b −1 k = b −1 

d

Let PX be a P-tree mask of set X that can quickly identify data points in X. PX is a bit pattern containing 1s and 0s, where bit 1 indicates that a point at that bit position belongs to X, while 0 indicates otherwise. Using this mask, (3) can be simplified by substituting:

∑x

TV ( X , a ) = ( X − a )  ( X − a )

2 ⋅ ∑ ai ⋅

d  0 0  = ∑  ∑ ∑ 2 j + k ⋅ ∑ xij ⋅ xik  i =1  j =b −1 k =b −1 x∈X 

x∈X

Therefore:

Let us consider X as the relation R itself. In this case, the mask PX can be removed since all objects now belong to R. The formula then can be rewritten as:

= ∑∑ xi2 − 2 ⋅ ∑∑ xi ai + ∑∑ ai2 x∈X i =1

(8)

i =1

x∈X i =1

d d  d  = ∑  ∑ xi2 − 2 ⋅ ∑ xi ai + ∑ ai2  x∈X  i =1 i =1 i =1 

x∈X i =1

2

x∈X i =1

Where vij can either be 0 or 1. Now, let x be a vector in d-

d

d

T3 = ∑∑ ai = COUNT ( PX ) ⋅ ∑ ai2

(1)

d

5

(7)

IV. ENHANCED SMART-TV WITH DIMENSIONAL PROJECTIONS PRUNNING SMART-TV classifier finds the candidates of neighbors by forming a total variation contour around the total variation of the unclassified object [1]. The objects within the contour are then considered as the superset of nearest neighbors. These objects are identified efficiently using P-tree range query algorithm without the need to scan the total variation values. The proposed algorithm further prunes the neighbors set by means of dimensional projections. After pruning, the k-nearest neighbors are searched from the set, and then, they vote to determine the class label of the unclassified object. The algorithm consists of two phases: preprocessing and classifying. In the preprocessing phase, all processes are conducted only once, while in the classifying phase, the processes are repeated for every new unclassified object. Let R(A1,…,An, C) be a training space and X(A1,…,An) = R[A1,…,An] be the features subspace. TV(X, a) is the total variation of X about a, and let f (a ) = TV ( X , a ) − TV ( X , µ ) .

ISSN: 1942-9703 / © 2009 IIJ

6


Recall that:

ABIDIN ET AL.

shown in Fig. 2.

  TV ( X , a ) = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  − i =1  j =b −1 k =b −1  d

0

0

d

2 ⋅ ∑ ai ⋅ i =1

d

0

∑2

j =b −1

j

⋅ COUNT ( Pij ) + | X | ∑ ai2 i =1

d  0 0  = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  − i =1  j =b −1 k =b −1  d

d

2⋅ | X | ∑ ai µ i + | X | ∑ ai2 i =1

i =1

  = ∑  ∑ ∑ 2 j + k ⋅ COUNT ( Pij ∧ Pik )  + i =1  j =b −1 k =b −1  d d   | X |  − 2 ⋅ ∑ ai µ i + ∑ ai2  i =1 i =1   d

0

Fig. 2. The shape of graph g(a).

0

A. Preprocessing Phase In the preprocessing phase, we compute g(x), ∀x ∈ X, and create derived P-trees of the functional values g(x). The derived P-trees are stored together with the P-trees of the dataset. The complexity of this computation is O(n) since the computation is applied to all objects in X. Furthermore, because the vector mean µ = (µ1 , µ 2 ,  , µ d ) is used in the

Since f (a ) = TV ( X , a ) − TV ( X , µ ) , we get: d d   = | X |  − 2 ⋅ ∑ ai µ i + ∑ ai2  − i =1 i =1   d d   | X |  − 2 ⋅ ∑ µ i µ i + ∑ µ i2  i =1 i =1  

function, then the vector must be initially computed. We compute the vector mean efficiently using P-tree vertical structure. An aggregate function COUNT is used to count the number of 1s in the bit pattern such that the sum of 1s in each vertical bit pattern is acquired first. The following formula shows how to compute the element of vector mean at

d d   =| X |  − 2 ⋅ ∑ (ai µ i − µ i µ i ) + ∑ ai2 − µ i2  i =1 i =1   d d d   =| X |  ∑ ai2 − 2∑ µ i ai + ∑ µ i2  i =1 i =1  i =1 

=| X | a − µ

2

i th dimension vertically: µi =

In the preprocessing phase, the function f (a ) is applied to each training object, and derived P-trees of those functional values are created to incorporate a fast and efficient way to determine the candidates of neighbors. Since in large training sets the values of f (a ) can be very large and representing these large values in binary will require unnecessarily large to reduce the bit width. By taking the first derivation of

(12)

0 0 1 1 ⋅ ∑ 2 j ⋅ ∑ xij = ⋅ ∑ 2 j ⋅ COUNT ( Pij ) X j =b −1 x∈X X j =b −1

(11)

number of bits, we let g ( a) = ln( f ( a )) = ln X + ln a − µ

0 1 1 ⋅ ∑ xi = ⋅ ∑ ∑ 2 j ⋅ xij = X x∈X X x∈X j =b −1

2

g(a) fixing at

th

d dimension, we find the gradient of g at a is 2 ∂g (a) = ⋅ (a − µ ) d . The gradient is zero if and 2 ∂a d a−µ only if a=µ, and the gradient length depends only on the length of vector (a − µ ) . This indicates that the isobars are hyper-circles. Note also that to avoid singularity when a=µ, we add a constant 1 to the function f(a) such that g (a ) = ln( f (a ) + 1). The shape of the graph is a funnel as

The complexity to compute the vector mean is O(d⋅b), where d is the number of dimensions and b is the bit width. B. Classifying Phase The following processes are performed on each unclassified object a. For simplicity, consider the objects in 2-dimensional space. 1. Determine vector (a − µ ) , where a is the new object and

µ is the vector mean of the features space X. 2. Given an epsilon of the contour (ε > 0), determine two vectors located in the lower and upper side of a by moving the ε unit inward toward µ along vector (a − µ ) and moving the ε unit outward away from a. Let b and c be the two vectors in the lower and upper side of a respectively, then vector b and c can be determined using the following equations:

ISSN: 1942-9703 / © 2009 IIJ


Vol.1/No.2 (2009)

 ε b = µ + 1 − a−µ   ε c = µ + 1 + a − µ 

  ⋅ (a − µ )  

  ⋅ (a − µ )  

(13)

(14)

3. Calculate g(b) and g(c) such that g(b)≤ g(a)≤ g(c) and determine the interval [g(b), g(c)] that creates a contour over the functional line. The contour mask of the interval is created efficiently using the P-tree range query algorithm without having to scan the functional values one by one. The mask is a bit pattern containing 1s and 0s, where bit 1 indicates that the object is in the contour while 0 indicates otherwise. The objects within the pre-image of the contour in the original feature space are considered as the superset of neighbors (ε neighborhood of a or Nbrhd(a, ε)).

Fig. 3. The pre-image of the contour of interval [g(b), g(c)] creates a Nbrhd (a, ε).

4. Since the total variation contour is annular around the mean, the candidate set may contain neighbors that are actually far from the unclassified object a, e.g. located within the contour but in the opposite side of a. Therefore, an efficient pruning technique is required to eliminate the superfluous neighbors in the candidate set. We prune those neighbors using dimensional projections. For each dimension, a dimensional projection contour is created around the vector element of the unclassified object in that dimension. The size of the contour is specified by moving the ε unit away from the element of vector a in that dimension on both sides. It is important to note that because the training set was converted into P-trees vertical structure, thus, no additional derived P-trees are needed to perform the projection. The P-trees of each dimension can be used directly with no extra cost. Furthermore, a parameter ms (manageable size of candidate set) is needed for the pruning. This parameter specifies the upper bound of neighbors in the candidate set such that if the manageable size of neighbors is reached, the pruning process will be terminated, and the number of neighbors in the candidate set is considered small enough to be scanned to search for the k-nearest neighbors. The pruning

7

technique consists of two major steps. First, it obtains the count of neighbors in the pre-image of the total variation contour (candidate set) relative to a particular dimension. The rationale is to maintain the neighbors that are predominant (close to the unclassified object) in most dimensions so that, when the Euclidian distance is calculated (step 5 of the classifying phase), they are the true closest neighbors. The process of obtaining the count starts from the first dimension. The dimensional projection contour around the unclassified object a is formed, and the contour mask is created. Again, the mask is simply a bit pattern containing 1s and 0s, where bit 1 indicates that the object belongs to the candidate set when projected on that dimension while 0 indicates otherwise. The contour mask is then AND-ed with the mask of the pre-image of the total variation contour, and the total number of 1s is counted. Note that no neighbors are pruned at this point, only the count of 1s is obtained. The process continues for all dimensions, and at the end of the process, the counts are sorted in descending order. The second step of the pruning is to intersect each dimensional projection contour with the candidate set. The intersection starts based on the order of the count. The dimension with the highest count is intersected first, followed by the dimension with the second most count, and so forth. In each intersection, the number of neighbors in the candidate set is updated. From the implementation perspective, this intersection is simply a logical AND operation between the mask of the total variation contour and the mask of the dimensional projection contour. The second step of the pruning technique continues until a manageable size of neighbors is reached or all dimensional projection contours have been intersected with the candidate set. Fig. 4 illustrates the intersection between the pre-image of the total variation contour and the dimensional projection contours.

Fig. 4. Pruning the contour using dimensional projections.

5. Find the k-nearest neighbors from the remaining neighbors by calculating the Euclidian distance, ∀x ∈ Nbrhd(a, ε),

d 2 ( x, a ) =

ISSN: 1942-9703 / © 2009 IIJ

d

∑ (x i =1

i

− ai ) 2

.

8


6. Let the k nearest neighbors vote using a weighted vote ( − d *d )

to influence the vote of the k nearest w=e neighbors. The weighting function is a Gaussian function that gives a nice drop off based on the distance of the neighbors to the unclassified object as shown in Fig. 5. The closer the neighbors the higher is the weight and vise versa. Weighting function w = exp(-d*d) 1 0.9 0.8 0.7

Weight

0.6 0.5

ABIDIN ET AL.

cardinality of these datasets is varied from 32 to 96 million. The KDDCUP 1999 dataset is the network intrusion dataset used in KDDCUP 1999 [10]. The dataset contains more than 4.8 million samples from the TCP dump. Each sample identifies a type of network intrusion such as Normal, IP Sweep, Neptune, Port Sweep, Satan, and Smurf. The Wisconsin Diagnostic Breast Cancer (WDBC) dataset contains 569 diagnosed breast cancers with 30 real-valued features. The task is to predict two types of diagnoses as either Benign or Malignant. The OPTICS dataset has eight different classes, containing 8,000 points in 2-dimensional space. The Iris dataset is the Iris plants dataset, created by R.A. Fisher [6]. The task is to classify Iris plants into one of the three varieties: Iris Setosa, Iris Versicolor, and Iris Virginica. Iris dataset contains 150 samples in a 4-dimensional space.

0.4 0.3 0.2 0.1 0

0

0.5

1

1.5

Fig. 5. Weighting function w

2 2.5 Distance (d)

3

3.5

4

= e ( − d *d ) .

V. EXPERIMENTAL RESULTS The experiments were conducted on an Intel Pentium 4 CPU 2.6 GHz machine with 3.8GB RAM, running Red Hat Linux version 2.4.20-8smp. We tested the proposed classifier using several datasets that have been previously used for classification problems, including those datasets from the Machine Learning Databases Repository at We (http://www.ics.uci.edu/~mlearn/MLRepository.html). also incorporated real-life datasets, such as Remote Sense Imaginary (RSI), OPTICS and Iris datasets, a few benchmark datasets used to evaluate classification algorithms. The RSI dataset is a set of aerial photographs from the Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA) near Oakes, North Dakota, taken in 1998. The images contain three bands: red, green, and blue. Each band has values in the range of 0 and 255, which in binary can be represented using 8 bits. The corresponding synchronized data for soil moisture, soil nitrate, and crop yield were also used, and the crop yield was selected as the class attribute. Combining all the bands and synchronized data, a dataset with 6 dimensions (five feature attributes and one class attribute) was obtained. To simulate different classes, the crop yield was divided into four different categories: low yield having intensity between 0 and 63, medium low yield having intensity between 64 and 127, medium high yield having intensity between 128 and 191, and high yield having intensity between 192 and 255. Three synthetic datasets were created to study the scalability and running time of the proposed classifier. The

A. Speed and Scalability Comparison We compared the performance in terms of speed using only RSI datasets. As mentioned previously, the cardinality of these datasets varies from 32 to 96 million. In this experiment, we compared the classification time of enhanced SMART-TV (SMART-TV with P-tree) against the SMART-TV with scan, P-KNN with HOBBIT metric, and a brute-force KNN. Note that SMART-TV with scan is the original SMART-TV [1] and it is slightly difference from the enhanced SMART-TV. SMART-TV with scan does not have derived P-trees of the total variation values of the training set and it determines the superset of neighbors in the total variation contour by scanning the total variation values of the training objects one by one. While scanning, the relative indexes of the objects in the contour are stored in an array. In this experiment, the same pruning technique was also applied to the original SMARTTV after the scanning process is completed. The candidate mask of the neighbors is created using P-Tree vertical structure API, and the parameter passed to the API is the array that holds the indexes of objects within the total variation contour. The results show that even though the scan was conducted on a single dimension containing the total variation values of the training objects, the original SMART-TV takes more time when compared to the enhanced SMART-TV that uses P-tree range query to determine the objects in the contour. Fig. 6 and 7 depict the classification time and scalability comparison of the algorithms. The speed was measured by observing the time required by CPU to complete the classification process. We learned that the enhanced SMART-TV and P-KNN are very comparable in terms of speed. Both algorithms are faster than the other two algorithms. For example, P-KNN takes 12.37 seconds on an average to classify using a dataset of size 96 million, while the enhanced SMART-TV takes 17.42 seconds. For the same dataset, SMART-TV with scan takes about 106.76 seconds to classify and KNN takes 891.58 seconds. KNN with sequential scan is almost three orders of magnitude slower than the enhanced SMART-TV and P-KNN algorithms. From this observation, it is clear that when the cardinality increases, the

ISSN: 1942-9703 / © 2009 IIJ


classification time of KNN increases linearly. The algorithms that use P-tree vertical data structure are much faster. SMART-TV with scan is faster than KNN because it only scans a single dimension and compares the functional values of the unclassified object with the functional values of the training objects. In addition, the pruning technique is also incorporated in SMART-TV with scan, where as KNN has to scan and compute the Euclidian distance at the same time. However, the SMART-TV with scan is still slower than the enhanced SMART-TV and P-KNN. The enhanced SMART-TV devotes much of its time for preprocessing as discussed previously in Preprocessing Phase section. The complexity to create P-trees and derived P-trees of the functional values is O(n) since the computation is applied to all objects in the training set. The enhanced SMART-TV also needs to compute a vector mean. The complexity to compute the mean is O(d⋅b), where d is the number of dimensions and b is the bit width of the P-trees. In classification phase, the time is used to prune the superfluous neighbors in the candidate set. However, since the pruning is done using P-tree vertical masks and AND bit operation, the process is very fast. This is also true for the other vertical algorithm such as P-KNN. KNN classifier, on the other hand, does not build a model or structure in advance. It invests all the efforts in the classification phase when new objects arrive. Much of its time is used to calculate the distance between the training and the unclassified objects. Therefore, the computational complexity for KNN to classify each new unclassified sample is O(n). SMART-TV w/ P-Tree SMART-TV w/ Scan P-KNN with HOBit KNN

Classification Time (Seconds/Sample)

700

250

200 Smart-TV with P-tree Smart-TV with Scan 150

P-KNN HOBBIT KNN

100

50

0 KDDCUP Dataset

Fig. 7. Classification time on KDDCUP dataset.

B. Classification Accuracy Comparison We examined the classification accuracy (F-score) using the experimental datasets with 5-folded cross validation model. We used ε = 0.1 and ms = 200, k=3, k=5, and k=7 and compared the enhanced SMART-TV with P-KNN using HOBBIT, P-KNN using Equal Interval Neighborhood (EINring), and KNN with linear search. The results are summarized in Table 1. From the results, it is clear that the classification accuracy of enhanced SMART-TV is very comparable to that of the KNN algorithm for all datasets. The performance of P-KNN using HOBBIT degrades significantly when classifying the dataset with high dimensions as shown in WDBC dataset. TABLE 1 CLASSIFICATION ACCURACY COMPARISON ON VARIOUS DATASETS

900 800

9

300

Classification Time (Seconds/Sample)

Vol.1/No.2 (2009)

Classification Accuracy (F-Score) Dataset

P-KNN with HOBBIT 0.91

P-KNN with EIN n/a

KNN

KDDCUP

Enhanced SMART-TV 0.93

OPTICS

0.99

0.99

0.99

0.99

IRIS

0.97

0.89

0.96

0.96

WDBC

0.94

0.39

0.96

0.97

600 500 400 300

0.93

200 100 0

VI. CONCLUSION 0

32000000 64000000 Dataset Cardinality

Fig. 6. Trend of classification time on RSI dataset.

96000000

In this paper, we have presented an enhancement of SMART-TV classifier by introducing dimensional projections pruning. The evaluations show that the dimensional projections can effectively prune the superfluous neighbors that are distance away from the unclassified object, even though in terms of absolute difference of total variation, their values can be small. In terms of speed and scalability, the enhanced SMART-TV algorithm is very fast and comparable to the other vertical nearest neighbor algorithms, and in terms of accuracy, the enhanced SMART-TV is very comparable to that of KNN classifier.

ISSN: 1942-9703 / © 2009 IIJ

10


ABIDIN ET AL.

REFERENCES [1] Abidin, T., & Perrizo, W. (2006). SMART-TV: A fast and scalable nearest neighbor based classifier for data mining, Proceedings of the 21st Annual ACM Symposium on Applied Computing (pp. 536-540). Dixon, France: Printing House, Inc. [2] Abidin, T., Perera, A., Serazi, M., & Perrizo, W. (2006). A vertical approach to computing set squared distance. International Journal of Computers and their Applications (IJCA), 13 (2), 94-102. [3] Abidin, T., Dong, A., Li, H., & Perrizo, W. (2006). Efficient image classification on vertically decomposed data. Proceedings of the 1st IEEE International Workshop on Multimedia Databases and Data Management. Atlanta, Georgia: IEEE Computer Society. [4] Ankerst, M., Breunig, M, Kriegel, H-P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. Proceeding of the ACM SIGMOD (pp. 49-60). Philadelphia, P.A. [5] Cover, T. M., & Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transaction on Information Theory, 13, 21-27. [6] Demsar, J., Zupan, B., & Leban, G., (2004). Orange: from experimental machine learning to interactive data mining. White paper (www.ailab.si/orange). Faculty of Computer and Information Science, University of Ljubljana. [7] Ding, Q., Khan, M., Roy, A., & Perrizo, W. (2002). The p-tree algebra. Proceedings of the 17th Annual ACM Symposium on Applied Computing (pp. 426-431). Madrid, Spain. [8] Gray, J. (2004). The next database revolution. Proceedings of the 10th ACM SIGMOD (pp. 1-4). Paris. [9] Grother, P.J., Candela, G.T., & Blue, J.L. (1997). Fast implementations of nearest neighbor classifiers. Pattern Recognition, 30, 459-465. [10] Hettich, S., & Bay, S. D. (1999). The UCI KDD archive http://kdd.ics.uci.edu. Irvine, University of California, Department of Information and Computer Science. [11] Khan, M., Ding, Q., & Perrizo, W. (2002). K-nearest neighbor classification on spatial data stream using p-trees. Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 517-528). Taipei, Taiwan. [12] Perera, A., Abidin, T., Serazi, M., Hamer, G., & Perrizo, W. (2005). Vertical set squared distance based clustering without prior knowledge of k. Proceedings of the 14th International Conference on Intelligent and Adaptive Systems and Software Engineering (pp. 72-77). Toronto, Canada. [13] Perrizo, W. (2001). Peano count tree technology lab notes (Tech. Rep. NDSU-CS-TR-01-1). Computer Science Department, North Dakota State University. [14] Rahal, I., Ren, D., & Perrizo, W. (2004). A scalable vertical model for mining association rules. Journal of Information & Knowledge Management (JIKM), 3(4), 317-329. [15] Ren, D., Wang, B., & Perrizo, W. (2004). RDF: a density-based outlier detection method using vertical data representation. Proceedings of the 4th IEEE International Conference on Data Mining (pp. 503-506). [16] Vries, A.P., Mamoulis, N., Nes, N., & Kersten, M. (2002). Efficient kNN search on vertically decomposed data. Proceedings of the ACM SIGMOD (pp. 322-333). Madison, Wisconsin, USA.

ISSN: 1942-9703 / © 2009 IIJ

Taufik F. Abidin received his B.Sc from Sepuluh Nopember Institute of Technology, Indonesia in 1993. Upon completing his B.Sc degree, he joined Syiah Kuala University in Banda Aceh, Indonesia. He received his M.Tech degree in Computer Science from RMIT University, Australia in 2000, and completed his Ph.D. in Computer Science at North Dakota State University (NDSU), USA in 2006. He received the ND EPSCoR Doctoral Dissertation Award from NDSU in 2005. He has been a Senior Software Engineer at Ask.com in New Jersey to develop algorithms and implement efficient production-level programs to improve web search results. His research interests include Data Mining, Database Systems, and Information Retrieval. William Perrizo is a Professor of Computer Science at North Dakota State University. He holds a Ph.D. degree from the University of Minnesota, a Master’s degree from the University of Wisconsin and a Bachelor’s degree from St. John's University. He has been a Research Scientist at the IBM Advanced Business Systems Division and the U.S. Air Force Electronic Systems Division. His areas of expertise are Data Mining, Knowledge Discovery, Database Systems, Distributed Database Systems, High Speed Computer and Communications Networks, Precision Agriculture and Bioinformatics. He is a member of ISCA, ACM, IEEE, IAAA, and AAAS.


Vol. 1/No. 2 (2009)

11

Using Business Intelligence Solutions for Achieving Organization’s Strategy: Arab International University Case Study Mouhib Alnoukari Arab International University, Damascus, Syria

Abstract—Business Intelligence (BI) is becoming an important IT framework that can help organizations managing, developing and communicating their intangible assets such as information and knowledge. Thus it can be considered as an imperative framework in the current knowledge-based economy arena. In this paper, we will explain the role BI is playing in providing organizations with a way to plan and achieve their business strategy. We will experiment this role using a case study in the field of high education, especially helping one of the new private university in Syria (Arab International University)planning and achieving their business strategy. Index Terms—Business Intelligence, Data Mining, Strategic Management.

I. INTRODUCTION

B

usiness Intelligence is becoming vital for many organizations, especially those have extremely large amount of data. Decision makers depend on detailed and accurate information when they have to make decisions. BI can provide decision makers with such accurate information, and with the appropriate tools for data analysis. BI is an umbrella term that combines architectures, tools, data bases, applications, practices, and methodologies [20, 6]. Gartner Group (1996) (the first company used BI in marker in the mid-1990) defined BI as “information and applications available broadly to employees, consultants, customers, suppliers, and the public. The key to thriving in a competitive marketplace is staying ahead of the competition. Making sound business decisions based on accurate and current information takes more than intuition. Data analysis, reporting, and query tools can help business users dig in the mine of data to extract and/or synthesize valuable information from it – today these tools collectively fall into category called Business Intelligence” [9]. Many organizations who developed successful BI solutions, such as Continental Airlines, have seen investment in BI generate increases in revenue and cost saving equivalent to 1000% return on Manuscript received September 30, 2009. Mouhib Alnoukari is with the Arab International University, Damascus, Syria (phone: +963-015-2050; fax: +963-015-860385; e-mail: [email protected]).

investment (ROI) [22]. An important question that was raised by many researchers [16, 18] as to what was the main reason pushing companies to search for BI solutions, and what differentiates BI from Decision Support System (DSS) systems? In fact, over the last decades, organizations developed a lot of Operational Information Systems (OIS), resulting in a huge amount of disparate data that are located in different geographic locations, on different storage platforms, with different forms. This situation prevents organization from building a common, integrated, correlated, and immediate access to information at its global level. DSS evolved during the 1970s, with the objective of providing organization’s decision makers with the required data to support decision-making process. In the 1980s, Executive Information System (EIS) evolved to provide executive officers with the information needed to support strategic decision-making process. BI evolved during the 1990s as data-driven DSS, sharing some of the objectives and tools of DSS and EIS systems. BI architectures include: data warehousing, business analytics, business performance management, and data mining. Most of BI solutions are dealing with structured data [1]. However, many application domains require the use of unstructured data (or at least semi-structured data), e.g. customer e-mails, web pages, competitor information, sales reports, research paper repositories, and so on [4, 21]. Any BI solution can be divided into the following three layers [1]: data layer, which is responsible for storing structured and unstructured data for decision support purposes. Structured data is usually stored in Operational Data Stores (ODS), Data Warehouses (DW), and Data Marts (DM). Unstructured data are handled by using Content and Document Management Systems. Data are extracted from operational data sources, e.g. SCM, ERP, CRM, or from external data sources, e.g. market research data. Data are extracted from data sources that are transformed and loaded into DW by ETL tools. Logic layer provides functionality to analyze data and provide knowledge. This includes OLAP, data mining. And finally access layer, realized by some sort of software portals (BI portal). Our main focus in this paper is to explain the role of BI solution that helps organizations in formulating,

ISSN: 1942-9703 / © 2009 IIJ

12


implementing, and achieving their strategy. Many researchers [5, 17, 10, 12] were highlighting the IT alignment in general, with business, without specifying what are the technologies, and tools that can help organizations in achieving their strategy.

ALNOUKARI

the need to position the firm in an external marketplace where growth can take place, and functional integration which addresses how to best structure internal systems to execute the business strategy of the firm [12].

Fig. 1. IT alignment with Business Strategy [5].

The rest of this paper will be organized as the follows, the next section will explain the role BI is taking as an IT-enabler to achieve organization’s strategy; such role will be highlighted by using strategic alignment model proposed by Henderson and Venkatraman (1993), explaining how this alignment can help organizations in becoming flexible organizations, concluding how could BI solution provide organizations with sustainable competitive advantages. BI role as a strategic solution will be then experimented using a case study at the higher education field (Arab International University), explaining how the solution implemented at this university helped achieving their main strategic goal, improving quality in their higher education system. II. BUSINESS INTELLIGENCE AS AN IT ENABLER TO ACHIEVE ORGANIZATION’S STRATEGY In recent years, IT in general, and BI as a strategic framework, is becoming increasingly important in strategic management, supporting business strategies. IT-enabled strategic management addresses the IT role in strategy formulation and implementation processes [19]. Drucker, the pioneer of management by objectives, was one of the first who recognized the dramatic changes IT brought to management. Strategic management theories were largely geared towards gaining competitive advantages. Porter (1979) proposed a number of very influential strategic analysis models, such as the five-forces model of competition, the value chain and generic competitive strategies. Porter (1979) said “The essence of strategy formulation is coping with competition” [14]. Many researchers have indicated the importance of IT alignment with business strategy in order to enhance corporate strategy [5, 17], (Figure1). Strategic Alignment Model developed by Henderson and Venkatraman (1993) was one of the first models that described in an explicit way the interrelationships between business strategies and IT strategies [10]. This model is based on two main concepts (Figure2): strategic fit that recognizes

Fig. 2. Strategic Alignment Model

IT alignment is not simply formulating IT strategy to fit business strategy. It has to consider external forces and the environment uncertainty. Such alignment helps organizations becoming flexible organizations. As a result of accelerations in the rates of innovation and technological changes, markets evolve rapidly, products’ life cycles get shorter and innovation becomes the main source of competitive advantage. Therefore, organizations seek flexibility to meet market demands. Drnevich et al. (2006) explained that flexibility-based perspectives evolved from Schumpeter’s concept of creative destruction [8]. Operationalization of these perspectives in strategic management is done through dynamic capabilities and real options views. Dynamic capabilities view refers to the firm’s abilities to maintain and adapt its internal resources to environment changes to maintain sustainability of competitive advantages. It refers to the capability of acquiring new ways of competitive advantage. It involves continuous search, innovation and adaptation of firm resources and capabilities to uncover and tape new sources of competitive advantages. Real options view is effective in dealing with issues of uncertainty. It allows the firm to defer investment decisions until uncertainties are resolved. New IT organizational adoption facilitates the transition into flexible organizations. Business Intelligence is one of these new IT frameworks that can help such transition. BI technologies become a source of competitive advantages and differentiation [13, 11]. Tang and Walters (2006) mentioned that competitive advantage became a hot strategic management topic [19]. They also view that generating new knowledge in a continued way is the single way to obtain competitive advantage. There are many reasons for organization to adopt business intelligence systems in order to achieve organization’s

ISSN: 1942-9703 / © 2009 IIJ

Vol. 1/No. 2 (2009)


strategy: • Business Intelligence is considered as an extension to corporate strategy activities. Herring (1988) considered that “Strategy can be no better than the information from which it is derived” [11]. • Data analytics can be used effectively to build future business strategy. • Data analytics and data mining could reveal hidden reasons for some deficiencies as well as possible high-yielding new investments. • Corporations need to be sure that they are receiving the right information related to their long-term strategy. Herring (1988) considered that business intelligence can help organizations in [11]: • Supporting the strategic decision making process of the corporation. • Supporting corporation SWOT analysis • Supporting strategic planning and processes. All the mentioned benefits should provide organizations with sustainable competitive advantages. III. ARAB INTERNATIONAL UNIVERSITY BI SOLUTION Arab International University (AIU) is a new private university in Syria. Being a 3-year old university, it began to find difficulties in managing the huge data deployed from its different information systems. The academic system, financial system, HR system, and Quality Assurance Automated System (QAAS) system, are at the core of the university daily operations [3]. As most of the university information systems were provided from different sources, there was an urgent need to integrate data from all these sources into one data warehouse in a manner that could help the university in making use of all data to assure quality The data gathered in order to produce the university BI solution and help in enhancing education quality are: • Academic data (registration, examination, enrollment, etc). • Financial data (student fees, staff salaries, orders, sales, etc.). • Human Resources data (staff personal information). • QAAS data (student feedback, GPA differences, drop ratio, plan completion, industry feedback, etc). An enterprise data warehouse is built to hold all the previous data. The data sources were provided from different sources (Oracle databases, SQL Server data bases, excel). The data warehouse was built using Oracle 11g data base. The ETL was built by using ETL package in Oracle Warehouse Builder. The solution developed (by AIU developers) followed ASD-DM paradigm for BI solution. ASD-DM paradigm suggests three agile modeling steps (specifically Adaptive Software Development (ASD)) to develop a successful BI

13

model [2]. The BI solution could provide university’s top management with reports to uncover trends in the students and instructors’ performance in a manner that would be impossible or at least extremely tedious without data warehouses and data mining.

Fig. 3. AIU students clustered according to their accumulative GPA & their credit hours.

One of the main strategic goals of the AIU is to enhance students GPA. Figure 3 shows the correlation between student’s accumulative GPA and their credit hours. It shows that students with higher accumulative hours are getting higher accumulative GPA. On the other hand it was very clear that there is strong correlation between students’ level of English and accumulative GPA as shown in figure 4. Clustering each faculty’s students according to their cumulative GPAs, and their completed hours helps the university’s academic advisors focus on special groups, especially the group of students that are likely to drop out (Figure 5). Correlation between credit hours and AGPA changes shows a clear picture about the optimal number of the credit hours the students would take to increase their AGPA. Figure 5 Shows that this ranges between 2-12 and 20-22 hours. This provides the AIU decision makers with the reasons to find out how to help students to enhance their AGPAs while getting the required credit hours in each

Fig. 4. Relationship between AIU students’ accumulative GPA & their English level.

ISSN: 1942-9703 / © 2009 IIJ

14


semester (which varies between 16-19 hours).These results helped AIU to update their education system in order to force students to enhance their English level by adding more English teaching hours at the early phases. AIU BI solution is able to provide AIU managers with a set of reports that can help them evaluating the university current strategy. AIU BI solution’s reporting includes:

ALNOUKARI

enhancing the total number of enrolled courses (Figure 7). This has an immediate financial revenue increase.

Fig. 7. Currents enrolled university courses.

Fig. 5. Credit hours relation’s with the AGPA changes.

• Predicting students GPAs. Different algorithms were used for prediction. Evaluation of the results for each algorithm permit choosing the best method with the highest predictive confidence. SVA algorithm was chosen with more than 70% predictive confidence (Figure 6 shows GPA prediction deviation errors). This also helps predicting the average GPA per each faculty, which would help AIU preparing plans to enhance the overall performance.

• Analysis of the total number of students’ presence per different time ranges per days helps AIU achieving one of its strategic goals by enhancing services provided to its students and optimizing costs (Figure 8). Such analysis helps AIU preparing the optimal plans for transportation, restaurants, library services, and others. • Identifying students likely to drop out using the students predicted GPA results (Figure 6). • Classifications of students’ results according to different subjects (Figure 4).

Fig. 8. AIU students attendance per different time ranges per days.

Fig. 6. GPA prediction deviation errors chart.

• Market basket analysis report helps in preparing the time table for each semester. The resultant time table would contain a set of highly interrelated courses that students require. This means that two courses with high association correlation don’t overlap in the time table, also these two courses should be enrolled in the same semester, and not wait for one or two semesters. This analysis helps achieving one of the main AIU strategic goals by

IV. CONCLUSION In this paper, we explained the use of BI solution in formulating, implementing, and achieving organization’s strategy. We also demonstrated how BI solution could provide organizations with sustainable competitive advantages. This work can be extended by integrating knowledge management with BI solutions, as it can help deriving more

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

value (knowledge) from the explosion of textual information, which can add more inputs to the strategic decision. Another important factor is the use of agile methodologies in order to manage the high-speed high-change current environment. Such complex and dynamic environment highly affect organization’s strategy.

15

[22] J. Zack, R. K. Rainer, and T. E. Marshall, “Business Intelligence: An Analysis of the Literature”, Information Systems Management, 25: pp. 121– 131, 2007.

REFERENCES [1] M. Alnoukari, and W. Alhussan., “Using Data Mining Techniques for Predicting Future Car market Demand”, International Conference on Information & Communication Technologies: From Theory to Applications, IEEE Conference, Syria, 2008. [2] M. Alnoukari, Z. Alzoabi, and S. Hanna, “Using Applying Adaptive Software Development (ASD) Agile Modeling on Predictive Data Mining Applications: ASD-DM Methodology”, International Symposium on Information Technology, ITSIM 08, Malaysia, 2008. [3] Z. Alzoabi, F. Diko, and M. Alnoukari, “Enhancing Education Quality Assurance Using information Systems- QAAS System”, International Symposium on Information Technology, ITSIM 08, Malaysia, 2008. [4] H. Baars, and H. G. Kemper, “Management Support with Structured and Unstructured Data- An Integrated Business Intelligence Framework”, Information Systems Management, 25: pp. 132–148, 2007. [5] D. Boddy, A. Boonstra, and G. Kennedy, Managing Information Systems: An Organisational Perspective, (2nd edition). Harlow: Pearson. 2005 [6] F. Cody, J. T. Kreulen, V. Krishna, and W. S. Spangler, “The Integration of Business Intelligence and Knowledge Management”, Systems Journal, Vol. 41, No. 4, pp. 697–713. [7] T. Corbitt, “Business Intelligence and Data Mining”, Management Services Magazine, November 2003. [8] P. Drnevich, J. Hahn, and M. Shanley, “Toward a Strategic Perspective of Information Technology”, IT-Enabled Strategic Management, Idea Group Publishing, pp. 16-37, 2006. [9] Gartner Group (September, 1996). Retrieved November 12, 2005, from http://www.innerworx.co.za/products.htm. [10] W. V. Grembergen, S. D. Haes, and E. Guldentops, “Structures, Processes and Relational Mechanisms for IT Governance”, Strategies for Information Technology Governance, by Wim Van Grembergen (ed), Idea Group Publishing, 2004. [11] J. P. Herring, “Building a Business Intelligence Systems”, Journal of Business Strategy, May/June 1988. [12] J. D. Katz, “The Integral Role of Information Technology in Achieving Business Strategy Success: Managing the Information Resources of Global Competitors”, Advanced Topics in Global Information Management, by Tan F. B., Idea Group Publishing, pp. 42-62, 2002. [13] M. Pérez-Valls, J. M. Ortega-Egea, and J. A. Plaza-Úbeda, “Relationship between New Information Technologies and Flexible Organizational Forms”, IT-Enabled Strategic Management, Idea Group Publishing, pp. 1-16, 2006. [14] M. E. Porter, “How competitive forces shape strategy?”, Harvard Business Review 57, April/Mai 1979, pp. 137-145. [15] M. E. Porter, “Competitive Strategy—Techniques for Analyzing Industries and Competitors”, New York: Free Press. 1980. [16] D. J. Power, “A Brief History of Decision Support Systems”. DSSResources.COM, World Wide Web, http://DSSResources.COM/history/dsshistory.html, version 4.0, March 10, 2007. [17] R. Sabherwal, and Y. E. Chan, “Alignment Between Business and IS Strategies: A Study of Prospectors, Analyzers, and Defenders”. Information Systems Research, 12(1), 11-33. 2001. [18] M. Shariat, R. Jr. Hightower, “Conceptualizing Business Intelligence Architecture”, Marketing Management Journal, Volume 17, Issue 2, pp. 40 – 46, 2007. [19] Z. Tang, and B. Walters, “The Interplay of Strategic Management and Information Technology”, IT-Enabled Strategic Management, Idea Group Publishing, pp. 1-16, 2006. [20] E. Turban, J. E. Aronson, T. P. Liang, R. Sharda, Decision Support and Business Intelligence Systems, 8th edition, Pearson Prentice Hall, 2007. [21] C. H. Wee, and M. L. Leow, “Competitive Business Intelligence in Singapore”, Journal of Strategic Marketing 1, pp. 112-139, 1994.

ISSN: 1942-9703 / © 2009 IIJ

16


ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

17

Feature Selection for Large Scale Data by Combining Class Association Rule Mining and Information Gain: a Hybrid Approach Appavu alias Balamurugan, Pramala, Rajalakshmi and Rajaram Department of Information Technology, Thiagarajar College of Engineering, Madurai, India Abstract— Feature selection is a fundamental problem in data mining to select relevant features and cast away irrelevant and redundant features from an original feature set based on some evaluation criterion. In this paper, we propose a filter method to find associate attributes with respect to class and rank using information gain. Highly associated features are searched using class association rule mining. Information gain is used for removing redundancy and for further pruning.The efficiency and effectiveness of our method is demonstrated through extensive comparisons with other methods using real-world data of high dimensionality. Index Terms— Association mining, Data mining, Feature selection, Information gain

I. INTRODUCTION

D

ATA MINING deals with the discovery of hidden knowledge, unexpected patterns and new rules from large databases. It often involves datasets with a large number of attributes. Many of the attributes in most real world data are redundant and/or simply irrelevant to the purposes of discovering interesting patterns. Attribute reduction selects relevant attributes in the dataset prior to performing data mining. Attribute reduction is also known as feature selection. For example, high dimensional data (i.e., data sets with hundreds or thousands of features), can contain high degree of irrelevant and redundant information which may greatly degrade the performance of learning algorithms. Therefore, feature selection becomes very necessary for machine learning tasks when they face high dimensional data nowadays. Thus the dimensionality of the datasets has to be reduced. Selection of features is one of the most important tasks for designing a good classifier. Before we enter the learning phase, it is said that as many features as possible are to be collected for improving the performance. Yet irrelevant and correlated may degrade the performance of the classifier. Large number of features may yield to complex models which make interpretation difficult. This work aims to enhance feature selection methodologies and correspondingly the accuracy and performance of the algorithm.

II. RELATED WORK Several feature selection techniques have been proposed in the literature, including some important surveys on feature selection algorithms such as Molina et al.[2] Guyon and Elisseeff [3]. Different feature selection methods can be categorized into the wrapper [4], filter [5-9] and the embedded model [1, 5]. The wrapper model uses the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subset. The most important point about this method is the higher computational cost [4]. The filter model separates feature selection from classifier learning and selects feature subsets that are independent of any learning algorithm. It relies on various measures of the general characteristics of training data such as distance, information dependency and consistency [15]. Filter methods are found to perform faster than wrapper methods and are therefore widely used to deal with high dimensional datasets. Several feature selection techniques have been proposed in the literature like Correlation-based feature selection (CFS), Principal Component Analysis (PCA),Gain Ratio Attribute Evaluation(GR) etc[17]. Most of these methods usually combine with some other method to find the appropriate number of attributes [18, 19]. There are some proposals on how to find optimal feature selection algorithms. They have comprehensively discussed about attribute selection based on entropy metrics [10, 11, 12]. Several new feature selection methods have been derived based on entropy metric such as symmetrical uncertainty, information gain and gain ratio [13] and mutual information [14]. The concept of two dimensional discriminant rules initially grew in association rules generation and also support for classification and regression [7]. The core of the operation is laid on the concept of xmonotone region optimization. The concern of correlation in numeric and discrete class has been discussed in detail in comparison to accuracy level is considered for single attributes .The need to accommodate pair wise attributes as a two dimensional feature is applied as a pattern [18].Other recent researchers also attempt to explore the pair wise attributes selection in the aspect of noise detection [19].

ISSN: 1942-9703 / © 2009 IIJ

18

INTERNETWORKING INDONESIA JOURNAL III. PROBLEM DEFINITION

The enormity in the number of instances causes serious problems to many machine learning algorithms. High dimensional data (i.e., data sets with hundreds or thousands of features) can contain high degree of irrelevant and redundant information which may greatly degrade the performance of learning algorithms. In general any learning algorithm faces the problem of selecting a subset of features upon which attention must be focused. Therefore, feature selection becomes very necessary. In this paper Feature selection is done using class association rule incorporated along with information gain. Our proposed method filters out the irrelevant attributes to target class by rule generation according to class based association rule [20] and then select features with relevant support and confidence value. Redundancy from the selected features is removed using information gain algorithm. IV. PROPOSED METHODOLOGY In this work, we propose a filter approach to select relevant features based on the associations among them. Such associations can be discovered using class based association rule mining. The proposed associative feature selection approach is based on the heuristics discussed above to separate relevant and irrelevant terms. The occurrence of terms in many association rules means that they are associated with many other terms. These terms should then be assigned with a high score so that they are considered as relevant terms. On the contrary, terms that occur infrequently in association rules should be assigned with a low score of irrelevant terms. The assignment of scores to features comprises the following three steps: (1) We first determine the constraints of the association rules; (2) We search for association rules satisfying the constraints; (3) Further reduce feature by setting threshold and removing redundant features using information gain by ranking criteria. In the first step, we determine the constraint F for the association rules: F: ARs → Boolean. Here, ARs is the set of all association rules on the set of terms. Conventionally, the constraint for association rules is that the values of support and confidence should be greater than the minimal values (thresholds) of min_supp and min_conf. The constraint F can also include other measures for mining association rules. The second step searches for association rules that satisfy the constraint F. The typical approach for searching association rules is the Apriori algorithm[16]. In this approach, the support measure is first used to filter out most of the rules that do not satisfy min_supp. The confidence measure (or other measures) is then applied to the remaining rules that satisfy min_conf. Finally, the threshold is set for selected features by retrieving the percentage of selected features from the set of original features. Redundancy is also removed from those

BALAMURUGAN ET AL.

selected features having similar ranking. Here pruning is done at two stages, One is to prune low score feature based on the confidence and support value. The other is to prune redundant attributes which are present within the threshold value. PROPOSED ALGORITHM Input: S (a1, a2, a3……aN, C) // a training dataset M instance, S classes Min sup, min conf // minimum support and confidence Output: Sbest //selected features Find Relevant Features: 1 begin 2 ck : candidate item set of size k; 3 Lk : frequent item set of size k; 4 L1 : frequent item set1; 5 C1 : class label; 6 L2 : combine L1 and C1 and check with minsup and conf; 7 for k=2 to Lk! =null do begin 8 Lk+1 : combine Lk and Lk; 9 ck+1 : candidates generated from Lk+1; 10 for each transaction t in database d do begin 11 count++; //all candidates in ck+1 12 Lk+1 : candidates in cK+1 with minsup and minconf; 13 end; 14 end; 15 Slist : selected feature from rule 16 Selected percentage = (selected feature/total feature) * 100;

also in t

Remove Redundant Features: 17 find gain value for Slist features using info gain algorithm; m

18

Info(D)= − ∑ Pi log 2 ( Pi )

//m-no of distinct class

i =1

19

for each feature value v

20

Info A (D)= ∑ (|D j | |D|) * Info(D j ) //v-no of distinct value

21

Gain(A)=Info(A)-Info A (D)

j=1

22 23 24 25 26

end if two or more have same gain value remove all redundant features leave one feature end No of feature select = (selected percentage/100)* selected feature 27 Sbest : Final Selected feature ; Where, Final Selected feature =No of feature select - feature selected from order of Info gain value

Let dataset have N features, M instances, S target classes. Towards the end of the proposed algorithm, association rules are generated, from which we select the corresponding feature. To remove redundancy from relevant features we go for information gain. Using information gain algorithm, the gain value for each attribute with class label is calculated. If two or more features have same information gain value, that is called as redundancy. Remove those redundant features. Sbest is the final selected based on percentage calculation. The above proposed algorithm is based on the following two concepts namely the class based and correlation based

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

association rule mining. A. Class Based Association Association rule mining has gained a great deal of attention. Formally, an association rule R is an implication X  Y, where X and Y are sets of items in a given dataset. The confidence of the rule confidence(R) is the percentage of records that contain Y among the total number of records containing X. The support of the rule ‘support(R)’ is the percentage of records containing X and Y with respect to the number of all records. Let P[S] be the probability of an itemset S present in a certain transaction of the database. P[S] can be considered as the support of the itemset S as defined above,(also denoted as support(S)).The antecedence support is denoted by P[X] and the consequence support by P[Y] of the rule R. Assume that the database contains N records with the numbers of records that contain X, Y, and both X and Y are a, b, and c respectively. It can be implied from the definitions of support and confidence of association rules that support(R) = c/N and confidence(R) = c/a; and from the definition of support of an itemset that P[X] = a/N, P[Y] = b/N and P[X ∧ Y] = c/N. Thus, the values of supp(R) and conf(R) can be computed using P[X ∧ Y] and P[Y] as follows:

Support= ( R) P[ X ∧ Y ]

Confidence( R ) =

P [ X ∧Y ] P[ X ]

(1) (2)

The confidence of a rule R measures the implication relation of the antecedence (X) to the consequence (Y), which is the actual interestingness to the rule. It shows the prediction of Y when X occurs. Association rules are proposed for resolving market basket problems on transactional data. However, when all rule targets are constrained by the class labels, association rules become class (or constraint) association rules and they can be used for classification purpose. B. Information Gain In our solution we adopt an approach based on the information-theoretical concept of entropy, a measure of the uncertainty of a random variable. . The entropy of a variable X is defined as (3) H ( X )= P ( xi ) lo 2g( P ( xi ))

∑ i

and the entropy of X after observing values of another variable Y is defined as

H ( X Y ) = ∑ P ( y j )∑ P( xi y j ) lo 2 (gP( xi y j )) j

(4)

i

where P(xi) is the prior probability for all values of X, and P(xi/yj) is the posterior probability of X given the values of Y. The amount by which the entropy of X decreases reflects additional information about X provided by Y and is called Information gain given by (5) IG= (X Y) H ( X ) − H (X Y)

19

According to this measure, a feature Y is regarded to be more correlated to feature X than to feature Z, if IG(X/Y) > IG(Z/Y). Symmetry is a desired property for a measure of correlations between features. However, information gain is biased in favor of features with more values. Furthermore, the values have to be normalized to ensure that they are comparable and have the same aspect. If two or more features have the same value of information gain, they are removed from the dataset to reduce redundancy. V. IMPLEMENTATION OF PROPOSED ALGORITHM The proposed algorithm is implemented using Java net beans. The Oracle ODBC driver is first created, and then the ‘Oracle ODBC driver’ is selected in the ‘Create new Data Source’ window. The properties ‘Data Source name’, ‘Description’ and ‘User ID’ are given in the ‘ODBC Setup window’. The user interface (GUI) is designed to be simple and user friendly. The user interface consists of three buttons. Button1 is ‘Read Input from Arff Format’. On clicking Button1, a file chooser is opened to select the ARFF file and it converts the selected input arff file into oracle. Button2 is ‘Features based on Class association. On clicking Button2 new window opens to get support and confidence values, a dataset undergoes attribute reduction using class association rule mining method. Class association rule mining will select the relevant features with minimum support and confidence values. Button3 is ‘Features based on Info gain’. On clicking Button3, features selected from button2 is further undergoes attribute reduction using information gain. Information gain is used to reduce redundant attribute by calculating gain value. On clicking Button4 i.e. ‘display features’, features selected from Button2 will be displayed. Button5 is to drop tables. Links are provided appropriately for each button, so that each option opens in a separate window so as to make the user understand and feel free to access it. VI. PERFORMANCE EVALUATION In order to evaluate the efficiency and performance of our proposed methodology, we conduct tests on various datasets. Table 1 gives the list of datasets chosen for evaluation along with the number of features and instances present in each domain. These domains are then subject to various feature selection methods that already exist. The results obtained from existing methods and our proposed methodology has been tabulated in Table 2. The domains with the reduced number of features are then evaluated with classifiers like J48 and Naïve bayes. Their performance measures are shown in Table 3 and Table 4 respectively. It seen from the tables that our proposed method selects lesser number of features when compared to other methods. The time taken for classification reduces, consequently increasing the efficiency of the classification algorithms. Table 5 gives the performance of the entire set of features. Figure 1 depicts the improvement in accuracy when NB is applied to a set of selected features from a feature selection algorithm. The improvement in accuracy when

ISSN: 1942-9703 / © 2009 IIJ


20

Information Gain, association rule mining and NB are applied sequentially to a set of selected features is shown in Figure2. A comparative analysis of accuracies of applying C4.5 and NB after information gain and ARM, over a set of selected features is shown in Figure 3.From the above observations, we show the following results.

[5]

1.Reduced (or equal) number of selected features when compared to other methods in datasets such as vote, spectheart, shuttle, monk, Hayes Roth, car evaluation, contact lenses, postoperative, weather, parity (refer Table 2) . 2. Improved performance (or equal) when compared to the performance of other feature selection methods, A.With j48 classifier in datasets such as vote, spectheart, shuttle, monk, contact lenses, postoperative (refer Table 3). B. With naïve bayes classifier in datasets such as vote, spectheart, shuttle, parity, weather, nursery, monk, contact lenses, postoperative. (refer Table 4) 3. Improved performance (or equal) when compared to the performance of with total features, (refer Table 5 & 7) A. With naïve bayes classifier in datasets such as vote, shuttle, parity, weather, monk, contact lenses, tic-tac, car evaluation. B. With j48 classifier in datasets such as shuttle, contact lenses, postoperative. C. With ID3 classifier in datasets such as spectheart, shuttle, parity, monk, contact lenses.

[8]

VII. CONCLUSION AND FUTURE WORK The objective of this paper is to describe a novel approach for attribute reduction. In data-mining, most of the classification algorithms are less accurate due to the presence of some relevant and redundant attributes in the dataset. A new Attribute Reduction algorithm using class based association rule mining has been incorporated along with information gain and evaluated through extensive experiments. We used Apriori algorithm to identify the relevant attributes and remove irrelevant attributes from the dataset. We then remove redundant features using information gain. Our approach demonstrates its efficiency and effectiveness in dealing with high dimensional data for classification. Our future work, involves the study to combine this approach with feature discretization algorithms to smoothly handle data of different feature types.

[6]

[7]

[9]

[10] [11] [12] [13]

[14]

[15] [16]

[17]

[18]

[19]

[20]

Bhavani, S.D., Rani, T.S., Bapi, R.S.: Feature selection using correlation fractal dimension: Issues and applications in binary classification problems. Applied Soft Computing 8 (2008) 555-563 Haindl, M., Somol, P., Ververidis, D., Kotropoulos, C.: Feature selection based on mutual correlation. Progress in Pattern Recognition, Image Analysis and Applications, Proceedings 4225 (2006) 569-577 Liu, H., Motoda, H., Yu, L.: A selective sampling approach to active feature selection. Artificial Intelligence 159 (2004) 49-74 Liu, H., Yu, L., Dash, M., Motoda, H.: Active feature selection using classes. Advances in Knowledge Discovery and Data Mining 2637 (2003) 474-485 Yu, L., Liu, and H.: Feature Selection for High-Dimensional Data: A Fast Correlation-based Filter Solution. Proc. Int.Conference ICML2003 2003 (2003) 856-86310. Liu, H.A., Setiono, R.: Incremental feature selection Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis: An International Journal 1(1997) 131-156 Koller, D., Sahami, and M.: Toward Optimal Feature Selection. Proc. Int.Conference ICML'96 (1996)170-178 Bakus, J., Kamel, and M.S.: Higher order feature selection for text classification. Knowledge and Information Systems 9 (2006) 468-491 Liu, H., Yu, and L.: Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17 (2005) 491-502 Peng, H.C., Long, F.H., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 1226-1238 Liu, H.A., Setiono, R.: Incremental feature selection. Liu, W.Z., White, A.P.: The Importance of Attribute Selection Measures in Decision Tree Induction. Machine Learning 15 (1994) 25-41 H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kauffmann, San Francisco, 2nd Edition,2005. Harol, A., Lai, C., Pezkalska, E., Duin, and R.P.W.: Pairwise feature evaluation for constructing reduced representations. Pattern Analysis and Applications 10(2007) 55-68 Van Hulse, J.D., Khoshgoftaar, T.M., Huang, H.Y.: The pair wise attribute noise detection algorithm. Knowledge and Information Systems 11 (2007)171-190 Tin Dung Do, Siu Cheung Hui and Alvis C.M. Fong.: Associative Feature Selection for Text Mining. International Journal of Information Technology, Vol. 12 No.4 (2006).

REFERENCES [1]

[2]

[3] [4]

BALAMURUGAN ET AL.

De Sousa, E.P.M., Traina, C., Traina, A.J.M., Wu, L.J., Faloutsos, C.: A fast and effective method to find correlations among attributes in databases. Data Mining and Knowledge Discovery 14 (2007) 367-407 Molina, L.C., Belanche, L., Nebot, and A.: Attribute Selection Algorithms: A survey and experimental evaluation. Proceedings of 2nd IEEE's KDD 2002 (2002) 306-313 Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3 (2003) 1157-1182 Applied Intelligence 9 (1998) 217-230 11. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97 (1997) 273-324

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

21

APPENDIX Table I Dataset Description: Dataset Vote Spectheart shuttle landing Parity Monk Nursery Postoperative Hayes Roth tic-tac Carevalution Contact lenses Weather Soybean(Small)

Instances 435 267 15 100 124 12960 57 132 958 1728 24 14 35

Features 17 23 7 11 6 9 9 6 10 7 5 5 47

Classes 2 2 2 2 2 3 3 3 2 4 3 2 4

Table II Number of selected features for each feature selection algorithm

Dataset

Cfs

Vote Spectheart Shuttle Parity Monk Nursery Postoperative Hayes Roth tic-tac car evaluation contact lenses Weather Soybean(Small )

4 12 2 3 2 1 5 1 5 1 1 2

Chi squar e 6 8 6 6 2 1 5 3 1 6 2 2

22

35

Gai n

Info gain

Oneratt r

Symmetric

Relief

Associatio n

Class Association & Infogain

7 10 3 6 2 1 5 3 1 6 2 2

7 9 5 6 2 1 5 3 1 6 2 2

6 16 4 6 2 5 7 3 1 6 3 3

7 9 4 6 2 5 7 3 1 6 1 2

8 8 4 6 2 3 5 3 5 5 3 2

3 11 6 5 4 5 6 3 5 3 3 3

2 6 1 3 2 4 4 1 4 2 2 2

35

35

35

35

35

40

28

ISSN: 1942-9703 / © 2009 IIJ


22

BALAMURUGAN ET AL.

Table III Accuracy of C4.5 on selected features for each feature selection algorithm

Dataset

Cfs

Chisquare

Vote Spectheart Shuttle Parity Monk Nursery Postoperative Hayes Roth Tic-tac car evaluation contact lenses Weather Soybean(Small)

96.092 81.6479 53.3333 44 91.9355 70.9722 71.1864 37.8788 79.4363 70.0231 70.8333 42.8571 96.31

95.6322 79.4007 53.3333 40 91.9355 70.9722 71.1864 50 69.9374 92.3611 87.5 42.8571 98.15

Gain

Info gain

Oneattr

Symmetric

Relief

Associati on

91.7241 75.6554 60 40 91.9355 70.9722 71.1864 50 69.9374 92.3611 87.5 42.8571 98.15

91.7241 75.6554 53.3333 40 91.9355 70.9722 71.1864 50 69.9374 92.3611 87.5 42.8571 98.15

95.6322 79.0262 60 50 91.9355 90.7407 71.1864 70.4545 69.9374 92.3611 58.3333 50 98.15

91.7241 757.655 53.333 40 91.935 70.972 71.186 70.454 69.937 92.3611 70.833 42.857 98.15

96.092 79.4007 53.3333 48 91.9355 89.213 71.1864 70.4545 79.4363 93.2292 83.3333 42.8571 98.15

95.6322 79.4007 53.333 50 90.322 78.603 71.929 70.454 79.436 80.324 83.333 64.285 91.33

Class Associatio n& Info gain 95.6322 79.4007 53.3333 46 91.9355 78.248 71.9298 54.5455 76.3048 76.5625 87.5 42.8571 98.23

Table IV Accuracy of NB on selected features for each feature selection algorithm

Dataset

Cfs

Chi square

Vote Spectheart Shuttle Parity Monk Nursery Postoperative Hayes Roth tic-tac Car evaluation Contact lenses Weather Soybean(Small)

96.092 82.0225 80 50 100 70.9722 72.8814 37.8788 72.4426 70.0231 70.8333 78.5714 94.47

91.0345 76.779 80 46 100 70.9722 72.8814 59.8485 69.9374 85.5324 87.5 78.5714 92.94

Gain

Info gain

Onerattr

Symmetric

Relief

Association

91.7241 80.1498 80 46 100 70.9722 72.8814 59.8485 69.9374 85.5324 87.5 78.5714 92.94

91.7241 79.0262 73.3333 46 100 70.9722 72.8814 59.8485 69.9374 85.5324 87.5 78.5714 92.94

91.0345 79.0262 73.3333 47 100 88.8426 67.966 81.8182 69.9374 85.5324 54.1667 71.4286 92.94

91.7241 79.0262 73.3333 46 100 70.9722 72.8814 59.8485 69.9374 85.5324 70.8333 78.5714 92.94

93.5633 76.779 73.3333 43 100 87.6543 76.2712 81.8182 72.4426 85.3588 83.3333 78.5714 92.94

92.8736 77.1536 80 47 99.1935 77.2531 71.9298 81.8182 72.4426 79.5718 83.3333 50 87.34

Figure 1: Accuracy of NB on selected features for each feature selection algorithm ISSN: 1942-9703 / © 2009 IIJ

Class Association & info gain 95.6322 77.9026 80 51 100 77.3848 75.4386 54.5455 70.5637 76.8519 87.5 78.5714 92.94


Vol. 1/No. 2 (2009)

23

Table V Accuracy of ID3, C4.5 and NB on full set of features

DATASET Vote Spectheart Shuttle landing Parity Monk nursery postoperative Hayes Roth tic-tac Carevalution Contact lenses Weather Soybean(Small)

TOTAL FEATURES 16 22 6 10 5 8 8 5 9 6 4 4 35

ID3 (%)

C4.5 (%)

70.0375 60 45 95.9677 98.1867 69.4915 0 83.4029 89.3519 70.8333 85.7143 91.50

96.3218 80.8989 53.3333 48 90.3226 97.0525 71.1864 72.7273 85.0731 92.3611 83.3333 50 92.09

NAIVE BAYES (%) 90.1149 79.0262 80 40 99.1935 90.3241 80.303 69.6242 85.5324 70.8333 57.1429 92.97

Figure 2: Improvement in performance of Naive Bayes Classifier ISSN: 1942-9703 / © 2009 IIJ

24


BALAMURUGAN ET AL.

Figure 3: Improvement in performance of C4.5 and Naive Bayes Classifier

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

25

Detecting the Originality of Extended Research Articles Using Similarity Techniques – A Proposal Shanmugasundaram Hariharan B. S. Abdur Rahman University, Chennai, Tamilnadu, India Abstract— Research articles with respect to increased growth of research all round the world, personal interest of academicians, industrialists or other personals has paved way to huge knowledge repositories. The outcome of knowledge base provided by such a group is both advantageous and vice versa. In sense we mean that if the knowledge were used properly, it leads to innovation of new ideas or further enhancements, while on the other side when such knowledge is used repetitively in a reformulated fashion it would result in plagiarism of research articles pulling out the quality of research. This article proposes a method to detect the originality of research articles published as an extension of the previous works using similarity technique. We have proposed a method to summarize the scientific papers, so as to reduce the time involved in reading a document, there by presenting the originality of the research artciles. From the preliminary study carried out using the corpus collected, the study seems to be promising. We focus on to implement the proposed system and provide a comparison of the methods discussed in this paper analyzing the necessary parameters. Index Terms— Similarity measures, Cosine metric, scientific articles, summarization, extended research papers

I. INTRODUCTION Text summarization is a technique of automatically creating summary from one or more texts. Initial interest for automatic summarization started in 1960’s in American research libraries, where a large amount of scientific papers and books were to be digitally stored and made searchable. Research on automatic summarizing, taken as including extracting, abstracting, etc., has a long history with an early burst of effort in the sixties following Luhn’s pioneering work [4], followed by marked growth of activity since the mid eighties and especially very recently[5]. Research on summarization has attained its roots long back in 1950’s-1960’s and has become a steady subject of interest among research community [4, 7]. Text summarization has several branches or dimensions [13].History of summarization can be traced through several approaches like surface –level approaches in 1950’s, entity – Shanmuasundaram Hariharan is with the Department of Information Technology, B.S.Abdur Rahman University, Chennai, Tamilnadu, India.(Phone :04422751347, Mobile: +91-9884204036, E-mail : [email protected]). He is working as Assistant Professor and currently pursuing his doctoral programme in the area of Information Retrieval.

level approaches in 1960’s, extensive entity level approaches in 1970’s, 1990’s with the renaissance of all three approaches [14]. In addition we are seeing the emergence of new areas such as multi-document summarization, multilingual summarization and multimedia summarization [13]. Due to strenuous efforts and improvements over the significant years [15], summarization has been a major challenge and has drifted attention among the research community. Rapid advancement of Internet technologies, documents available on the web were digitized and made available online. Researchers regularly browse through literature papers to update their existing knowledge or for use in their research work. The challenge lies in summarizing the research articles. This paper mainly focuses on methods to detect the similarity among research articles. The proposed idea well suits for identifying the originality of papers submitted for conference submissions, journals and more specifically extended research articles that are submitted for publication in more than one system. The proposed system leads to the following. - To prevent malpractice of repeating the same article in repetitive forms. - Helps the researchers to think in a broader sense. - Improves the quality of the paper. Helps the academicians to enhance the work in better way. The paper is organized as follows: Section II discusses the related research carried out. In Section III, proposed system with sample corpus, modules involved in the proposed system are explained. Finally Section IV gives the conclusions. II. RELATED WORK Simone Teufel et al., [1] proposed a system for scientific articles that concentrates the statements which have rhetorical status for summarization and highlighted summaries. Several experiments were conducted for substantial corpus of conference articles with human judgments of rhetorical status and sentence relevance. Extraction of unseen articles content and classification into fixed set of seven numbers of rhetorical categories is done which is viewed further as a singledocument summary. Dain Kaplan et al., [2] designed an automatic method based on co-reference-chains for extracting citations from research papers. The authors considered the span of text as “c sites” which describes the work being cited. The system is designed

ISSN: 1942-9703 / © 2009 IIJ


26

to extract all c-sites that refer the target paper is aggregated later to form summary. It is different from traditional summarization technique based on fact the summary generated contains multiple points-of-view. The authors conducted several surveys in relation to parsing and extraction on several pre-existing components. Vahed Qazvinian et al., [3] developed a model that summarizes a single article that can be summarized further based on the topic. The author found that this model breaks the difficulty for researchers to explore the vast amount of scientific literature in each field of study. Authors have used some effective clustering approaches for study of citation summary network based on the views given to the articles by others. Balage Filho et al., [5] presented a paper with experiments on scientific text summarization. The authors adopted the simple extractive summarizers that generates a small version of main part from the given complete source text. It is also found from their investigation that by considering the specificity of the text genre, the results can be improved and made better. The authors have validated summarization process by taking only text structures. Maher Jaoua et al., [6] proposed a system which creates indicative summaries for scientific papers that differs from conventional methods. The authors extract summaries in two steps. First step produce a population of extracts followed by second step that classifies and selects the best one based on some global criteria that are defined for whole extract than sentences. The authors have deployed a summarization system for French language called “ExtraGen” is developed as a

HARIHARAN

prototype that performs the generation and classification mechanism by implementing a genetic algorithm. Stephen Wan et al., [8] developed a new research tool called Citation-Sensitive In-Browser Summarizer (CSIBS), to speed up the update on research article based on user requirement browsing task. Building such comprehensive summary helps the readers to explore the cited article and determine whether time is to be spent and thereby alleviate overload. The authors retrieved the sentences from cited document by exploiting citation context by bringing together the meta-data and a citation-sensitive. The authors found that the relevancy judgment task is facilitated by CSIBS, thereby the users' self reported confidence in making such judgments is increased. II.

EXPERIMENTAL SETUP

This section briefs the proposed system to detect the originality in extended research articles. Fig 1 gives the proposed architectural system, with each module discussed in several subsections. The input source article is mostly in pdf or html format. We convert the documents in either form to text for further processing. Table I presents the details of the samples investigated for the proposed system. The corpuses were collected from the papers submitted by different authors to International Conference and journals, which we have surveyed personally for project and research activities (This proposed system is currently under implementation). A sample set S1 is shown in Figures 2A, 2B, 2C for illustration. The proposed system would provide a solution to point out, which of the articles is better or worth reading, which document has been plagiarized etc.

TABLE I: DATA CORPUS AND STATISTICS

Set ID S1

S2

A.

Title Centroid Based summarization Centroid Based summarization of multiple documents Centroid Based summarization of multiple documents Using timestamps A Bottom up approach for sentence ordering in multi document summarization A Bottom up approach for sentence ordering in multi document summarization

Pre-processing Of Documents:

Preprocessing is most significant task in data mining tasks, information retrieval and several other tasks. In this phase the documents were pre-processed by the following steps: a. Converting all uppercase letters to lowercase. b. Removing unwanted words (stop words) c. Stemming the terms occurring in the document. To measure the content similarity (discussed in Section B), we remove the stop words from the text document followed by stemming of samples. We adopt the same process whenever similarity is measured. Stop words are ordinary or unusual words which occur in the document, which don’t have significant meaning (e.g.

Year of publication 2000 2004

Authors Dragomir Radev Dragomir Radev

Publisher ACL Elsevier

2007

Nedunchelian

IEEE

2007

Danuksha Bollegola

IEEE

2009

Danuksha Bollegola

Elsevier

connector words, conjunctions, single letter words). From a corpus of database, we eliminate such unwanted words [10]. We also eliminate special symbols that do not have significant part in text processing (e.g. “,”, /,-, etc, in general symbols other than characters and numbers). Truncation, also called stemming, is a technique that allows us to search for various word endings and spellings simultaneously. Stemming algorithms [11] are used in many types of language processing and text analysis systems, and are also widely used in summarization, information retrieval and database search systems. A stemmer is a program determines a stem form of a given word. Terms with a common stem will usually have similar

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

meanings. An example is shown below, that gives to a common stem: ACUSES, ACUSER, ACUSED, ACUSERS = ACCUS The suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and

27

complexity of the data in the system, which is advantageous. We have adopted suffix stripping that would help us better in term frequency calculations during sentence scoring. Each stem term and its equivalents are stored in database and retrieved whenever comparison is required [11]. Source article

Article -I (Year – 200X)

Article -II (Year – 200X)

Article –III (Year – 200X)

Pre-processor

HTML - Text Converter

PDF – text Converter

Special Character elimination Stop word Removal

Stop Words

Stemming Words

Stemming

Similarity Analyzer

Summarizer

Scoring

Frequency

Weight calculation

Sentences

Ranking

Words

Subject

Content

Statisticer

Articles detected based on originality Fig 1: Proposed system Architecture B. Similarity analysis:

The system designed to detect similarity among text documents calculates content similarity among specified documents. The similarity is estimated between the articles using cosine metric. Though there exist several choices like dice, Jaccard, Hellinger, we adopt cosine measure, which is quiet popular and yields better results [9,12] given by expression (1). k k 2 k 2 Co sin e (ti ,tj ) = ∑ tiht jh / ∑ tih ∑ t jh = h 1 = h 1= h 1

---(1)

A document is treated as a vector, with tih representing the first vector (each term being normalized using the total number of words in the document) and tjh corresponds to the second vector, k: is the number of terms in the document. Statisticer acts as agent in finding out the number of sentences, word count and frequency of words. Based on the term frequency of both documents, cosine metric obtains a value in the interval of 0-1. If both documents are exactly same they have a value of 1, 0 otherwise. Consider an example to illustrate the calculations for measuring the similarity. If S1 and S2 denotes two different

ISSN: 1942-9703 / © 2009 IIJ

28


sentences correspondingly, then cosine measure is calculated as shown below. S1: Heavy earth quake affects Indonesia. S2: Indonesia rocked at 6.5 ritcher scale. All the terms in each sentence have term frequency of 1. Hence cosine value = (1*1)/sqrt (5*6) = 1/ 2.23* 2.44 = 0.18. A suitable threshold can be fixed to identify the similarity among extended and original articles. This section also discusses the issue of measuring the similarity among documents. Number of papers available online, lack of commercially available tools to detect the plagiarism effectively has made us to focus on such an issue. Our focus lies on measuring the similarity of documents under the following categories. • Similarity measured as whole (Category – I) - measures the similarity by taking the entire content • Similarity measured under each subject/criteria(Category – II) - Each technical paper is structured in different ways. However the paper has titles like “Abstract”, “Conclusion/Future Work”, “Introduction”, “Results”, “Experimental Section” • Similarity measured from summaries of each articles(Category – III) - Generates summary for each criteria - Measures the ratio of similarity under each category Through these categories, we would set penalties for sentence scoring. Consider an scenario to illustrate the situations of each the categories discussed above. Category – I when it measures the relevancy among the entire contents, may not reflect the originality of the articles. The reason behind this that users who write research articles abstract the contents from the original source. Hence we may go for Category II and measure the importance at each level, which would lead to better conclusions. This would allow the researchers or academicians who are interested in reading out the literature to skip the paper immediately or to read anyone

HARIHARAN

(which is worth while) without wasting his time. Category – III would be more useful in upcoming years, as literature papers have been increasing year by year. Hence determining the similarity among research articles based on summary of each individual article may be deemed useful, provided the quality of generated summary is good. C. Summarizer: The algorithm designed for summarizing scientific articles consists of the following steps: a. Scoring each sentence in document. b. Weight calculation for each scored sentence. c. Ranking the results based on the user requirements. Sentences within each document are ranked depending on the sentence weights using term frequency approach [8]. For generating weight for each sentence we adopted extraction method, where we measure the frequency of terms in each document with special weights considered for each special category like (bold, italic, words matching title or subtitle etc...). Finally summary is generated depending on the score each sentences obtain. IV.

CONCLUSIONS

The paper has focused on the issue of research articles published in different publications. We have identified a solution to detect the similarity of the research articles, so as the proposed system might provide assistance in finding out the originality of the content. The system also serves as a platform to detect plagiarism, especially for papers submitted for conferences. The system can also well be adopted to measure the relativeness among articles of any nature depending on the users choice by modifying the threshold. The paper has not focused on the implementation part, as it is currently implemented. We also focus to identify main theme of each articles based on the summaries of each articles.

Fig 2A: Paper published in the Year 2000 – In Proceedings of ANLP/NAACL Workshop

ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

29

Fig 2B: Paper published in the year 2004 by Elsevier

Fig 2C: Paper published in the year 2008 by IEEE REFERENCES [1]

[2]

[3]

[4] [5]

[6]

[7]

Simone Teufel and Marc Moens, “Summarizing Scientific Articles:Experiments with Relevance and Rhetorical Status”, In: Association for Computational Linguistics,Vol .28,No. 4,pp 409-445, 2002. Dain Kaplan, Ryu Iida and Takenobu Tokunaga, “Automatic Extraction of Citation Contexts for Research Paper Summarization: A Coreferencechain based Approach”, Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries, ACL-IJCNLP 2009, pp. 88–95, Suntec, Singapore, 2009. Vahed Qazvinian and Dragomir R. Radev, “Scientific Paper Summarization Using Citation Summary Networks”, Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 689–696, Manchester, 2008. Luhn, H. P,“The Automatic Creation of Literature Abstracts”. IBM Journal of Research Development, 2(2):159–165,1958. Balage Filho, P.P. Salgueiro Pardo, T.A. and das Gracas Volpe Nunes, M, “Summarizing Scientific Texts: Experiments with Extractive Summarizers” ,In Proceedings of IEEE,Intelligent Systems Design and Applications, 2007. ISDA 2007. Seventh International Conference , pp.520-524, 2007. Maher Jaoua and Abdelmajid Ben Hamadou, “Automatic Text Summarization of Scientific Articles Based on Classification of Extract’s Population” , In proceedings of Computational linguistics and

[8]

[9]

[10] [11] [12] [13] [14]

[15]

intellignet text processing, Lecture Notes in Computer Science, Springer Berlin / Heidelberg, Volume 2588/2008, pp. 363-377, 2008. H.P. Edmundson, “ New Methods in Automatic Extracting “, Journal of the ACM, Vol .16, no 2 ,pp.264-285, 1969. Shanmugasundaram Hariharan and Rengarmanujam Srinivasan, “Investigations in Single document Summarization by Extraction Method”, In Proceedings of IEEE International Conference on Computing, Communication and Networking (ICCCN’08), pp.1-5, 2008. Shanmugasundaram Hariharan and Rengaramanujam Srinivasan (2008b), “A Comparison of Similarity Measures for Text Documents”, Journal of Information & Knowledge Management, Vol. 7, No. 1, pp. 1–8. http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words(Last accessed on 2 December 2009) M.F. Porter, “An algorithm for suffix stripping”, Program, 14(3) pp 130−137, 1980. Michael W.Berry, Murray Brown , “Lecture notes in Data Mining”, World Scientific Publishing., 2006. Mani.I., and M.T Maybury , “Advances in Automatic Summarization.”, MIT Press , Cambridege, MA., 1999. Karen Sparck-Jones (2007).Automatic Summarizing: The state of art. Information Processing and Management ,Issue :43 , pp:1449-1481, 2007. Karel Jezek and Josef Steinberger, “Automatic summarization (The state of Art 2007 and new challenges)”, Vaclav Snasel, Znalosti, pp.112.,2008.

ISSN: 1942-9703 / © 2009 IIJ

30


ISSN: 1942-9703 / © 2009 IIJ

Vol.1/No.2 (2009)


31

Prediksi Masa Studi Sarjana dengan Artificial Neural Network Muhamad Hanief Meinanda, Metri Annisa, Narendi Muhandri, dan Kadarsyah Suryadi Fakultas Teknologi Industri Institut Teknologi Bandung (ITB), Indonesia

Abstrak—Prediksi lama masa studi dibutuhkan oleh manajemen perguruan tinggi dalam menentukan kebijakan preventif terkait pencegahan dini kasus Drop Out (DO). Penelitian ini bertujuan untuk menentukan faktor akademis yang berpengaruh terhadap masa studi dan membangun model prediksi terbaik dengan teknik data mining. Kriteria pemilihan model yang digunakan adalah minimasi Sum Square Error (SSE). Model terbaik untuk memprediksi lama masa studi adalah model yang dibangun dengan Artificial Neural Network (ANN) dengan arsitektur Multilayer Perceptron (MLP). Dari penelitian ini ditemukan bahwa lama masa studi dipengaruhi oleh Indeks Prestasi Kumulatif (IPK), jumlah mata kuliah yang diambil, jumlah mata kuliah mengulang, dan jumlah pengambilan mata kuliah tertentu. Kata Kunci—Artificial Neural Network, Multilayer Perceptron, Prediksi masa studi.

I. PENDAHULUAN emakin ketatnya persaingan dalam mendapatkan lapangan pekerjaan menuntut perguruan tinggi menghasilkan sarjana yang berkualitas dan memiliki daya saing. Untuk itu, setiap perguruan tinggi selalu melakukan evaluasi performansi mahasiswa. Hasil evaluasi tersebut disimpan dalam basis data akademik. Data tersebut dapat digunakan untuk sebagai pendukung keputusan oleh manajemen perguruan tinggi. Salah satu variabel indikator efisiensi proses pendidikan adalah informasi mengenai lama masa studi mahasiswa.

S

Muhamad Hanief Meinanda adalah mahasiswa Program Sarjana Teknik Industri Fakultas Teknologi Industri Institut Teknologi Bandung. Saat ini bertanggung jawab sebagai Koordinator Asisten Laboratorium Perencanaan dan Optimasi Sistem Industri (LPOSI) ITB. Penulis dapat dihubungi melalui e-mail: [email protected]. Pandangan dan informasi tentang penulis dapat diakses pada http://www.hanief.com. Metri Annisa Arrum adalah mahasiswi Program Sarjana Teknik Industri Fakultas Teknologi Industri Institut Teknologi Bandung. Saat ini aktif sebagai Asisten Laboratorium Perencanaan dan Optimasi Sistem Industri (LPOSI) ITB. Penulis dapat dihubungi melalui e-mail: [email protected]. Narendi Muhandri adalah mahasiswa Program Sarjana Teknik Industri Fakultas Teknologi Industri Institut Teknologi Bandung. Penulis pernah aktif menjadi salah satu asisten mata kuliah dan praktikum Logika Pemrograman dan Komputer Teknik Industri ITB. Penulis dapat dihubungi melalui e-mail: [email protected]. Kadarsah Suryadi adalah dosen Program Studi Teknik Industri ITB. Saat ini penulis bertanggung jawab sebagai Ketua Program Studi Magister dan Doktor Teknik & Manajemen Industri. Penulis merupakan dosen dari Laboratorium Sistem Informasi dan Keputusan dengan kelompok keahlian Manajemen Industri. Penulis dapat dihubungi melalui e-mail: [email protected].

Artificial Neural Network (ANN) sejak diperkenalkan pada sekitar tahun 1940 telah banyak diimplementasikan pada berbagai bidang keilmuan. ANN banyak digunakan untuk melakukan prediksi atau peramalan [1]. Williams dan Li (2008) telah meneliti penggunaan ANN dengan algoritma training back-propagation untuk melakukan prediksi pacuan kuda di Jamaika. ANN dengan jenis feed forward network atau back-propagation yang digunakan dalam penelitian ini telah terbukti memberikan hasil yang baik untuk keperluan prediksi [2]. Al Cripps (1996) telah melakukan penelitian penggunaan ANN untuk memprediksi perfomansi akademik berupa presentasi kelulusan, masa studi, dan GPA. Penelitian tersebut menggunakan tidak menggunakan data akademis yang diperoleh selama mahasiswa kuliah. Variabel prediktor yang digunakan pada penelitian tersebut adalah usia, jenis kelamin, skor American College Testing (ACT), ras, dan kemampuan membaca [3]. Bijayananda Naik dan Srinivasan Ragothaman (1998) telah meneliti penggunaan neural network untuk memprediksi tingkat kesuksesan mahasiswa MBA, dengan prediktor GPA program sarjana [4]. Dengan acuan kesempatan penelitian yang tersedia berdasarkan penelitian sebelumnya maka pada penelitian ini akan diteliti variabel prediktor dari data akademis yang berpengaruh terhadap masa studi dan pembuatan model ANN untuk prediksi masa studi. Model prediksi tersebut digunakan untuk menentukan kebijakan terhadap mahasiswa yang diprediksi memiliki masa studi melebihi batas. Pada penelitian ini diujicobakan juga model multiple regression sebagai model pembanding dalam melakukan prediksi masa studi. II. TINJAUAN PUSTAKA A. Struktur Neural Network Artificial neural network (ANN) terinspirasi dari kesadaran atas complex learning system pada otak yang terdiri dari setset neuron yang saling berhubungan secara dekat. Jaringan neuron mampu melakukan tugas yang sangat kompleks seperti klasifikasi dan pemahaman pola. ANN dapat memperkirakan rentang yang cukup luas suatu model statistika dan fleksibel dalam menggambarkan model (linier maupun nonlinier) [5]. ANN dapat digunakan untuk permasalahan yang sama dengan permasalahan statistika multivariat seperti multiple regression, analisa diskriminan, dan analisa kluster. Dalam banyak kasus, hasil yang didapat dengan ANN dapat dibandingkan dengan model statistika multivariat [6].

ISSN: 1942-9703 / © 2009 IIJ


32

Terdapat tiga jenis utama dari ANN yakni Multilayer Perceptron, Radial Basis Function, dan Kohonen Network. Multilayer Perceptron merupakan model yang paling banyak digunakan untuk melakukan prediksi. Radial Basis Function merupakan model yang dapat melakukan hal yang dilakukan oleh Multilayer Perceptron. Kohonen Network baik digunakan pada permasalahan clustering [7]. Pada penelitian ini digunakan model Multilayer Perceptron karena model ini umum digunakan pada permasalahan prediksi. Multilayer Perceptron merupakan model yang memetakan suatu set input data menjadi set output, dengan menggunakan fungsi aktivasi nonlinier. Pada Multilayer Perceptron variabel independen maupun dependen dapat memiliki tingkat pengukuran metrik maupun nonmetrik. Multilayer perceptron merupakan feedforward neural network dimana informasi bergerak hanya dalam satu arah, dari simpul input melalui simpul tersembunyi dan simpul output [8]. B. Algoritma Pembelajaran Neural network memperoleh nilai bobot dari suatu algoritma pembelajaran tertentu. Bobot ini digunakan dalam melakukan transformasi nilai dari node input ke node output. Algoritma pembelajaran merupakan tahap penyesuaian terhadap bobot yang telah terbentuk secara random. Pembaharuan nilai bobot secara umum dirumuskan sebagai berikut: (1) wij (n + 1) = wij (n) + ∆wij (n) dimana ∆wij(n) dihitung dengan algoritma pembelajaran dan wij(n) merupakan bobot awal yang ditentukan secara acak pada tahap inisialisasi [9]. C. Algoritma Back-Propagation Masukan dari node input diteruskan ke hidden layer kemudian dilanjutkan ke node output. Setiap hubungan dari unit i ke unit j memiliki bobot wij yang mengindikasikan kekuatan dari koneksi. Jumlah dari pembobotan, aj, untuk suatu input xij dan bobot wij didefinisikan sebagai berikut: n

a j = ∑ wij xi

(2)

III. METODOLOGI PENELITIAN Metodologi penelitian diadopsi dari metodologi CrossIndustry Standard Process for Data Mining (CRISP-DM) yang dikembangkan pada tahun 1996 oleh analis dari Daimler Chrysler, SPSS, dan NCR. CRISP-DM memiliki enam fase yaitu Business understanding phase, Data understanding phase, Data preparation phase, Modeling phase, Evaluation phase, dan Deployment phase [10]. Tahap awal dari penelitian adalah memahami permasalahan yang akan diselesaikan yaitu melakukan estimasi terhadap masa studi berdasarkan ketersamaan pola antara data masa lalu dengan data aktual. Pada tahap ini, peneliti melakukan pemahaman terhadap data dan mencoba mencari adanya pola serta keterkaitan antara variabel-variabel data dengan masingmasing tujuan penelitian. Peneliti kemudian melakukan preprocessing data di antaranya dengan melakukan pembuatan cross-tabulation, koreksi terhadap data yang mengalami misclassification, dan menghapus missing value dan outlier. Setelah itu, dilakukan tahap pembuatan model. Untuk estimasi terhadap masa studi yang memiliki tingkat pengukuran metrik begitu juga dengan prediktornya peneliti menggunakan model artificial neural network dan multiple regression. Setelah mengetahui model yang sesuai, peneliti melakukan pembuatan model dengan menggunakan data training. Model yang dipilih mempertimbangkan kesesuaian asumsi model dan error yang dihasilkan. Kemudian model tersebut diterapkan pada set data testing dan dilakukan analisis terhadap penggunaan model dengan hasil yang diperoleh. IV. PENGOLAHAN DATA Data input yang digunakan merupakan data hipotetik dalam kontes Data Mining, Pagelaran Mahasiswa Nasional Bidang Teknologi Informasi dan Komunikasi (Gemastik) 2009. Data tersebut berasal dari data akademis aktual di suatu perguruan tinggi. Data tersebut terdiri dari catatan akademis 1289 mahasiswa, dengan variabel yang dapat dilihat pada Tabel I.

i =1

dimana nilai n merupakan jumlah input pada suatu neuron. Fungsi aktivasi yang digunakan adalah fungsi aktivasi logistic sigmoid:

TABEL I KETERANGAN VARIABEL PADA DATA INPUT Field

(3)

ID Masa Studi

Nilai galat, Ej(n), antara output aktual yj(n) dan nilai output dari neuron dj(n) dihitung dengan rumus: (4) E j ( n) = d j ( n) − y j ( n)

Kode Mata Kuliah Nama Mata Kuliah Ambil Ke

g (a) =

1 1 + e− a

Rumus pembelajaran dengan Back-Propagation adalah:

δE j ∆wij = ηxi + α∆wij = ηxi − α δwij

MEINANDA ET AL.

Nilai

(5)

dimana η adalah laju pembelajaran (learning rate) dan α adalah faktor moment. Parameter tersebut menentukan seberapa besar pengaruh parameter lama terhadap arah perubahan parameter yang baru.

ISSN: 1942-9703 / © 2009 IIJ

Deskripsi Data Identitas mahasiswa (primary key) Rentang waktu Mahasiswa menjalani masa kuliah Kode unik untuk tiap mata kuliah

Tipe Data String Integer String

Nama mata kuliah

String

Jumlah pengambilan suatu mata kuliah oleh mahasiswa bersangkutan Nilai untuk tiap mata kuliah yang diambil mahasiswa bersangkutan

Integer String

Vol.1/No.2 (2009)


TABEL II KETERANGAN VARIABEL PADA CROSS-TABULATION Nama Pengolahan Tingkat Deskripsi Data Data Data Pengukuran Bobot Nilai mahasiswa Melakukan Ordinal yang dikodifikasi nilai transformasikan A=4; B=3; C=2; ke dalam angka D=1; E=0 SKS Jumlah satuan Menjumlahkan Metrik kredit semester seluruh satuan yang diambil oleh kredit semester mahasiswa dari tiap mata kuliah yang diambil oleh mahasiswa. IPK Indeks prestasi Hasil pembagian Metrik kumulatif kolom bobot mahasiswa dengan kolom SKS Jumlah Jumlah mata Penjumlahan Metrik Mata kuliah yang total mata kuliah Kuliah diambil yang diambil mahasiswa mahasiswa Jumlah Jumlah Penjumlahan Metrik Mengulang pengulangan mata total pengulangan kuliah mata kuliah yang diambil mahasiswa

Dari data input tersebut kemudian dilakukan tahap preprocessing yang terdiri dari transformasi dan pembersihan data. Transformasi data dilakukan dengan membuat crosstabulation sehingga data memiliki unique key dan memiliki kolom sesuai dengan hipotesa variabel prediktor (Tabel II). Selanjutnya dilakukan koreksi terhadap entri data yang memiliki misclassification. Misclassification terjadi karena ada perbedaan kurikulum sehingga nama mata kuliah yang sama tertulis berbeda. Selanjutnya dilakukan penghapusan terhadap missing value dan data yang memiliki nilai tidak wajar (outlier). Eksperimen dilakukan pada komputer berbasis Intel Atom N280 menggunakan algoritma ANN Multilayer Perceptron, Linear Regression, dan Spearman Correlation dengan perangkat lunak SPSS 16 (Windows XP x32). Preprocessing data dan pembuatan Cross-Tabulation dilakukan pada platform yang sama dengan menggunakan Microsoft Excel 2007 SP0. V. EKSPERIMEN DAN ANALISIS A. Penentuan Variabel Prediktor Variabel masa studi yang terdapat pada data training terdiri dari 47 nilai berbeda dengan interval antara 41 sampai 88 bulan, sehingga variabel masa studi merupakan variabel dengan skala pengukuran metrik. Variabel prediktor untuk data masa studi memiliki skala pengukuran metrik. Dalam melakukan prediksi nilai masa studi, peneliti akan menggunakan model dependensi dengan variabel dependen metrik. Langkah selanjutnya adalah penentuan prediktor yang dapat mempengaruhi variabel masa studi. Data yang disediakan adalah catatan akademis setiap mahasiswa, sehingga prediktor yang akan dimasukan ke dalam model prediksi merupakan data yang berkaitan dengan performansi akademik mahasiswa. Peneliti mengajukan hipotesa a priori dalam menentukan

33

variabel independen sebagai prediktor masa studi. Hipotesa tersebut adalah adanya hubungan antara Indeks Prestasi Kumulatif, jumlah mata kuliah yang diambil, dan jumlah mata kuliah mengulang terhadap masa studi. Variabel prediktor IPK dihitung dengan menghitung bobot nilai dibagi dengan total SKS yang diambil. Peneliti tidak memasukan variabel IPK ke dalam model, karena variabel IPK merupakan variabel turunan dari variabel bobot dan SKS. Variabel IPK tidak digunakan dalam membangun model dan digantikan oleh variabel bobot dan variabel SKS. B. Preprocessing Preprocessing data dilakukan agar pada tahap pembuatan model mampu menghasilkan model yang efektif. Beberapa hal yang dilakukan pada tahap data preprocessing adalah transformasi ke dalam bentuk yang lebih informatif dengan menggunakan cross-tabulation. Tahap data preprocessing selanjutnya adalah melakukan penghapusan terhadap missingvalue dan outlier pada data. C. Exploratory Analysis Prediktor yang dipilih secara a priori diuji dengan menggunakan Spearman Correlation Coefficient untuk memastikan adanya hubungan antara prediktor-prediktor tersebut terhadap masa studi. Korelasi Spearman dipilih karena tidak membutuhkan asumsi distribusi dan normalitas data [11]. Hasil perhitungan korelasi dapat dilihat pada Tabel III. Pada tabel tersebut nilai signifikansi korelasi kurang dari nilai kritis (α=0.05), sehingga berdasarkan pengujian korelasi, semua prediktor memiliki korelasi signifikan terhadap variabel masa studi, sehingga variabel tersebut dimasukkan ke dalam model prediksi. TABEL III PERHITUNGAN KORELASI MASA STUDI Jumlah Jumlah Spearman's rho Bobot MataKul Mengulang Correla Masa tion -0.60 0.76 0.78 Studi Coef, Sig. (20 0 0 tailed) N

1221

1221

1221

SKS 0.43 0 1221

Untuk keperluan validasi, data dibagi ke dalam dua kelompok (split-sample) yakni data training dan data testing. Data training merupakan data yang digunakan untuk membangun model. Data training dipilih secara random dengan jumlah data 80% dari seluruh data. Data testing digunakan untuk keperluan validasi. D. Pembuatan Model Prediksi Masa Studi Model yang dapat digunakan untuk memprediksi variabel metrik dengan prediktor metrik adalah model multiple regression dan neural network. Model yang pertama kali dibangun dan diuji adalah model regresi. Pengujian asumsi model regresi baru dapat dilakukan ketika model regresi sudah terbentuk. Model regresi memiliki asumsi normalitas error, konstant error variance (homoscedasticity), dan independensi error [7]. Semua asumsi tersebut harus dapat dipenuhi agar model regresi tidak bias. Setelah melakukan pembuatan model regresi menggunakan parameter Minimum Least Square

ISSN: 1942-9703 / © 2009 IIJ


didapat bahwa variansi dari error tidak konstan (Gambar 1). Berdasarkan scatterplot antara standarized residual terhadap z-predicted variansi titik residu membesar dengan membesarnya predicted value. Oleh karena itu, model regresi tidak efisien digunakan karena asumsi homoscedasticity tidak dipenuhi.

MEINANDA ET AL.

Variable Add. vs SSE Sum-Squared-Error

34

170 160 150 140 130 120 110 100 90 80 70 13

18

23

28

33

38

Variable Additions

Gambar. 2. Plot Penambahan Variabel v.s. SSE

Dari Gambar 2, penurunan terhadap SSE terjadi signifikan pada penambahan sejumlah 22 variabel. Oleh karena itu, pada model ditambahan 22 variabel yang berisi data jumlah berapa kali mahasiswa tertentu mengambil mata kuliah yang terbanyak diambil mahasiswa secara keseluruhan. Perfomansi dari model perbaikan dapat dilihat pada Tabel VI. Gambar. 1. Plot Pengujian Homoscedasticity dari Error

Model lain yang digunakan dalam melakukan prediksi masa studi adalah Artificial Neural Network (ANN). Jenis Artificial Neural Network yang digunakan adalah Multilayer Percepteron (MLP). Multilayer Percepteron dapat digunakan untuk memprediksi data ril dengan supervised training. Arsitektur yang digunakan dipilih yang terbaik secara otomatis oleh SPSS 16. Tipe training yang dipilih adalah batch training. Batch training mampu menghasilkan error paling kecil dibanding metoda training lainnya. Initial learning rate di-set = 0.4, momentum 0.9. Training epoch ditentukan secara otomatis. Performansi model dapat dilihat pada Tabel IV. TABEL IV RINGKASAN MODEL MLP MASA STUDI Training Sum of Squares Error 184.346 Relative Error .357 Dependent Variable: MASASTUDI

E. Improvement Model Masa Studi Untuk memperkecil Sum of Squares Error (SSE), peneliti kembali mengevaluasi prediktor yang digunakan. Peneliti memasukan prediktor tambahan berupa jumlah berapa kali mahasiswa mengambil mata kuliah tertentu. Mata kuliah yang dipilih adalah mata kuliah yang paling banyak diambil oleh mahasiswa. Mata kuliah yang paling banyak diambil mengindikasikan mata kuliah tersebut banyak diulang mahasiswa. Variabel tersebut dimasukan satu persatu ke dalam model kemudian dilakukan evaluasi SSE yang dihasilkannya (Tabel V). TABEL V PERFORMANSI MODEL (SSE) SETELAH PENAMBAHAN VARIABEL Penambahan Sum Square Error Variabel 15 150.5 20 93.3 25 89.1 30 86.4 35 81.0

TABEL VI RINGKASAN MODEL MLP MASA STUDI PERBAIKAN Training Sum of Squares Error 97.004 Relative Error .159 Dependent Variable: MASASTUDI

Model tersebut selanjutnya di-validasi menggunakan set data testing. Validasi dilakukan dengan membandingkan apakah ada perbedaan signifikan terhadap data masa studi aktual dengan data masa studi prediksi. Berdasarkan hasil uji normalitas data prediksi menggunakan Kolmogorov-Smirnov dengan tingkat kepercayaan 95%, didapatkan bahwa data prediksi tidak berdistribusi normal, sehingga uji beda yang dilakukan dengan Wilcoxon Signed-Ranks (Tabel VII). TABEL VII UJI BEDA VARIABEL MASA STUDI AKTUAL DENGAN PREDIKSI Test Statisticsb Pred - MasaStudi Z -1.842a Asymp. Sig. (2-tailed) .065 a. Based on negative ranks. b. Wilcoxon Signed Ranks Test

Uji beda menghasilkan nilai p-value sebesar 0.65. Nilai tersebut lebih besar dari nilai kritis 0.05. Oleh karena itu, dengan tingkat kepercayaan 95%, tidak ada perbedaan signifikan antara nilai masa studi aktual dengan nilai masa studi berdasarkan model prediksi. VI. PENUTUP Berdasarkan hasil eksperimen, evaluasi, dan analisis yang dilakukan, maka dapat disimpulkan bahwa (a) variabel Indeks Prestasi Kumulatif, jumlah mata kuliah yang diambil, jumlah mata kuliah mengulang, dan jumlah pengambilan mata kuliah tertentu mempengaruhi masa studi (b) Dalam melakukan prediksi masa studi, model regresi akan menghasilkan prediksi masa studi yang bias karena asumsi Homoscedasticity Error tidak dapat dipenuhi (c) Artificial Neural Network dengan

ISSN: 1942-9703 / © 2009 IIJ

Vol.1/No.2 (2009)


arsitektur Multilayer Perceptron dalam penelitian ini merupakan model terbaik untuk memprediksi lama masa studi. Mengingat data yang digunakan merupakan data hipotetik pada kontes Data Mining Pagelaran Mahasiswa Nasional Teknologi Informasi (Gemastik 2009), maka simpulan penelitian akan berbeda jika diaplikasikan pada data akademis aktual. Penelitian lebih lanjut mengenai prediksi performansi akademik mahasiswa dapat dilakukan dalam bentuk: (a) penelitian menggunakan data akademis aktual pada perguruan tinggi tertentu (b) pembuatan model prediksi terhadap status Drop Out sehingga dapat digunakan sebagai early warning (c) perancangan perangkat lunak untuk mengimplementasikan model sehingga praktis digunakan pada sistem nyata. VII. DAFTAR PUSTAKA [1]

J. Williams and L. Yan, “A case study using neural network algorithms: horse racing prediction in jamaica” in International Conf. on Artificial Intelligence (ICAI'08), Las Vegas, 2008. [2] A. Lapedes and R. Farber, “How neural nets works,” Evolution, Learning, and Cognition, pp. 331-345,1998, submitted for publication. [3] Al Cripps, “Using Artificial Neural Nets to Predict Academic Performance,” in ACM Symposium on Applied Computing, 1996. [4] N. Bijayananda dan R. Srinivasan, “Predicting M.B.A. student performance: An empirical comparison of neural network vis-à-vis statistical models,” in Midwest Decision Sciences, Lincoln Institute, 1998. [5] Y. Bar-Yam, Dynamics of Complex Systems. 2008. [6] P.V. Balakrishnan, M.C. Cooper, V.S. Jacob, dan P.A. Lewis, “A study of the classification capabilities of neural networks using unsupervised learning: A comparasion with k-means clustering,” Psychometrika. Vol. 59, 1994. [7] J. Hair danR. Anderson, Multivariate Data Analysis. New York : Prentice Hall, 1998. [8] D. Pyle,Data Preparation for Data Mining. Morgan Kaufmann Publisher, 1999. [9] C.M. Bishop, Neural Networks for Pattern Recognition, 3rd ed. Oxford : Oxford University Press, 1995. [10] D.T. Larose, Discovering Knowledge In Data – An Introduction to Data Mining. New Jersey : John Wiley & Sons, 2005. [11] E.L. Lehmann dan H.J.M D’Abrera. Nonparametrics: Statistical Methods Based on Ranks. rev. ed. NJ : Prentice-Hall, 1998.

ISSN: 1942-9703 / © 2009 IIJ

35

36


ISSN: 1942-9703 / © 2009 IIJ


Vol.1/No.2 (2009)

37

Adaptive Content-based Navigation Generating System: Data Mining on Unorganized Multi Web Resources Diana Purwitasari1, Yasuhisa Okazaki2, and Kenzi Watanabe2 1 Department of Informatics, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia 2 Department of Information Science, Saga University, Saga, Japan

Abstract— Rapid growth of the Internet makes Web-based applications becomes a new means to learn. Application for learning takes into account of creating navigation as an important task to help users in understanding structured idea of learning topics within its materials collection. Let multi resources become the collection of a learning application. Then the arising problem is how to structure such abundance resources as n avigation i n the application that could satisfy different interests of users. Our adaptive content-based navigation generating system provides a domain of subjects called topics extracted from the collection. S ystem automatically produces an organized structure of topics offering users' guidance for learning. S ince the collection consists of Web pages, system will equally exploit information of contents and hyperlinks with the aim of generating navigation. When context interests of users changes like in a time user clicks a topic to know more about it, content-based navigation will adapt. S ystem exploits user model because of that adaptation.

Index Terms— navigation generating system, text mining, unorganized web resources

I. INT RODUCT ION

W

ITH such availability of abundance resources in the Internet, there are lots of learning site confined to certain coverage knowledge area. Some subjects in one knowledge area might be overlap to other coverage. Ones could want to put together existing learning materials from multi resources in order to develop learning system that accommodates different subjects of users' interests. Scopes of subjects are hidden topics in the whole system collection of learning materials gathered from multi resources. Some users might not yet have enough structured idea of D. Purwitasari is with the Department of Informatics, Institut T eknologi Sepuluh Nopember, Surabaya, Indonesia, e-mail: [email protected]. Y. Okazaki and K. Watanabe are with Department of Information Science, Saga University, Saga, Japan.

learning topics. Navigation, an organized structure from a collection of learning materials, offers guidance for learning to help users. When dealing with such abundance resources, navigation map of learning topics would be out of date if it is manually constructed. It is easier to maintain structure of multi resources in the manner of hand over navigation creating tasks to a system. Content-based navigation is defined as a sequence list of topics and sub topics hierarchically structured from a collection of documents. In our system we use data mining techniques and then followed by hypergraph partitioning to extract topics. Mapping of topics is considered as hypergraph construction with vertices representing words and edges representing strength relation between words. Extracted topics will be restructured to produce a hierarchical topics list using modified agglomerative clustering. System utilizes information of contents and hyperlinks in the materials collection of Web pages to provide the list. Implementation at educational fields, which particularly utilize services on the Web, refers the term of adaptive as information filtering. That is to say as finding relevant items to user interests in large collection of document. Therefore content-based navigation adapts to users when context of users' interests changes like in a time user clicks a topic to know more about it.

II. OUR PREVIOUS W ORKS AND RELAT ED W ORKS A variety of educational resources available to users on the Web for almost every domain is changing rapidly. However, the abundance of resources has created a problem of finding and organizing resources that match individual goals, interests, and current knowledge of users. Web-based systems for learning with the beneficial of adaptivity and intelligence provide an alternative to the traditional “just-put-it-on-the-Web” approach. That makes personalized access for users becomes the issue. Web-based learning systems should behave differently for different users to provide the adaptation effect. So far, the only techniques that demonstrate a good ability to provide personalized access to information in the learning context are

ISSN: 1942-9703 / © 2009 IIJ

38


PURWITASARI ET AL.

Fig. 1. Framework of adaptive content-based navigation generating system.

adaptive navigation support [1] such as curriculum sequencing. The goal is to provide a user with “optimal path” through the materials that is the most suitable individually planned sequence of topics for learning. This becomes very important due to its ability to guide users through the volume of available resources. Sequencing could be implemented in the form of list of some recommended links as a suggested learning path. We design generating system which produces appropriate learning path with current interest of users. A document of Web page is representation form of at least an identified topic in learning path. On account of users want to browse more of a topic, system translates users' clicked action into inquiry using feature terms within the Web page as search keywords of selected topic. Our scoring schema exploits information of contents and hyperlinks in Web pages [2]. Many research efforts, which influence our framework to generate content-based navigation, have been engaged to bring structured representation to a large collection of documents in the Web. To fulfill the function there are two main approaches: (i) organize documents in a domain of subjects [3] [4] [5] [6] and (ii) provide scenarios of topics for learning particular subjects [7] [8]. We do thorough study literature to propose a framework that makes the best use of both models [9]. Then prototype implementation of the framework shows system uniqueness compare to works of previously mentioned researches. With data mining techniques on topic identification [3], we adapt the techniques and then employ similarity function for merging topics to other tasks like structuring and representing document into topics within content-based navigation. Documents may have membership in several domains of subjects. Though fuzzy clustering might be more suitable in partition collection of

subjects-overlapped documents [4], we show even agglomerative clustering can be applied. We consent in exploiting information of hyperlinks as well [5] [6] to generate navigation of learning path because documents in the materials collection are Web pages. Evaluations are done to verify validity of selected methods in the framework [10]. Some evaluations concerns about preparing Web contents that should be sufficient for a collection thus the system could suitably generate navigation. Another evaluation is related to comparing results of some distance measures for partitioning criteria on agglomerative clustering. Other evaluation observes generating navigation from contents of some documents that close to user's current interest of topic. The documents are retrieved with our scoring model [2]. After observing number of evaluations, the proposed framework is implemented [11]. However it is still a model system which generates a content-based navigation. An adaptive effect to generate and re-generate navigation that supports statement to produce appropriate learning path with current interest of users is emphasized in current work.

III. CONT ENT -BASED NAVIGAT ION GENERAT ING SYST EM Given a collection of documents, system outlines learning topics then provides learning scenarios to offer guidance for users. Initially it extracts topics which are frequently discussed in the collection with data mining techniques (Topics Detecting Module, Fig.1), then clusters extracted topics in order to hierarchically produce a sequence list of topics (Navigation Creating Module, Fig.1)

ISSN: 1942-9703 / © 2009 IIJ

Vol.1/No.2 (2009)


A. Overview Basic cognitive process of topic identification is that there exist sets of common words (frequent itemsets) for each topic. Frequent itemsets are mapped into hypergraph with vertices represent words and edges represent strength relation between words (Topics Detecting Module, Fig.1). To extract topics means to partition hypergraph into sub-graphs. Then extracted topics will be restructured to produce a hierarchical list of topics-subtopics using agglomerative clustering with some adjustments. This hierarchical list is supposed to give an implicit path as kind of learning direction for users (Navigation Creating Module, Fig.1). When users click in one topic, system makes an inquiry using feature terms within Web page as search keywords of selected topic. Inquiry results retrieved with our scoring scheme [2] become new resources to produce next sequence of topics (Query Searching Module, Fig.1). This module becomes addendum of adaptive effect in adaptive content-based navigation generating system. B. Topics Detecting Module If important terms (frequent itemsets) in the collection are mapped into graph of word vertices, sub graphs indicate implicit topics. Hypergraph is a graph in which its edge can connect more than two vertices. This module partitions hypergraph to single out hyperedges for identifying mostly discussed topics. shmetis library in hMetis tool is executed to do hypergraph partitioning[12]. Based on defined writing rule by shmetis for input-output files, we provide ways of encoding frequent 2-itemsets of hypergraph into input file and decoding output file into list of unnamed topics consisting common words of each topic. Topics Detecting Module begins with preprocessing collection [13] which includes indexing, removing stop words, stemming, and TF-IDF (term frequency - inverse document frequency) weighting in order to retrieve candidates of frequent itemsets. Data mining lists frequent itemsets that satisfying some minimum values of support and confidence filters. We introduce usage of other filters to set apart significant frequent 2-itemsets. First is indispensable feature value of TF-IDF term weight (Eq.1), and second is our own weighting schema of term entropy (Eq.2).

ωd ,t =

tf d ,t N ⋅ log ≥ δ w ...............................................(1) max tf d ,t nt i

Let the collection D of N Web pages, |D| = N, shows n t as number of documents containing term t. Then a weight ωd,t is assigned for each term t in a document d that depends on the number of term occurrences in the document, tfd,t, normalized by frequency value of the highest term occurrences. Term weight ωd,t is attenuated with an effect of idft = log N/n t if the term occurs too often in the collection. For TF-IDF filter we do not consider values of term weight ωd,t less than δw.

39

T ABLE I A CONTINGENCY TABLE TO HELP DETERMINING IMP ORTANCE STATE OF TERMS IN A DOCUMENT

from set S lt from set S gt

belongs to CD a c

belongs to CNOT_D b d

| S lt | = a + b | S gt | = c + d

Entropy value measures uncertainty state. We consider entropy value of a term reflects effectiveness of the term in identifying certain document to others. This reasoning is derived from selection of most useful attribute to test at each node in decision tree algorithm [14]. For term entropy calculation we assume that there are only two classes as target classification: (i) whether the term is one of important terms in currently observed document or (ii) in the contrary that the term is important for any other documents except the currently observed one. Attribute for classification of a term belongs to (i) or (ii) is test condition whether frequency of the term in currently observed document (term frequency) will be less or more than average value of term frequency from all documents in the collection. In a collection D, for a document d let term t is classified into two classes: (i) important terms of document d, Cd, or (ii) important terms in other documents, Cnot_d. The state whether tfd,t value is less or greater than average number of term occurrences t in any document within collection avg(tfd,t)d∈{1…} becomes attribute for classifying. For each calculation of term entropy value ent d,t in document d, we assume that there are only two classes as target classification. Term entropy value ent d,t is defined as:

ent d ,t = ∑

| Sn | ent _ pd ,t ( S x ) ≥δ e ..................................... (2) N

with S x can be set of documents where term occurrences tfd,t in current document d is less, S lt, or greater, S gt, than average number avg(tfd,t); S x ∈{S lt, S gt}, and D = {S lt,∩S gt} (Table 1 depicts contingency table to help determining importance state of terms in a document). For entropy filter we do not consider values of term entropy less than δe . Given a set S x , containing only positive and negative documents of 2 class problem Cd and Cnot_d, entropy of set S x relative to this simple, binary classification is defined as:

ent _ pd ,t ( S x ) = −( p p log 2 p p + pn log 2 pn )

................. (3)

where p p is proportion of documents with term t belongs to Cd in S x and p n is proportion of documents belongs to C not_d. We list frequent itemsets that satisfy data mining filters, which are minimum support (Eq.4) and minimum confidence (Eq.5) [15]. Support s value shows a percent value of n documents that must contain terms ti and tj, pair of common words occurs together in

ISSN: 1942-9703 / © 2009 IIJ


40

PURWITASARI ET AL.

multiple documents. The support filter will ignore any 2-itemsets of words that occur in few documents to ensure the ones that often occur are worthy of attention.

tpca or tpcb could become topics cluster as a result of merge topics. This modification handles issue of articles that very likely have blended topics.

s i , j = p(t i , t j ) ≥ δ s ....................................................................(4)

sim(d i , tpct ) =

While confidence c, also called as mutual information, measures dependence or correlation between occurences of terms ti and tj. The confidence filter makes further analysis to find whether the presence of a 2-itemsets that often occurs (read: sufficient to the support filter) is strongly significant to become a potential frequent itemsets.

ci , j = log 2

p (t i , t j ) p (t i ) × p (t j )

≥ δ c ..............................................(5)

Note, p(x) is defined as a probability function of x, i.e. p(t) = n t/N, while p(x, y) is a joint probability function of x and y. C. Navigation Creating Module To know relations between topics-subtopics, agglomerative clustering is used to organize topics in nested groups of merged topics [3]. Clustering initialization uses extracted topics from hypergraph partitioning as cluster seeds. In the first iteration a cluster contains single topic but for the next a cluster could contain more than one merged topics. Merging some topics in a cluster could result well comprehensive topic or not is an appraising question. Thus we judge within distances on members in the same cluster should be minimized while between distances on different clusters be maximized. The smaller value variance ratio of between and within distance shows, the better clustering results are. If certain iteration has large value of variance ratio though it inclines smaller at previous iterations, then clustering will cut-off tree of topics-subtopics. Our assumptions during clustering process are that each document can only belong to exactly one cluster of topics and any cluster without document members will be removed. We use extracted topics as cluster seeds in initialization. Because of the assumptions there is possibility that some initial clusters do not have any document members and will be left out during tree construction. Our proximity matrix P is a matrix whose ijth entry denotes the similarity between topics in i th and jth clusters. Before merging topic a, tpca, and topic b, tpcb, we recalculate similarity between any document d to soon-to-be-merged topics based on their common words (sim(d, tpca), and sim(d, tpcb)). It means that there are words that not only frequently found in documents close to tpca but also exist in documents of tpcb. The similarity of document-to-topic is derived from TFIDF formula in term weighting process (Eq.6) [3]. Accumulation of document similarities value with every topics pair is applied in a mutual information style to update proximity matrix of topic clusters in each iteration time clustering. Eq.7 defines simab as similarity of topic a, tpca, and topic b, tpcb [3]. In the first iteration, tpca and tpcb contain single topic which is cluster seed taken from the extracted topics by Topics Detecting Module. For next iterations

∑ k∈t

tf ik × (log(

∑

j∈t

(log(

N 2 )) × nk

N 2 )) nk

∑

j∈t

(tf ij ) 2 (log(

N 2 )) nk

.......................................................................................................... (6)

simab =

∑

i∈docs

∑

sim(d i , tpca ) × sim(d i , tpcb ) N

j∈docs

sim(d j , tpca ) N

×

∑

k∈docs

sim(d k , tpcb ) N

.......................................................................................................... (7) Distances between two clusters of topics signify overlapping domain of subjects. Similarity function which is used to define likeness of document and cluster also works for calculating distances between clusters [3]. Our implementation demonstrates that similarity function of document and cluster is not only for merging topics [9] [11]. Aforesaid design for cutting-off topics-subtopics tree utilizes similarity function to work out on variance ratio value. In the end cluster of single topic or merged topics will have a document representation. For that purpose, this module searches the representation from list of documents similar to clusters of topics. The list is a descended order by depth-first traversing. Similarity function is used to measure likeness level similarity between document and cluster of topics based on common words which exist within. We compute similarity of each traversed document ignoring any document which has already become representation of other clusters. Similarity between document and sub clusters of currently visited cluster is also checked, because it is possible that the document is more suitable as representation of one of its sub cluster. In order to get orders of traversal topics in the list, we consider implicit links information of Web pages. PageRank [16] is selected to do link analysis for acquiring importance weight of each document representation.

IV. A DAPT IVE CONT ENT -BASED NAVIGAT ION GENERAT ING SYST EM Content-based navigation generating system changes point focus of learning path if current knowledge state of users is shifting. Query Searching Module will pin-point some documents from the collection where their contents relevant with information in user model (see Fig.1). Then the system

ISSN: 1942-9703 / © 2009 IIJ


Vol.1/No.2 (2009)

41

Fig. 2. Browser displays first page from a generated navigation.

Fig. 3. Map of topics from a generated navigation on a Web-based Application.

adaptively re-generates a new navigation to learn those documents.

A. Query Searching Module We use PageRank [16] to pick out most linked-by Web pages which their contents are more suitable with selected topic.

Generated navigation is derived from mapping of clustered topics where each single topic might be term phrases. Since position of combined terms might lead to different meaning, to find suitable contents we need to exploit schema of term weighting that considers position of terms [17]. Our searching process applies combined ranking factors of link analysis, PageRank Scoring, and content analysis that still retaining spatial information (position of terms) of search keywords,

ISSN: 1942-9703 / © 2009 IIJ


42

Fourier Domain Scoring [2]. Searching results act as dummy collection replacing the original one in which the system will produce the next suitable navigation for more focused collection to learn further. Adaptive content-based navigation generating system uses Vector Space Model (VSM) or TF-IDF weighting, Fourier Domain Scoring (FDS) and PageRanks Score (PRS) in a combination to get similarity score of a document with selected material. The selected one is a Web page where user clicks [Generate] button (see Fig.2). System concerns term weight values of important terms that exist in document and query Web pages to get VSM score. System will calculate FDS score if and only if title of selected material consists more than single term. Finally multiplication between those three kinds of scores, if FDS score exists, defines final similarity between Web pages of document and query. B. System Implementation on Web-based Application As experiments to test functionality of each module, adaptive content-based navigation generating system is implemented. Small collection of documents contains English Wikipedia articles without images 1 is prepared. Collection of Web pages is prepared beforehand such as do html scraping based on XML files as patterns to extract only learning contents from raw Web pages. For this time being it is easier to observe adaptive effect with small experimental collection (±30 docs). The adaptive is concerned with topics mapping to recognize domain of subjects and then topics spotting of user interests to provide scenarios for learning subjects. We use libraries of Oracle Text 2 in indexing terms and Porter Stemmer algorithm in stemming processes. List of stop words contains words provided by Oracle Text, HTML tags and some common words in Wikipedia articles. Browser display Web page in Fig.2 for users. Highlighted links indicate that their being referenced Web pages are documents in the collection. System deals with contents of those pages in consecutive way of routine processes from aforementioned modules to derive map of topics for generating navigation. First generated navigation is illustrated in Fig.3 where it also shows graph of words 3, though to create illustration graph is not part of navigation generating system. Note that the graph describes map of topics in experimental collection of Wikipedia articles. The navigation has two sections as shown with a two-divided area in map op topics by dashed line. Each divided area in the graph contains several groups of gradational vertices. Nodes with the same gradation color are common words of single topic or sub section in generated navigation. Let say user clicks one topic of first generated navigation in 1

crawled from main page of category Statistic http://en.wikipedia.org/wiki/Statistic 2 Oracle T ext in Oracle Database http://www.oracle.com/technology/products/text 3 hypergraph is written with Dot Language and viewed at ZGRViewer http://zvtm.sourceforge.net/zgrviewer.html

PURWITASARI ET AL.

Fig.3 (see rectangle area of a sub section in the navigation). Selected topic has words in dotted ellipses within its coverage concept of subjects. Query Searching Module recognizes that searching results of documents have subjects relate to selected topic as illustrated with ellipse area of solid line. For second generated navigation in this case, the system will produce a structure from pin-points topics within solid line ellipse, and not from a whole map of topics. It is said that the system adaptively changes point focus of learning path for users based on their current interest. Fig.2 illustrates Statistic course in a sample of web-based learning. It is suggested that user learns topics about [Statistic Population] then followed by [Order Statistic] from the course. If user clicks link of [Probability Distribution], the system analyzes topic mapping in Fig.3 and generates new learning path focused on topics in ellipse area with solid line. The second generated path does not concern with all topics in the mapping.

V. CONCLUSIONS AND FUT URE W ORKS In previous works we do study literatures and propose a framework of navigation generating system, evaluate selected methods in the frameworks to confirm their validities, then implement the framework with still one left aspect to produce appropriate learning path for users: adaptive effect. This work brings to completion of proposed framework for generating navigation solely based on contents in the collection of documents. With addition of adaptive effect, the system could generate navigation as guidance for users without neglecting their context of interests on certain learning subjects. Adaptive content-based navigation generating system has produced a structural list of topics, not only hierarchical list in the different level but also reading sequence in the same level. Instead following links within Web pages of learning materials and then reading them in a hypertextual manner, users could follow path in our generated navigation and stay away of any hypertext disorientation. System implementation on this paper utilizes the term of adaptive from user model with information on the selected topic. As a note, it is believed that the collection of learning materials consists of topics and relation between topics decided by frequency of words in documents within the collection. Information of selected topics by user should be recorded to reflect user preferences of learning topics. Preference of a user is not adequately reflected yet in this paper. More elaborate studies are needed to represent user model. Information about previously visited links, time to access in each link or others could become reference for modeling users. Next, we would like the system also considers those aspects. The framework that has been discussed so far involves analyzing hidden topics that we believe will affect the optimal path of navigation support system to the user. It also will affect the reaction of the user to the system about learning process. Though evaluating the framework about functionality has been

ISSN: 1942-9703 / © 2009 IIJ


Vol.1/No.2 (2009)

done, more quantitative evaluations of the effectiveness of the system are also expected. The effectiveness should face the question of how to directly evaluate user reaction to navigation generating system. Our next work will also consider this unanswered question

[17] L. A. Park, K. Ramamohanarao, and M. Palaniswami, “ Fourier Domain Scoring: A Novel Document Ranking Method,” IEEE T rans. on Knowledge and Data Engineering, vol. 16, no. 5, pp. 529–539, May 2004.

REFERENCES [1] P. Brusilovsky and E. Millan, “ T he Adaptive Web: Methods and Strategies of Web Personalization”, ser. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2007, vol. 4321, ch. User Models for Adaptive Hypermedia and Adaptive Educational Systems, pp. 3–53. [2] D. Purwitasari, Y. Okazaki, and K. Watanabe, “ A Study on Web Resources’ Navigation for e-Learning: Usage of Fourier Domain Scoring on Web Pages Ranking Method,” in ICICIC ’07: Proc. of the Second Intl. Conf. on Innovative Computing, Information and Control, 2007. [3] C. Clifton, R. Cooley, and J. Rennie, “ T opCat: Data Mining for T opic Identification in a T ext Corpus,” IEEE T rans. on Knowledge and Data Engineering, vol. 16, no. 8, pp. 949–964, 2004. [4] M. E. S. Mendes, W. Jarrett, O. Prnjat, and L. Sacks, “ Flexible Searching and Browsing for T elecoms Learning Material,” in IST ’2003: Proc. of the 2003 Intl. Symp. on T elecomm., 2003. [5] M. Halkidi, B. Nguyen, I. Varlamis, and M. Vazirgiannis, “ T HESUS: Organizing Web Document Collections based on Link Semantics,” T he VLDB Journal, vol. 12, no. 4, pp. 320–332, 2003. [6] J. Zhu, J. Hong, and J. G. Hughes, “ PageCluster: Mining Conceptual Link Hierarchies from Web Log Files for Adaptive Web site Navigation,” ACM T rans. Interet T echnol., vol. 4, no. 2, pp. 185–208, 2004. [7] A. D. Wissner-Gross, “ Preparation of T opical Reading Lists from the Link Structure of Wikipedia,” in ICALT ’06: Proc. of the Sixth IEEE Intl. Conf. on Advanced Learning T echnologies, 2006, pp. 825–829. [8] S. Reinhold, “ WikiT rails: Augmenting Wiki Structure for Collaborative, Interdisciplinary Learning,” in WikiSym ’06: Proc. of the 2006 Intl. Symp. on Wikis, 2006, pp. 47–58. [9] D. Purwitasari, Y. Okazaki, and K. Watanabe, “ Content-based Navigation in Web-based Learning Applications,” in ICCE ’08: Proc. of the 16th Intl. Conf. on Computers in Education, 2008, pp. 557–564. [10] D. Purwitasari, Y. Okazaki, and K. Watanabe, “A Study on Adaptive Content-based Navigation Generating System for Web-based Learning,” in SIG-ALST ’53: Proc. of the 53 Conf. on Special Interest Group - Adv. Learning Science and T ech., T he Japanese Society for Artificial Intelligence. IEEE Education Japan Chap., 2008, pp. 31–36. [11] D. Purwitasari, Y. Okazaki, and K. Watanabe, “Data Mining for Navigation Generating System with Unorganized Web Resources,” in KES ’08: Proc. of the 12th Intl. Conf. on Knowledge-Based Intelligent Information and Engineering Systems, 2008, pp. 598–605. [12] G. Karypis, “ Multilevel Hypergraph Partitioning,” Comput. Sci. and Eng. Dept., Univ. Minnesota, Minneapolis, T ech. Rep. 02-25, 2002. [13] R. B. Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley, 1999. [14] J. R. Quinlan, “ Induction of Decision T rees,” Machine Learning, vol. 1, pp. 81–106, 1986. [15] R. Agrawal, T . Imieli´nski, and A. Swami, “ Mining Association Rules between Sets of Items in Large Databases,” ACM SIGMOD, vol. 22, no. 2, pp. 207–216, 1993. [16] L. Page, S. Brin, R. Motwani, and T . Winograd, “ T he PageRank Citation Ranking: Bringing Order to the Web,” Stanford Digital Library T echnologies Project, T ech. Rep., 1998.

43

.

ISSN: 1942-9703 / © 2009 IIJ

44


ISSN: 1942-9703 / © 2009 IIJ


Vol. 1/No. 2 (2009)

45

Fuzzy Decision Tree dengan Algoritme ID3 pada Data Diabetes F. Romansyah, I. S. Sitanggang, S. Nurdiati

Abstract— Decision tree is one of widely used methods in developing classification models. In order to handle uncertainty, fuzzy approach is used. This work applied a classification technique using fuzzy decision tree method on diabetes dataset to obtain classification rules for predicting new data. Fuzzy ID3 (fuzzy Iterative Dichotomiser 3) was used to develop fuzzy decision tree with high accuracy. The result is a fuzzy decision tree with the highest accuracy 94,15% for fuzziness control threshold (θr ) = 75% and leaf decision threshold (θn) = 8 % or 10%. Index Terms— Fuzzy decision tree, Fuzzy ID3

Penelitian ini bertujuan untuk: 1) Menerapkan salah satu teknik klasifikasi yaitu Fuzzy ID3 (Iterative Dichotomiser 3) Decision Tree pada data hasil pemeriksaan lab pasien; 2) Menemukan aturan klasifikasi pada data diabetes yang menjelaskan dan membedakan kelas-kelas atau konsep sehingga dapat digunakan untuk memprediksi penyakit diabetes berdasarkan nilai dari atribut lain yang diketahui. Model yang dihasilkan pada penelitian ini diharapkan dapat digunakan oleh pihak yang berkepentingan untuk memprediksi potensi seseorang atau pasien terserang penyakit diabetes, sehingga terjadinya penyakit ini pada seseorang dapat diprediksi sedini mungkin dan dapat dilakukan tindakan antisipasi.

I. PENDAHULUAN

II. FUZZY DECISION TREE

D

ata mining merupakan proses ekstraksi informasi atau pola penting dalam basis data berukuran besar [3]. Pada penelitian ini akan diterapkan salah satu teknik dalam data mining, yaitu klasifikasi terhadap data diabetes. Data diabetes yang digunakan adalah data hasil pemeriksaan lab pasien dari sebuah rumah sakit yang meliputi pemeriksaan GLUN (Glukosa Darah Puasa), GPOST (Glukosa Darah 2 Jam PP), Tg (Trigliserida), HDL (Kolesterol HDL), serta diagnosa pasien berdasarkan nilai GLUN, GPOST, HDL, dan TG. Klasifikasi merupakan salah satu metode dalam data mining untuk memprediksi label kelas dari suatu record dalam data. Metode yang digunakan dalam penelitian ini yaitu metode klasifikasi dengan fuzzy decision tree. Penggunaan teknik fuzzy memungkinkan melakukan prediksi suatu objek yang dimiliki oleh lebih dari satu kelas. Dengan menerapkan teknik data mining pada data diabetes diharapkan dapat ditemukan aturan klasifikasi yang dapat digunakan untuk memprediksi potensi seseorang terserang penyakit diabetes.

Naskah diterima pada 2 Desember 2009. F. Romansyah adalah Associate Business Consultant (IT Consultant) di PT AGIT (Astra Graphia Information T echnology) 22/F ANZ T ower Jl. Jend Sudirman kavling 33A Jakarta 10220 (e-mail: [email protected]). I. S. Sitanggang adalah staf pengajar di Departemen Ilmu Komputer FMIPA Institut Pertanian Bogor, Jl. Meranti, Wing 20 Level V, Kampus IPB Darmaga, Bogor 16680 – Indonesia (e-mail: [email protected]). S. Nurdiati adalah staf pengajar di Departemen Ilmu Komputer FMIPA Institut Pertanian Bogor, Jl. Meranti, Wing 20 Level V, Kampus IPB Darmaga, Bogor 16680 – Indonesia (e-mail: [email protected])

A. Peubah Linguistik dan Linguistic Term Peubah linguistik merupakan peubah verbal yang dapat digunakan untuk memodelkan pemikiran manusia yang diekspresikan dalam bentuk himpunan fuzzy. Peubah linguistik dikarakterisasi oleh quintaple (x, T(x), X, G, M) dengan x adalah nama peubah, T(x) adalah kumpulan dari linguistic term, G adalah aturan sintaks, M adalah aturan semantik yang bersesuaian dengan setiap nilai peubah linguistik. Sebagai contoh, jika umur diinterpretasikan sebagai peubah linguistik, maka himpunan dari linguistic term T(umur) menjadi : T(umur) = {sangat muda, muda, tua} Setiap term dalam T(umur) dikarakterisasi oleh himpunan fuzzy, X menunjukkan nilai interval x. Aturan semantik menunjukkan fungsi keanggotaan dari setiap nilai pada himpunan linguistic term [1]. Linguistic term didefinisikan sebagai kumpulan himpunan fuzzy yang didasarkan pada fungsi keanggotaan yang bersesuaian dengan peubah linguistik. Misalkan D adalah kumpulan dari record yang terdiri dari himpunan atribut I = {I1 ,..., I n } , dengan I v , v = 1,..., n . Atribut I dapat berupa atribut numerik atau kategorikal. Untuk setiap record d elemen D, d [I v ] menotasikan nilai i dalam record d untuk atribut I v . Kumpulan linguistic term dapat didefinisikan pada seluruh domain dari atribut kuantitatif. Lvr , r = 1,..., sv menotasikan linguistic term yang berasosiasi dengan atribut I v , sehingga himpunan fuzzy dapat didefinisikan untuk setiap Lvr . Himpunan fuzzy, Lvr ,

ISSN: 1942-9703 / © 2009 IIJ

r = 1,..., sv

didefinisikan sebagai:

46


µ L (i )  vr v jika I diskret v  ∑ dom( I v ) iv  Lvr =  µ ∫ dom( I ) Lvr (iv ) jika I kontinu v v  iv untuk semua iv ∈ dom( I v ) , dengan dom( I v ) = {iv1 ,..., ivn }. Derajat

keanggotaan dari nilai iv ∈ dom( I v ) dengan beberapa linguistic term L vr dinotasikan oleh µ Lvr . B. Fuzzy Decision Tree (FDT) Decision tree merupakan suatu pendekatan yang sangat populer dan praktis dalam machine learning untuk menyelesaikan permasalahan klasifikasi. Metode ini digunakan untuk memperkirakan nilai diskret dari fungsi target, yang mana fungsi pembelajaran direpresentasikan oleh sebuah decision tree [5]. Decision tree merupakan himpunan aturan IF…THEN. Setiap path dalam tree dihubungkan dengan sebuah aturan, di mana premis terdiri atas sekumpulan node-node yang ditemui, dan kesimpulan dari aturan terdiri atas kelas yang terhubung dengan leaf dari path [6]. Dalam pohon keputusan, leaf node diberikan sebuah label kelas. Non-terminal node, yang terdiri atas root dan internal node lainnya, mengandung kondisi-kondisi uji atribut untuk memisahkan record yang memiliki karakteristik yang berbeda. Edge-edge dapat dilabelkan dengan nilai-nilai numericsymbolic. Sebuah atribut numeric-symbolic adalah sebuah atribut yang dapat bernilai numeric ataupun symbolic yang dihubungkan dengan sebuah variabel kuantitatif. Sebagai contoh, ukuran seseorang dapat dituliskan sebagai atribut numeric-symbolic: dengan nilai kuantitatif, dituliskan dengan “1,72 meter”, ataupun sebagai nilai numeric-symbolic seperti “tinggi” yang berkaitan dengan suatu ukuran (size). Nilai-nilai seperti inilah yang menyebabkan perluasan dari decision tree menjadi fuzzy decision tree [8]. Penggunaan teknik fuzzy memungkinkan melakukan prediksi suatu objek yang dimiliki oleh lebih dari satu kelas. Fuzzy decision tree memungkinkan untuk menggunakan nilai-nilai numeric-symbolic selama konstruksi atau saat mengklasifikasikan kasus-kasus baru. Manfaat dari teori himpunan fuzzy dalam decision tree ialah meningkatkan kemampuan dalam memahami decision tree ketika menggunakan atribut-atribut kuantitatif. Bahkan, dengan menggunakan teknik fuzzy dapat meningkatkan ketahanan saat melakukan klasifikasi kasus-kasus baru [6]. C. Fuzzy ID3 Decision Tree ID3 (Iterative Dichotomiser 3) merupakan salah satu algoritme yang banyak digunakan untuk membuat suatu decision tree. Algoritme ini pertama kali diperkenalkan oleh Quinlan, menggunakan teori informasi untuk menentukan atribut mana yang paling informatif, namun ID3 sangat tidak stabil dalam melakukan penggolongan berkenaan dengan gangguan kecil pada data pelatihan. Logika fuzzy dapat memberikan suatu peningkatan untuk dalam melakukan penggolongan pada saat pelatihan [5].

ROM ANSYAH ET AL.

Algoritme fuzzy ID3 merupakan algoritme yang efisien untuk membuat suatu fuzzy decision tree. Algoritme fuzzy ID3 adalah sebagai berikut [5]: 1. Create a Root node that has a set of fuzzy data with membership value 1. 2. If a node t with a fuzzy set of data D satisfies the following conditions, then it is a leaf node and assigned by the class name. • The proportion of class C k is greater than or equal to θ r,

| D Ci | ≥ θr |D| • the number of a data set is less than θ n • there are no attributes for more classifications 3. If a node D does no satisfy the above conditions, then it is not a leaf-node. And an new sub-node is generated as follow: (i=1,…,L) calculate the • For A i’s information gain, and select the test attribute A max that maximizes them. • Devide D into fuzzy subset D 1,…,D m according to A max, where the membership value of the data in D j is the product of the membership value in D and the value of F max,j of the value of A max in D. • Generate new node t i,…,t m for fuzzy subsets D 1,…,D m and label the fuzzy sets F max,j to edges that connect between the nodes t j and t. • Replace D by D j (j=1,2,…,m) and repeat from 2 recursively.

D. Fuzzy Entropy dan Information Gain Information gain adalah suatu nilai statistik yang digunakan untuk memilih atribut yang akan mengekspansi tree dan menghasilkan node baru pada algoritme ID3. Suatu entropy dipergunakan untuk mendefinisikan nilai information gain. Entropy dirumuskan sebagai berikut [5]: N (1) H s ( S ) = ∑ − Pi * log 2 ( Pi ) i

dengan Pi adalah rasio dari kelas Ci pada himpunan contoh S = {x1,x2,…,xk }.

∑ j =1 x j ∈ Ci P = k

i

(2)

S

Information gain digunakan sebagai ukuran seleksi atribut, yang merupakan hasil pengurangan entropy dari himpunan contoh setelah membagi ukuran himpunan contoh dengan jumlah atributnya. Information gain didefinisikan sebagai berikut [5]: | Sv | G ( S , A) = H ( S ) − H ( S ) (3)

∑

v∈Values ( A)

|S|

v

| Sv | adalah rasio dari data dengan atribut |S| v pada himpunan contoh. Pada himpunan data fuzzy, terdapat penyesuaian rumus untuk menghitung nilai entropy untuk atribut dan information

dengan bobot Wi =


Vol. 1/No. 2 (2009)

gain karena adanya ekspresi data fuzzy. Berikut adalah persamaan untuk mencari nilai fuzzy entropy dari keseluruhan data [5]: N (4) H f ( S ) = H s ( S ) = ∑ − Pi * log 2 ( Pi ) i

Untuk menentukan fuzzy entropy dan information gain dari suatu atribut pada algoritme fuzzy ID3 (FID3) digunakan persamaan sebagai berikut [5]:

∑ j µij log ∑ j µij

(5)

∑v⊆ A | Sv | * H f (Sv , A)

(6)

N

H f ( S , A) = −∑i =1 C

G f (S ) = H f (S ) −

N

2

S

N

|S |

S

dengan µj adalah nilai keanggotaan dari pola ke-j untuk kelas ke-i. Hf(S) menunjukkan entropy dari himpunan S dari data pelatihan pada node. |S v | adalah ukuran dari subset S v ⊆ S dari data pelatihan xj dengan atribut v. |S| menunjukkan ukuran dari himpunan S. E. Threshold dalam Fuzzy Decision Tree Jika pada proses learning dari FDT dihentikan sampai semua data contoh pada masing-masing leaf-node menjadi anggota sebuah kelas, akan dihasilkan akurasi yang rendah. Oleh karena itu untuk meningkatkan akurasinya, proses learning harus dihentikan lebih awal atau melakukan pemotongan tree secara umum. Untuk itu diberikan 2 (dua) threshold yang harus terpenuhi jika tree akan diekspansi, yaitu [5]: • Fuzziness control threshold (FCT) / θr Jika proporsi dari himpunan data dari kelas Ck lebih besar atau sama dengan nilai threshold θr, maka hentikan ekspansi tree. Sebagai contoh: jika pada sebuah subdataset rasio dari kelas 1 adalah 90%, kelas 2 adalah 10% dan θr adalah 85%, maka hentikan ekspansi tree. • Leaf decision threshold (LDT) / θn Jika banyaknya anggota himpunan data pada suatu node lebih kecil dari threshold θn, hentikan ekspansi tree. Sebagai contoh: sebuah himpunan data memiliki 600 contoh dengan θn adalah 2%. Jika jumlah data contoh pada sebuah node lebih kecil dari 12 (2% dari 600), maka hentikan ekspansi tree.

III. T AHAPAN PENELIT IAN Proses dasar sistem mengacu pada proses dalam Knowledge Discovery in Database (KDD). Proses tersebut dapat diuraikan sebagai berikut : a. Pembersihan data, membuang data dengan nilai yang hilang dan data yang duplikat. b. Transformasi data ke dalam bentuk data fuzzy. c. Data dibagi menjadi training set dan test set dengan menggunakan metode k-fold cross validation [2]. d. Aplikasi teknik data mining, merupakan tahap yang penting karena pada tahap ini teknik data mining diaplikasikan terhadap data. Untuk menemukan aturan klasifikasi digunakan metode fuzzy decision tree. Langkahlangkah pada metode tersebut yaitu: 1. Menentukan atribut yang akan digunakan.

47

2. Menentukan banyaknya fuzzy set untuk masing-masing atribut. 3. Menentukan banyaknya training set yang akan digunakan. 4. Menghitung membership value. 5. Memilih besarnya threshold yang akan digunakan. 6. Membangun fuzzy decision tree dengan algoritme Fuzzy ID3. Spesifikasi perangkat keras dan perangkat lunak yang digunakan untuk penelitian ini adalah sebagai berikut : Perangkat keras berupa komputer personal dengan spesifikasi: Prosesor AMD Athlon 64 2800+, Memori DDR 512 MB, dan Harddisk 80 GB. Perangkat lunak yang digunakan adalah sistem operasi Windows XP Profesional, MATLAB 7.0 sebagai bahasa pemrograman, dan Microsoft Excel 2003. A. Transformasi Data Dari 5 atribut yang digunakan, 4 di antaranya merupakan atribut yang kontinu, yaitu GLUN, GPOST, HDL, dan TG, sedangkan atribut Diagnosa adalah atribut kategorik. Berdasarkan referensi hasil laboratorium, range normal untuk atribut GLUN, GPOST, HDL, dan TG diperlihatkan pada Tabel 1. T ABLE 1 NILAI REFERENSI HASIL LABORATORIUM Kode Pe me riksaan

Satuan

Nilai Normal

GLUN GPOST HDL TG

Mg/DL Mg/DL Mg/DL Mg/DL

70 ~ 100 100 ~ 140 40 ~ 60 50 ~ 150

Atribut yang telah ditransformasi ke dalam bentuk fuzzy antara lain: • Atribut GLUN Atribut GLUN dibagi menjadi 4 kelompok atau linguistic term, yaitu rendah (GLUN < 70 mg/DL), sedang (70 mg/DL <= GLUN < 110 mg/DL), tinggi (110 mg/DL <= GLUN < 140 mg/DL), dan sangat tinggi (GLUN >= 140 mg/DL) [4]. Dari pembagian itu dapat ditentukan membership function dari himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi untuk atribut GLUN secara terpisah yaitu:  0  x − 65 1 ; x < 65   ,  x − 75  10 µ rendah ( x) =  ; 65 <= x < 75 µ x ( ) =  1 sedang  − 10  x − 115 0 ; x >= 75   − 10  0   0  x − 105   10 µ tinggi ( x) =  1  x − 145  − 10  0 

;

;

x < 65

;

65 <= x < 75

;

75 <= x < 105

; 105 <= x < 115 ;

x >= 115

x < 105

; x < 135  0  x − 135 ; 135 <= x < 145 µ sangatTinggi ( x) =  ; 115 <= x < 135  10 ; x >= 145  1 ; 105 <= x < 115 , ; 135 <= x < 145 ;

x >= 145

Himpunan fuzzy untuk setiap linguistic term menggunakan kurva dengan bentuk trapesium seperti pada Gambar 1.

ISSN: 1942-9703 / © 2009 IIJ


48


MF GLUN 1

MF HDL

0,8

MF

ROM ANSYAH ET AL.

µ_rendah

0,6

µ_sedang

0,4

µ_tinggi

0,8

µ_sangatTinggi

0,6

MF

0,2

1

µ_rendah µ_sedang

0,4

0 1

18

35

52

69

86 103 120 137

µ_tinggi

0,2

GLUN

0 1

6 11 16 21 26 31 36 41 46 51 56 61 66 71 HDL

Gambar 1 Himpunan fuzzy atribut GLUN

• Atribut GPOST

Gambar 3 Himpunan fuzzy atribut HDL

Atribut GPOST dibagi menjadi 4 kelompok atau linguistic term, yaitu rendah (GPOST < 100 mg/DL), sedang (100 mg/DL <= GPOST < 140 mg/DL), tinggi (140 mg/DL <= GPOST < 200 mg/DL), dan sangat tinggi (GPOST >= 200 mg/DL) [4]. Dari pembagian itu dapat ditentukan membership function dari himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi untuk atribut GPOST secara terpisah yaitu:  0  x − 95 1 ; x < 95   ,  x − 105  10 µ rendah ( x) =  ; 95 <= x < 105 µ x ( ) =  1 sedang − 10   x − 145 0 ; x >= 105   − 10  0   0  x − 135   10 µ tinggi ( x) =  1  x − 205  − 10  0 

;

x < 95

; 95 <= x < 105 ; 105 <= x < 135 ; 135 <= x < 145 ;

x >= 145

x < 135

; 135 <= x < 145 , ; 145 <= x < 195 ; 195 <= x < 205 ;

;

; x < 195  0  x − 195 ; 195 <= x < 205 µ sangatTinggi ( x) =   10 ; x >= 205  1

x >= 205

• Atribut TG Atribut TG dibagi menjadi 3 kelompok atau linguistic term, yaitu rendah (TG < 50 mg/DL), sedang (50 mg/DL <= TG < 150 mg/DL), dan tinggi (TG >= 150 mg/DL) [4]. Dari pembagian itu dapat ditentukan membership function dari himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi untuk atribut TG secara terpisah yaitu:  0  x − 45 1 ; x < 45    x − 55 ,  10 µ rendah ( x) =  ; 45 <= x < 55 µ 1 sedang ( x ) =  − 10   x − 155 0 ; x >= 55   − 10  0  ; x < 145  0  x − 145 ; 145 <= x < 155 µ tinggi ( x) =   10 1 ; x >= 155 

x < 45

;

45 <= x < 55

;

55 <= x < 145

; 145 <= x < 155 ;

x >= 155



MF TG 1 0,8

MF

MF GPOST 1

µ_rendah

0,6

µ_sedang 0,4

0,8

MF

;

µ_rendah

0,2

0,6

µ_sedang

0

0,4

µ_tinggi

µ_tinggi

1

16

31

46

61

76

91 106 121 136 151 166 TG

µ_sangatTinggi 0,2

Gambar 4 Himpunan fuzzy atribut T G

0 1

22 43 64 85 106 127 148 169 190 211 GPOST

Gambar 2 Himpunan fuzzy atribut GPOST

• Atribut HDL Atribut HDL dibagi menjadi 3 kelompok atau linguistic term, yaitu rendah (HDL < 40 mg/DL), sedang (40 mg/DL <= HDL < 60 mg/DL), dan tinggi (HDL >= 60 mg/DL) [4]. Dari pembagian itu dapat ditentukan membership function dari himpunan fuzzy rendah, sedang, tinggi, dan sangat tinggi untuk atribut HDL secara terpisah yaitu:  0  x − 35 1 ; x < 35    x − 45 ,  10 µ rendah ( x) =  ; 35 <= x < 45 µ x ( ) =  1 sedang  − 10  x − 65 0 ; x >= 455   − 10  0  ; x < 55  0  x − 55 ; 55 <= x < 65 µ tinggi ( x) =   10 1 ; x >= 65 

;

x < 35

; 35 <= x < 45 ; 45 <= x < 55 ; 55 <= x < 65 ;

x >= 65

Data dari atribut GLUN, GPOST, HDL, dan TG kemudian akan ditransformasi ke dalam bentuk fuzzy dengan menghitung derajat keanggotaan fuzzy pada tiap-tiap himpunan dari domain setiap atribut linguistik. • Atribut Dignosa Atribut Diagnosa selanjutnya akan disebut sebagai CLASS, direpresentasikan oleh dua buah peubah linguistik yaitu ”negatif diabetes” dan ”positif diabetes”. Kedua linguistic termnya tersebut didefinisikan sebagai berikut: ”negatif diabetes” = 1 ”positif diabetes” = 2 Nilai atribut CLASS yang akan dikategorikan sebagai positif diabetes adalah diagnosa dengan label E10 (Insulindependent diabetes mellitus), E11 (Non-insulin-dependet diabetes mellitus), E12 (Malnutrition-related diabetes mellitus), dan E13 (Unspecified diabetes mellitus), selainnya dikategorikan sebagai negatif diabetes.

Vol. 1/No. 2 (2009)


B. Data Mining Pada tahap ini dilakukan teknik data mining menggunakan algoritme FID3 untuk membangun fuzzy decision tree (FDT). Data yang telah ditransformasi akan dibagi menjadi training set dan test set. Pembagian data ini menggunakan metode 10fold cross validation. Data akan dibagi menjadi 10 subset (S1,....,S10) yang berbeda dengan jumlah yang sama besar. Setiap kali sebuah subset digunakan sebagai test set maka 9 buah partisi lainnya akan dijadikan sebagai training set. Fase training dilakukan untuk membangun FDT dengan menggunakan algoritme FID3. Proses training harus melalui berbagai tahapan, sebagai contoh akan dijelaskan pembentukan FDT dengan menggunakan contoh training set yang terdiri dari 15 data atau record. Pada contoh training set tersebut akan diterapkan algoritme fuzzy ID3 untuk menemukan model atau aturan klasifikasi. Langkah-langkah pembentukan aturan klasifikasi dengan algoritme FID3, yaitu: 1. Membuat root node dari semua data training yang ada dengan memberi nilai derajat keanggotaan untuk semua record sama dengan 1. 2. Menghitung fuzzy entropy dari training set yang ada. Dari hasil perhitungan diperoleh nilai fuzzy entropy sebesar 0.9968. Nilai ini akan digunakan untuk menghitung nilai information gain dari masing-masing atribut. 3. Menghitung information gain dari atribut GLUN, GPOST, HDL, dan TG, masing-masing diperoleh nilai 0.2064, 0.3330, 0.0304, dan 0.0050. Dari hasil tersebut dipilih atribut dengan nilai information gain terbesar yaitu GPOST. Atribut GPOST selanjutnya digunakan untuk mengekspansi tree atau menjadi root-node, tapi pada subnode selanjutnya atribut GPOST tidak dapat digunakan untuk mengekspansi tree. 4. Data training diekspansi berdasarkan atribut GPOST sehingga diperoleh hasil seperti pada Gambar 5.

49

5. Menghitung proporsi dari tiap kelas yang ada pada tiaptiap node. Misalkan, untuk sub-node dengan nilai keanggotaan atribut rendah, proporsi kelasnya adalah: C1 = 0.1 + 1 + 1 = 2.1 C2 = 1 Proporsi kelas 1 (negatif diabetes) = C1 *100% = 67.74% C1 + C 2

Proporsi kelas 2 (positif diabetes) =

C2 * 100% = 32.26% C1 + C2

6. Pada contoh ini digunakan fuzziness control threshold (θr) sebesar 80% dan leaf decision threshold (θn) sebesar 20%*15 = 3. Kedua threshold tersebut akan menentukan apakah sub-node akan terus diekspansi atau tidak. Misalkan sub-node dengan nilai atribut rendah yang memiliki data sebanyak 4, berdasarkan nilai proporsi kelas 1 (67.74%) dan kelas 2 (32.26%) yang lebih kecil dari θr (80%) dan banyaknya data atau record pada sub-node tersebut lebih besar dari θn, maka sub-node tersebut akan terus diekspansi. Lain halnya jika θr yang digunakan adalah 65%, maka sub-node tersebut tidak akan diekspansi. 7. Lakukan terus ekspansi dari sub-node yang ada sampai tidak ada lagi data yang dapat diekspansi atau tidak ada lagi atribut yang dapat digunakan untuk mengekspansi tree yaitu ketika tree yang terbentuk sudah mencapai kedalaman maksimum atau sub-node tidak memenuhi syarat dari threshold yang diberikan. Jika sub-node sudah tidak dapat diekspansi maka nilai proporsi kelas terbesar merupakan kesimpulan dari sekumpulan aturan yang diperoleh dengan menghubungkan setiap node yang dilewati dari root node hingga leaf node. Gambar 6 menunjukkan fuzzy decision tree yang diperoleh dari training set.

Gambar 1 Fuzzy decision tree untuk contoh training set

Gambar 6 Fuzzy decision tree untuk training set

Gambar 5 Hasil ekspansi training set berdasarkan atribut GPOST

Nilai derajat keanggotaan yang baru masing-masing record pada sub-node diperoleh dari hasil perkalian antara derajat keanggotaan pada root node dan derajat keanggotaan atribut yang digunakan untuk mengekspansi tree. Misalkan, untuk sub-node dengan nilai atribut rendah, nilai derajat keanggotaan dari data no.38 µl = 0.1 dan derajat keanggotaan dari data no.38 pada root node µold = 1, maka nilai derajat keanggotaannya pada sub-node µnew adalah: µ new = µl ⊗ µ old = 1 ⊗ 0.1 = 0.1

IV. HASIL DAN PEMBAHASAN Berdasarkan langkah-langkah algoritme FID3 dalam Bab III, diperoleh sebuah model yang terdiri atas 9 aturan dengan menggunakan training set. Model atau aturan klasifikasi yang diperoleh: 1. IF GPOST rendah AND HDL rendah T HEN Negatif Diabetes 2. IF GPOST rendah AND HDL sedang T HEN Negatif Diabetes 3. IF GPOST rendah AND HDL tinggi T HEN Positif Diabetes 4. IF GPOST sedang T HEN T HEN Negatif Diabetes

ISSN: 1942-9703 / © 2009 IIJ

INTERNETWORKING INDONESIA JOURNAL 5. IF GPOST tinggi AND GLUN rendah T HEN Diabetes 6. IF GPOST tinggi AND GLUN sedang T HEN Diabetes 7. IF GPOST tinggi AND GLUN tinggi T HEN Diabetes 8. IF GPOST tinggi AND GLUN sangat tinggi T HEN Diabetes 9. IF GPOST sangat tinggi T HEN Positif Diabetes

Negatif Negatif Negatif Positif

Berdasar metode uji 10-fold cross validation, proses training dilakukan sebanyak 240 kali. Untuk setiap training set, proses training dilakukan sebanyak 24 kali, dengan mengubah nilai θr sebanyak 6 kali yaitu 75%, 80%, 85%, 90%, 95%, dan 98%, dan untuk masing-masing nilai θr yang sama diberikan nilai θn yang berbeda-beda yaitu 3%, 5%, 8%, dan 10%. Rata-rata jumlah aturan yang dihasilkan pada proses training dan waktu eksekusi yang dibutuhkan dapat dilihat pada Tabel 2 dan Tabel 3.

T ABLE 2 RATA - RATA JUMLAH ATURAN θr 75% 80% 85% 90% 95% 98%

θn 3% 4 7 11 12 20 27

5% 4 7 10 11 18 24

8% 4 7 10 10 15 20

10% 4 6 8 8 11 16

Nilai-nilai θr dan θn yang digunakan dipilih berdasarkan hasil percobaan, karena dengan nilai-nilai tersebut jumlah model atau aturan yang dihasilkan mengalami perubahan yang cukup signifikan. Dengan nilai θr 70%, aturan yang diperoleh tidak berbeda dengan nilai θr 75% sehingga nilai θr hanya dipilih sampai 75%.

ROM ANSYAH ET AL.

peningkatan yang paling signifikan terjadi pada saat nilai θr dinaikkan dari 90% menjadi 95% dan dari 95% menjadi 98%. Kondisi semacam ini disebabkan karena pada saat ekspansi training set yang pertama kali dilakukan banyak sub-node yang proporsi salah satu kelasnya sudah mencapai nilai di atas 90%, sehingga sub-node tersebut tidak perlu diekspansi lagi. Ketika nilai θr ditingkatkan sampai 95%, baru terjadi ekspansi pada beberapa sub-node yang lain dan hasil proporsi kelas pada node di bawahnya pun ternyata banyak yang lebih rendah dari nilai θr yang mengakibatkan training set akan terus diekspansi sampai seluruh atribut terpakai. Jika diamati dengan seksama pada Tabel 2, walaupun nilai θn ditingkatkan, jumlah aturan yang dihasilkan tidak mengalami penurunan secara signifikan. Berdasarkan pengamatan yang dilakukan, ternyata karakteristik data pada training set yang digunakan tidak terlalu berbeda, pada saat terjadi ekspansi tree data tidak akan terlalu menyebar, karenanya jumlah himpunan data yang ada pada sub-node tidak berbeda jauh dengan jumlah himpunan data yang ada pada root-node. Dengan adanya situasi yang demikian, syarat untuk menghentikan ekspansi tree yaitu jumlah data atau record pada sub-node harus lebih kecil dari nilai θn sulit untuk tercapai. Nilai θr yang terlalu rendah dan atau θn yang terlalu tinggi akan menghasilkan tree dengan ukuran yang kecil sehingga jumlah aturan yang dihasilkan juga sangat sedikit, hal ini terjadi karena tree yang sedang dibangun mengalami pemotongan (pruning) pada saat model masih mempelajari struktur dari training set. Sebaliknya, nilai θr yang terlalu tinggi dan atau θn yang terlalu rendah kadang kala akan menyebabkan fuzzy decision tree berperilaku seperti decision tree biasa yang tidak memerlukan adanya threshold sehingga menghasilkan tree dengan ukuran sangat besar dan jumlah aturan yang juga sangat banyak, karena tree akan terus diekspansi sampai leafnode terdalam. 30 25 Jumlah Aturan

50

20 15 10 5 0 75%

T ABLE 3

80%

85%

θn=3%

θr 75% 80% 85% 90% 95% 98%

θn 3% 0,125 0,239 0,344 0,404 0,717 0,955

5% 0,128 0,227 0,334 0,361 0,594 0,842

90%

95%

98%

θr

RATA - RATA WAKTU EKSEKUSI DALAM DETIK θn=5%

θn=8%

θn=10%

Gambar 7 Perbandingan rata-rata jumlah aturan 8% 0,120 0,214 0,308 0,328 0,492 0,712

10% 0,127 0,183 0,244 0,264 0,356 0,544

Berdasarkan hasil percobaan, semakin tinggi nilai θr semakin banyak pula aturan yang dihasilkan, hal ini terjadi karena sebelum suatu node didominasi oleh sebuah kelas dan proporsi untuk kelas tersebut di atas atau sama dengan nilai θr maka tree akan terus diekspansi. Pada Tabel 2 terlihat bahwa,

Gambar 7 menunjukkan perbandingan rata-rata jumlah aturan yang dihasilkan pada proses training. Dapat terlihat bahwa semakin tinggi nilai θr akan menyebabkan jumlah aturan yang dihasilkan juga meningkat, hal sebaliknya terjadi jika nilai θn semakin tinggi maka aturan yang dihasilkan cenderung berkurang.


Vol. 1/No. 2 (2009)

V. EVALUASI KINERJA FID3

1,200

Evaluasi kinerja algoritme FID3 dapat diketahui dengan cara menghitung rata-rata akurasi dari seluruh proses testing pada 10 test set yang berbeda. Evaluasi kinerja dari algoritme FID3 pada nilai θr dan θn yang berbeda dapat dilihat pada Tabel 4.

1,000 Waktu (detik)

51

0,800 0,600 0,400

T ABLE 4

0,200

AKURASI FUZZY DECISION TREE

0,000 75%

80%

85%

90%

95%

98%

θr θn=3%

θn=5%

θr θn=8%

θn=10%

Gambar 8 Perbandingan rata-rata waktu eksekusi proses training

Dengan melihat Gambar 7 dan Gambar 8, dapat disimpulkan bahwa, semakin tinggi nilai θr yang digunakan akan menghasilkan jumlah aturan yang semakin banyak sehingga waktu yang dibutuhkan untuk menghasilkan aturan-aturan tersebut juga meningkat, hal ini terjadi karena proses yang harus dilakukan untuk membangun tree semakin banyak. Untuk mengukur akurasi dari model yang dihasilkan pada fase training, proses testing dilakukan sebanyak 240 kali. Proses testing dilakukan dengan cara memasukkan aturan yang diperoleh dari proses training ke dalam sebuah FIS Mamdani [7] untuk menentukan kelas dari masing-masing record pada test set. Untuk satu kali proses training dilakukan satu kali proses testing. Dengan melihat hasil testing, mengubah nilai θr dan θn tidak menyebabkan adanya perubahan pada nilai akurasi dari model yang ada, walaupun jumlah aturan yang dihasilkan proses training mengalami peningkatan. Hal ini terjadi karena secara kebetulan semua aturan pada kedua model tersebut memiliki kelas target yang sama yaitu negatif diabetes dan test set yang digunakan juga berasal dari kelas target yang sama sehingga hasil testing dari kedua buah model dengan test set yang sama tidak mengalami perubahan. Namun model dengan aturan yang seragam seperti ini tidak dapat dipakai untuk melakukan prediksi karena berapapun nilai atribut yang diberikan hasilnya akan selalu negatif diabetes. Hasil testing akan berbeda jika aturan-aturan pada model yang dihasilkan memiliki kelas target yang berbeda. Model untuk training set pertama dengan θr (75%) dan θn (3%): IF GPOST IF GPOST IF GPOST IF GPOST

rendah T HEN Negatif Diabetes sedang T HEN Negatif Diabetes tinggi T HEN Negatif Diabetes sangat tinggi T HEN Negatif Diabetes

Pada test set ketujuh dan kedelapan, nilai akurasi mengalami penurunan saat θr ditingkatkan menjadi 80%. Berdasarkan pengamatan yang dilakukan, keadaan seperti ini dinamakan overfitting karena terlalu tingginya θr untuk training set tersebut, sehingga tree akan terus diekspansi sampai betulbetul sesuai dengan training set. Akibatnya tree memiliki node-node yang mengandung data yang mengalami kesalahan klasifikasi.

75% 80% 85% 90% 95% 98%

θn 3% 94,14 92,07 92,07 92,07 90,69 90,69

5% 94,14 92,07 92,07 92,07 91,73 91,73

8% 94,15 93,45 93,45 93,45 93,10 93,10

10% 94,15 93,45 93,45 93,45 93,45 93,45

Dari Tabel 4 dapat dilihat bahwa kinerja algoritme FID3 mengalami penurunan jika nilai θr semakin besar dan atau nilai θn semakin kecil, walaupun penurunan yang terjadi tidaklah signifikan sehingga masih dapat ditoleransi. Seperti telah dijelaskan sebelumnya kondisi ini disebabkan karena terjadinya fenomena overfitting. Nilai akurasi terbaik yaitu 94,15% diperoleh pada θr = 75% dan θn = 8 % atau 10%. VI. KESIMPULAN Dari berbagai percobaan yang dilakukan terhadap data Diabetes didapat kesimpulan bahwa algoritme FID3 memiliki kinerja yang baik dalam membentuk fuzzy decision tree untuk data Diabetes yang ada. Nilai akurasi terbaik dari model yaitu 94,15% diperoleh pada fuzziness control threshold (θr) = 75% dan leaf decision threshold (θn) = 8 % atau 10%. Nilai θr dan θn sangat berpengaruh terhadap jumlah aturan yang dihasilkan, nilai θr yang terlalu tinggi akan menyebabkan turunnya nilai akurasi. Di lain pihak, nilai θn yang terlalu rendah juga dapat menyebabkan akurasi menurun. REFERENCES [1] E Cox. Fuzzy Modeling and Algorithms for Data mining and Exploration. USA: Academic Press. 2005. [2] L. Fu. Neural Network In Computer Intellegence. Singapura: McGraw Hill. 1994. [3] J. Han and M. Kamber. Data Mining Concepts and Techniques. Simon Fraser University. USA: Morgan Kaufman, 2006 [4] Herwanto. Pengembangan Sistem Data Mining untuk Diagnosis Penyakit Diabetes Menggunakan Algoritme Classification Based Association [T esis]. Bogor. Departemen Ilmu Komputer Fakultas Matematika dan Ilmu Pengetahuan Alam Institut Pertanian Bogor, 2006. [5] G. Liang. A Comparative Study of Three Decision Tree algorithms: ID3, Fuzzy ID3 and Probabilistic Fuzzy ID3. Informatics & Economics Erasmus University Rotterdam Rotterdam, the Netherlands, 2005. [6] C. Marsala. Application of Fuzzy Rule Induction to Data Mining. University Pierre et Marie Curie. 1998. [7] Ormos L. Soft Computing Method On Thom’s Catastrophe Theory For Controlling Of Large-Scale Systems. University of Miskolc, Department of Automation. 2004. [8] Y. Yuan dan Shaw M J. Induction of fuzzy decision trees, Fuzzy Sets and Systems Vol. 69. 1995.

ISSN: 1942-9703 / © 2009 IIJ

52


Firat Romansyah dilahirkan di Indonesia pada tahun 1985. Dia memperoleh gelar Sarjana Ilmu Komputer dari Departemen Ilmu Komputer, Institut Pertanian Bogor, Indonesia pada tahun 2007. Saat ini dia bekerja sebagai Associate Business Consultant (IT Consultant) di PT AGIT (Astra Graphia Information T echnology). Imas Sukae sih Sitanggang dilahirkan di Indonesia pada tahun 1975. Dia memperoleh gelar Sarjana Matematika dari Departemen Matematika, Institut Pertanian Bogor, Indonesia pada tahun 1997 dan mendapat gelar Master Ilmu Komputer dari Universitas Gadjah Mada, Indonesia pada tahun 2002. Saat ini dia bekerja sebagai dosen di Departemen Ilmu Komputer, Institut Pertanian Bogor, Indonesia. T opik utama penelitiannya adalah dalam spatial data mining. Sri Nurdiati dilahirkan di Indonesia pada tahun 1960. Dia memperoleh gelar Sarjana Statistika dari Institut Pertanian Bogor, Indonesia pada tahun 1984 dan mendapat gelar Master Ilmu Komputer dari The University of Western Ontario di Canada pada tahun 1991. Pada tahun 2005 dia memperoleh gelar Doktor Matematika T erapan dari Twente University di Belanda. Saat ini dia bekerja sebagai dosen di Departemen Ilmu Komputer, Institut Pertanian Bogor, Indonesia. T opik utama penelitiannya adalah scientific computing. .

ROM ANSYAH ET AL.

Internetworking Indonesia Journal

The Indonesian Journal of ICT and Internet Development ISSN: 1942-9703

About the Internetworking Indonesia Journal The Internetworking Indonesia Journal (IIJ) was established out of the need to address the lack of an Indonesia-wide independent academic and professional journal covering the broad area of Information and Communication Technology (ICT) and Internet development in Indonesia. The broad aims of the Internetworking Indonesia Journal (IIJ) are therefore as follows: • Provide an Indonesia-wide independent journal on ICT: The IIJ seeks to be an Indonesia-wide journal independent from any specific institutions in Indonesia (such as universities and government bodies). Currently in Indonesia there are numerous university-issued journals that publish papers only from the respective universities. Often these university journals experience difficulty in maintaining sustainability due to the limited number of internally authored papers. Additionally, most of these university-issued journals do not have an independent review and advisory board, and most do not have referees and reviewers from the international community. • Provide a publishing venue for graduate students: The IIJ seeks also to be a publishing venue for graduate students (such as Masters/S2 and PhD/S3 students) as well as working academics in the broad field of ICT. This includes graduate students from Indonesian universities and those studying abroad. The IIJ provides an avenue for these students to publish their papers to a journal which is international in its reviewer scope and in its advisory board. • Improve the quality of research & publications on ICT in Indonesia: One of the long term goals of the IIJ is to promote research on ICT and Internet development in Indonesia, and over time to improve the quality of academic and technical publications from the Indonesian ICT community. Additionally, the IIJ seeks to be the main publication venue for various authors worldwide whose interest include Indonesia and its growing area of information and communication technology. • Provide access to academics and professionals overseas: The Editorial Advisory Board (EAB) of the IIJ is intentionally composed of international academics and professionals, as well as those from Indonesia. The aim here is to provide Indonesian authors with access to international academics and professionals who are aware of and sensitive to the issues facing a developing nation. Similarly, the IIJ seeks to provide readers worldwide with easy access to information regarding ICT and Internet development in Indonesia. • Promote the culture of writing and authorship: The IIJ seeks to promote the culture of writing and of excellent authorship in Indonesia within the broad area of ICT. It is for this reason that the IIJ is bilingual in that it accepts and publishes papers in either English or Bahasa Indonesia. Furthermore, the availability of an Indonesia-wide journal with an international advisory board may provide an incentive for young academics, students and professionals in Indonesia to develop writing skills appropriate for a journal. It is hoped that this in-turn may encourage them to subsequently publish in other international journals in the future.

Internetworking Indonesia Journal ISSN: 1942-9703 www.InternetworkingIndonesia.org

Internetworking Indonesia Journal

Recommend Documents