KLASTERISASI DOKUMEN MENGGUNAKAN ALGORITMA K-MEANS DAN SINGULAR VALUE DECOMPOSITION
TUGAS AKHIR
Sebagai Persyaratan Guna Meraih Gelar Sarjana Strata 1 Teknik Informatika Universitas Muhammadiyah Malang
Sielvy Ayuni 201110370311327
JURUSAN TEKNIK INFORMATIKA FAKULTAS TEKNIK UNIVERSITAS MUHAMMADIYAH MALANG 2016
KATA PENGANTAR Assalamu’alaikum. Wr. Wb Dengan puji syukur penulis panjatkan kehadirat Allah SWT yang dengan rahmat dan ridho-Nya lah sehingga Tugas Akhir dengan judul Klasterisasi Dokumen Menggunakan Algoritma K-Means dan Singular Value Decomposition. Penyusunan Tugas Akhir ini dimaksudkan untuk memenuhi persyaratan guna menyelesaikan studi pada program S-1 Teknik Informatika di Universitas Muhammadiyah Malang. Pada kesempatan ini penulis ingin menyampaikan ucapan terima kasih kepada berbagai pihak yang telah membantu penyelesaian Tugas Akhir ini. Ucapan terima kasih penulis sampaikan kepada: 1. Bapak
DR.
Muhajir
Effendy,M,AP,
selaku
Rektor
Universitas
Muhammadiyah Malang yang telah menyediakan fasilitas guna lancarnya pembelajaran. 2. Bapak Yuda Munarko, S.Kom., M.Sc., selaku ketua jurusan Teknik Informatika di Universitas Muhammadiyah Malang. 3. Bapak Yufis Azhar, S.Kom., M.Kom dan Ibu Gita Indah Marthasari, S.T., M.Kom selaku dosen pembimbing Tugas Akhir yang senantiasa sabar dalam membimbing dan memotivasi dalam Tugas Akhir ini. 4. Bapak Yuda Munarko, S.Kom., M.Sc. dan Bapak Mahar Faiqurahman, S.Kom., M.T, selaku dosen penguji yang memberikan masukan yang membangun. 5. Orang Tua sekaligus sayap pelindungku, terimakasi tak terhingga atas segalanya. Semoga Allah berikan Kesembuhan untuk Mamak, Kesehatan dan Umur panjang untuk Mamak dan Bapak. Aamiin… 6. Jagoan-jagoan kakak, semoga Allah menjaga kalian dimanapun dan menjadi laki-laki yang bertanggung jawab terhadap keluarga dan 4 wanitanya kelak. Aamiin… 7. Sahabat sholehaku Indah, Zulfa, Nisa, Imel, Masiti, Qisthiya, Putri, Andri, Dina, Qiqi, Diah Eka, dan lainnya yang tidak disebutkan terimakasih selalu mengingatkan dalam setiap kebaikan dan mengajak untuk lebih mendekat
kepada Allah SWT, semoga Allah menjaga kami selalu dalam iman dan islam. 8. Sahabat sholehku Reza, Syarif, Agung, Reno, Endru, Riris, Affrizal, Luwi, Robbi dan lainnya, terimakasih sudah saling menjaga selama dimalang. Semoga Allah menjaga kami dalam iman dan islam. 9. Teman-teman seperjuangan teknik Informatika UMM, khusunya angkatan 2011 dan kelas G, terimakasih atas kenangan indah selama masa perkuliahan, semoga Alah menjaga kami selalu. Semoga budi baik yang mereka berikan kepada penulis mendapat balasan dari Allah SWT. Penulis menyadari bahwa Tugas Akhir ini masih jauh dari sempurna, mengingat keterbatasan pengetahuan penulis. Dengan segala kerendahan hati penulis memohon kepada semua pihak untuk memberikan kritik dan saran untuk kesempurnaan Tugas Akhir ini. Penulis berharap semoga Tugas Akhir ini bermanfaat bagi peneliti pada khususnya dan bagi pembaca pada umumnya. Wassalamu’alaikum. Wr. Wb.
Malang, 11 Januari 2016 Penulis,
Sielvy Ayuni
DAFTAR ISI LEMBAR PERSETUJUAN ................................................................................... ii LEMBAR PENGESAHAN ................................................................................... iii LEMBAR PERNYATAAN ....................................................................................iv ABSTRAK. ..............................................................................................................v ABSTRACT ..............................................................................................................vi KATA PENGANTAR .......................................................................................... vii LEMBAR PERSEMBAHAN .................................................................................ix DAFTAR ISI ............................................................................................................x DAFTAR GAMBAR ........................................................................................... xiii DAFTAR TABEL. .................................................................................................xv DAFTAR GRAFIK ...............................................................................................xvi BAB I PENDAHULUAN 1.1. Latar Belakang Masalah .............................................................................1 1.2. Rumusan Masalah ....................................................................................... 2 1.3. Tujuan ........................................................................................................2 1.4. Batasan Masalah ......................................................................................... 3 1.5. Metodologi Penelitian .................................................................................3 1.6. Sistematika Penelitian .................................................................................5 BAB II LANDASAN TEORI 2.1. Definisi Data Mining ..................................................................................7 2.2. Text Mining .................................................................................................8 2.2.1. Case Folding ...................................................................................... 9 2.2.2. Tokenizing .......................................................................................... 9 2.2.3. Stopword Removal atau Filtering..................................................... 10 2.2.4. Stemming .......................................................................................... 10 2.3. Pembobotan TF-IDF ................................................................................12 2.4. Analisis Klaster ....................................................................................... 13 2.5. Algoritma K-Means ................................................................................. 14 2.5.1. Kelebihan K-Means ........................................................................ 16 2.5.2. Kekurangan K-Means .....................................................................16 2.6. Singular Value Decomposition (SVD) .................................................... 17 2.7. Java ..........................................................................................................18 2.7.1. Kelebihan Java ................................................................................18 2.7.2. Kekurangan Java ............................................................................19 BAB III ANALISA DAN RANCANGAN SISTEM 3.1. Analisa Sistem .......................................................................................... 20 3.2. Analisa Kebutuhan .................................................................................... 20 3.2.1. Functional Requirement ................................................................ 22 3.2.1.1. Fungsi Search ......................................................................22 3.2.1.2. Fungsi Preprocessing .......................................................... 23 3.2.1.3. Fungsi Pembobotan TF-IDF ............................................... 24 3.2.1.4. Fungsi Mulai Klaster ........................................................... 25 3.2.2. Non-Functional Requirement ........................................................ 25
3.3.
3.4.
3.5.
3.6.
3.2.3. Flowchart ....................................................................................... 26 3.2.3.1. Flowchart Preprocessing .................................................... 26 3.2.3.2. Flowchart TF-IDF ............................................................... 28 3.2.3.3. Flowchart SVD ...................................................................29 3.2.3.4. Flowchart K-Means ............................................................. 29 3.2.4. Usecase Diagram ...........................................................................30 3.2.5. Activity Diagram ...........................................................................31 3.2.5.1. Menu Search .......................................................................31 3.2.5.2. Menu Klaster .......................................................................31 3.2.5.3. Menu Lihat Data .................................................................32 Perancangan Sistem .................................................................................. 33 3.3.1. Sequence Diagram .........................................................................33 3.3.2. CDM (Conceptual Data Model) ..................................................... 34 3.3.3. PDM (Phisical Data Model) ........................................................... 35 Contoh Kasus dan Penyelesaian ............................................................... 35 3.4.1. Tahap Preprocessing .....................................................................35 3.4.2. Pembobotan TF-IDF ......................................................................37 3.4.3. Singular Value Decomposition (SVD) .......................................... 39 3.4.4. Klastering ....................................................................................... 40 Perancangan Pengujian .............................................................................46 3.5.1. Pengujian Berdasarkan Functional Requirement ........................... 46 3.5.2. Pengujian Klaster ...........................................................................46 Desain Interface ....................................................................................... 46
BAB IV IMPLEMENTASI DAN PENGUJIAN 4.1. Implementasi Software .............................................................................49 4.1.1. Implementasi Preprocessing ......................................................... 49 4.1.2. Implementasi TFID .......................................................................51 4.1.3. Implementasi SVD ........................................................................52 4.1.4. Implementasi K-Means .................................................................54 4.1.5. Implementasi Search Dokumen .................................................... 56 4.1.6. Implementasi Lihat Dokumen ....................................................... 57 4.1.7. Implementasi Detail Dokumen ..................................................... 57 4.2. Implementasi Interface Sistem..................................................................57 4.2.1. Interface Menu Utama ..................................................................58 4.2.2. Interface Menu Search ..................................................................58 4.2.3. Interface Menu lihat Dokumen ..................................................... 59 4.2.4. Interface Menu Preprocessing dan TDIF ..................................... 60 4.2.5. Interface Menu Klaster dengan K-Means dan SVD ..................... 60 4.3. Pengujian Sistem....................................................................................... 61 4.3.1. Data Set ......................................................................................... 61 4.3.2. Metode Pengujian .........................................................................63 4.3.3. Hasil Uji Coba Klaster dengan Metode Purity ............................. 64 4.3.3.1. Klastering dengan K-Means ..............................................64 4.3.3.2. Klasterng dengan K-Means dan SVD ............................... 66 4.3.3.3. Klastering dengan K-Means Centroid ditentukan ............. 67 4.3.3.4. Klastering dengan K-Means dan SVD Centroid ditentukan ...........................................................................69
4.3.3.5. Klastering dengan K-Means Range Data 250-350 Karakter..............................................................................70 4.3.3.6. Klastering dengan K-Means dan SVD Range Data 250-350 Karakter..............................................................................72 4.3.3.7. Klastering dengan K-Means Range Data 400-430 Karakter..............................................................................73 4.3.3.8. Klastering dengan K-Means dan SVD Range Data 400-430 Karakter..............................................................................75 4.3.3.9. Klastering dengan K-Means Range Data 500-550 Karakter..............................................................................77 4.3.3.10. Klastering dengan K-Means dan SVD Range Data 500-550 Karakter..............................................................................78 4.3.3.11. Klastering dengan K-Means Range Data 600-700 Karakter..............................................................................80 4.3.3.12. Klastering dengan K-Means dan SVD Range Data 600-700 Karakter..............................................................................81 4.4. Pembahasan .............................................................................................. 83 BAB V PENUTUP 5.1. Kesimpulan ............................................................................................... 88 5.2. Saran .........................................................................................................89 DAFTAR PUSTAKA ........................................................................................... 90
DAFTAR GAMBAR
1.1. Metodologi Penelitian ...................................................................................... 3 1.2. Rancangan Sistem ........................................................................................... 4 2.1. Tahap Proses KDD .......................................................................................... 7 2.2. Proses Case Folding ........................................................................................ 9 2.3. Proses Tokenizing ........................................................................................... 10 2.4. Proses Filtering .............................................................................................. 11 2.5. Proses Stimming ............................................................................................. 10 3.1. Rancangan Sistem .......................................................................................... 21 3.2. Flowchart Case Folding ................................................................................. 26 3.3. Flowchart Tokenizing ..................................................................................... 27 3.4. Flowchart Filtering ........................................................................................ 27 3.5. Flowchart Stimming ....................................................................................... 28 3.6. Flowchart Pembobotan TF IDF .....................................................................28 3.7. Flowchart SVD .............................................................................................. 29 3.8. Flowchart Algoritma K-Means ......................................................................30 3.9. Use Case Diagram Sistem ..............................................................................30 3.10. Activity Diagram Menu Search ...................................................................31 3.11. Activity Diagram Menu Klaster ...................................................................32 3.12. Activity Diagram Menu Lihat Dokumen ...................................................... 32 3.13. Sequence Diagram Menu Search .................................................................33 3.14. Sequence Diagram Menu Klaster ................................................................ 33 3.15. Sequence Diagram Menu Lihat Dokumen ................................................... 34 3.16. CDM (Conceptual Data Model) ..................................................................34 3.17. PDM (Phisical Data Model) ........................................................................35 3.18. Matrik Hasil TF-IDF .................................................................................... 39 3.19. Singular Matrik A ........................................................................................ 39 3.20. Singular Matrik A dengan Rank k=3 ........................................................... 40 3.21. Hasil Perkalian Singular Matrik u,s,v,t ........................................................ 40 3.22. Interface Menu Search ................................................................................. 47 3.23. Interface Menu Klaster ................................................................................47 4.1. Implementasi Case Folding ...........................................................................50 4.2. Implementasi Tokenizing ...............................................................................50 4.3. Implementasi Filtering ................................................................................... 51 4.4. Implementasi Stemming ................................................................................. 51 4.5. Implementasi TF ............................................................................................ 52 4.6. Implemetntasi IDF ......................................................................................... 52 4.7. Implementasi TF IDF ..................................................................................... 52 4.8. Implementasi SVD ......................................................................................... 53 4.9. Perkalian Matrik SVD .................................................................................... 53 4.10. Implementasi Nilai K ................................................................................... 54 4.11. Implementasi Centroid Awal .......................................................................54 4.12. Implementasi Hitung Jarak Klaster .............................................................. 55 4.13. Implementasi Centroid Baru ........................................................................55 4.14. Implementasi Cek Anggota Klaster ............................................................. 56
4.15. Implementasi Search Dokumen ...................................................................56 4.16. Implementasi Lihat Dokumen ......................................................................57 4.17. Implementasi Detail Dokumen ....................................................................57 4.18. Interface Menu Utama ................................................................................. 58 4.19. Interface Menu Search ................................................................................. 58 4.20. Interface Menu Lihat Dokumen ...................................................................59 4.21. Interface Menu Detail Lihat Dokumen ........................................................ 59 4.22. Interface Menu Processing dan IDF ............................................................ 60 4.23. Interface Menu Klaster dengan K-Means dan SVD .....................................60 4.24. Hasil Klaster K-Means I ..............................................................................64 4.25. Hasil Klaster K–Means II ............................................................................65 4.26. Hasil Klaster K-Means III ............................................................................65 4.27. Hasil Klaster K-Means dan SVD I ............................................................... 66 4.28. Hasil Klaster K-Means dan SVD II ............................................................. 66 4.29. Hasil Klaster K-Means dan SVD III ............................................................ 67 4.30. Hasil Klaster K-Means I (Centroid ditentukan)............................................67 4.31. Hasil Klaster K–Means II (Centroid ditentukan)..........................................68 4.32. Hasil Klaster K-Means III (Centroid ditentukan) .........................................68 4.33. Hasil Klaster K-Means dan SVD I (Centroid ditentukan) ............................ 69 4.34. Hasil Klaster K-Means dan SVD II (Centroid ditentukan)........................... 69 4.35. Hasil Klaster K-Means dan SVD III (Centroid ditentukan) ......................... 70 4.36. Hasil Klaster K-Means I (Range 250-350 Karakter) ....................................70 4.37. Hasil Klaster K–Means II (Range 250-350 Karakter) ..................................71 4.38. Hasil Klaster K-Means III (Range 250-350 Karakter)..................................71 4.39. Hasil Klaster K-Means dan SVD I (Range 250-350 Karakter) .................... 72 4.40. Hasil Klaster K-Means dan SVD II (Range 250-350 Karakter) ................... 72 4.41. Hasil Klaster K-Means dan SVD III (Range 250-350 Karakter) .................. 73 4.42. Hasil Klaster K-Means I (Range 400-430 Karakter) ....................................73 4.43. Hasil Klaster K–Means II (Range 400-430 Karakter) ..................................74 4.44. Hasil Klaster K-Means III (Range 400-430 Karakter)..................................74 4.45. Hasil Klaster K-Means dan SVD I (Range 400-430 Karakter) .................... 75 4.46. Hasil Klaster K-Means dan SVD II (Range 400-430 Karakter) ................... 75 4.47. Hasil Klaster K-Means dan SVD III (Range 400-430 Karakter) .................. 76 4.48. Hasil Klaster K-Means I (Range 500-550 Karakter) ....................................77 4.49. Hasil Klaster K–Means II (Range 500-550 Karakter) ..................................77 4.50. Hasil Klaster K-Means III (Range 500-550 Karakter)..................................78 4.51. Hasil Klaster K-Means dan SVD I (Range 500-550 Karakter) .................... 78 4.52. Hasil Klaster K-Means dan SVD II (Range 500-550 Karakter) ................... 79 4.53. Hasil Klaster K-Means dan SVD III (Range 500-550 Karakter) .................. 79 4.54. Hasil Klaster K-Means I (Range 600-700 Karakter) ....................................80 4.55. Hasil Klaster K–Means II (Range 600-700 Karakter) ..................................80 4.56. Hasil Klaster K-Means III (Range 600-700 Karakter)..................................81 4.57. Hasil Klaster K-Means dan SVD I (Range 600-700 Karakter) .................... 81 4.58. Hasil Klaster K-Means dan SVD II (Range 600-700 Karakter) ................... 82 4.59. Hasil Klaster K-Means dan SVD III (Range 600-700 Karakter) .................. 82
DAFTAR TABEL
3.1. Functional Requirement .........................................................................22 3.2. Fungsi Search ......................................................................................... 22 3.3. Fungsi Preprocessing .............................................................................23 3.4. Fungsi Pembobotan TF-IDF ..................................................................23 3.5. Fungsi Mulai Klaster ..............................................................................24 3.6. Fungsi Lihat Data ................................................................................... 25 3.7. Non-Functional Requirement .................................................................26 3.8. Contoh Data Dokumen ..........................................................................35 3.9. Proses Case Folding dan Tokenizing ..................................................... 36 3.10. Proses Stopword Removal ......................................................................36 3.11. Proses Steaming ..................................................................................... 37 3.12. Proses Pembobotan TFIDF ................................................................... 38 3.13. Proses Perkalian TF-IDF ........................................................................ 38 3.14. (a) Dokumen Hasil SVD ........................................................................41 3.14. (b) Penentuan Centroid Awal ............................................................... 41 3.15. Hasil Perhitungan Jarak ke Centroid I dan Centroid II ......................... 42 3.16. Jarak Anggota Tiap Klaster ....................................................................42 3.17. (a) Centroid Baru Untuk Klaster I ......................................................... 43 3.17. (b) Centroid Baru Untuk Klaster II ........................................................ 43 3.18. Hasil Perhitungan Jarak Iterasi I ............................................................ 44 3.19. Jarak dan Anggota Klaster Iiterasi ke-0 dan Iiterasi ke-1 ...................... 44 3.20. (a) Centroid Baru Klaster I Iterasi ke-2 ................................................. 44 3.20. (b) Centroid Baru Klaster II Iterasi ke-2 ................................................44 3.21. Jarak Data ke Centroid Iterasi ke-2 ........................................................ 46 3.22. Jarak dan Anggota Klaster Iterasi ke-0, ke-1, ke-2 ................................ 44 4.1. Pengujian Functional Requirement ........................................................ 61 4.2. Pengujian Non-Functional Requirement ................................................ 61 4.3. Data Set Klasterisasi Manual Dokumen TA ..........................................62 4.4. Hasil Klaster K-Means ...........................................................................83 4.5. Hasil Klaster K-Means dan SVD ........................................................... 83 4.6. Hasil Klaster K-Means (Centroid Ditentukan) ....................................... 84 4.7. Hasil Klaster K-Means dan SVD (Centroid Ditentukan) ....................... 84 4.8. Hasil Klaster K-Means (Range 250-350 Karakter).................................84 4.9. Hasil Klaster K-Means dan SVD (Range 250-350 Karakter) ................. 85 4.10. Hasil Klaster K-Means (Range 400-430 Karakter).................................85 4.11. Hasil Klaster K-Means dan SVD(Range 400-430 Karakter) .................. 85 4.12. Hasil Klaster K-Means (Range 500-550 Karakter)................................. 85 4.13. Hasil Klaster K-Means dan SVD (Range 500-550 Karakter) ................. 86 4.14. Hasil Klaster K-Means (Range 600-700 Karakter).................................86 4.15. Hasil Klaster K-Means dan SVD (Range 600-700 Karakter) ................. 86
DAFTAR GRAFIK
4.1. Hasil Klasterisasi Dokumen ......................................................................87
DAFTAR PUSTAKA [1]
Hermawari, Fajar Astuti. 2009. Data Mining. Yogyakarta: penerbit ANDI
[2]
Affandi dan Catur supriyanto. 2011. Kombinasi Teknik Chi Squere dan Singular Value Decomposition untuk Reduksi Fitur Pada Pengelompokan Dokumen.seminar nasional teknologi dan komuniasi terapan 2011 (semantik 2011). ISBN 979-26-255-0
[3]
Prasetyo, Eko. 2012. Data Mining : Konsep dan Aplikasi Menggunakan Matlab. Yogyakarta: penerbit ANDI
[4]
Abbas, Irfan dkk,. 2012. Pengolahan Awal Data dan Penerapan Algoritma
Singular
Value
Decomposition
(SVD)
untuk
Memaksimalkan Varience Scores Princial Component dan Efisiensi
Proses
pada
Algortma
Principal
Component
Analysis(PCA). Semarang: Universitas Dian Nuswantoro. [5]
Umran, Munzir dan Taufi Faudi Abidin. 2009. Pengelompokan Dokumen
Menggunakan
K-Means
dan
Singular
Value
Decomposition : Studi Kasus Menggunakan Data Blog. Jurusan Matematika FMIPA, Universitas Syiah Kuala. [6]
Abdurrahman. 2014. Pengelompokan Buku Berbahasa Indonesia dengan mengimplementasikan Metode Text Mining dan Algoritma Artificial Bee Colony K-Means. Tugas Akhir Teknik Informatika. Malang : Universitas Muhammadiyah Malang.
[7]
Isanta, S. Andika. 2012. Pengelompokan Dengan Menggunakan Algoritma Agglomerative Hierarchical Clustering dan Bisecting K-Means Serta Pencarian Cerdas Berbasis Semantic Web Pada Studi Kasus Dokumen Tugas Akhir Jurusan Teknik Informatika Universitas Muhammadiyah Malang. Tugas Akhir Teknik Informatika. Malang: Universitas Muhammadiyah Malang.
[8]
Agusta, Ledy. Perbandingan Algoritma Stemming Porter Dengan Algoritma Nazief & Andriani untuk Stemming Dokumen Text Berbahasa Indonesia. Universitas Kristen Satya Wacana. BALI, 14 November 2009:196-201.
[9]
Lindawati. 2008. Data Mining dengan Teknik Clustering dalam Pengklasifikasian Data Mahasiswa : Studi Kasus Prediksi Lama Studi Mahasiswa Universitas Bina Nusantara. Seminar Nasional Informatika 2008 (semnasIF 2008). UPN “veteran” Yogyakarta, 24 mei 2008. ISSN: 1979-2328
[10] Tajunisha, Saravanan. Performance Analysis of K-Means with Different Initialization Methods for High Dimensional Data. International journal of Artificial Intelegence & Applications (IJAIA), vol. 1, no. 4, pp. 45-52, October 2010 [11] Wu, Junjie. 2012. Advances in K-Means Clustering, A data Mining Thinking. Spinger : Heidelberg New York, London. ISBN 9783-624-29807-3 (Ebook) [12] Gan, Guojun, Chaoqun Ma, Jianhong Wu. 1979. Data Clustering: Theory, Algorithms, and Applications. ASA-SIAM Series on Statistic and Applied Probability ; 20) ISBN: 978-0-898716-238 (alk. Paper) [13] Li Yanjun, Congnan Luo, Soon M, Chung. May 2008. Text Clustering with Feature Selection by Using Statistical Data. IEEE. Vol : 20, No. 5. [14] Karmayasa, Oka dan Bagus Mahendra. 2012. Implementasi Vector Space Model dan Beberapa Notasi Metode Term Frequency Inverse Document Frequency (TF-IDF) pada Sistem Temu Kembali Informasi. Bali : Universitas Udayana. [15] Pebtian, Arzico. 2014. Aplikasi Pengamanan Hak Cipta untuk Data Gambar
Digital
Menggunakan
Metode
Singular
Value
Decomposition (SVD). Pelita Informatika Budi Darma, vol. VII, no. 1, juli 2014