Data Mining: Konsep dan Teknik — Bab 1 — Syahril Efendi, S.Si., MIT Departemen Matematika & Departemen Ilmu Komputer FMIPA USU
October 10, 2012
Data Mining: Konsep Dan Teknik
1
Bab 1. Pengenalan
Kenapa Data Mining?
Apa itu Data Mining?
Pandangan Multi-Dimensional dari Data Mining
Macam data apa dapat ditambang?
Macam-macam pola apa dapat ditambang?
Teknologi apa yang digunakan?
Macam aplikasi apa yang ditargetkan?
Isu-isu utama dalam Data Mining
Laporan singkat Histori Data Mining dan Masyarakat Data Mining
Kesimpulan
October 10, 2012
Data Mining: Concepts and Techniques
2
Kenapa Data Mining?
Ledakan Pertumbuhan data : dari terabytes sampai petabytes
Pengumpulan data dan Ketersediaan data
Perkakas pengumpulan data otomatis, sitem database, Web, masyarakat komputerisasi
Sumber-sumber Utama dari data berlimpah
Bisnis: Web, e-commerce, transactions, stocks, …
Sain: Remote sensing, bioinformatics, scientific simulation, …
Society : Berita, camera digital, YouTube
Kita tenggelam dalam data tapi lapar Pengetahuan
Kebutuhan adalah induk dari penemuan “Necessity is the mother of invention” Data mining:Analisis otomatis dari himpunan segerombolan data
October 10, 2012
Data Mining: Concepts and Techniques
3
Evolusi dari Sain
Sebelum 1600, Ilmu Empiris (empirical science)
1600-1950, Ilmu teoritikal (theoretical science)
1950-1990, Ilmu Komputasional (computational science)
Setiap disiplin ilmu memiliki pertumbuhan komponen teoritikal. Model-model teoritikal kerap kali termotivasi dari pengalaman dan digeneralisasi pemahamannya. Lebih 50 tahun terakhir, Beberapa disiplin memiliki tiga pertumbuhan, cabang komputasional (misalnya: empiris, teoritikal, dan ekologi komputasional, atau physik, atau linguistik.) Simulasi Ilmu komputasional secara tradisional. Pertumbuhannya tidak dapat menemukan bentuk solusi model matematika kompleks.
1990-Sekarang, Ilmu data (data science)
Banjir data dari instrumen dan simulasi ilmu-ilmu baru
Kemampuan penyimpanan secara ekonomi dan manajemen data online (petabytes)
Internet dan jaringan komputasi yang dapat diakses mendapatkan arsip-arsip secara universal Scientific info. management, acquisition, organization, query, and visualization tasks scale selalu linier dengan volume data. Data mining adalah tantangan utama baru!
October 10, 2012
Data Mining: Concepts and Techniques
4
Evolusi Teknologi Database
1960s:
1970s:
model data Relasional, implementation DBMS relasional
1980s:
RDBMS, model data lanjutan(extended-relational, OO, deductive, dll.)
Aplikasi berorientasi DBMS (spatial, scientific, engineering, dll.)
1990s:
Pengumpulan Data, Pembentukan database, IMS dan jaringan DBMS
Data mining, data warehousing, multimedia databases, dan Web databases
2000s
Stream data management and mining
Data mining dan aplikasinya
Teknologi Web(XML, integrasi data) dan sistem informasi global
October 10, 2012
Data Mining: Concepts and Techniques
5
Apa itu Data Mining?
Data mining (knowledge discovery from data)
Ekstraksi kepentingan(non-trivial, implisit, sebelumnya tak diketahui dan bermanfaat secara potensial) pola-pola atau pengetahuan dari jumlah data yang besar
Alternatif nama
Data mining: istilah tak cocok atau nama yang salah (a misnomer)? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Tampilan : berubah jadi “data mining”?
Pencarian sederhana dan pemrosesan query
(Deduktif) sistem pakar
October 10, 2012
Data Mining: Concepts and Techniques
6
Knowledge Discovery (KDD) Process
Ini adalah pandangan typikal sistem database dan komuniti Pattern Evaluation data warehousing Peran data mining penting dalam proses penemuan pengetahuan Data Mining (knowledge discovery) Task-relevant Data Selection
Data Warehouse Data Cleaning Data Integration Databases October 10, 2012
Data Mining: Concepts and Techniques
7
Contoh : Kerangka Web Mining
Web mining biasanya meminta
Pencucian data (Data cleaning)
Integrasi data dari banyak sumber
sebuah database untuk penyimpanan data (Warehousing the data)
Konstruksi Data cube
Seleksi data untuk data mining
Data mining
Presentasi dari hasil-hasil penambangan
Pola-pola dan pengetahuan digunakan atau disimpan ke dalam knowledge-base
October 10, 2012
Data Mining: Concepts and Techniques
8
Data Mining dalam Kecerdasan Bisnis Peningkatan potensial untuk mendukung keputusan bisnis
Decision Making Data Presentation Visualization Techniques
End User
Business Analyst
Data Mining Information Discovery
Data Analyst
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems October 10, 2012
Data Mining: Concepts and Techniques
DBA
9
Contoh: Mining vs. Eksplorasi Data
Kajian Kecerdasan Bisnis
Warehouse, data cube, pelaporan yang tidak banyak penambangan
Objek-objek bisnis vs. Perkakas data mining Contoh rantai suplai: Perkakas (tools) Presenatasi Data Eksplorasi
October 10, 2012
Data Mining: Concepts and Techniques
10
Proses KDD: Pandangan Tipikal dari ML dan Statistik Input Data
Data PreProcessing
Integrasi data Normalisasi Seleksi Fitur Reduksi Dimensin
Data Mining
Penemuan Pola Asosiasi & Korelasi Klasifikasi Cluster Analisis Pencilan (Outlier) …………
PostProcessing
Evaluasi Pola Seleksi Pola Interpretasi Pola Visualisasi Pola
Ini ada pandangan dari mesin pembelajaran dan komuniti statistik
October 10, 2012
Data Mining: Concepts and Techniques
11
Contoh : Data Mining Kedokteran
data mining Kesehatan dan kedokteran– seringkali mengadopsi statistik dan mesin pembelajaran Awal pemrosesan data (termasuk ekstraksi fitur dan reduksi dimensi)
Klasifikasi dan/atau proses cluster
Akhir pemrosesan untuk presentasi
October 10, 2012
Data Mining: Concepts and Techniques
12
Pandangan Multi-Dimensi Data Mining
Data untuk ditambang Database data (extended-relational, object-oriented, heterogeneous, legacy), data warehouse, transactional data, stream, spatiotemporal, time-series, sequence, text and web, multi-media, graphs & social and information networks Knowledge untuk ditambang (atau: fungsi-fungsi Data mining) Karakterisasi, Diskriminasi, asosiasi, klasifikasi, cluster, trend/deviasi, analisis pencilan (outlier), dll. Deskriptif vs. prediktif data mining Fungsi-fungsi Multiple/integrated dan penambangan di level multiple Teknik-teknik utilisasi Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern recognition, visualization, high-performance, dll. Applikasi Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, dll.
October 10, 2012
Data Mining: Concepts and Techniques
13
Data Mining: macam-macam Data?
Aplikasi dan kumpulan data berorintasi Database
Relational database, data warehouse, transactional database
Aplikasi lanjutan dan kumpulan data lanjutan
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web
October 10, 2012
Data Mining: Concepts and Techniques
14
Fungsi Data Mining: (1) Generalisasi
Integrasi Informasi dan konstruksi data warehouse
Teknologi Data cube
Pencucian data, transformasi, integrasi, dan model data multidimensional Metoda Scalable untuk penghitungan (yakni, material) agregat multidimensional OLAP (online analytical processing)
Deskripsi konsep multidimensional: Karakterisasi dan diskriminasi
Generalisasi, Meringkas (summarize), dan karakteristik data kontras, yakni., wilayah kering vs. basah
October 10, 2012
Data Mining: Concepts and Techniques
15
Fungsi Data Mining: (2) Asosiasi dan Analisis Korelasi
Frekuensi pola-pola (atau frekuensi kumpulan item)
Apa item-item yang dibelanjakan bersama secara frekuensi dalam pusat perbelanjaan?
Asosiasi, korelasi vs. Kasual (sebab akibat)
Tipikal aturan asosiasi
Popok (Diaper) Bir (Beer) [0.5%, 75%] (pendukung, kepercayaan)
Item-item diasosiasikan dengan kuat juga dikorelasikan dengan kuat?
Bagaimana menambang pola-pola dan aturan-aturan dengan efisien dalam kumpulan data besar? Bagaimana menggunakan pola-pola untuk klasifikasi, cluster, dan aplikasi lain?
October 10, 2012
Data Mining: Concepts and Techniques
16
Fungsi Data Mining: (3) Klasifikasi
Klasifikasi dan prediksi label
Menbangun dasar model (fungsi) pada beberapa contoh pelatihan Menggambarkan dan membedakan kelas-kelas atau Konsep-konsep untuk memprediksi masa depan
Memprediksi beberapa kelas label yang tak diketahui
Metode Tipikal
Yakni., mengklasifikasi negara berdasarkan iklim (climate), atau mengklasifikasi mobil berdasarkan jarak dan penggunaan bensin atau solar
Pohon Keputusan, Klasifikasi Bayesian, support vector machines, neural networks, Kalsifikasi berdasar aturan,Klasifikasi berdasar pola, logistic regression, …
Aplikasi Tipikal:
Deteksi kecurangan kartu kredit, Perdagangan langsung, classifying stars, Penyebaran penyakit (diseases), web-pages, …
October 10, 2012
Data Mining: Concepts and Techniques
17
Fungsi Data Mining: (4) Anailisis Cluster
Pembelajaran yang tidak disupervisi (yakni, label kelas tak diketahui) Group data untuk kategori baru (yakni, cluster), misalnya., cluster rumah untuk menemukan pola-pola distribusi Prinsip: Maksimumkan kesamaan dalam kelas (intra-class) & minimumkan kesamaan antar kelas (interclass) Banyak Metode dan aplikasi
October 10, 2012
Data Mining: Concepts and Techniques
18
Fungsi Data Mining: (5) Analisis Pencilan (Outlier)
Analisis Pencilan (Outlier)
Pencilan (Outlier): Suatu objek data yang tidak memenuhi dengan prilaku umum data Gangguan (Noise) atau Pengecualian (exception)? ― Satu orang menyampah orang yang lain dapat menghargai
Metode: dengan produk cluster atau analisis regresi, …
Berguna dalam deteksi kecurangan, analisis kejadian yang aneh
October 10, 2012
Data Mining: Concepts and Techniques
19
Time and Ordering: Analisis Pola sekuensial, Trend dan Evolusi
Analisis Sekuen, trend dan evolusi Trend, time-series, dan analisis deviasi: misalnya., regresi dan prediksi nilai Penambangan pola sekuensial Misalnya, Pertama membeli camera digital, selanjutnya membeli kartu memori SD besar Analisis periodik Motif dan analisis sekuen biologikal Pendekatan dan motif berurutan Analsis berbasis kesamaan Penambangan data mengalir (streams) Ordered, Waktu-bermacam-macam, potentially infinite, data streams
October 10, 2012
Data Mining: Concepts and Techniques
20
Analisis struktur dan jaringan
Penambangan graf (Graph mining) Menemukan subgraf yang sering (misalnya., senayawa kimia), trees (XML), substructures (web fragments) Analisis jaringan informasi (Information network analysis) Jaringan sosial (Social networks): aktor (objek, node) dan hubungan (edge) misalnya, jaringan penulis dalam CS, jaringan teroris Jaringan Multiple heterogeneous Satu orang mempunyai beberapa jaringan informasi: teman, famili, teman sekelas, … Link yang membawa banyak informasi semantik: Link mining Penamabangan web (Web mining) Web adalah jaringan informasi besar: dari PageRank untuk Google Analisis jaringan informasi web Penemuan komunitas Web, penambangan pendapat, penamabangan pengguna, …
October 10, 2012
Data Mining: Concepts and Techniques
21
Evaluasi Pengetahuan
Apa pentingnya semua pengetahuan ditambang?
Satu orang mendapat pola dan pengetahuan dalam jumlah yang besar
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
Evaluation of mined knowledge → directly mine only interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness …
October 10, 2012
Data Mining: Concepts and Techniques
22
Data Mining: Confluence of Multiple Disciplines Machine Learning
Applications
Algorithm
October 10, 2012
Pattern Recognition
Data Mining
Database Technology
Data Mining: Concepts and Techniques
Statistics
Visualization
High-Performance Computing
23
Why Confluence of Multiple Disciplines?
Tremendous amount of data (Jumlah data yg luar biasa)
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Algorithms must be highly scalable to handle such as tera-bytes of data
Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations
New and sophisticated applications
October 10, 2012
Data Mining: Concepts and Techniques
24
Applications of Data Mining
Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQLServer Analysis Manager, Oracle Data Mining Tools) to invisible data mining
October 10, 2012
Data Mining: Concepts and Techniques
25
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
October 10, 2012
Data Mining: Concepts and Techniques
26
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
Handling complex types of data
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
October 10, 2012
Data Mining: Concepts and Techniques
27
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
1991-1994 Workshops on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
ACM Transactions on KDD starting in 2007
October 10, 2012
Data Mining: Concepts and Techniques
28
Conferences and Journals on Data Mining
KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) European Conf. on Machine Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Int. Conf. on Web Search and Data Mining (WSDM)
October 10, 2012
Other related conferences
DB conferences: ACM SIGMOD, VLDB, ICDE, EDBT, ICDT, … Web and IR conferences: WWW, SIGIR, WSDM
ML conferences: ICML, NIPS
PR conferences: CVPR,
Journals
Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
Data Mining: Concepts and Techniques
29
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM)
Database systems (SIGMOD: ACM SIGMOD Anthology —CD ROM)
Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems,
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc.
Web and IR
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.
October 10, 2012
Data Mining: Concepts and Techniques
30
Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006 (3ed. 2011)
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
October 10, 2012
Data Mining: Concepts and Techniques
31
Summary
Data mining: Discovering interesting patterns and knowledge from massive amount of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of data Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining
October 10, 2012
Data Mining: Concepts and Techniques
32