Spatial Data Mining using SAR-Kriging Model
Atje Setiawan Abdullah
A Lecturer at Informatics Engineering Study Program Department of Computer Science FMIPA Universitas Padjadjaran Jl. Raya Bandung Sumedang Km 21 Jatinangor e-mail:
[email protected],
[email protected]
SEAMS School Spatio Temporal Data Mining and Optimization Modeling UTC-Bandung, August 9-19, 2016
1. Introduction In this paper we combine the Expansion of Spatial Autoregressive (Expansion SAR) model as an extension of SAR model and Kriging technique to predict a quality of education of elementary school. The quality of education is defined as a result of student on study which is measured by National End Test (UAN). In Indonesia the score of UAN still spreadly sparse, because there are difference on education services based on spatial or location.
Education of elementary or middle level is study process of passing school, imposed to student to be having storey; certain interest in cognate ability, psycomotoric, and affective, according to specified by a middle and elementary education curriculum. Quality of education defined as achievement reached by the student and measured by pursuant to final test value of national (UAN).
1.1 Problems Research about quality of education still be limited, focused at measurement of result of education through UAN school, and analysis method still limited to descriptive analysis. Considering regional swampy forest broadness of education in Indonesia and social condition, economic, and also culture which different in each location, hence related/relevant problem with the education quality in school at various location in Indonesia represent the interesting study to be studied by method of spatial of data mining.
One of model of spatial of data mining which can be used for the description and prediction is Expansion Spatial Autoregressive ( Expansion SAR). The Expansion SAR used for prediction of observation in sample location. In the case of measuring heterogeneities based on co-ordinate of location spatial. Lack of the SAR model, it cannot be used to predict at unsample location. Kriging method is one of spatial analysis which can be used for prediction at unsample location. So, we try to combine the SAR and Kriging method to be SAR-Kriging for prediction at unsample location using the parameter of SAR as an input of Kriging method.
1.2 The Aims of Research • Studying model of combination of Expansion SAR and Kriging method (SAR-Kriging) • Applying concept of spatial of data mining use the method of SAR-Kriging, for prediction at unsample locations. For case study we use the database of SDPN 2003 to predict quality of education for elementary school, junior high school and senior high school in Indonesia.
PROSES SPASIAL DATA MINING MENGGUNAKAN SAR-KRIGING
DATABASE HASIL SDPN 2003 CLEANING DATA & TRANSFORMASI KE RASIO
PREPROCESSING
HASIL CLEANING & TRANSFORMASI SELEKSI INDIKATOR MENGGUNAKAN FAKTOR & SEM
HASIL SELEKSI FAKTOR DAN SEM
DATA EKSTERNAL
KOORDINAT KECAMATAN INTEGRASI DATA SPASIAL & NON SPASIAL
HASIL DATA PREPARATION MODEL MODEL SAR SAR
HASIL MODEL SAR & INDEKS MORAN
DATA MINING
MODEL EKSPANSI SAR
DATA MUTU HASIL SURVEI
HASIL EKSPANSI SAR & GRAFIK PERHITUNGAN KRIGING
HASIL MODEL SAR-KRIGING
DATA MUTU HASIL EKSPANSI SAR
PERSAMAAN SAR-KRIGING DAN MUTU HASIL SAR-KRIGING
HASIL PERBANDINGAN DATA AKTUAL & PREDIKSI
POSTPROCESSING
INTERPRETASI
PATTERN
EVALUASI & VISUALISASI
KNOWLEDGE
PROSES DATA MINING DATABASE SDPN 2003
SELEKSI DATA
CLEANING DATA
TRANSFORMASI DATA
INTEGRASI DATA
DATA MINING
PENGEMBANGN APLIKASI
INTERPRETASI DAN VISUALISASI HASIL
KNOWLEDGE
DATABASE SDPN 2003 Scalability Ukuran data 3,91 GB (4.178.499.369 byte) Terukur terdiri dari struktur tabel SD/SMP/SMA
High dimentionality Jumlah total record adalah 203.590 Jumlah variabel terdiri dari 569
Heterogeneity and Complex Data Melibatkan data non spasial dan data spasial Data non spasial indikator mutu pendidikan Data spasial koordinat kecamatan
Data Ownership and Distribution Tersebar secara geografis terdiri dari: provinsi,kabupaten, kecamatan dan desa
Non-traditional Analysis Melibatkan koordinat lokasi dan peta lokasi kecamatan, kabupaten dan provinsi di Indonesia Analysis menggunakan model spasial
SELEKSI DATA
DATABASE SDPN 2003 Data Persekolahan Data Pendidikan Luar Sekolah Data Non Pendidikan Data Perguruan Tinggi
257660 3047 240 13202
DATA PERSEKOLAHAN TK: SD: SMP: SMA: SMK:
54226 Record 158590 Record 28949 Record 10810 Record 4753 Record
DATA PENELITIAN SD: SMP: SMA
158.590 record dengan 122 variabel 28.949 record dengan 138 variabel 10.810 record dengan 142 variabel
TRANSFORMASI DATA DARI VARIABEL KE INDIKATOR
SELECT left(sd_sarana.id,7) AS kdkec, Sum(jbkips_1+jbkips_2+jbkips_3+jbkips_4+jbkips_5+jbkips_6+jbkPPKN_1+jbkPPKN_2+jbkPPK N_3+jbkPPKN_4+jbkPPKN_5+jbkPPKN_6+jbkINDO_1+jbkINDO_2+jbkINDO_3+jbkINDO_4+jbkI NDO_5+jbkINDO_6+jbkMat_1+jbkMat_2+jbkMat_3+jbkMat_4+jbkMat_5+jbkMat_6+jbkipa_1+jbki pa_2+jbkipa_3+jbkipa_4+jbkipa_5+jbkipa_6)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSBKTS, Sum(Lbangun)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sKtk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLBTS, Sum(Ltanah)/ Sum(jsisK_tk1l+jsisK_tk1p+jsisK_tk2l+jsisK_tk2p+jsisK_tk3l+jsisK_tk3p+jsisK_tk4l+jsisK_tk4p+jsi sK_tk5l+jsisK_tk5p+jsisK_tk6l+jsisK_tk6p) AS RSLTTS, Sum(jrng_baik)/ Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm) AS RSRB, Sum(jprg_ppkn+jprg_indo+jprg_mat+jprg_ipa+jprg_ips)/Sum(jrng_baik+jrng_rr+jrng_rb+jrng_bm) AS RSPRGTK FROM SD_Sarana INNER JOIN SD_SISWA ON SD_Sarana.ID=SD_SISWA.ID GROUP BY left(sd_sarana.id,7);
Hasil Query untuk Agregat Sarana sebagai berikut:
INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD
SELEKSI INDIKATOR
Rasio luas bangunan thdp jml siswa (RSLBTS)
Rasio jumlah siswa thp jumlah kelas (RSTRB)
DATA DASAR SD: 122 Variabel SMP: 138 Variabel SMA 142 Variabel
Rasio jml buku thdp jml siswa (RSBKTS)
Rasio jml alat peraga thdp jml kelas (RSPRGTK)
Rasio jml siswa thp jml guru(RSTGR)
Rasio jml ruang kelas baik thdp jml ruang kelas (RSRB)
Rasio jml siswa usia 7 tahun thdp jml siswa (RSBR7)
input
proses
Mutu
Rasio jml siswa baru thdp jml siswa (RSB)
Rata-rata Tingkat Kelulusan siswa (TKTLLS)
Rasio jml guru tetap thdp jml guru (RSGTTG)
Rasio jml pendaftar asal TK thdp jml pendaftar (RSDFTK)
Rasio jml guru agama thdp rombel (RSGATRB)
Rasio jml guru kelas terhadap jml guru (RSGKTG)
Rasio jml siswa mengulang thdp jml siswa (RSULGTJS)
Rasio jml guru >= D2 thdp jml guru (RSGLTG)
Rasio julah guru B. Ing thdp rombel (RSGINTROM)
RSLAB 0.02 28.69 RSTGR
0.01 4.29
HASIL SELEKSI INDIKATOR MENGGUNAKAN ANALISIS FAKTOR SD: 14 Indikator SMP: 16 Indikator SMA: 14 Indikator
Rasio jml siswa putus sekolah thdp jml siswa (RSPTSTD)
Rata-rata jumlah nilai UAS (TOTUAS)
Rasio jml siswa usia 7-12 tahun thdp jml siswa (RSUM712)
TRANSFORMASI DATA DASAR KE DATA INDIKATOR (QUERY) SD: 21 Indikator SMP: 19 Indikator SMA: 20 Indikator
Rasio luas tanah thdp jml siswa (RSLTTS)
0.03
RSBR12 0.01
-0.01
PROSES 0.89
INPUT
0.03
RSRB 0.03
0.04 -0.00
0.05
-0.00
0.01 RSUM1315 0.03
0.01
0.00 -0.00 RSGUAN0.00 -0.00 0.03
MUTU
RSGLTG0.01 1.00 40.43
0.00 RSPTSTS
0.00 RSDFSD
HASIL SELEKSI INDIKATOR MENGGUNAKAN SEM SD: 7 Indikator SMP: 10 Indikator SMA: 13 Indikator
TOTUAN1.55
Chi-Square=32.88, df=27, P-value=0.20104, RMSEA=0.023
INTEGRASI DATA
Menghubungkan kecamatan-kecamatan pada peta spasial dengan data kecamatan yang disurvei pada SDPN 2003.
Kecamatan yang tidak tersurvei pada SDPN 2003 dihilangkan dengan cara mengedit data spasialnya.
Menggabungkan data non spasial dengan data spasial yang telah terpilih pada tabel peta spasial sesuai dengan kecamatan masing-masing.
Menjalankan program MATLAB menggunakan metode yang sesuai
MODEL SAR KRIGING
Database SDPN 2003 Sihombing (2002) Nababan (2003) SELEKSI VARIABEL Proses Input Output Analisis Faktor, SEM
DATA MINING
PROSES SPASIAL
Cliff dan Ord (1975) Anselin, (1988) Cressie (1993) Armstrong (1998) Lazarevic (2000) Lichstein et al. (2002) Sekhar et al. (2003) LeSage (1999) LeSage dan Pace (2004) Van Beers dan Kleijinen (2004) Celik et al. (2005) Bronnenberg (2005) Kanazaki et al. (2006) Kumar dan Remadevi (2006) Bakkali, S. dan Amrani, M. (2008) Lu et.al (2008) Zhao Lu et al. (2008)
Koperski et al. (1997) Berry dan Linoff (2000) Soukup dan Davidson (2002) Giudici, et al. (2003) Han dan Kamber (2006) Tan et al. (2006) Olson dan Shi (2007) Refaat (2007) Giannotti dan Pedreschi (2008) Maimon dan Rokach, (2008)
SPASIAL DATA MINING
MODEL KAUSAL Model SAR Model Ekspansi SAR
DESKRIPSI Indeks dan Plot Moran
MODEL SAR-KRIGING
PREDIKSI Ordinary Kriging
1.3 Variables of Research In this research we use the database of SDPN 2003 from Balitbang-Depdiknas (2003), especially in elementary and indicator variables. Elementary variable represent the variable in individual raw data of school. Indicator variable is variable obtained by pursuant to elementary variables. Elementary variable cover the school identity, student indicator, medium indicator, teacher indicator, and total assess the UAN. From above indicator, builder by system of input and output of quality of education, input consisted by the student indicator, process composed by the indicator of medium and teacher indicator, output indicator of quality of education consisted by the amount assess the UAN and mean mount the pass. Indicator selection use the factor analysis and Structural Equation Model ( SEM).
HASIL REDUKSI VARIABEL INDIKATOR PENELITIAN MUTU PENDIDIKAN JENJANG SD MENGGUNAKAN STRUCTURAL EQUATION MODEL
Rasio jml ruang kelas baik thdp jml ruang kelas (RSRB) Rasio jumlah siswa thp jumlah kelas (RSTRB)
Rasio jml siswa usia 7 tahun thdp jml siswa (RSBR7)
Rata-rata jumlah nilai UAS (TOTUAS)
input
proses
Mutu Rata-rata Tingkat Kelulusan siswa (TKTLLS)
Rasio jml siswa baru thdp jml siswa (RSB)
Rasio jml guru >= D2 thdp jml guru (RSGLTG)
Figure 1.1 Variables Reduction Process
Figure 1.1 shows the result reduces of indicator variables having an effect on to quality, using factor analysis and SEM. The result for input gives 3 indicators, student ratio to amount class, ratio sum up the student old age 7 year to student at the first class and ratio new student to all all students. Process composed by 2 indicators that is ratio of well classroom to all space and competent teacher ratio to total teacher. Output composed by 2 indicators, total assess the UAN, and mount the pass. Indicator outputs UAN try to be analyzed by expansion SAR model.
2. Modeling at Spatial Data Mining 2.1 The Expansion SAR Model The expansion SAR like known the previous model spatial SAR in measuring heterogeneities spatial based on neighborhood. Model the linear spatial locally in the case of measuring heterogeneities based on coordinate of location spatial or a co-ordinate. Model the spatial like this is first time introduced by Casetti ( 1972, 1992 in Anselin, 1988 & Lesage, 1999). Paying attention to model regression in the following is:
y 0 β1x
(2.1)
Where abouts and each showing coefficient regression, and vector perception from free variable. Coefficient regression in the equation shows the heterogeneities spatial in perception unit. For that, in the equation require to be entangled by a number of extension variables, for example and in such a way till go into effect:
1 0 1 z1 2 z2
(2.2)
If the equation (2.1) substitution into equation ( 2.2) obtained: y 0 0 x 1 ( z 1 x) 2 ( z 2 x)
In general model the Casetti formulated as follows:
y Xβ ε
β ZJβ 0
(2.3)
where y1 y2 y y n
x1' 0 X 0
x β 0 y Z x1 I k 0 Z
0 x 2'
1 2 β n
0 ' xn
1 2 ε n
Z y1 I k
0
Z xn I k
I k
Z yn
The model appraised by using smallest square method to appraise the parameters. Pursuant to the parameter valuation, other valuation for the dot of in space appraised to use the second equation from (2.3). Distance from perception center formulated:
di
z xi z xc 2 z yi z y 2
(2.4)
so the expansion SAR model can be noticed:
y α Xβ XDβ 0 ε
(2.5)
In the equation (2.5), the influence of variable can be separated between non spatial and spatial
y α Xβ XDβ0 ε non spatial
spatial
Parameter β and β0 can be used to describe marginal influence for non spatial and spatil influences. For describing independent variables individually to dependent variable also can be used graphically through equation
xi i Z x xi yi i Z y yi di i D 0 i
(2.6)
2.2 Ordinary Kriging Method Kriging is a method of calculating estimates of a regionalized variable at a point, over an area, or within a volume, and uses as a criterion the minimization of an estimation variance Kriging interpolation involves the generation of images of the reservoir properties and commonly used to visualize reservoir heterogeneities Therefore, Kriging techniques not well suited for reproducing geological reservoir patterns where the number of data are very limited. Using Kriging technique, we can predict the observation at unsample location (Armstrong, 1998).
Assume that the regionalized variable under study has value Z i Z ( xi ) , each representing the value at a point
xi
. Also assume that this regionalized variable is second order stationary, with: expectation: E[Z ( x)] m Covariance:
EZ ( x h).Z ( x) m 2 C (h)
Variogram:
E Z ( x h) Z ( x) 2 (h)
2
A kriged estimator
ZV*
is a linear combination of n values of the regionalized variable: n
Z i Z i * V
(2.7)
i 1
For two locations, we have the minimum variance of Kriging (Armstrong, 1998): 1 2V 1V 1 1 2 12
2
1 1V 2V 2 2 12
To get the value of 1 and 2 using ordinary Kriging method we should have the values of , and 12 2V 1V
The value of 12 is semivariogram experimental from two sample points and 1V
is the semivariogram of the first sample point and the unsample point which will be predicted.
For case study we use the spherical semivariogram for two locations
ˆ (r ) h ( h) r ˆ (r )
,h r ,h r
(2.9)
2.3 SAR-Kriging Method Method of SAR-Kriging in this study represent the combination model the Expansion SAR with the technique Kriging addressed for the prediction of quality of education unsample locations. Stages in explainable SAR-Kriging model as follows (Abdullah, A.S.-2009):
• Determining variable dependent and independent to model the Ekspansi SAR entangling region data through distance between location center with the perception location • Conducting parameter estimating model the Expansion SAR with the Maximum Likelihood method • Determining location which unsample , around two sample location of co-ordinate and also apart to location sample
• Parameter valuation model the Expansion SAR made by input at Kriging method to obtain; get weight in location to be predicted of quality of education • The weight of Kriging represent the parameter valuation in unsample location • The weight of Kriging obtained become the coefficient model of the Expansion SAR in unsample location • Because model of Expansion SAR represent the model for the data of cross sectional, hence method of SAR Kriging got applicable to predict of quality of education if known by the independent values variable.
The Result of SAR-Kriging In this paper, we implemented spatial data mining using SAR-Kriging method to predict quality of education at 13 provinces in Indonesia included Aceh Province. In the base survey of education year 2003, Aceh didn’t included as a survey location, because of the situation and condition was very dangerous. So, for predicting of quality education we can use SAR-Kriging method.
For the method of SAR-Kriging, selected by data inputproses of quality of storey; level of elementary school, junior high school, and senior high school from two provinces in region of Indonesia, that is Banten Province and South Sulawesi Province.
Figure 3.1 Maps of Provinces in Indonesia http://zulfadli.files.wordpress.com/2008/01/indonesia-50-provinsi-gif.gif
Following the SAR-Kriging procedure, we have: (1). Location co-ordinate which unsample selected by 13 provinces around Banten and South Sulawesi (2). It ’ s obtained by a parameter valuation model the Expansion SAR through technique Kriging to 13 new locations by its co-ordinate (3). Position of 13 locations between Banten and South Sulawesi Provinces (4). Pursuant to weight Kriging at step 2, can be expressed by model of prediction expansion SAR through Kriging to quality of education at 13 unsample locations for elementary school
Figure 3.2 Kriging Weight and Prediction of Quality Education at 13 Provinces
Pursuant to inferential result that to 13 locations among Banten and South Sulawesi, obtained by model prediction of quality of education for elementary school through method of SAR Kriging. If known by the values from input variable and process the education and also co-ordinate of each;every location, hence quality of education measured by totalizing UAN will be able to predict. Model the prediction of quality of education to 13 locations among Banten and South Sulawesi expressed as following tables:
Table 3.1 Prediction of Quality Education for Elementary School in Indonesia using SAR-Kriging
From Table 3.1 we can explain that quality of education in 13 provinces influenced by component of non spatial with five variables and five components spatial with five the variable including distance of perception location to center location. If we a selected Aceh Provinces between Banten and South Sulawesi, pursuant to data SDPN 2003 obtained by the following model Expansion SAR:
Quality of Education at Aceh = 25.61 + 0.02RSTRB + 5.88RSB 2.87RSBR7 – 6.31RSRB + 1.77RSGLTG + 0.22d-RSTRB -7.81d-RSB -11.39d-RSBR71.53d-RSRB+0.57d-RSGLTG
For predicting of quality education on elementary school, junior high school and senior high school at 13 Provinces in Indonesia, we have a comparison between actual and prediction SARKriging as follows:
Table 3.2 Comparison of Quality Education Actual and Prediction SAR-Kriging At Elementary School
NO
PROVINCE
ACTUAL
PREDICTION
ERROR
APE
1
DKI
26.85
23.81
3.04
11.32
2
JABAR
31.73
26.04
5.69
17.93
3
JATENG
26.15
27.44
-1.29
4.93
4
DIY
26.47
26.76
-0.29
1.10
5
JATIM
26.83
28.19
-1.36
5.07
6
ACEH
25.94
24.27
1.67
6.44
7
SUMUT
24.22
24.54
-0.32
1.32
8
SUMBAR
23.13
29.13
-6
25.94
9
SULUT
24.95
25.96
-1.01
4.05
10
SULBAR
25.39
25.48
-0.09
0.35
11
KALBAR
24.09
24.1
-0.01
0.04
12
KALTENG
23.43
26.52
-3.09
13.19
13
KALTIM
23.57
26.68
-3.11
13.19
MAPE
8.07
Table 3.3 Comparison of Quality Education Actual and Prediction SAR-Kriging At Junior High School NO
PROVINCE
ACTUAL
PREDICTION
ERROR
APE
1
DKI
18.54
16.99
1.55
8.36
2
JABAR
17.85
16.82
1.03
5.77
3
JATENG
17.65
18.00
-0.35
1.98
4
DIY
18.99
17.98
1.01
5.31
5
JATIM
16.46
16.97
-0.51
3.10
6
ACEH
14.47
15.23
-0.76
5.25
7
SUMUT
18.53
15.11
3.42
18.46
8
SUMBAR
19.20
16.57
2.63
13.69
9
SULUT
14.13
17.30
-3.17
22.43
10
SULBAR
18.02
17.36
0.66
3.66
11
KALBAR
16.15
16.07
0.08
0.50
12
KALTENG
18.20
16.94
1.26
6.92
13
KALTIM
16.42
16.71
-0.29
1.77
MAPE
7.48
Table 3.4 Comparison of Quality Education Actual and Prediction SAR-Kriging At Senior High School NO
PROVINCE
ACTUAL
PREDICTION
ERROR
APE
1
DKI
36.74
16.90
19.84
54.00
2
JABAR
36.30
31.20
5.10
14.04
3
JATENG
39.54
29.92
9.62
24.33
4
DIY
40.30
29.25
11.05
27.43
5
JATIM
45.34
29.55
15.79
34.82
6
ACEH
17.16
28.93
-11.77
68.61
7
SUMUT
31.90
38.66
-6.76
21.19
8
SUMBAR
33.22
35.46
-2.24
6.73
9
SULUT
45.48
38.54
6.94
15.26
10
SULBAR
20.78
37.17
-16.39
78.87
11
KALBAR
16.58
16.70
-0.12
0.72
12
KALTENG
39.09
37.96
1.13
2.89
13
KALTIM
25.48
33.33
-7.85
30.81
MAPE
29.21
From three tables above, we can conclude that Mean Average Percentage Error (MAPE) for prediction of quality education at 13 provinces I Indonesia for elementary school and junior high school are less than 10%. But for senior high school more than 10%. It means that the SARKriging method fit a good model for prediction of quality education at unsample locations on elementary and junior high school in Indonesia.
4. Conclusion 1). SAR-Kriging model is one of tools in spatial data mining which combines expansion SAR model and Kriging method. 2). An application of SAR-Kriging model for prediction of quality of education at unsample locations in Indonesia show that it gave a good result for elementary and junior high school at 13 provinces which are located in among two selected provinces.
References • Abdullah, A. S. 2009. Spatial Data Mining using SARKriging Model (Spatial Autoregressive-Kriging) for Mapping Quality of Education in Indonesia. Unpublished Dissertation. Yogyakarta: Universitas Gadjah Mada. • Anselin, L. 1988, Spatial Econometrics : Method and Models, London: Kluwer Academic publisher. • Armstrong, M. 1998. Basic Liniear Geostatistic, New York: Springer Verlag. • Balitbang Depdiknas, 2003, Survei Dasar Pendidikan Nasional Tahun 2003, Jakarta. • Han, J., and Kamber, M., 2006, Data Mining, Concept and Techniques, USA: Academic Press. • LeSage, J. P. 1999. The Theory and Practice of Spatial Econometrics. University of Toledo.