CHAPTER 4 EVALUATION
4.1
System Specification For testing purposes, the algorithm was implemented in PHP (5.3.0), MySQL
database (5.5.8), and Apache web server, in a localhost environment. The experiment application is basically a text box that accepts one Indonesian word, and a button to process the given word. The machine used is equipped with Intel Core i5-3317U (1.7 GHz, 4 CPUs), and a 4GB RAM.
4.2
Testing The test data/sample was collected manually from Kompas.com, one of the
biggest news companies in Indonesia. The 25 articles collected were from Kompas.com’s online article taken between 1st November 2012 and 15th January 2013, and distributed evenly between 10 news categories. Before lemmatized, the articles were parsed so it fits these conditions: 1. Minimum length of tested word is 4. 2. Numbers and special characters are truncated, leaving alphabets and stripes (‘-‘). 3. The data is supplied in a form of one word per lemmatization process.
55
56 The parsed/formatted data contains 64,372 words with an average of 6.68 characters per word, and 10,579 unique words. The data is stored in a MySQL table to ease testing process. The test will be performed on several different categories:
57 1. All / Combined category
Table 4.1 Sample test case from All / Combined Category Full
Unique
terjadi
rosmaniar
terutama
diduga
meluruh
membuktikan
rusia
mohammad
yang
batas
dihebohkan
diskriminasi
adalah
berharap
malaysia
menegaskan
peningkatan
juli
malaysia
disimpan
mikro
biji
akan
gangguan
bangsa
kuncinya
desa
ujang
persen
smyczek
nomor
langsung
desa
pirang
merebut
ditemukan
This is the combined version of all categories, and 18 cases were randomly taken and represented in a table form above ‘Full’ means that the test data is not filtered for any duplicate words; and ‘Unique’ means that the test data contains no duplicate words. 18 entries were randomly taken for the sample test cases displayed above.
58 2. Business category
Table 4.2 Sample test case from Business Category Full
Unique
terjadi
nantinya
kondisi
jasa
izin
minggu
terakhir
pembangunan
untuk
konsumer
daya
menanggung
keperluan
kecil
inklusif
eksekutif
untuk
subsidi
membidik
dibanding
usaha
umkm
bidang
kelas
memperkaya
bermotor
termasuk
mereka
bermotor
dinner
tiga
muntahiyah
cara
prospek
memberikan
layanan
Business category discusses about the development of Indonesia’s economy, fiscal & monetary, stock market development, business analyses, important figures in Indonesian business world, business planning, and opportunities.
59 3. Regional category
Table 4.3 Sample test case for Regional Category Full
Unique
timur
terkait
surat
hakim
juga
syech
surat
nomor
jalan
rangkaian
pemberhentian
masih
disampaikan
diajukan
kekaguman
sore
desember
waiwerang
tetapi
berkah
rekomendasi
islami
untuk
dicampur
jalan
effendi
garut
anak-anak
jawabannya
melanggar
ruas
berupa
bupati
setelah
tanah
mengangkut
Regional category mostly discusses about important/key events that happened in the highlighted regions, such as Sumatra, Jawa, Kalimantan, and Eastern Indonesia. The article’s news content is mostly mixed; sometimes it talks about criminality, politics, or social events.
60 4. Education category
Table 4.4 Sample test case for Education Category Full
Unique
dalam
pada
yang
terbuka
transisi
melaksanakan
sadar
undang-undang
rusak
meminta
menjadikan
mewujud
bayar
waktu
anak
berpesta
baru
seluruh
rupiah
meniadakan
oleh
dibatalkan
pihak
kurikulum
namun
ekstra
sorotan
memang
jelas
memliki
dalam
masih
mila
triliun
sekolah
baik
Education categories contain news about education development in Indonesia; including the important figures in Indonesia’s education world, scholarships, study guide, education agenda, and information.
61 5. Science category
Table 4.5 Sample test case for Science Category Full
Unique
institut
spesies
memantau
laut
diketahui
pusat
penduduk
kelautan
tidak
serius
selatan
ilmu
besar
besar
hijau
perilaku
hujan
jagielonian
rambut
menghargai
melebihi
korban
rekannya
untuk
struktur
berkumpul
bmkg
branicki
data
ekor
membuat
lancashire
gelombang
pernah
hingga
tiga
Science category mostly discusses about energy conservation, global warming, general science, astrology, archeology, biology, laboratory tests, scientific phenomena and other news that can be categorized as science.
62 6. Sports category
Table 4.6 Sample test case for Sports Category Full
Unique
digelar
ivan
melakukan
peringkat
sozonov
merebut
menjabat
bertemu
solidarity
roth
hasil
meningkatkan
pekan
abdul
jadi
bulu
dengan
kejuaraan
punggung
rafael
akan
tangkis
adalah
kalinya
mengalahkan
melaju
salah
menjabat
kurniawan
barunya
kualifikasi
untuk
slam
secara
sementara
senin
Sports news discusses about events that occur in Indonesia regarding major, popular sports category such as racing, tennis, badminton, celebrity sports, and other sports.
63 7. International category
Table 4.7 Sample test case for International category Full
Unique
tawaran
jendela
warga
mengatasi
mengirim
inggris
tentang
abou
pengandilan
pakistan
alasan
berniat
tahun
penyelamat
militer
khovanskoye
mengimpor
teroris
palsu
peralatan
tenda-tenda
nyawa
kata
kondisi
udara
mengkaji
oleh
diabaly
rights
silvio
warga
beku
mereka
senapan
angin
entv
International category mostly discusses about important events and/or issues that occur outside Indonesia. The category also brought up important international figures, unique stories, and tragedies.
64 8. ‘Oasis’ category
Table 4.8 Sample test case for Oasis category
Oasis
category
Full
Unique
yang
ternama
kota
mesti
keresidenan
tergenang
tanah
menghadirkan
dengan
tawanan
relatif
khususnya
menjadi
memorakporandakan
tinutuan
press
tambah
ngeri
hostel
bertubi-tubi
mampu
risiko
kepala
terumbu
film
tepung
andalan
esthy
kami
menyengat
yoseph
nilai
tungka
larangan
meningkat
namanya
provide
complementary
and
entertainment
news
that
happens/exists in Indonesia. This category discusses about poetries, short stories, novels, serial stories, and arts.
65 9. Megapolitan category
Table 4.9 Sample test case for Megapolitan category Full
Unique
biasanya
pencarian
bukan
calon
keberadaan
desa-desa
tidak
kegiatan
fatmawati
tambang
juga
di-mark
matinya
nilai
diperbolehkan
bernomor
anak
supriyati
kepada
penyelewengan
masalah
keluarga
menegaskan
salah
balik
sisi
tanggul
pinggir
banjir
pantas
sekarang
lakukan
untuk
berpendapat
jokowi
permohonan
Megapolitan category mostly discusses about lifestyle, crime story, urban life, and important events and/or issues that happen in the major cities in Indonesia. Mostly, news about trending celebrities are posted here.
66 10. Travel category
Table 4.10 Sample test case for Travel category Full
Unique
hidup
melahirkan
agar
fotografer
selamanya
hhhh
berbau
lendang
menjadi
konservasi
lady
membelenggu
dilestarikan
hal-hal
tadi
denting
bebas
afrika
serius
rata
semua
menimbulkan
ratri
seolah
pelaku
karakter
brazil
orang-orang
setiap
menulis
penitipan orangorang keringat
pelak pikul karana
Travel category highlight news that discusses about traveling. Some of the subcategories are travel stories, food stories, travel tips, hotel stories, and stories in which Indonesian take part in the international travel events.
67 11. National category
Table 4.11 Sample test case for National category Full
Unique
pembunuhan
hrwg
setelah
citra
larangan
prasarana
senilai
terdiri
aceh
digambarkan
diminta
amir
korupsinya
hudep
sebagai
berwarna
menuturkan
maka
sudah
mengalir
minustah
meliputi
setelah
urus
juga
tentang
tahu
mempercepat
agung
kubu
bathoegana
tanya-tanya
hujan
widjojanto
partai
tengah-tengah
National category provide news that occurs especially in Indonesia. The news content itself may be mixed; there may be articles that discuss about politics, economy, business, crime stories, and issues about celebrities.
68 4.2.1
Validity In analyzing the test data, there are several constraints/limitations to which
lemmatization process are considered a success, which are considered an error/fault, and specific cases that are out of the current algorithm’s scope. A lemmatization is considered successful, if a lemma is correctly produced from the input word. There are some cases that a lemma is produced incorrectly, which will fall to the error category. Out-of-scope cases are considered invalid or unqualified; therefore neither counted as a failure nor success. These out-of-scope cases are: 1. Proper Nouns and Abbreviations, which include people, area, or company names (Microsoft, Bandung, PT.KAI, etc.). The main reason this is included as out of scope, because they do not exist in the dictionary. 2. Foreign Word, which means other words not in Indonesian language. Same as point 1, they are not listed in Indonesian dictionary. 3. Infix, an affix that is inserted inside a word. For example, the infix ‘-er-‘ for ‘gigi’ which produces ‘gerigi’. Words that contain infix are already listed as lemma; therefore infix removal procedure is not supported by this algorithm. 4. Non-standard Words and Affixes, which mean words that is not defined in Indonesian Dictionary, or slang words and affixes. A few examples of these words would be ‘nyambung’, ‘gue’, ‘pikirin’ with its ‘–in’ suffix.
Lemmatization errors can be classified into a few categories: 1. Overlemmatized: This term is similar to overstemming; Affix removal is performed too much/extensively, such that the produced lemma is not as
69 expected. For example, in ECS’s case of overstemming, ‘penyidikan’ to ‘sidi’, where the correct one should be ‘sidik’. 2. Underlemmatized: This term is similar to understemming; Affix removal is performed too few, such that the produced lemma is not as expected. In ECS’s case, ‘mengalami’ to ‘alami’, where the correct one should be ‘alam’. 3. Incorrect Rule: In this case, the affix was incorrectly removed because of ineffective or incorrect rule. For example, ‘mengatakan’ may become ‘katak’, by removing ‘-an’ suffix, and ‘meng-‘ prefix.
4.2.2
Process The test algorithm will fetch all parsed data, and process them one by one. The
results are saved in a separate table. When the input word is immediately returned as lemma, the algorithm will return “input_is_lemma” exception message; this does not affect the result in any way. When the algorithm fails to obtain the lemma from input word, it will return the original word, however with an exception message “lemma_not_found”. However, this does not mean that all results produced with an exception message is classified as error; the message can also indicate proper nouns and foreign words. After successfully storing the process results to the database, a manual inspection is done to analyze lemmatization errors. The algorithm itself does not know when it is overlemmatizing/underlemmatizing the input word; when it finds a lemma, then it will be returned as success. These cases need to be classified manually. Improvements were made in order for the lemmatization algorithm to perform better. The most notable improvement from the previous algorithm is lemma phrase checking. There are phrases that are considered as a lemma, for example, ‘tanggung
70 jawab’. These lemma phrases, when given a confix, will be joined together and become one word. e.g. the lemma phrase ‘tanggung jawab’, when given a ‘per- -an’ confix, will result in ‘pertanggungjawaban’. This case is not handled by the previous stemming algorithm, because it consists of more than one word. Other minor improvements include rule precedence, rule order, and recoding path optimization.
4.2.3
Results Below shown a sample result from the testing process:
Table 4.12 Sample Result from Business Category input
output
time (s)
Issue
perancis
perancis
0.00469
lemma_not_found
gencar
gencar
0.00184
investasi
investasi
0.00219
semakin
semakin
0.0032
bersemangat
semangat
0.00465
untuk
untuk
0.00339
berinvestasi
investasi
0.01076
indonesia
Indonesia
0.00346
terutama
utama
0.00447
dalam
dalam
0.00308
penjualan
jual
0.00558
pelaku
laku
0.00244
usaha
usaha
0.00139
perancis
perancis
0.00779
melihat
lihat
0.00438
perekonomian
ekonomi
0.00701
cukup
cukup
0.00341
baik
baik
0.00337
menjanjikan
janji
0.0057
lemma_not_found
71 These results were captured specifically from the business category, and duplicate entries are ignored. Here the test results are summarized based on the categories and conditions specified earlier:
Table 4.13 Test Result Summary FULL
Category
UNIQUE
T
V
S
E
P
T
V
S
E
P
Business
6344
5627
5550
77
0.98632
1868
1580
1559
21
0.98671
Regional
6470
4802
5846
81
0.98313
1213
1011
995
16
0.98417
Education
4165
5927
3598
32
0.99460
868
637
623
14
0.97802
Science
6246
5504
5398
73
0.98674
874
643
630
13
0.97978
Sports
6231
3242
5522
42
0.98705
838
608
604
4
0.99342
International
10953
3630
9917
75
0.97934
2037
1593
1575
18
0.98870
Megapolitan
3998
5471
3214
28
0.99488
610
302
297
5
0.98344
National
5499
5564
4764
38
0.99317
559
326
324
2
0.99387
Oasis
6087
9992
5462
42
0.99580
820
528
524
4
0.99242
Travel
8379
7502
7457
45
0.99400
892
611
607
4
0.99345
All
64372
57261
56728
533
0.99069
10579
7839
7738
101
0.98712
Where: T = Total data count V = Valid test data count S = Successful lemmatization E = Error / Failures P = Precision; obtained by computing The average processing time for each word is 0.00411 seconds. This average time is obtained by calculating:
72 Where n is the total data count in test result, and time(i) is the processing time taken to lemmatize a word, taken from the ‘time’ column (the table format follows table 4.7) on the ith row of the result table.
4.2.4 Errors From the 533 cases of failure, 421 of them are overlemmatization cases; 54 of them are underlemmatization cases; and the rest (58 cases) are incorrect rule application, or order. Here the error analysis are presented below:
Table 4.14 Error Analysis Issue
Sample Case
Overlemmatized
berupa => upa kebijakan => bija
Underlemmatized
meminimalisasi (not lemmatized)
Incorrect Lemma
berilmu => beril
‘berupa’ is mistakenly overlemmatized because, referring to affix removal rule 1 (on table 3.1) the default output is produced by always removing ‘ber-‘ prefix first, and the ‘be-‘ will be reserved as a recoding path. Before recoding is performed, the word ‘upa’ (which means present/gift) is found in the dictionary lookup process, and thus produces an overlemmatized result. It is possible to swap the default output with its recoding path, i.e. ‘be-‘ will be removed first, and ‘ber-‘ will be reserved as recoding path instead and thus produces the expected output ‘rupa’. However, by swapping the
73 recoding path, new cases arise; such as ‘berombak’ will be overlemmatized to ‘rombak’, while the expected output is ‘ombak’. Same goes for ‘berubah’, which will be mistakenly lemmatized to ‘rubah’. ‘kebijakan’ is overlemmatized to ‘bija’ because in the derivational suffix removal process, ‘-kan’ is detected, not knowing that the ‘-k’ is actually part of the lemma, and the word ‘bija’ (which means seed) exists in dictionary; therefore produces overlemmatized result. ‘meminimalisasi’ is underlemmatized because the lemmatization algorithm does not process ‘-isasi’ suffix; and therefore the original word/input is returned, and a LEMMA NOT FOUND
error is produced.
‘berilmu’ is incorrectly lemmatized to ‘beril’, because of rule precedence / step execution order. The ‘-mu’ fragment is identified as a possessive pronoun, and the word ‘beril’ (a type of mineral; Be3Al2Si6O18) exists in dictionary therefore produces error.
4.3
Demo Implementation The algorithm was also implemented to a simple web application, built in PHP
and with MySQL database. The purpose of this demo is mainly for illustrative demo, to directly input the desired word without preparing the batch files. Basically, an input is supplied to the app, and an output will be shown as result.
74
Figure 4.1 Main Display of Application
This demo application only accepts one Indonesian word; if there are characters other than alphabets or stripes (‘-‘), then the word will not be processed. When a valid word is supplied to the demo app, the word will be immediately processed. When a lemmatization process is successful, then Figure 4.2 will be shown; however when the lemmatization process returns error, such as lemma not found, then Figure 4.3 will be shown instead.
75
Figure 4.2 Example of Successful Output
On the output screen, the processing time taken to lemmatize the input word is also shown, and a convenience link to lemmatize another input word.
Figure 4.3 Example of Failed Output
76 When the lemmatization process fails, the processing time is also displayed to represent how much time it takes to process a faulty input. The processing time is obtained by calculating the time difference between the start time, which is marked right after the input is supplied to the lemmatizer, and the stop time, which is marked when the result is produced from the lemmatizer.